Skip to content

ElasticJob 2.0.4 Sharding Infinite Loop Issue Report #2488

@zjx990

Description

@zjx990

Symptoms Observed
In production environment, we encountered the following symptoms:

All shards not executing: 4-shard job completely stopped execution
ZooKeeper operations spike: ZK OPS maintained at ~700 high-frequency operations
Normal leader election: Election logs appeared only once, leader election successful
Healthy application processes: All 4 scheduling machines running normally
Recoverable by restart: Job execution resumes normally after application restart
Root Cause Analysis
Core Issue
During network failures, sharding transaction failures are silently ignored by the exception handling mechanism, causing sharding state nodes to become "zombie nodes", triggering all non-leader nodes to enter infinite waiting loops.

Technical Details
Exception Handling Flaw (Critical code location)

// RegExceptionHandler.java:53-55
private static boolean isIgnoredException(final Throwable cause) {
return null != cause && (cause instanceof ConnectionLossException ||
cause instanceof NoNodeException ||
cause instanceof NodeExistsException);
}
Transaction Failure Masked (Critical code location)

// JobNodeStorage.java:168-177
public void executeInTransaction(final TransactionExecutionCallback callback) {
try {
// ... transaction operations
curatorTransactionFinal.commit();
} catch (final Exception ex) {
RegExceptionHandler.handleException(ex); // Network exceptions silently ignored
}
}
Infinite Loop Waiting (Critical code location)

// ShardingService.java:132-138
while (!leaderElectionService.isLeader() &&
(jobNodeStorage.isJobNodeExisted(ShardingNode.NECESSARY) ||
jobNodeStorage.isJobNodeExisted(ShardingNode.PROCESSING))) {
BlockUtils.waitingShortTime(); // 100ms infinite loop
}
Failure Chain
Network Jitter → ZK Connection Loss → Sharding Transaction Commit Failure → ConnectionLossException →
Exception Silently Ignored → Sharding State Nodes Remain → Non-leader Nodes Infinite Loop →
Complete Job Execution Halt + High-frequency ZK Operations
Impact Assessment
Business Interruption: Distributed jobs completely stop executing until manual restart
Resource Waste: CPU consumed by ineffective loops, ZK bears high-frequency meaningless operations
Operational Cost: Requires manual monitoring and restart intervention, no automatic recovery
Stability Risk: Common failures like network jitter can trigger this, affecting system availability
Reproduction Steps
Start multi-shard ElasticJob cluster
Disconnect network connection to ZooKeeper during leader sharding process
Observe symptoms: Sharding execution stops, ZK OPS abnormal, non-leader nodes high CPU usage
Restart verification: Application resumes normal operation after restart
Proposed Fix Solutions
Short-term Solutions
Improve Exception Handling: Transaction operations should not ignore ConnectionLossException
Add Timeout Mechanism: Add timeout exit logic in sharding waiting loop
Enhance Logging: Elevate ignored network exception log level to WARN
Long-term Solutions
Unify Node Types: Unify sharding state nodes as ephemeral nodes, leverage session timeout for automatic cleanup
Add Health Checks: Periodically detect and clean zombie nodes
Improve Monitoring: Add sharding wait time and ZK operation frequency monitoring
Affected Code Files
elastic-job-common/elastic-job-common-core/src/main/java/com/dangdang/ddframe/job/reg/exception/RegExceptionHandler.java
elastic-job-lite/elastic-job-lite-core/src/main/java/com/dangdang/ddframe/job/lite/internal/storage/JobNodeStorage.java
elastic-job-lite/elastic-job-lite-core/src/main/java/com/dangdang/ddframe/job/lite/internal/sharding/ShardingService.java
Configuration Details
monitorExecution: false (confirmed not an execution monitoring issue)
Shard Count: 4
Node Count: 4
Environment Information
ElasticJob Version: 2.0.4
ZooKeeper Version: 3.4.6
Curator Version: 2.10.0
Java Version: [Please specify your Java version]
Additional Context
This issue represents a critical design flaw where the system lacks runtime self-healing capabilities and over-relies on application restarts to resolve problems. The silent exception handling mechanism masks transaction failures, leading to hidden issues that only surface when they cause system-wide impact.

The problem is particularly concerning in production environments where network instability is common, as it can cause extended service outages requiring manual intervention.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions