Skip to content

feat(core): add BackoffManager for transient error backoff (2/3)#1152

Closed
abueide wants to merge 4 commits intomasterfrom
feature/tapi-batch-manager
Closed

feat(core): add BackoffManager for transient error backoff (2/3)#1152
abueide wants to merge 4 commits intomasterfrom
feature/tapi-batch-manager

Conversation

@abueide
Copy link
Contributor

@abueide abueide commented Mar 6, 2026

Summary

  • Adds BackoffManager for global transient error backoff (5xx, 408, 410, 460)
  • Adds BackoffConfig, BackoffStateData types to types.ts
  • Expands HttpConfig to include backoffConfig
  • Adds validateBackoffConfig in config-validation.ts with SDD-specified bounds
  • Core test suite (12 tests)

Merge order

Depends on #1153. Rebase onto master after #1153 lands.

  1. feat(core): add UploadStateMachine for rate limiting (1/3) #1153 — UploadStateMachine + rate limit types
  2. feat(core): add BackoffManager for transient error backoff (2/3) #1152 (this) — BackoffManager + backoff types
  3. feat(core): add error classification and default HTTP config (3/3) #1150 — Error classification + defaultHttpConfig

Design decision: Global vs per-batch backoff

The SDD assumes file-based batches with stable identifiers ({writeKey}-{fileIndex}). The RN SDK uses a queue model where chunk() in SegmentDestination.sendEvents() produces new arrays each flush — there is no stable batch identity across flushes.

Solution: BackoffManager tracks global transient error state instead of per-batch metadata. Same exponential backoff formula, but state is global. This is equivalent in practice because during a TAPI outage all batches fail anyway. Within a single flush, all batches are still attempted sequentially per the SDD.

Components

BackoffManager (148 lines)

Global transient error backoff manager.

Methods:

  • canRetry() — gates the flush (like UploadStateMachine's canUpload())
  • handleTransientError(statusCode) — sets exponential backoff
  • reset() — clears on success
  • getRetryCount() — returns current count for X-Retry-Count header

States: READY | BACKING_OFF

Backoff formula: min(baseBackoffInterval * 2^retryCount, maxBackoffInterval) + jitter

Config Validation (61 lines added)

Validates and clamps backoff configuration to safe ranges:

  • maxBackoffInterval: 0.1s – 86,400s
  • baseBackoffInterval: 0.1s – 300s
  • maxTotalBackoffDuration: 60s – 604,800s
  • jitterPercent: 0 – 100
  • maxRetryCount: 1 – 100

Type Definitions (20 lines added to types.ts)

  • BackoffConfig — backoff settings from Settings CDN
  • BackoffStateData — backoff state persistence shape
  • Expanded HttpConfig to include backoffConfig

Test plan

  • 12 core tests covering canRetry gate, exponential progression, backoff cap, jitter, max retry/duration limits, disabled config, reset
  • devbox run test-unit passes
  • No unrelated diffs

abueide added a commit that referenced this pull request Mar 6, 2026
Adds comprehensive extended test suites for edge cases and implementation
details that supplement the core tests in the feature PRs:

- UploadStateMachine.extended.test.ts: disabled config, multiple retries,
  getter tests, persistence tests
- BatchUploadManager.extended.test.ts: unique IDs, disabled config, getter
  tests, detailed exponential backoff algorithm tests, persistence tests
- SegmentDestination.extended.test.ts: state reset after success, legacy
  behavior, Retry-After header parsing

These tests provide thorough coverage of edge cases, backwards compatibility,
and implementation details without adding bulk to the core feature PRs.

Related to PRs: #1150, #1151, #1152, #1153

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
abueide and others added 2 commits March 6, 2026 16:13
Add the UploadStateMachine component for managing global rate limiting
state for 429 responses, along with supporting types and config validation.

Components:
- RateLimitConfig, UploadStateData, HttpConfig types
- validateRateLimitConfig with SDD-specified bounds
- UploadStateMachine with canUpload/handle429/reset/getGlobalRetryCount
- Core test suite (10 tests) and test helpers
- backoff/index.ts barrel export

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add global transient error backoff manager that replaces the per-batch
BatchUploadManager. Uses same exponential backoff formula from the SDD
but tracks state globally rather than per-batch, which aligns with the
RN SDK's queue model where batch identities are ephemeral.

Components:
- BackoffConfig, BackoffStateData types
- Expanded HttpConfig to include backoffConfig
- validateBackoffConfig with SDD-specified bounds
- BackoffManager with canRetry/handleTransientError/reset/getRetryCount
- Core test suite (12 tests)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@abueide abueide force-pushed the feature/tapi-batch-manager branch from fdee650 to 70679d7 Compare March 6, 2026 22:16
@abueide abueide changed the title feat(core): add BatchUploadManager with exponential backoff (1b/4) feat(core): add BackoffManager for transient error backoff (2/3) Mar 6, 2026
…Manager

Improvements to BackoffManager:
- Add comprehensive JSDoc comments for all public methods
- Add logging when transitioning from BACKING_OFF to READY
- Fix template literal expression errors with unknown types
- Add edge case tests for multiple status codes and long durations
- Fix linting issues (unsafe assignments in mocks)

All 14 tests pass. Ready for review.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
CRITICAL FIX: Fixed off-by-one error in exponential backoff calculation
- Was using newRetryCount (retry 1 → 2^1 = 2s), now uses state.retryCount (retry 1 → 2^0 = 0.5s)
- Now correctly implements SDD progression: 0.5s, 1s, 2s, 4s, 8s, 16s, 32s

DOCUMENTATION: Added design rationale for global vs per-batch backoff
- Documented deviation from SDD's per-batch approach
- Explained RN SDK architecture constraints (no stable batch identities)
- Provided rationale: equivalent in practice during TAPI outages

TESTS: Updated all tests to verify SDD-compliant behavior
- Fixed exponential progression test (was testing wrong values)
- Added comprehensive SDD formula validation test (7 retry progression)
- Fixed jitter test to match new 0.5s base delay
- All 15 tests pass

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@abueide
Copy link
Contributor Author

abueide commented Mar 10, 2026

Superseded by #1154

@abueide abueide closed this Mar 10, 2026
@abueide abueide deleted the feature/tapi-batch-manager branch March 12, 2026 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant