-
Notifications
You must be signed in to change notification settings - Fork 329
Track external accumulators in tracer instead of using SparkInfo values #10553
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
gh-worker-dd-mergequeue-cf854d
merged 7 commits into
master
from
charles.yu/djm-0000/fix-spark-plan-metrics
Mar 10, 2026
Merged
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
86e3185
Track external accumulators in tracer instead of using SparkInfo values
charlesmyu 3128df8
Create and implement getSum
charlesmyu bba18eb
Send summed SQL plan metric values
charlesmyu 7e4b7de
Limit external accumulators to 5,000 per stage
charlesmyu f313fa1
Use compensated sum to limit rounding errors
charlesmyu b203263
Cast to Number type instead of Long
charlesmyu a18088b
Merge branch 'master' into charles.yu/djm-0000/fix-spark-plan-metrics
charlesmyu File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you benchmarked to see how much memory overhead this could add if there happened to be 5000 accumulators? Do we know what the expected number of accumulators is typically?
Just trying to get a sense of whether this is a reasonable limit - i.e. does it allow for the most common number of accumulators while safeguarding memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(note this is not a blocker to the PR - just interested in where the 5000 limit came from)
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries, it was good to think through this properly!
Honestly, this was a bit of a conservative guesstimate based on the previous limit of 50k accumulators, to provide at least a limit of some sort to prevent any runaway Spark apps from blowing through all the memory.
Evaluating a bit more critically, we could probably increase this since we discard the
SparkAggregatedTaskMetricsobject once a stage is completed as opposed to keeping them in memory for the entire duration of the Spark job. Since Spark should really only be working on a single stage at a time, a limit of 5,000 should be fairly conservative from a memory usage standpoint. It also in our benefit that the previous value is based on an object that should be larger than theAccumulatorV2we're now using.Some quick napkin math to make sure this is reasonable from a usage standpoint as well - each stage is composed of multiple operations (speaking anecdotally the most I've seen is 10 operations per stage, so an upper bound of 50 operations should be fairly generous), and we see the most metrics per operation in Spark jobs run by Databricks, which has maybe ~50 metrics per stage in the worst case? This would give 2,500 accumulators we have to track, which gives us a decent amount of overhead.
A lot of anecdotal numbers unfortunately, but hopefully this is already an improvement on the previous 50k limit.