Skip to content

Track external accumulators in tracer instead of using SparkInfo values#10553

Merged
gh-worker-dd-mergequeue-cf854d[bot] merged 7 commits intomasterfrom
charles.yu/djm-0000/fix-spark-plan-metrics
Mar 10, 2026
Merged

Track external accumulators in tracer instead of using SparkInfo values#10553
gh-worker-dd-mergequeue-cf854d[bot] merged 7 commits intomasterfrom
charles.yu/djm-0000/fix-spark-plan-metrics

Conversation

@charlesmyu
Copy link
Contributor

@charlesmyu charlesmyu commented Feb 9, 2026

What Does This Do

Updates the metrics in the _dd.spark.sql_plan meta field to use distributions calculated from individual task metrics, rather than the naively summed metrics provided by the StageInfo objects from Spark. This is because StageInfo naively sums all accumulators, even though that may not make sense for certain Spark SQL metrics (e.g. avg hash probes per key for aggr operations). Instead, we should accumulate those ourselves into distribution metrics and emit them accordingly.

Currently in the UI, this is only used in one place (in the Spark SQL metrics in the DJM product), so we're not too worried about changing the format here. UI update to follow.

If any issues arise with sending traces with a larger number of histograms, we can disable it using the DD_SPARK_TASK_HISTOGRAM_ENABLED feature flag.

Motivation

We'd like accurate metrics for Spark SQL operations that can reflect task-level characteristics as a distribution. This brings us more in line with what is shown in the Spark UI:
image

Additional Notes

We can't get rid of the original map that tracks accumulators to stages as we still use that to associate Spark SQL operations to stages. However, we can avoid storing the entire accumulator now, and instead just store a simple map of accumulator ID to stage ID. This will be done in a followup PR: #10645

Contributor Checklist

Jira ticket: [PROJ-IDENT]

Note: Once your PR is ready to merge, add it to the merge queue by commenting /merge. /merge -c cancels the queue request. /merge -f --reason "reason" skips all merge queue checks; please use this judiciously, as some checks do not run at the PR-level. For more information, see this doc.

@pr-commenter
Copy link

pr-commenter bot commented Feb 9, 2026

Benchmarks

Startup

Parameters

Baseline Candidate
baseline_or_candidate baseline candidate
git_branch master charles.yu/djm-0000/fix-spark-plan-metrics
git_commit_date 1773155932 1773159485
git_commit_sha c04d61b a18088b
release_version 1.61.0-SNAPSHOT~c04d61b318 1.61.0-SNAPSHOT~a18088bb01
See matching parameters
Baseline Candidate
application insecure-bank insecure-bank
ci_job_date 1773161439 1773161439
ci_job_id 1493369015 1493369015
ci_pipeline_id 101664668 101664668
cpu_model Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
kernel_version Linux runner-zfyrx7zua-project-304-concurrent-0-rkq0r8cz 6.8.0-1031-aws #33~22.04.1-Ubuntu SMP Thu Jun 26 14:22:30 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux Linux runner-zfyrx7zua-project-304-concurrent-0-rkq0r8cz 6.8.0-1031-aws #33~22.04.1-Ubuntu SMP Thu Jun 26 14:22:30 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
module Agent Agent
parent None None

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 62 metrics, 9 unstable metrics.

Startup time reports for petclinic
gantt
    title petclinic - global startup overhead: candidate=1.61.0-SNAPSHOT~a18088bb01, baseline=1.61.0-SNAPSHOT~c04d61b318

    dateFormat X
    axisFormat %s
section tracing
Agent [baseline] (1.069 s) : 0, 1068724
Total [baseline] (11.092 s) : 0, 11092471
Agent [candidate] (1.057 s) : 0, 1056871
Total [candidate] (10.974 s) : 0, 10973604
section appsec
Agent [baseline] (1.246 s) : 0, 1245605
Total [baseline] (11.103 s) : 0, 11102971
Agent [candidate] (1.253 s) : 0, 1253216
Total [candidate] (11.177 s) : 0, 11176908
section iast
Agent [baseline] (1.244 s) : 0, 1243515
Total [baseline] (11.365 s) : 0, 11364953
Agent [candidate] (1.229 s) : 0, 1228916
Total [candidate] (11.319 s) : 0, 11318599
section profiling
Agent [baseline] (1.179 s) : 0, 1178655
Total [baseline] (10.949 s) : 0, 10949107
Agent [candidate] (1.179 s) : 0, 1179444
Total [candidate] (10.908 s) : 0, 10907681
Loading
  • baseline results
Module Variant Duration Δ tracing
Agent tracing 1.069 s -
Agent appsec 1.246 s 176.882 ms (16.6%)
Agent iast 1.244 s 174.792 ms (16.4%)
Agent profiling 1.179 s 109.932 ms (10.3%)
Total tracing 11.092 s -
Total appsec 11.103 s 10.5 ms (0.1%)
Total iast 11.365 s 272.482 ms (2.5%)
Total profiling 10.949 s -143.364 ms (-1.3%)
  • candidate results
Module Variant Duration Δ tracing
Agent tracing 1.057 s -
Agent appsec 1.253 s 196.345 ms (18.6%)
Agent iast 1.229 s 172.045 ms (16.3%)
Agent profiling 1.179 s 122.573 ms (11.6%)
Total tracing 10.974 s -
Total appsec 11.177 s 203.304 ms (1.9%)
Total iast 11.319 s 344.995 ms (3.1%)
Total profiling 10.908 s -65.924 ms (-0.6%)
gantt
    title petclinic - break down per module: candidate=1.61.0-SNAPSHOT~a18088bb01, baseline=1.61.0-SNAPSHOT~c04d61b318

    dateFormat X
    axisFormat %s
section tracing
crashtracking [baseline] (1.212 ms) : 0, 1212
crashtracking [candidate] (1.185 ms) : 0, 1185
BytebuddyAgent [baseline] (634.921 ms) : 0, 634921
BytebuddyAgent [candidate] (627.236 ms) : 0, 627236
AgentMeter [baseline] (29.518 ms) : 0, 29518
AgentMeter [candidate] (29.05 ms) : 0, 29050
GlobalTracer [baseline] (259.427 ms) : 0, 259427
GlobalTracer [candidate] (256.238 ms) : 0, 256238
AppSec [baseline] (32.06 ms) : 0, 32060
AppSec [candidate] (31.446 ms) : 0, 31446
Debugger [baseline] (60.234 ms) : 0, 60234
Debugger [candidate] (59.341 ms) : 0, 59341
Remote Config [baseline] (604.793 µs) : 0, 605
Remote Config [candidate] (588.737 µs) : 0, 589
Telemetry [baseline] (8.718 ms) : 0, 8718
Telemetry [candidate] (8.58 ms) : 0, 8580
Flare Poller [baseline] (5.824 ms) : 0, 5824
Flare Poller [candidate] (7.233 ms) : 0, 7233
section appsec
crashtracking [baseline] (1.188 ms) : 0, 1188
crashtracking [candidate] (1.188 ms) : 0, 1188
BytebuddyAgent [baseline] (657.956 ms) : 0, 657956
BytebuddyAgent [candidate] (663.208 ms) : 0, 663208
AgentMeter [baseline] (12.031 ms) : 0, 12031
AgentMeter [candidate] (12.044 ms) : 0, 12044
GlobalTracer [baseline] (257.849 ms) : 0, 257849
GlobalTracer [candidate] (259.082 ms) : 0, 259082
AppSec [baseline] (177.231 ms) : 0, 177231
AppSec [candidate] (177.895 ms) : 0, 177895
Debugger [baseline] (65.742 ms) : 0, 65742
Debugger [candidate] (66.091 ms) : 0, 66091
Remote Config [baseline] (573.424 µs) : 0, 573
Remote Config [candidate] (564.707 µs) : 0, 565
Telemetry [baseline] (9.07 ms) : 0, 9070
Telemetry [candidate] (9.058 ms) : 0, 9058
Flare Poller [baseline] (3.587 ms) : 0, 3587
Flare Poller [candidate] (3.636 ms) : 0, 3636
IAST [baseline] (24.001 ms) : 0, 24001
IAST [candidate] (24.136 ms) : 0, 24136
section iast
crashtracking [baseline] (1.213 ms) : 0, 1213
crashtracking [candidate] (1.183 ms) : 0, 1183
BytebuddyAgent [baseline] (809.277 ms) : 0, 809277
BytebuddyAgent [candidate] (796.242 ms) : 0, 796242
AgentMeter [baseline] (11.875 ms) : 0, 11875
AgentMeter [candidate] (11.327 ms) : 0, 11327
GlobalTracer [baseline] (248.872 ms) : 0, 248872
GlobalTracer [candidate] (248.572 ms) : 0, 248572
AppSec [baseline] (27.415 ms) : 0, 27415
AppSec [candidate] (26.591 ms) : 0, 26591
Debugger [baseline] (62.904 ms) : 0, 62904
Debugger [candidate] (64.146 ms) : 0, 64146
Remote Config [baseline] (528.992 µs) : 0, 529
Remote Config [candidate] (536.244 µs) : 0, 536
Telemetry [baseline] (14.843 ms) : 0, 14843
Telemetry [candidate] (14.279 ms) : 0, 14279
Flare Poller [baseline] (4.892 ms) : 0, 4892
Flare Poller [candidate] (4.695 ms) : 0, 4695
IAST [baseline] (25.385 ms) : 0, 25385
IAST [candidate] (25.348 ms) : 0, 25348
section profiling
ProfilingAgent [baseline] (93.278 ms) : 0, 93278
ProfilingAgent [candidate] (93.724 ms) : 0, 93724
crashtracking [baseline] (1.172 ms) : 0, 1172
crashtracking [candidate] (1.155 ms) : 0, 1155
BytebuddyAgent [baseline] (680.748 ms) : 0, 680748
BytebuddyAgent [candidate] (680.925 ms) : 0, 680925
AgentMeter [baseline] (8.589 ms) : 0, 8589
AgentMeter [candidate] (8.567 ms) : 0, 8567
GlobalTracer [baseline] (215.079 ms) : 0, 215079
GlobalTracer [candidate] (215.074 ms) : 0, 215074
AppSec [baseline] (31.868 ms) : 0, 31868
AppSec [candidate] (31.983 ms) : 0, 31983
Debugger [baseline] (63.471 ms) : 0, 63471
Debugger [candidate] (63.44 ms) : 0, 63440
Remote Config [baseline] (580.388 µs) : 0, 580
Remote Config [candidate] (583.577 µs) : 0, 584
Telemetry [baseline] (8.901 ms) : 0, 8901
Telemetry [candidate] (9.736 ms) : 0, 9736
Flare Poller [baseline] (4.261 ms) : 0, 4261
Flare Poller [candidate] (3.487 ms) : 0, 3487
Profiling [baseline] (93.841 ms) : 0, 93841
Profiling [candidate] (94.283 ms) : 0, 94283
Loading
Startup time reports for insecure-bank
gantt
    title insecure-bank - global startup overhead: candidate=1.61.0-SNAPSHOT~a18088bb01, baseline=1.61.0-SNAPSHOT~c04d61b318

    dateFormat X
    axisFormat %s
section tracing
Agent [baseline] (1.056 s) : 0, 1056448
Total [baseline] (8.83 s) : 0, 8829551
Agent [candidate] (1.059 s) : 0, 1058583
Total [candidate] (8.838 s) : 0, 8837505
section iast
Agent [baseline] (1.224 s) : 0, 1224235
Total [baseline] (9.582 s) : 0, 9581549
Agent [candidate] (1.224 s) : 0, 1224058
Total [candidate] (9.491 s) : 0, 9490599
Loading
  • baseline results
Module Variant Duration Δ tracing
Agent tracing 1.056 s -
Agent iast 1.224 s 167.787 ms (15.9%)
Total tracing 8.83 s -
Total iast 9.582 s 751.998 ms (8.5%)
  • candidate results
Module Variant Duration Δ tracing
Agent tracing 1.059 s -
Agent iast 1.224 s 165.476 ms (15.6%)
Total tracing 8.838 s -
Total iast 9.491 s 653.094 ms (7.4%)
gantt
    title insecure-bank - break down per module: candidate=1.61.0-SNAPSHOT~a18088bb01, baseline=1.61.0-SNAPSHOT~c04d61b318

    dateFormat X
    axisFormat %s
section tracing
crashtracking [baseline] (1.191 ms) : 0, 1191
crashtracking [candidate] (1.195 ms) : 0, 1195
BytebuddyAgent [baseline] (626.863 ms) : 0, 626863
BytebuddyAgent [candidate] (627.754 ms) : 0, 627754
AgentMeter [baseline] (28.952 ms) : 0, 28952
AgentMeter [candidate] (29.07 ms) : 0, 29070
GlobalTracer [baseline] (255.539 ms) : 0, 255539
GlobalTracer [candidate] (256.534 ms) : 0, 256534
AppSec [baseline] (31.475 ms) : 0, 31475
AppSec [candidate] (31.501 ms) : 0, 31501
Debugger [baseline] (58.563 ms) : 0, 58563
Debugger [candidate] (58.603 ms) : 0, 58603
Remote Config [baseline] (592.375 µs) : 0, 592
Remote Config [candidate] (597.938 µs) : 0, 598
Telemetry [baseline] (8.633 ms) : 0, 8633
Telemetry [candidate] (8.638 ms) : 0, 8638
Flare Poller [baseline] (8.707 ms) : 0, 8707
Flare Poller [candidate] (8.717 ms) : 0, 8717
section iast
crashtracking [baseline] (1.19 ms) : 0, 1190
crashtracking [candidate] (1.192 ms) : 0, 1192
BytebuddyAgent [baseline] (794.413 ms) : 0, 794413
BytebuddyAgent [candidate] (794.618 ms) : 0, 794618
AgentMeter [baseline] (11.293 ms) : 0, 11293
AgentMeter [candidate] (11.293 ms) : 0, 11293
GlobalTracer [baseline] (246.561 ms) : 0, 246561
GlobalTracer [candidate] (246.698 ms) : 0, 246698
AppSec [baseline] (26.385 ms) : 0, 26385
AppSec [candidate] (26.499 ms) : 0, 26499
Debugger [baseline] (63.153 ms) : 0, 63153
Debugger [candidate] (62.394 ms) : 0, 62394
Remote Config [baseline] (545.376 µs) : 0, 545
Remote Config [candidate] (522.734 µs) : 0, 523
Telemetry [baseline] (14.886 ms) : 0, 14886
Telemetry [candidate] (14.788 ms) : 0, 14788
Flare Poller [baseline] (4.74 ms) : 0, 4740
Flare Poller [candidate] (4.91 ms) : 0, 4910
IAST [baseline] (25.081 ms) : 0, 25081
IAST [candidate] (25.15 ms) : 0, 25150
Loading

Load

Parameters

Baseline Candidate
baseline_or_candidate baseline candidate
git_branch master charles.yu/djm-0000/fix-spark-plan-metrics
git_commit_date 1773155932 1773159485
git_commit_sha c04d61b a18088b
release_version 1.61.0-SNAPSHOT~c04d61b318 1.61.0-SNAPSHOT~a18088bb01
See matching parameters
Baseline Candidate
application insecure-bank insecure-bank
ci_job_date 1773161843 1773161843
ci_job_id 1493369016 1493369016
ci_pipeline_id 101664668 101664668
cpu_model Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
kernel_version Linux runner-zfyrx7zua-project-304-concurrent-1-kezndvbb 6.8.0-1031-aws #33~22.04.1-Ubuntu SMP Thu Jun 26 14:22:30 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux Linux runner-zfyrx7zua-project-304-concurrent-1-kezndvbb 6.8.0-1031-aws #33~22.04.1-Ubuntu SMP Thu Jun 26 14:22:30 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

Summary

Found 4 performance improvements and 1 performance regressions! Performance is the same for 13 metrics, 18 unstable metrics.

scenario Δ mean agg_http_req_duration_p50 Δ mean agg_http_req_duration_p95 Δ mean throughput candidate mean agg_http_req_duration_p50 candidate mean agg_http_req_duration_p95 candidate mean throughput baseline mean agg_http_req_duration_p50 baseline mean agg_http_req_duration_p95 baseline mean throughput
scenario:load:insecure-bank:iast_GLOBAL:high_load better
[-203.078µs; -109.973µs] or [-7.016%; -3.799%]
unsure
[-590.899µs; -146.072µs] or [-7.319%; -1.809%]
unstable
[-90.558op/s; +200.058op/s] or [-7.203%; +15.913%]
2.738ms 7.705ms 1311.938op/s 2.895ms 8.073ms 1257.188op/s
scenario:load:petclinic:appsec:high_load worse
[+0.738ms; +1.454ms] or [+4.060%; +7.995%]
unsure
[+0.487ms; +1.763ms] or [+1.622%; +5.868%]
unstable
[-36.673op/s; +10.736op/s] or [-14.598%; +4.273%]
19.280ms 31.167ms 238.250op/s 18.184ms 30.042ms 251.219op/s
scenario:load:petclinic:no_agent:high_load better
[-1.912ms; -0.412ms] or [-10.097%; -2.178%]
unsure
[-2.959ms; -0.325ms] or [-9.461%; -1.038%]
unstable
[-9.997op/s; +43.247op/s] or [-4.154%; +17.971%]
17.774ms 29.629ms 257.281op/s 18.937ms 31.271ms 240.656op/s
scenario:load:petclinic:tracing:high_load better
[-2.278ms; -1.114ms] or [-11.971%; -5.856%]
better
[-3.132ms; -1.596ms] or [-10.097%; -5.144%]
unstable
[-2.690op/s; +47.440op/s] or [-1.116%; +19.682%]
17.332ms 28.656ms 263.406op/s 19.028ms 31.020ms 241.031op/s
Request duration reports for insecure-bank
gantt
    title insecure-bank - request duration [CI 0.99] : candidate=1.61.0-SNAPSHOT~a18088bb01, baseline=1.61.0-SNAPSHOT~c04d61b318
    dateFormat X
    axisFormat %s
section baseline
no_agent (1.209 ms) : 1197, 1221
.   : milestone, 1209,
iast (3.226 ms) : 3185, 3267
.   : milestone, 3226,
iast_FULL (5.859 ms) : 5800, 5918
.   : milestone, 5859,
iast_GLOBAL (3.649 ms) : 3589, 3709
.   : milestone, 3649,
profiling (1.999 ms) : 1980, 2017
.   : milestone, 1999,
tracing (1.792 ms) : 1776, 1807
.   : milestone, 1792,
section candidate
no_agent (1.181 ms) : 1169, 1193
.   : milestone, 1181,
iast (3.3 ms) : 3252, 3348
.   : milestone, 3300,
iast_FULL (5.999 ms) : 5937, 6061
.   : milestone, 5999,
iast_GLOBAL (3.495 ms) : 3436, 3554
.   : milestone, 3495,
profiling (2.094 ms) : 2074, 2115
.   : milestone, 2094,
tracing (1.837 ms) : 1822, 1852
.   : milestone, 1837,
Loading
  • baseline results
Variant Request duration [CI 0.99] Δ no_agent
no_agent 1.209 ms [1.197 ms, 1.221 ms] -
iast 3.226 ms [3.185 ms, 3.267 ms] 2.018 ms (166.9%)
iast_FULL 5.859 ms [5.8 ms, 5.918 ms] 4.65 ms (384.7%)
iast_GLOBAL 3.649 ms [3.589 ms, 3.709 ms] 2.44 ms (201.8%)
profiling 1.999 ms [1.98 ms, 2.017 ms] 790.122 µs (65.4%)
tracing 1.792 ms [1.776 ms, 1.807 ms] 582.72 µs (48.2%)
  • candidate results
Variant Request duration [CI 0.99] Δ no_agent
no_agent 1.181 ms [1.169 ms, 1.193 ms] -
iast 3.3 ms [3.252 ms, 3.348 ms] 2.119 ms (179.4%)
iast_FULL 5.999 ms [5.937 ms, 6.061 ms] 4.818 ms (407.9%)
iast_GLOBAL 3.495 ms [3.436 ms, 3.554 ms] 2.314 ms (195.9%)
profiling 2.094 ms [2.074 ms, 2.115 ms] 913.146 µs (77.3%)
tracing 1.837 ms [1.822 ms, 1.852 ms] 656.228 µs (55.6%)
Request duration reports for petclinic
gantt
    title petclinic - request duration [CI 0.99] : candidate=1.61.0-SNAPSHOT~a18088bb01, baseline=1.61.0-SNAPSHOT~c04d61b318
    dateFormat X
    axisFormat %s
section baseline
no_agent (19.397 ms) : 19202, 19592
.   : milestone, 19397,
appsec (18.576 ms) : 18387, 18765
.   : milestone, 18576,
code_origins (17.858 ms) : 17680, 18035
.   : milestone, 17858,
iast (17.913 ms) : 17735, 18092
.   : milestone, 17913,
profiling (18.731 ms) : 18544, 18918
.   : milestone, 18731,
tracing (19.366 ms) : 19167, 19566
.   : milestone, 19366,
section candidate
no_agent (18.138 ms) : 17955, 18321
.   : milestone, 18138,
appsec (19.595 ms) : 19394, 19797
.   : milestone, 19595,
code_origins (18.035 ms) : 17854, 18216
.   : milestone, 18035,
iast (18.115 ms) : 17933, 18297
.   : milestone, 18115,
profiling (18.651 ms) : 18467, 18835
.   : milestone, 18651,
tracing (17.714 ms) : 17538, 17889
.   : milestone, 17714,
Loading
  • baseline results
Variant Request duration [CI 0.99] Δ no_agent
no_agent 19.397 ms [19.202 ms, 19.592 ms] -
appsec 18.576 ms [18.387 ms, 18.765 ms] -820.661 µs (-4.2%)
code_origins 17.858 ms [17.68 ms, 18.035 ms] -1.539 ms (-7.9%)
iast 17.913 ms [17.735 ms, 18.092 ms] -1.483 ms (-7.6%)
profiling 18.731 ms [18.544 ms, 18.918 ms] -665.918 µs (-3.4%)
tracing 19.366 ms [19.167 ms, 19.566 ms] -30.238 µs (-0.2%)
  • candidate results
Variant Request duration [CI 0.99] Δ no_agent
no_agent 18.138 ms [17.955 ms, 18.321 ms] -
appsec 19.595 ms [19.394 ms, 19.797 ms] 1.458 ms (8.0%)
code_origins 18.035 ms [17.854 ms, 18.216 ms] -102.577 µs (-0.6%)
iast 18.115 ms [17.933 ms, 18.297 ms] -22.706 µs (-0.1%)
profiling 18.651 ms [18.467 ms, 18.835 ms] 513.128 µs (2.8%)
tracing 17.714 ms [17.538 ms, 17.889 ms] -424.255 µs (-2.3%)

Dacapo

Parameters

Baseline Candidate
baseline_or_candidate baseline candidate
git_branch master charles.yu/djm-0000/fix-spark-plan-metrics
git_commit_date 1773155932 1773159485
git_commit_sha c04d61b a18088b
release_version 1.61.0-SNAPSHOT~c04d61b318 1.61.0-SNAPSHOT~a18088bb01
See matching parameters
Baseline Candidate
application biojava biojava
ci_job_date 1773161634 1773161634
ci_job_id 1493369017 1493369017
ci_pipeline_id 101664668 101664668
cpu_model Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
kernel_version Linux runner-zfyrx7zua-project-304-concurrent-1-h8sf4sly 6.8.0-1031-aws #33~22.04.1-Ubuntu SMP Thu Jun 26 14:22:30 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux Linux runner-zfyrx7zua-project-304-concurrent-1-h8sf4sly 6.8.0-1031-aws #33~22.04.1-Ubuntu SMP Thu Jun 26 14:22:30 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 11 metrics, 1 unstable metrics.

Execution time for biojava
gantt
    title biojava - execution time [CI 0.99] : candidate=1.61.0-SNAPSHOT~a18088bb01, baseline=1.61.0-SNAPSHOT~c04d61b318
    dateFormat X
    axisFormat %s
section baseline
no_agent (15.001 s) : 15001000, 15001000
.   : milestone, 15001000,
appsec (14.895 s) : 14895000, 14895000
.   : milestone, 14895000,
iast (18.104 s) : 18104000, 18104000
.   : milestone, 18104000,
iast_GLOBAL (17.891 s) : 17891000, 17891000
.   : milestone, 17891000,
profiling (15.45 s) : 15450000, 15450000
.   : milestone, 15450000,
tracing (15.044 s) : 15044000, 15044000
.   : milestone, 15044000,
section candidate
no_agent (15.696 s) : 15696000, 15696000
.   : milestone, 15696000,
appsec (14.933 s) : 14933000, 14933000
.   : milestone, 14933000,
iast (19.079 s) : 19079000, 19079000
.   : milestone, 19079000,
iast_GLOBAL (17.714 s) : 17714000, 17714000
.   : milestone, 17714000,
profiling (14.775 s) : 14775000, 14775000
.   : milestone, 14775000,
tracing (15.16 s) : 15160000, 15160000
.   : milestone, 15160000,
Loading
  • baseline results
Variant Execution Time [CI 0.99] Δ no_agent
no_agent 15.001 s [15.001 s, 15.001 s] -
appsec 14.895 s [14.895 s, 14.895 s] -106.0 ms (-0.7%)
iast 18.104 s [18.104 s, 18.104 s] 3.103 s (20.7%)
iast_GLOBAL 17.891 s [17.891 s, 17.891 s] 2.89 s (19.3%)
profiling 15.45 s [15.45 s, 15.45 s] 449.0 ms (3.0%)
tracing 15.044 s [15.044 s, 15.044 s] 43.0 ms (0.3%)
  • candidate results
Variant Execution Time [CI 0.99] Δ no_agent
no_agent 15.696 s [15.696 s, 15.696 s] -
appsec 14.933 s [14.933 s, 14.933 s] -763.0 ms (-4.9%)
iast 19.079 s [19.079 s, 19.079 s] 3.383 s (21.6%)
iast_GLOBAL 17.714 s [17.714 s, 17.714 s] 2.018 s (12.9%)
profiling 14.775 s [14.775 s, 14.775 s] -921.0 ms (-5.9%)
tracing 15.16 s [15.16 s, 15.16 s] -536.0 ms (-3.4%)
Execution time for tomcat
gantt
    title tomcat - execution time [CI 0.99] : candidate=1.61.0-SNAPSHOT~a18088bb01, baseline=1.61.0-SNAPSHOT~c04d61b318
    dateFormat X
    axisFormat %s
section baseline
no_agent (1.472 ms) : 1461, 1484
.   : milestone, 1472,
appsec (3.776 ms) : 3557, 3996
.   : milestone, 3776,
iast (2.253 ms) : 2184, 2322
.   : milestone, 2253,
iast_GLOBAL (2.291 ms) : 2222, 2360
.   : milestone, 2291,
profiling (2.075 ms) : 2021, 2129
.   : milestone, 2075,
tracing (2.067 ms) : 2014, 2120
.   : milestone, 2067,
section candidate
no_agent (1.473 ms) : 1462, 1485
.   : milestone, 1473,
appsec (3.8 ms) : 3580, 4020
.   : milestone, 3800,
iast (2.248 ms) : 2180, 2317
.   : milestone, 2248,
iast_GLOBAL (2.288 ms) : 2219, 2357
.   : milestone, 2288,
profiling (2.108 ms) : 2052, 2165
.   : milestone, 2108,
tracing (2.051 ms) : 1998, 2104
.   : milestone, 2051,
Loading
  • baseline results
Variant Execution Time [CI 0.99] Δ no_agent
no_agent 1.472 ms [1.461 ms, 1.484 ms] -
appsec 3.776 ms [3.557 ms, 3.996 ms] 2.304 ms (156.5%)
iast 2.253 ms [2.184 ms, 2.322 ms] 780.342 µs (53.0%)
iast_GLOBAL 2.291 ms [2.222 ms, 2.36 ms] 819.071 µs (55.6%)
profiling 2.075 ms [2.021 ms, 2.129 ms] 602.894 µs (40.9%)
tracing 2.067 ms [2.014 ms, 2.12 ms] 594.708 µs (40.4%)
  • candidate results
Variant Execution Time [CI 0.99] Δ no_agent
no_agent 1.473 ms [1.462 ms, 1.485 ms] -
appsec 3.8 ms [3.58 ms, 4.02 ms] 2.327 ms (157.9%)
iast 2.248 ms [2.18 ms, 2.317 ms] 775.143 µs (52.6%)
iast_GLOBAL 2.288 ms [2.219 ms, 2.357 ms] 814.545 µs (55.3%)
profiling 2.108 ms [2.052 ms, 2.165 ms] 635.131 µs (43.1%)
tracing 2.051 ms [1.998 ms, 2.104 ms] 577.608 µs (39.2%)

@charlesmyu charlesmyu force-pushed the charles.yu/djm-0000/fix-spark-plan-metrics branch from 4e5bdc7 to ba09c80 Compare February 9, 2026 14:48
@charlesmyu charlesmyu force-pushed the charles.yu/djm-0000/fix-spark-plan-metrics branch 5 times, most recently from cde7981 to e52fbc5 Compare February 19, 2026 21:41
@charlesmyu charlesmyu force-pushed the charles.yu/djm-0000/fix-spark-plan-metrics branch from e52fbc5 to e413d1d Compare February 19, 2026 22:02
@charlesmyu charlesmyu added inst: apache spark Apache Spark instrumentation type: enhancement Enhancements and improvements labels Feb 19, 2026
@charlesmyu charlesmyu force-pushed the charles.yu/djm-0000/fix-spark-plan-metrics branch 2 times, most recently from 89df516 to 8651527 Compare February 24, 2026 20:06
Copy link
Contributor Author

charlesmyu commented Feb 24, 2026

@charlesmyu charlesmyu marked this pull request as ready for review March 4, 2026 15:18
@charlesmyu charlesmyu requested review from a team as code owners March 4, 2026 15:18
@charlesmyu charlesmyu requested a review from mcculls March 4, 2026 15:18
@charlesmyu charlesmyu force-pushed the charles.yu/djm-0000/fix-spark-plan-metrics branch from 8651527 to 7e4b7de Compare March 4, 2026 15:37
Copy link
Contributor

@pawel-big-lebowski pawel-big-lebowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, elegant implementation tackling a complex problem — I only left a small comment.

private static final MethodHandles methodLoader =
new MethodHandles(ClassLoader.getSystemClassLoader());
private static final MethodHandle externalAccums =
methodLoader.method(TaskMetrics.class, "externalAccums");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you provide some doc on why do we need reflection and which Spark version support externalAccums/withExternalAccums?

Copy link
Contributor Author

@charlesmyu charlesmyu Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find any good public-facing docs for this (probably since it's an internal API), but it seems like the relevant commit is here: apache/spark@b33a3ee

Somewhere in Spark v3.5.2, there was a change to move from directly accessing externalAccums to using the withExternalAccums pattern. Unfortunately it seems like it was to remediate a performance regression so there wasn't any backwards compatibility provided with that change, and as a result we need reflection to figure out which method to use when pulling the accumulators.

/** Wrapper around the DDSketch library so that it can be used in an instrumentation */
public class DDSketchHistogram implements Histogram {
private final DDSketch sketch;
private double sum;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@charlesmyu charlesmyu Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good call - added!

f313fa1 (this PR)

try {
// As of spark 3.5, all SQL metrics are Long, safeguard if it changes in new
// versions
hist.accept((Long) acc.value());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could consider casting to Number which would then support all built-in number types:

Suggested change
hist.accept((Long) acc.value());
hist.accept(((Number) acc.value()).doubleValue());

I used doubleValue() to get the value as that's what the histogram API expects - this will automatically map the given number to the double type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense!

b203263 (this PR)

class SparkAggregatedTaskMetrics {
private static final double HISTOGRAM_RELATIVE_ACCURACY = 1 / 32.0;
private static final int HISTOGRAM_MAX_NUM_BINS = 512;
private static final int MAX_ACCUMULATOR_SIZE = 5000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you benchmarked to see how much memory overhead this could add if there happened to be 5000 accumulators? Do we know what the expected number of accumulators is typically?

Just trying to get a sense of whether this is a reasonable limit - i.e. does it allow for the most common number of accumulators while safeguarding memory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(note this is not a blocker to the PR - just interested in where the 5000 limit came from)

Copy link
Contributor Author

@charlesmyu charlesmyu Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, it was good to think through this properly!

Honestly, this was a bit of a conservative guesstimate based on the previous limit of 50k accumulators, to provide at least a limit of some sort to prevent any runaway Spark apps from blowing through all the memory.

Evaluating a bit more critically, we could probably increase this since we discard the SparkAggregatedTaskMetrics object once a stage is completed as opposed to keeping them in memory for the entire duration of the Spark job. Since Spark should really only be working on a single stage at a time, a limit of 5,000 should be fairly conservative from a memory usage standpoint. It also in our benefit that the previous value is based on an object that should be larger than the AccumulatorV2 we're now using.

Some quick napkin math to make sure this is reasonable from a usage standpoint as well - each stage is composed of multiple operations (speaking anecdotally the most I've seen is 10 operations per stage, so an upper bound of 50 operations should be fairly generous), and we see the most metrics per operation in Spark jobs run by Databricks, which has maybe ~50 metrics per stage in the worst case? This would give 2,500 accumulators we have to track, which gives us a decent amount of overhead.

A lot of anecdotal numbers unfortunately, but hopefully this is already an improvement on the previous 50k limit.

Copy link
Contributor

@mcculls mcculls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments - main Q is whether we should use a compensated sum in the histogram, like in https://github.com/DataDog/sketches-java/blob/master/src/main/java/com/datadoghq/sketch/WithExactSummaryStatistics.java#L24 or is it enough to do just a simple sum?

@charlesmyu charlesmyu added this pull request to the merge queue Mar 10, 2026
@dd-octo-sts
Copy link
Contributor

dd-octo-sts bot commented Mar 10, 2026

/merge

@gh-worker-devflow-routing-ef8351
Copy link

gh-worker-devflow-routing-ef8351 bot commented Mar 10, 2026

View all feedbacks in Devflow UI.

2026-03-10 18:03:11 UTC ℹ️ Start processing command /merge


2026-03-10 18:03:16 UTC ℹ️ MergeQueue: pull request added to the queue

The expected merge time in master is approximately 1h (p90).


2026-03-10 19:06:41 UTC ℹ️ MergeQueue: This merge request was merged

@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 10, 2026
@gh-worker-dd-mergequeue-cf854d gh-worker-dd-mergequeue-cf854d bot merged commit cc12228 into master Mar 10, 2026
574 checks passed
@gh-worker-dd-mergequeue-cf854d gh-worker-dd-mergequeue-cf854d bot deleted the charles.yu/djm-0000/fix-spark-plan-metrics branch March 10, 2026 19:06
@github-actions github-actions bot added this to the 1.61.0 milestone Mar 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

inst: apache spark Apache Spark instrumentation type: enhancement Enhancements and improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants