Track external accumulators in tracer instead of using SparkInfo values by charlesmyu · Pull Request #10553 · DataDog/dd-trace-java

charlesmyu · 2026-02-09T05:33:49Z

What Does This Do

Updates the metrics in the _dd.spark.sql_plan meta field to use distributions calculated from individual task metrics, rather than the naively summed metrics provided by the StageInfo objects from Spark. This is because StageInfo naively sums all accumulators, even though that may not make sense for certain Spark SQL metrics (e.g. avg hash probes per key for aggr operations). Instead, we should accumulate those ourselves into distribution metrics and emit them accordingly.

Currently in the UI, this is only used in one place (in the Spark SQL metrics in the DJM product), so we're not too worried about changing the format here. UI update to follow.

If any issues arise with sending traces with a larger number of histograms, we can disable it using the DD_SPARK_TASK_HISTOGRAM_ENABLED feature flag.

Motivation

We'd like accurate metrics for Spark SQL operations that can reflect task-level characteristics as a distribution. This brings us more in line with what is shown in the Spark UI:

Additional Notes

We can't get rid of the original map that tracks accumulators to stages as we still use that to associate Spark SQL operations to stages. However, we can avoid storing the entire accumulator now, and instead just store a simple map of accumulator ID to stage ID. This will be done in a followup PR: #10645

Contributor Checklist

Format the title according to the contribution guidelines
Assign the type: and (comp: or inst:) labels in addition to any other useful labels
Avoid using close, fix, or any linking keywords when referencing an issue
Use solves instead, and assign the PR milestone to the issue
Update the CODEOWNERS file on source file addition, migration, or deletion
Update public documentation with any new configuration flags or behaviors

Jira ticket: [PROJ-IDENT]

Note: Once your PR is ready to merge, add it to the merge queue by commenting /merge. /merge -c cancels the queue request. /merge -f --reason "reason" skips all merge queue checks; please use this judiciously, as some checks do not run at the PR-level. For more information, see this doc.

pr-commenter · 2026-02-09T06:22:38Z

Benchmarks

Startup

Parameters

	Baseline	Candidate
baseline_or_candidate	baseline	candidate
git_branch	master	charles.yu/djm-0000/fix-spark-plan-metrics
git_commit_date	1773155932	1773159485
git_commit_sha	`c04d61b`	`a18088b`
release_version	1.61.0-SNAPSHOT~c04d61b318	1.61.0-SNAPSHOT~a18088bb01

See matching parameters

	Baseline	Candidate
application	insecure-bank	insecure-bank
ci_job_date	1773161439	1773161439
ci_job_id	1493369015	1493369015
ci_pipeline_id	101664668	101664668
cpu_model	Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz	Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
kernel_version	Linux runner-zfyrx7zua-project-304-concurrent-0-rkq0r8cz 6.8.0-1031-aws #33~22.04.1-Ubuntu SMP Thu Jun 26 14:22:30 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux	Linux runner-zfyrx7zua-project-304-concurrent-0-rkq0r8cz 6.8.0-1031-aws #33~22.04.1-Ubuntu SMP Thu Jun 26 14:22:30 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
module	Agent	Agent
parent	None	None

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 62 metrics, 9 unstable metrics.

Startup time reports for petclinic

gantt
    title petclinic - global startup overhead: candidate=1.61.0-SNAPSHOT~a18088bb01, baseline=1.61.0-SNAPSHOT~c04d61b318

    dateFormat X
    axisFormat %s
section tracing
Agent [baseline] (1.069 s) : 0, 1068724
Total [baseline] (11.092 s) : 0, 11092471
Agent [candidate] (1.057 s) : 0, 1056871
Total [candidate] (10.974 s) : 0, 10973604
section appsec
Agent [baseline] (1.246 s) : 0, 1245605
Total [baseline] (11.103 s) : 0, 11102971
Agent [candidate] (1.253 s) : 0, 1253216
Total [candidate] (11.177 s) : 0, 11176908
section iast
Agent [baseline] (1.244 s) : 0, 1243515
Total [baseline] (11.365 s) : 0, 11364953
Agent [candidate] (1.229 s) : 0, 1228916
Total [candidate] (11.319 s) : 0, 11318599
section profiling
Agent [baseline] (1.179 s) : 0, 1178655
Total [baseline] (10.949 s) : 0, 10949107
Agent [candidate] (1.179 s) : 0, 1179444
Total [candidate] (10.908 s) : 0, 10907681

baseline results

Module	Variant	Duration	Δ tracing
Agent	tracing	1.069 s	-
Agent	appsec	1.246 s	176.882 ms (16.6%)
Agent	iast	1.244 s	174.792 ms (16.4%)
Agent	profiling	1.179 s	109.932 ms (10.3%)
Total	tracing	11.092 s	-
Total	appsec	11.103 s	10.5 ms (0.1%)
Total	iast	11.365 s	272.482 ms (2.5%)
Total	profiling	10.949 s	-143.364 ms (-1.3%)

candidate results

Module	Variant	Duration	Δ tracing
Agent	tracing	1.057 s	-
Agent	appsec	1.253 s	196.345 ms (18.6%)
Agent	iast	1.229 s	172.045 ms (16.3%)
Agent	profiling	1.179 s	122.573 ms (11.6%)
Total	tracing	10.974 s	-
Total	appsec	11.177 s	203.304 ms (1.9%)
Total	iast	11.319 s	344.995 ms (3.1%)
Total	profiling	10.908 s	-65.924 ms (-0.6%)

gantt
    title petclinic - break down per module: candidate=1.61.0-SNAPSHOT~a18088bb01, baseline=1.61.0-SNAPSHOT~c04d61b318

    dateFormat X
    axisFormat %s
section tracing
crashtracking [baseline] (1.212 ms) : 0, 1212
crashtracking [candidate] (1.185 ms) : 0, 1185
BytebuddyAgent [baseline] (634.921 ms) : 0, 634921
BytebuddyAgent [candidate] (627.236 ms) : 0, 627236
AgentMeter [baseline] (29.518 ms) : 0, 29518
AgentMeter [candidate] (29.05 ms) : 0, 29050
GlobalTracer [baseline] (259.427 ms) : 0, 259427
GlobalTracer [candidate] (256.238 ms) : 0, 256238
AppSec [baseline] (32.06 ms) : 0, 32060
AppSec [candidate] (31.446 ms) : 0, 31446
Debugger [baseline] (60.234 ms) : 0, 60234
Debugger [candidate] (59.341 ms) : 0, 59341
Remote Config [baseline] (604.793 µs) : 0, 605
Remote Config [candidate] (588.737 µs) : 0, 589
Telemetry [baseline] (8.718 ms) : 0, 8718
Telemetry [candidate] (8.58 ms) : 0, 8580
Flare Poller [baseline] (5.824 ms) : 0, 5824
Flare Poller [candidate] (7.233 ms) : 0, 7233
section appsec
crashtracking [baseline] (1.188 ms) : 0, 1188
crashtracking [candidate] (1.188 ms) : 0, 1188
BytebuddyAgent [baseline] (657.956 ms) : 0, 657956
BytebuddyAgent [candidate] (663.208 ms) : 0, 663208
AgentMeter [baseline] (12.031 ms) : 0, 12031
AgentMeter [candidate] (12.044 ms) : 0, 12044
GlobalTracer [baseline] (257.849 ms) : 0, 257849
GlobalTracer [candidate] (259.082 ms) : 0, 259082
AppSec [baseline] (177.231 ms) : 0, 177231
AppSec [candidate] (177.895 ms) : 0, 177895
Debugger [baseline] (65.742 ms) : 0, 65742
Debugger [candidate] (66.091 ms) : 0, 66091
Remote Config [baseline] (573.424 µs) : 0, 573
Remote Config [candidate] (564.707 µs) : 0, 565
Telemetry [baseline] (9.07 ms) : 0, 9070
Telemetry [candidate] (9.058 ms) : 0, 9058
Flare Poller [baseline] (3.587 ms) : 0, 3587
Flare Poller [candidate] (3.636 ms) : 0, 3636
IAST [baseline] (24.001 ms) : 0, 24001
IAST [candidate] (24.136 ms) : 0, 24136
section iast
crashtracking [baseline] (1.213 ms) : 0, 1213
crashtracking [candidate] (1.183 ms) : 0, 1183
BytebuddyAgent [baseline] (809.277 ms) : 0, 809277
BytebuddyAgent [candidate] (796.242 ms) : 0, 796242
AgentMeter [baseline] (11.875 ms) : 0, 11875
AgentMeter [candidate] (11.327 ms) : 0, 11327
GlobalTracer [baseline] (248.872 ms) : 0, 248872
GlobalTracer [candidate] (248.572 ms) : 0, 248572
AppSec [baseline] (27.415 ms) : 0, 27415
AppSec [candidate] (26.591 ms) : 0, 26591
Debugger [baseline] (62.904 ms) : 0, 62904
Debugger [candidate] (64.146 ms) : 0, 64146
Remote Config [baseline] (528.992 µs) : 0, 529
Remote Config [candidate] (536.244 µs) : 0, 536
Telemetry [baseline] (14.843 ms) : 0, 14843
Telemetry [candidate] (14.279 ms) : 0, 14279
Flare Poller [baseline] (4.892 ms) : 0, 4892
Flare Poller [candidate] (4.695 ms) : 0, 4695
IAST [baseline] (25.385 ms) : 0, 25385
IAST [candidate] (25.348 ms) : 0, 25348
section profiling
ProfilingAgent [baseline] (93.278 ms) : 0, 93278
ProfilingAgent [candidate] (93.724 ms) : 0, 93724
crashtracking [baseline] (1.172 ms) : 0, 1172
crashtracking [candidate] (1.155 ms) : 0, 1155
BytebuddyAgent [baseline] (680.748 ms) : 0, 680748
BytebuddyAgent [candidate] (680.925 ms) : 0, 680925
AgentMeter [baseline] (8.589 ms) : 0, 8589
AgentMeter [candidate] (8.567 ms) : 0, 8567
GlobalTracer [baseline] (215.079 ms) : 0, 215079
GlobalTracer [candidate] (215.074 ms) : 0, 215074
AppSec [baseline] (31.868 ms) : 0, 31868
AppSec [candidate] (31.983 ms) : 0, 31983
Debugger [baseline] (63.471 ms) : 0, 63471
Debugger [candidate] (63.44 ms) : 0, 63440
Remote Config [baseline] (580.388 µs) : 0, 580
Remote Config [candidate] (583.577 µs) : 0, 584
Telemetry [baseline] (8.901 ms) : 0, 8901
Telemetry [candidate] (9.736 ms) : 0, 9736
Flare Poller [baseline] (4.261 ms) : 0, 4261
Flare Poller [candidate] (3.487 ms) : 0, 3487
Profiling [baseline] (93.841 ms) : 0, 93841
Profiling [candidate] (94.283 ms) : 0, 94283

Startup time reports for insecure-bank

gantt
    title insecure-bank - global startup overhead: candidate=1.61.0-SNAPSHOT~a18088bb01, baseline=1.61.0-SNAPSHOT~c04d61b318

    dateFormat X
    axisFormat %s
section tracing
Agent [baseline] (1.056 s) : 0, 1056448
Total [baseline] (8.83 s) : 0, 8829551
Agent [candidate] (1.059 s) : 0, 1058583
Total [candidate] (8.838 s) : 0, 8837505
section iast
Agent [baseline] (1.224 s) : 0, 1224235
Total [baseline] (9.582 s) : 0, 9581549
Agent [candidate] (1.224 s) : 0, 1224058
Total [candidate] (9.491 s) : 0, 9490599

baseline results

Module	Variant	Duration	Δ tracing
Agent	tracing	1.056 s	-
Agent	iast	1.224 s	167.787 ms (15.9%)
Total	tracing	8.83 s	-
Total	iast	9.582 s	751.998 ms (8.5%)

candidate results

Module	Variant	Duration	Δ tracing
Agent	tracing	1.059 s	-
Agent	iast	1.224 s	165.476 ms (15.6%)
Total	tracing	8.838 s	-
Total	iast	9.491 s	653.094 ms (7.4%)

gantt
    title insecure-bank - break down per module: candidate=1.61.0-SNAPSHOT~a18088bb01, baseline=1.61.0-SNAPSHOT~c04d61b318

    dateFormat X
    axisFormat %s
section tracing
crashtracking [baseline] (1.191 ms) : 0, 1191
crashtracking [candidate] (1.195 ms) : 0, 1195
BytebuddyAgent [baseline] (626.863 ms) : 0, 626863
BytebuddyAgent [candidate] (627.754 ms) : 0, 627754
AgentMeter [baseline] (28.952 ms) : 0, 28952
AgentMeter [candidate] (29.07 ms) : 0, 29070
GlobalTracer [baseline] (255.539 ms) : 0, 255539
GlobalTracer [candidate] (256.534 ms) : 0, 256534
AppSec [baseline] (31.475 ms) : 0, 31475
AppSec [candidate] (31.501 ms) : 0, 31501
Debugger [baseline] (58.563 ms) : 0, 58563
Debugger [candidate] (58.603 ms) : 0, 58603
Remote Config [baseline] (592.375 µs) : 0, 592
Remote Config [candidate] (597.938 µs) : 0, 598
Telemetry [baseline] (8.633 ms) : 0, 8633
Telemetry [candidate] (8.638 ms) : 0, 8638
Flare Poller [baseline] (8.707 ms) : 0, 8707
Flare Poller [candidate] (8.717 ms) : 0, 8717
section iast
crashtracking [baseline] (1.19 ms) : 0, 1190
crashtracking [candidate] (1.192 ms) : 0, 1192
BytebuddyAgent [baseline] (794.413 ms) : 0, 794413
BytebuddyAgent [candidate] (794.618 ms) : 0, 794618
AgentMeter [baseline] (11.293 ms) : 0, 11293
AgentMeter [candidate] (11.293 ms) : 0, 11293
GlobalTracer [baseline] (246.561 ms) : 0, 246561
GlobalTracer [candidate] (246.698 ms) : 0, 246698
AppSec [baseline] (26.385 ms) : 0, 26385
AppSec [candidate] (26.499 ms) : 0, 26499
Debugger [baseline] (63.153 ms) : 0, 63153
Debugger [candidate] (62.394 ms) : 0, 62394
Remote Config [baseline] (545.376 µs) : 0, 545
Remote Config [candidate] (522.734 µs) : 0, 523
Telemetry [baseline] (14.886 ms) : 0, 14886
Telemetry [candidate] (14.788 ms) : 0, 14788
Flare Poller [baseline] (4.74 ms) : 0, 4740
Flare Poller [candidate] (4.91 ms) : 0, 4910
IAST [baseline] (25.081 ms) : 0, 25081
IAST [candidate] (25.15 ms) : 0, 25150

Load

Parameters

	Baseline	Candidate
baseline_or_candidate	baseline	candidate
git_branch	master	charles.yu/djm-0000/fix-spark-plan-metrics
git_commit_date	1773155932	1773159485
git_commit_sha	`c04d61b`	`a18088b`
release_version	1.61.0-SNAPSHOT~c04d61b318	1.61.0-SNAPSHOT~a18088bb01

See matching parameters

	Baseline	Candidate
application	insecure-bank	insecure-bank
ci_job_date	1773161843	1773161843
ci_job_id	1493369016	1493369016
ci_pipeline_id	101664668	101664668
cpu_model	Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz	Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
kernel_version	Linux runner-zfyrx7zua-project-304-concurrent-1-kezndvbb 6.8.0-1031-aws #33~22.04.1-Ubuntu SMP Thu Jun 26 14:22:30 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux	Linux runner-zfyrx7zua-project-304-concurrent-1-kezndvbb 6.8.0-1031-aws #33~22.04.1-Ubuntu SMP Thu Jun 26 14:22:30 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

Summary

Found 4 performance improvements and 1 performance regressions! Performance is the same for 13 metrics, 18 unstable metrics.

scenario	Δ mean agg_http_req_duration_p50	Δ mean agg_http_req_duration_p95	Δ mean throughput	candidate mean agg_http_req_duration_p50	candidate mean agg_http_req_duration_p95	candidate mean throughput	baseline mean agg_http_req_duration_p50	baseline mean agg_http_req_duration_p95	baseline mean throughput
scenario:load:insecure-bank:iast_GLOBAL:high_load	better [-203.078µs; -109.973µs] or [-7.016%; -3.799%]	unsure [-590.899µs; -146.072µs] or [-7.319%; -1.809%]	unstable [-90.558op/s; +200.058op/s] or [-7.203%; +15.913%]	2.738ms	7.705ms	1311.938op/s	2.895ms	8.073ms	1257.188op/s
scenario:load:petclinic:appsec:high_load	worse [+0.738ms; +1.454ms] or [+4.060%; +7.995%]	unsure [+0.487ms; +1.763ms] or [+1.622%; +5.868%]	unstable [-36.673op/s; +10.736op/s] or [-14.598%; +4.273%]	19.280ms	31.167ms	238.250op/s	18.184ms	30.042ms	251.219op/s
scenario:load:petclinic:no_agent:high_load	better [-1.912ms; -0.412ms] or [-10.097%; -2.178%]	unsure [-2.959ms; -0.325ms] or [-9.461%; -1.038%]	unstable [-9.997op/s; +43.247op/s] or [-4.154%; +17.971%]	17.774ms	29.629ms	257.281op/s	18.937ms	31.271ms	240.656op/s
scenario:load:petclinic:tracing:high_load	better [-2.278ms; -1.114ms] or [-11.971%; -5.856%]	better [-3.132ms; -1.596ms] or [-10.097%; -5.144%]	unstable [-2.690op/s; +47.440op/s] or [-1.116%; +19.682%]	17.332ms	28.656ms	263.406op/s	19.028ms	31.020ms	241.031op/s

Request duration reports for insecure-bank

gantt
    title insecure-bank - request duration [CI 0.99] : candidate=1.61.0-SNAPSHOT~a18088bb01, baseline=1.61.0-SNAPSHOT~c04d61b318
    dateFormat X
    axisFormat %s
section baseline
no_agent (1.209 ms) : 1197, 1221
.   : milestone, 1209,
iast (3.226 ms) : 3185, 3267
.   : milestone, 3226,
iast_FULL (5.859 ms) : 5800, 5918
.   : milestone, 5859,
iast_GLOBAL (3.649 ms) : 3589, 3709
.   : milestone, 3649,
profiling (1.999 ms) : 1980, 2017
.   : milestone, 1999,
tracing (1.792 ms) : 1776, 1807
.   : milestone, 1792,
section candidate
no_agent (1.181 ms) : 1169, 1193
.   : milestone, 1181,
iast (3.3 ms) : 3252, 3348
.   : milestone, 3300,
iast_FULL (5.999 ms) : 5937, 6061
.   : milestone, 5999,
iast_GLOBAL (3.495 ms) : 3436, 3554
.   : milestone, 3495,
profiling (2.094 ms) : 2074, 2115
.   : milestone, 2094,
tracing (1.837 ms) : 1822, 1852
.   : milestone, 1837,

baseline results

Variant	Request duration [CI 0.99]	Δ no_agent
no_agent	1.209 ms [1.197 ms, 1.221 ms]	-
iast	3.226 ms [3.185 ms, 3.267 ms]	2.018 ms (166.9%)
iast_FULL	5.859 ms [5.8 ms, 5.918 ms]	4.65 ms (384.7%)
iast_GLOBAL	3.649 ms [3.589 ms, 3.709 ms]	2.44 ms (201.8%)
profiling	1.999 ms [1.98 ms, 2.017 ms]	790.122 µs (65.4%)
tracing	1.792 ms [1.776 ms, 1.807 ms]	582.72 µs (48.2%)

candidate results

Variant	Request duration [CI 0.99]	Δ no_agent
no_agent	1.181 ms [1.169 ms, 1.193 ms]	-
iast	3.3 ms [3.252 ms, 3.348 ms]	2.119 ms (179.4%)
iast_FULL	5.999 ms [5.937 ms, 6.061 ms]	4.818 ms (407.9%)
iast_GLOBAL	3.495 ms [3.436 ms, 3.554 ms]	2.314 ms (195.9%)
profiling	2.094 ms [2.074 ms, 2.115 ms]	913.146 µs (77.3%)
tracing	1.837 ms [1.822 ms, 1.852 ms]	656.228 µs (55.6%)

Request duration reports for petclinic

gantt
    title petclinic - request duration [CI 0.99] : candidate=1.61.0-SNAPSHOT~a18088bb01, baseline=1.61.0-SNAPSHOT~c04d61b318
    dateFormat X
    axisFormat %s
section baseline
no_agent (19.397 ms) : 19202, 19592
.   : milestone, 19397,
appsec (18.576 ms) : 18387, 18765
.   : milestone, 18576,
code_origins (17.858 ms) : 17680, 18035
.   : milestone, 17858,
iast (17.913 ms) : 17735, 18092
.   : milestone, 17913,
profiling (18.731 ms) : 18544, 18918
.   : milestone, 18731,
tracing (19.366 ms) : 19167, 19566
.   : milestone, 19366,
section candidate
no_agent (18.138 ms) : 17955, 18321
.   : milestone, 18138,
appsec (19.595 ms) : 19394, 19797
.   : milestone, 19595,
code_origins (18.035 ms) : 17854, 18216
.   : milestone, 18035,
iast (18.115 ms) : 17933, 18297
.   : milestone, 18115,
profiling (18.651 ms) : 18467, 18835
.   : milestone, 18651,
tracing (17.714 ms) : 17538, 17889
.   : milestone, 17714,

baseline results

Variant	Request duration [CI 0.99]	Δ no_agent
no_agent	19.397 ms [19.202 ms, 19.592 ms]	-
appsec	18.576 ms [18.387 ms, 18.765 ms]	-820.661 µs (-4.2%)
code_origins	17.858 ms [17.68 ms, 18.035 ms]	-1.539 ms (-7.9%)
iast	17.913 ms [17.735 ms, 18.092 ms]	-1.483 ms (-7.6%)
profiling	18.731 ms [18.544 ms, 18.918 ms]	-665.918 µs (-3.4%)
tracing	19.366 ms [19.167 ms, 19.566 ms]	-30.238 µs (-0.2%)

candidate results

Variant	Request duration [CI 0.99]	Δ no_agent
no_agent	18.138 ms [17.955 ms, 18.321 ms]	-
appsec	19.595 ms [19.394 ms, 19.797 ms]	1.458 ms (8.0%)
code_origins	18.035 ms [17.854 ms, 18.216 ms]	-102.577 µs (-0.6%)
iast	18.115 ms [17.933 ms, 18.297 ms]	-22.706 µs (-0.1%)
profiling	18.651 ms [18.467 ms, 18.835 ms]	513.128 µs (2.8%)
tracing	17.714 ms [17.538 ms, 17.889 ms]	-424.255 µs (-2.3%)

Dacapo

Parameters

	Baseline	Candidate
baseline_or_candidate	baseline	candidate
git_branch	master	charles.yu/djm-0000/fix-spark-plan-metrics
git_commit_date	1773155932	1773159485
git_commit_sha	`c04d61b`	`a18088b`
release_version	1.61.0-SNAPSHOT~c04d61b318	1.61.0-SNAPSHOT~a18088bb01

See matching parameters

	Baseline	Candidate
application	biojava	biojava
ci_job_date	1773161634	1773161634
ci_job_id	1493369017	1493369017
ci_pipeline_id	101664668	101664668
cpu_model	Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz	Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
kernel_version	Linux runner-zfyrx7zua-project-304-concurrent-1-h8sf4sly 6.8.0-1031-aws #33~22.04.1-Ubuntu SMP Thu Jun 26 14:22:30 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux	Linux runner-zfyrx7zua-project-304-concurrent-1-h8sf4sly 6.8.0-1031-aws #33~22.04.1-Ubuntu SMP Thu Jun 26 14:22:30 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 11 metrics, 1 unstable metrics.

Execution time for biojava

gantt
    title biojava - execution time [CI 0.99] : candidate=1.61.0-SNAPSHOT~a18088bb01, baseline=1.61.0-SNAPSHOT~c04d61b318
    dateFormat X
    axisFormat %s
section baseline
no_agent (15.001 s) : 15001000, 15001000
.   : milestone, 15001000,
appsec (14.895 s) : 14895000, 14895000
.   : milestone, 14895000,
iast (18.104 s) : 18104000, 18104000
.   : milestone, 18104000,
iast_GLOBAL (17.891 s) : 17891000, 17891000
.   : milestone, 17891000,
profiling (15.45 s) : 15450000, 15450000
.   : milestone, 15450000,
tracing (15.044 s) : 15044000, 15044000
.   : milestone, 15044000,
section candidate
no_agent (15.696 s) : 15696000, 15696000
.   : milestone, 15696000,
appsec (14.933 s) : 14933000, 14933000
.   : milestone, 14933000,
iast (19.079 s) : 19079000, 19079000
.   : milestone, 19079000,
iast_GLOBAL (17.714 s) : 17714000, 17714000
.   : milestone, 17714000,
profiling (14.775 s) : 14775000, 14775000
.   : milestone, 14775000,
tracing (15.16 s) : 15160000, 15160000
.   : milestone, 15160000,

baseline results

Variant	Execution Time [CI 0.99]	Δ no_agent
no_agent	15.001 s [15.001 s, 15.001 s]	-
appsec	14.895 s [14.895 s, 14.895 s]	-106.0 ms (-0.7%)
iast	18.104 s [18.104 s, 18.104 s]	3.103 s (20.7%)
iast_GLOBAL	17.891 s [17.891 s, 17.891 s]	2.89 s (19.3%)
profiling	15.45 s [15.45 s, 15.45 s]	449.0 ms (3.0%)
tracing	15.044 s [15.044 s, 15.044 s]	43.0 ms (0.3%)

candidate results

Variant	Execution Time [CI 0.99]	Δ no_agent
no_agent	15.696 s [15.696 s, 15.696 s]	-
appsec	14.933 s [14.933 s, 14.933 s]	-763.0 ms (-4.9%)
iast	19.079 s [19.079 s, 19.079 s]	3.383 s (21.6%)
iast_GLOBAL	17.714 s [17.714 s, 17.714 s]	2.018 s (12.9%)
profiling	14.775 s [14.775 s, 14.775 s]	-921.0 ms (-5.9%)
tracing	15.16 s [15.16 s, 15.16 s]	-536.0 ms (-3.4%)

Execution time for tomcat

gantt
    title tomcat - execution time [CI 0.99] : candidate=1.61.0-SNAPSHOT~a18088bb01, baseline=1.61.0-SNAPSHOT~c04d61b318
    dateFormat X
    axisFormat %s
section baseline
no_agent (1.472 ms) : 1461, 1484
.   : milestone, 1472,
appsec (3.776 ms) : 3557, 3996
.   : milestone, 3776,
iast (2.253 ms) : 2184, 2322
.   : milestone, 2253,
iast_GLOBAL (2.291 ms) : 2222, 2360
.   : milestone, 2291,
profiling (2.075 ms) : 2021, 2129
.   : milestone, 2075,
tracing (2.067 ms) : 2014, 2120
.   : milestone, 2067,
section candidate
no_agent (1.473 ms) : 1462, 1485
.   : milestone, 1473,
appsec (3.8 ms) : 3580, 4020
.   : milestone, 3800,
iast (2.248 ms) : 2180, 2317
.   : milestone, 2248,
iast_GLOBAL (2.288 ms) : 2219, 2357
.   : milestone, 2288,
profiling (2.108 ms) : 2052, 2165
.   : milestone, 2108,
tracing (2.051 ms) : 1998, 2104
.   : milestone, 2051,

baseline results

Variant	Execution Time [CI 0.99]	Δ no_agent
no_agent	1.472 ms [1.461 ms, 1.484 ms]	-
appsec	3.776 ms [3.557 ms, 3.996 ms]	2.304 ms (156.5%)
iast	2.253 ms [2.184 ms, 2.322 ms]	780.342 µs (53.0%)
iast_GLOBAL	2.291 ms [2.222 ms, 2.36 ms]	819.071 µs (55.6%)
profiling	2.075 ms [2.021 ms, 2.129 ms]	602.894 µs (40.9%)
tracing	2.067 ms [2.014 ms, 2.12 ms]	594.708 µs (40.4%)

candidate results

Variant	Execution Time [CI 0.99]	Δ no_agent
no_agent	1.473 ms [1.462 ms, 1.485 ms]	-
appsec	3.8 ms [3.58 ms, 4.02 ms]	2.327 ms (157.9%)
iast	2.248 ms [2.18 ms, 2.317 ms]	775.143 µs (52.6%)
iast_GLOBAL	2.288 ms [2.219 ms, 2.357 ms]	814.545 µs (55.3%)
profiling	2.108 ms [2.052 ms, 2.165 ms]	635.131 µs (43.1%)
tracing	2.051 ms [1.998 ms, 2.104 ms]	577.608 µs (39.2%)

charlesmyu · 2026-02-24T20:07:05Z

Store accumulator-stage lookups directly #10645
Track external accumulators in tracer instead of using SparkInfo values #10553 👈 (View in Graphite)
master

This stack of pull requests is managed by Graphite. Learn more about stacking.

pawel-big-lebowski

Nice, elegant implementation tackling a complex problem — I only left a small comment.

pawel-big-lebowski · 2026-03-05T12:55:33Z

...rk/spark_2.12/src/main/java/datadog/trace/instrumentation/spark/DatadogSpark212Listener.java

+  private static final MethodHandles methodLoader =
+      new MethodHandles(ClassLoader.getSystemClassLoader());
+  private static final MethodHandle externalAccums =
+      methodLoader.method(TaskMetrics.class, "externalAccums");


could you provide some doc on why do we need reflection and which Spark version support externalAccums/withExternalAccums?

I can't find any good public-facing docs for this (probably since it's an internal API), but it seems like the relevant commit is here: apache/spark@b33a3ee

Somewhere in Spark v3.5.2, there was a change to move from directly accessing externalAccums to using the withExternalAccums pattern. Unfortunately it seems like it was to remediate a performance regression so there wasn't any backwards compatibility provided with that change, and as a result we need reflection to figure out which method to use when pulling the accumulators.

mcculls · 2026-03-10T10:22:21Z

products/metrics/metrics-lib/src/main/java/datadog/metrics/impl/DDSketchHistogram.java

 /** Wrapper around the DDSketch library so that it can be used in an instrumentation */
 public class DDSketchHistogram implements Histogram {
  private final DDSketch sketch;
+  private double sum;


Should we use a compensated sum to limit rounding errors like in https://github.com/DataDog/sketches-java/blob/master/src/main/java/com/datadoghq/sketch/WithExactSummaryStatistics.java#L24 ?

Ah, good call - added!

f313fa1 (this PR)

...rk/spark_2.13/src/main/java/datadog/trace/instrumentation/spark/DatadogSpark213Listener.java

mcculls · 2026-03-10T10:49:18Z

...ark-common/src/main/java/datadog/trace/instrumentation/spark/SparkAggregatedTaskMetrics.java

+                try {
+                  // As of spark 3.5, all SQL metrics are Long, safeguard if it changes in new
+                  // versions
+                  hist.accept((Long) acc.value());


You could consider casting to Number which would then support all built-in number types:

Suggested change

hist.accept((Long) acc.value());

hist.accept(((Number) acc.value()).doubleValue());

I used doubleValue() to get the value as that's what the histogram API expects - this will automatically map the given number to the double type.

Makes sense!

b203263 (this PR)

mcculls · 2026-03-10T10:52:29Z

...ark-common/src/main/java/datadog/trace/instrumentation/spark/SparkAggregatedTaskMetrics.java

 class SparkAggregatedTaskMetrics {
  private static final double HISTOGRAM_RELATIVE_ACCURACY = 1 / 32.0;
  private static final int HISTOGRAM_MAX_NUM_BINS = 512;
+  private static final int MAX_ACCUMULATOR_SIZE = 5000;


Have you benchmarked to see how much memory overhead this could add if there happened to be 5000 accumulators? Do we know what the expected number of accumulators is typically?

Just trying to get a sense of whether this is a reasonable limit - i.e. does it allow for the most common number of accumulators while safeguarding memory.

(note this is not a blocker to the PR - just interested in where the 5000 limit came from)

No worries, it was good to think through this properly!

Honestly, this was a bit of a conservative guesstimate based on the previous limit of 50k accumulators, to provide at least a limit of some sort to prevent any runaway Spark apps from blowing through all the memory.

Evaluating a bit more critically, we could probably increase this since we discard the SparkAggregatedTaskMetrics object once a stage is completed as opposed to keeping them in memory for the entire duration of the Spark job. Since Spark should really only be working on a single stage at a time, a limit of 5,000 should be fairly conservative from a memory usage standpoint. It also in our benefit that the previous value is based on an object that should be larger than the AccumulatorV2 we're now using.

Some quick napkin math to make sure this is reasonable from a usage standpoint as well - each stage is composed of multiple operations (speaking anecdotally the most I've seen is 10 operations per stage, so an upper bound of 50 operations should be fairly generous), and we see the most metrics per operation in Spark jobs run by Databricks, which has maybe ~50 metrics per stage in the worst case? This would give 2,500 accumulators we have to track, which gives us a decent amount of overhead.

A lot of anecdotal numbers unfortunately, but hopefully this is already an improvement on the previous 50k limit.

mcculls

Left a few comments - main Q is whether we should use a compensated sum in the histogram, like in https://github.com/DataDog/sketches-java/blob/master/src/main/java/com/datadoghq/sketch/WithExactSummaryStatistics.java#L24 or is it enough to do just a simple sum?

dd-octo-sts · 2026-03-10T18:03:02Z

/merge

gh-worker-devflow-routing-ef8351 · 2026-03-10T18:03:12Z

View all feedbacks in Devflow UI.

2026-03-10 18:03:11 UTC ℹ️ Start processing command /merge

2026-03-10 18:03:16 UTC ℹ️ MergeQueue: pull request added to the queue

The expected merge time in master is approximately 1h (p90).

2026-03-10 19:06:41 UTC ℹ️ MergeQueue: This merge request was merged

charlesmyu force-pushed the charles.yu/djm-0000/fix-spark-plan-metrics branch from 4e5bdc7 to ba09c80 Compare February 9, 2026 14:48

charlesmyu force-pushed the charles.yu/djm-0000/fix-spark-plan-metrics branch 5 times, most recently from cde7981 to e52fbc5 Compare February 19, 2026 21:41

charlesmyu mentioned this pull request Feb 19, 2026

Store accumulator-stage lookups directly #10645

Merged

charlesmyu force-pushed the charles.yu/djm-0000/fix-spark-plan-metrics branch from e52fbc5 to e413d1d Compare February 19, 2026 22:02

charlesmyu added inst: apache spark Apache Spark instrumentation type: enhancement Enhancements and improvements labels Feb 19, 2026

charlesmyu force-pushed the charles.yu/djm-0000/fix-spark-plan-metrics branch 2 times, most recently from 89df516 to 8651527 Compare February 24, 2026 20:06

charlesmyu marked this pull request as ready for review March 4, 2026 15:18

charlesmyu requested review from a team as code owners March 4, 2026 15:18

charlesmyu requested a review from mcculls March 4, 2026 15:18

charlesmyu added 4 commits March 4, 2026 10:19

Track external accumulators in tracer instead of using SparkInfo values

86e3185

Create and implement getSum

3128df8

Send summed SQL plan metric values

bba18eb

Limit external accumulators to 5,000 per stage

7e4b7de

charlesmyu force-pushed the charles.yu/djm-0000/fix-spark-plan-metrics branch from 8651527 to 7e4b7de Compare March 4, 2026 15:37

pawel-big-lebowski approved these changes Mar 5, 2026

View reviewed changes

mcculls reviewed Mar 10, 2026

View reviewed changes

...rk/spark_2.13/src/main/java/datadog/trace/instrumentation/spark/DatadogSpark213Listener.java Show resolved Hide resolved

mcculls reviewed Mar 10, 2026

View reviewed changes

mcculls approved these changes Mar 10, 2026

View reviewed changes

Use compensated sum to limit rounding errors

f313fa1

charlesmyu added 2 commits March 10, 2026 11:01

Cast to Number type instead of Long

b203263

Merge branch 'master' into charles.yu/djm-0000/fix-spark-plan-metrics

a18088b

charlesmyu added this pull request to the merge queue Mar 10, 2026

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 10, 2026

gh-worker-dd-mergequeue-cf854d bot merged commit cc12228 into master Mar 10, 2026
574 checks passed

gh-worker-dd-mergequeue-cf854d bot deleted the charles.yu/djm-0000/fix-spark-plan-metrics branch March 10, 2026 19:06

github-actions bot added this to the 1.61.0 milestone Mar 10, 2026

	hist.accept((Long) acc.value());
	hist.accept(((Number) acc.value()).doubleValue());

Conversation

charlesmyu commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What Does This Do

Motivation

Additional Notes

Contributor Checklist

Uh oh!

pr-commenter bot commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

Startup

Parameters

Summary

Load

Parameters

Summary

Dacapo

Parameters

Summary

Uh oh!

charlesmyu commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pawel-big-lebowski left a comment

Choose a reason for hiding this comment

Uh oh!

pawel-big-lebowski Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

charlesmyu Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcculls Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

charlesmyu Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mcculls Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

charlesmyu Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

mcculls Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

mcculls Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

charlesmyu Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcculls left a comment

Choose a reason for hiding this comment

Uh oh!

dd-octo-sts bot commented Mar 10, 2026

Uh oh!

gh-worker-devflow-routing-ef8351 bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

charlesmyu commented Feb 9, 2026 •

edited

Loading

pr-commenter bot commented Feb 9, 2026 •

edited

Loading

charlesmyu commented Feb 24, 2026 •

edited

Loading

charlesmyu Mar 5, 2026 •

edited

Loading

charlesmyu Mar 10, 2026 •

edited

Loading

charlesmyu Mar 10, 2026 •

edited

Loading

gh-worker-devflow-routing-ef8351 bot commented Mar 10, 2026 •

edited

Loading