Track external accumulators in tracer instead of using SparkInfo values#10553
Conversation
BenchmarksStartupParameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 62 metrics, 9 unstable metrics. Startup time reports for petclinicgantt
title petclinic - global startup overhead: candidate=1.61.0-SNAPSHOT~a18088bb01, baseline=1.61.0-SNAPSHOT~c04d61b318
dateFormat X
axisFormat %s
section tracing
Agent [baseline] (1.069 s) : 0, 1068724
Total [baseline] (11.092 s) : 0, 11092471
Agent [candidate] (1.057 s) : 0, 1056871
Total [candidate] (10.974 s) : 0, 10973604
section appsec
Agent [baseline] (1.246 s) : 0, 1245605
Total [baseline] (11.103 s) : 0, 11102971
Agent [candidate] (1.253 s) : 0, 1253216
Total [candidate] (11.177 s) : 0, 11176908
section iast
Agent [baseline] (1.244 s) : 0, 1243515
Total [baseline] (11.365 s) : 0, 11364953
Agent [candidate] (1.229 s) : 0, 1228916
Total [candidate] (11.319 s) : 0, 11318599
section profiling
Agent [baseline] (1.179 s) : 0, 1178655
Total [baseline] (10.949 s) : 0, 10949107
Agent [candidate] (1.179 s) : 0, 1179444
Total [candidate] (10.908 s) : 0, 10907681
gantt
title petclinic - break down per module: candidate=1.61.0-SNAPSHOT~a18088bb01, baseline=1.61.0-SNAPSHOT~c04d61b318
dateFormat X
axisFormat %s
section tracing
crashtracking [baseline] (1.212 ms) : 0, 1212
crashtracking [candidate] (1.185 ms) : 0, 1185
BytebuddyAgent [baseline] (634.921 ms) : 0, 634921
BytebuddyAgent [candidate] (627.236 ms) : 0, 627236
AgentMeter [baseline] (29.518 ms) : 0, 29518
AgentMeter [candidate] (29.05 ms) : 0, 29050
GlobalTracer [baseline] (259.427 ms) : 0, 259427
GlobalTracer [candidate] (256.238 ms) : 0, 256238
AppSec [baseline] (32.06 ms) : 0, 32060
AppSec [candidate] (31.446 ms) : 0, 31446
Debugger [baseline] (60.234 ms) : 0, 60234
Debugger [candidate] (59.341 ms) : 0, 59341
Remote Config [baseline] (604.793 µs) : 0, 605
Remote Config [candidate] (588.737 µs) : 0, 589
Telemetry [baseline] (8.718 ms) : 0, 8718
Telemetry [candidate] (8.58 ms) : 0, 8580
Flare Poller [baseline] (5.824 ms) : 0, 5824
Flare Poller [candidate] (7.233 ms) : 0, 7233
section appsec
crashtracking [baseline] (1.188 ms) : 0, 1188
crashtracking [candidate] (1.188 ms) : 0, 1188
BytebuddyAgent [baseline] (657.956 ms) : 0, 657956
BytebuddyAgent [candidate] (663.208 ms) : 0, 663208
AgentMeter [baseline] (12.031 ms) : 0, 12031
AgentMeter [candidate] (12.044 ms) : 0, 12044
GlobalTracer [baseline] (257.849 ms) : 0, 257849
GlobalTracer [candidate] (259.082 ms) : 0, 259082
AppSec [baseline] (177.231 ms) : 0, 177231
AppSec [candidate] (177.895 ms) : 0, 177895
Debugger [baseline] (65.742 ms) : 0, 65742
Debugger [candidate] (66.091 ms) : 0, 66091
Remote Config [baseline] (573.424 µs) : 0, 573
Remote Config [candidate] (564.707 µs) : 0, 565
Telemetry [baseline] (9.07 ms) : 0, 9070
Telemetry [candidate] (9.058 ms) : 0, 9058
Flare Poller [baseline] (3.587 ms) : 0, 3587
Flare Poller [candidate] (3.636 ms) : 0, 3636
IAST [baseline] (24.001 ms) : 0, 24001
IAST [candidate] (24.136 ms) : 0, 24136
section iast
crashtracking [baseline] (1.213 ms) : 0, 1213
crashtracking [candidate] (1.183 ms) : 0, 1183
BytebuddyAgent [baseline] (809.277 ms) : 0, 809277
BytebuddyAgent [candidate] (796.242 ms) : 0, 796242
AgentMeter [baseline] (11.875 ms) : 0, 11875
AgentMeter [candidate] (11.327 ms) : 0, 11327
GlobalTracer [baseline] (248.872 ms) : 0, 248872
GlobalTracer [candidate] (248.572 ms) : 0, 248572
AppSec [baseline] (27.415 ms) : 0, 27415
AppSec [candidate] (26.591 ms) : 0, 26591
Debugger [baseline] (62.904 ms) : 0, 62904
Debugger [candidate] (64.146 ms) : 0, 64146
Remote Config [baseline] (528.992 µs) : 0, 529
Remote Config [candidate] (536.244 µs) : 0, 536
Telemetry [baseline] (14.843 ms) : 0, 14843
Telemetry [candidate] (14.279 ms) : 0, 14279
Flare Poller [baseline] (4.892 ms) : 0, 4892
Flare Poller [candidate] (4.695 ms) : 0, 4695
IAST [baseline] (25.385 ms) : 0, 25385
IAST [candidate] (25.348 ms) : 0, 25348
section profiling
ProfilingAgent [baseline] (93.278 ms) : 0, 93278
ProfilingAgent [candidate] (93.724 ms) : 0, 93724
crashtracking [baseline] (1.172 ms) : 0, 1172
crashtracking [candidate] (1.155 ms) : 0, 1155
BytebuddyAgent [baseline] (680.748 ms) : 0, 680748
BytebuddyAgent [candidate] (680.925 ms) : 0, 680925
AgentMeter [baseline] (8.589 ms) : 0, 8589
AgentMeter [candidate] (8.567 ms) : 0, 8567
GlobalTracer [baseline] (215.079 ms) : 0, 215079
GlobalTracer [candidate] (215.074 ms) : 0, 215074
AppSec [baseline] (31.868 ms) : 0, 31868
AppSec [candidate] (31.983 ms) : 0, 31983
Debugger [baseline] (63.471 ms) : 0, 63471
Debugger [candidate] (63.44 ms) : 0, 63440
Remote Config [baseline] (580.388 µs) : 0, 580
Remote Config [candidate] (583.577 µs) : 0, 584
Telemetry [baseline] (8.901 ms) : 0, 8901
Telemetry [candidate] (9.736 ms) : 0, 9736
Flare Poller [baseline] (4.261 ms) : 0, 4261
Flare Poller [candidate] (3.487 ms) : 0, 3487
Profiling [baseline] (93.841 ms) : 0, 93841
Profiling [candidate] (94.283 ms) : 0, 94283
Startup time reports for insecure-bankgantt
title insecure-bank - global startup overhead: candidate=1.61.0-SNAPSHOT~a18088bb01, baseline=1.61.0-SNAPSHOT~c04d61b318
dateFormat X
axisFormat %s
section tracing
Agent [baseline] (1.056 s) : 0, 1056448
Total [baseline] (8.83 s) : 0, 8829551
Agent [candidate] (1.059 s) : 0, 1058583
Total [candidate] (8.838 s) : 0, 8837505
section iast
Agent [baseline] (1.224 s) : 0, 1224235
Total [baseline] (9.582 s) : 0, 9581549
Agent [candidate] (1.224 s) : 0, 1224058
Total [candidate] (9.491 s) : 0, 9490599
gantt
title insecure-bank - break down per module: candidate=1.61.0-SNAPSHOT~a18088bb01, baseline=1.61.0-SNAPSHOT~c04d61b318
dateFormat X
axisFormat %s
section tracing
crashtracking [baseline] (1.191 ms) : 0, 1191
crashtracking [candidate] (1.195 ms) : 0, 1195
BytebuddyAgent [baseline] (626.863 ms) : 0, 626863
BytebuddyAgent [candidate] (627.754 ms) : 0, 627754
AgentMeter [baseline] (28.952 ms) : 0, 28952
AgentMeter [candidate] (29.07 ms) : 0, 29070
GlobalTracer [baseline] (255.539 ms) : 0, 255539
GlobalTracer [candidate] (256.534 ms) : 0, 256534
AppSec [baseline] (31.475 ms) : 0, 31475
AppSec [candidate] (31.501 ms) : 0, 31501
Debugger [baseline] (58.563 ms) : 0, 58563
Debugger [candidate] (58.603 ms) : 0, 58603
Remote Config [baseline] (592.375 µs) : 0, 592
Remote Config [candidate] (597.938 µs) : 0, 598
Telemetry [baseline] (8.633 ms) : 0, 8633
Telemetry [candidate] (8.638 ms) : 0, 8638
Flare Poller [baseline] (8.707 ms) : 0, 8707
Flare Poller [candidate] (8.717 ms) : 0, 8717
section iast
crashtracking [baseline] (1.19 ms) : 0, 1190
crashtracking [candidate] (1.192 ms) : 0, 1192
BytebuddyAgent [baseline] (794.413 ms) : 0, 794413
BytebuddyAgent [candidate] (794.618 ms) : 0, 794618
AgentMeter [baseline] (11.293 ms) : 0, 11293
AgentMeter [candidate] (11.293 ms) : 0, 11293
GlobalTracer [baseline] (246.561 ms) : 0, 246561
GlobalTracer [candidate] (246.698 ms) : 0, 246698
AppSec [baseline] (26.385 ms) : 0, 26385
AppSec [candidate] (26.499 ms) : 0, 26499
Debugger [baseline] (63.153 ms) : 0, 63153
Debugger [candidate] (62.394 ms) : 0, 62394
Remote Config [baseline] (545.376 µs) : 0, 545
Remote Config [candidate] (522.734 µs) : 0, 523
Telemetry [baseline] (14.886 ms) : 0, 14886
Telemetry [candidate] (14.788 ms) : 0, 14788
Flare Poller [baseline] (4.74 ms) : 0, 4740
Flare Poller [candidate] (4.91 ms) : 0, 4910
IAST [baseline] (25.081 ms) : 0, 25081
IAST [candidate] (25.15 ms) : 0, 25150
LoadParameters
See matching parameters
SummaryFound 4 performance improvements and 1 performance regressions! Performance is the same for 13 metrics, 18 unstable metrics.
Request duration reports for insecure-bankgantt
title insecure-bank - request duration [CI 0.99] : candidate=1.61.0-SNAPSHOT~a18088bb01, baseline=1.61.0-SNAPSHOT~c04d61b318
dateFormat X
axisFormat %s
section baseline
no_agent (1.209 ms) : 1197, 1221
. : milestone, 1209,
iast (3.226 ms) : 3185, 3267
. : milestone, 3226,
iast_FULL (5.859 ms) : 5800, 5918
. : milestone, 5859,
iast_GLOBAL (3.649 ms) : 3589, 3709
. : milestone, 3649,
profiling (1.999 ms) : 1980, 2017
. : milestone, 1999,
tracing (1.792 ms) : 1776, 1807
. : milestone, 1792,
section candidate
no_agent (1.181 ms) : 1169, 1193
. : milestone, 1181,
iast (3.3 ms) : 3252, 3348
. : milestone, 3300,
iast_FULL (5.999 ms) : 5937, 6061
. : milestone, 5999,
iast_GLOBAL (3.495 ms) : 3436, 3554
. : milestone, 3495,
profiling (2.094 ms) : 2074, 2115
. : milestone, 2094,
tracing (1.837 ms) : 1822, 1852
. : milestone, 1837,
Request duration reports for petclinicgantt
title petclinic - request duration [CI 0.99] : candidate=1.61.0-SNAPSHOT~a18088bb01, baseline=1.61.0-SNAPSHOT~c04d61b318
dateFormat X
axisFormat %s
section baseline
no_agent (19.397 ms) : 19202, 19592
. : milestone, 19397,
appsec (18.576 ms) : 18387, 18765
. : milestone, 18576,
code_origins (17.858 ms) : 17680, 18035
. : milestone, 17858,
iast (17.913 ms) : 17735, 18092
. : milestone, 17913,
profiling (18.731 ms) : 18544, 18918
. : milestone, 18731,
tracing (19.366 ms) : 19167, 19566
. : milestone, 19366,
section candidate
no_agent (18.138 ms) : 17955, 18321
. : milestone, 18138,
appsec (19.595 ms) : 19394, 19797
. : milestone, 19595,
code_origins (18.035 ms) : 17854, 18216
. : milestone, 18035,
iast (18.115 ms) : 17933, 18297
. : milestone, 18115,
profiling (18.651 ms) : 18467, 18835
. : milestone, 18651,
tracing (17.714 ms) : 17538, 17889
. : milestone, 17714,
DacapoParameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 11 metrics, 1 unstable metrics. Execution time for biojavagantt
title biojava - execution time [CI 0.99] : candidate=1.61.0-SNAPSHOT~a18088bb01, baseline=1.61.0-SNAPSHOT~c04d61b318
dateFormat X
axisFormat %s
section baseline
no_agent (15.001 s) : 15001000, 15001000
. : milestone, 15001000,
appsec (14.895 s) : 14895000, 14895000
. : milestone, 14895000,
iast (18.104 s) : 18104000, 18104000
. : milestone, 18104000,
iast_GLOBAL (17.891 s) : 17891000, 17891000
. : milestone, 17891000,
profiling (15.45 s) : 15450000, 15450000
. : milestone, 15450000,
tracing (15.044 s) : 15044000, 15044000
. : milestone, 15044000,
section candidate
no_agent (15.696 s) : 15696000, 15696000
. : milestone, 15696000,
appsec (14.933 s) : 14933000, 14933000
. : milestone, 14933000,
iast (19.079 s) : 19079000, 19079000
. : milestone, 19079000,
iast_GLOBAL (17.714 s) : 17714000, 17714000
. : milestone, 17714000,
profiling (14.775 s) : 14775000, 14775000
. : milestone, 14775000,
tracing (15.16 s) : 15160000, 15160000
. : milestone, 15160000,
Execution time for tomcatgantt
title tomcat - execution time [CI 0.99] : candidate=1.61.0-SNAPSHOT~a18088bb01, baseline=1.61.0-SNAPSHOT~c04d61b318
dateFormat X
axisFormat %s
section baseline
no_agent (1.472 ms) : 1461, 1484
. : milestone, 1472,
appsec (3.776 ms) : 3557, 3996
. : milestone, 3776,
iast (2.253 ms) : 2184, 2322
. : milestone, 2253,
iast_GLOBAL (2.291 ms) : 2222, 2360
. : milestone, 2291,
profiling (2.075 ms) : 2021, 2129
. : milestone, 2075,
tracing (2.067 ms) : 2014, 2120
. : milestone, 2067,
section candidate
no_agent (1.473 ms) : 1462, 1485
. : milestone, 1473,
appsec (3.8 ms) : 3580, 4020
. : milestone, 3800,
iast (2.248 ms) : 2180, 2317
. : milestone, 2248,
iast_GLOBAL (2.288 ms) : 2219, 2357
. : milestone, 2288,
profiling (2.108 ms) : 2052, 2165
. : milestone, 2108,
tracing (2.051 ms) : 1998, 2104
. : milestone, 2051,
|
4e5bdc7 to
ba09c80
Compare
cde7981 to
e52fbc5
Compare
e52fbc5 to
e413d1d
Compare
89df516 to
8651527
Compare
This stack of pull requests is managed by Graphite. Learn more about stacking. |
8651527 to
7e4b7de
Compare
pawel-big-lebowski
left a comment
There was a problem hiding this comment.
Nice, elegant implementation tackling a complex problem — I only left a small comment.
| private static final MethodHandles methodLoader = | ||
| new MethodHandles(ClassLoader.getSystemClassLoader()); | ||
| private static final MethodHandle externalAccums = | ||
| methodLoader.method(TaskMetrics.class, "externalAccums"); |
There was a problem hiding this comment.
could you provide some doc on why do we need reflection and which Spark version support externalAccums/withExternalAccums?
There was a problem hiding this comment.
I can't find any good public-facing docs for this (probably since it's an internal API), but it seems like the relevant commit is here: apache/spark@b33a3ee
Somewhere in Spark v3.5.2, there was a change to move from directly accessing externalAccums to using the withExternalAccums pattern. Unfortunately it seems like it was to remediate a performance regression so there wasn't any backwards compatibility provided with that change, and as a result we need reflection to figure out which method to use when pulling the accumulators.
| /** Wrapper around the DDSketch library so that it can be used in an instrumentation */ | ||
| public class DDSketchHistogram implements Histogram { | ||
| private final DDSketch sketch; | ||
| private double sum; |
There was a problem hiding this comment.
Should we use a compensated sum to limit rounding errors like in https://github.com/DataDog/sketches-java/blob/master/src/main/java/com/datadoghq/sketch/WithExactSummaryStatistics.java#L24 ?
There was a problem hiding this comment.
Ah, good call - added!
...rk/spark_2.13/src/main/java/datadog/trace/instrumentation/spark/DatadogSpark213Listener.java
Show resolved
Hide resolved
| try { | ||
| // As of spark 3.5, all SQL metrics are Long, safeguard if it changes in new | ||
| // versions | ||
| hist.accept((Long) acc.value()); |
There was a problem hiding this comment.
You could consider casting to Number which would then support all built-in number types:
| hist.accept((Long) acc.value()); | |
| hist.accept(((Number) acc.value()).doubleValue()); |
I used doubleValue() to get the value as that's what the histogram API expects - this will automatically map the given number to the double type.
There was a problem hiding this comment.
Makes sense!
| class SparkAggregatedTaskMetrics { | ||
| private static final double HISTOGRAM_RELATIVE_ACCURACY = 1 / 32.0; | ||
| private static final int HISTOGRAM_MAX_NUM_BINS = 512; | ||
| private static final int MAX_ACCUMULATOR_SIZE = 5000; |
There was a problem hiding this comment.
Have you benchmarked to see how much memory overhead this could add if there happened to be 5000 accumulators? Do we know what the expected number of accumulators is typically?
Just trying to get a sense of whether this is a reasonable limit - i.e. does it allow for the most common number of accumulators while safeguarding memory.
There was a problem hiding this comment.
(note this is not a blocker to the PR - just interested in where the 5000 limit came from)
There was a problem hiding this comment.
No worries, it was good to think through this properly!
Honestly, this was a bit of a conservative guesstimate based on the previous limit of 50k accumulators, to provide at least a limit of some sort to prevent any runaway Spark apps from blowing through all the memory.
Evaluating a bit more critically, we could probably increase this since we discard the SparkAggregatedTaskMetrics object once a stage is completed as opposed to keeping them in memory for the entire duration of the Spark job. Since Spark should really only be working on a single stage at a time, a limit of 5,000 should be fairly conservative from a memory usage standpoint. It also in our benefit that the previous value is based on an object that should be larger than the AccumulatorV2 we're now using.
Some quick napkin math to make sure this is reasonable from a usage standpoint as well - each stage is composed of multiple operations (speaking anecdotally the most I've seen is 10 operations per stage, so an upper bound of 50 operations should be fairly generous), and we see the most metrics per operation in Spark jobs run by Databricks, which has maybe ~50 metrics per stage in the worst case? This would give 2,500 accumulators we have to track, which gives us a decent amount of overhead.
A lot of anecdotal numbers unfortunately, but hopefully this is already an improvement on the previous 50k limit.
mcculls
left a comment
There was a problem hiding this comment.
Left a few comments - main Q is whether we should use a compensated sum in the histogram, like in https://github.com/DataDog/sketches-java/blob/master/src/main/java/com/datadoghq/sketch/WithExactSummaryStatistics.java#L24 or is it enough to do just a simple sum?
|
/merge |
|
View all feedbacks in Devflow UI.
The expected merge time in
|

What Does This Do
Updates the metrics in the
_dd.spark.sql_planmeta field to use distributions calculated from individual task metrics, rather than the naively summed metrics provided by theStageInfoobjects from Spark. This is becauseStageInfonaively sums all accumulators, even though that may not make sense for certain Spark SQL metrics (e.g. avg hash probes per key for aggr operations). Instead, we should accumulate those ourselves into distribution metrics and emit them accordingly.Currently in the UI, this is only used in one place (in the Spark SQL metrics in the DJM product), so we're not too worried about changing the format here. UI update to follow.
If any issues arise with sending traces with a larger number of histograms, we can disable it using the
DD_SPARK_TASK_HISTOGRAM_ENABLEDfeature flag.Motivation
We'd like accurate metrics for Spark SQL operations that can reflect task-level characteristics as a distribution. This brings us more in line with what is shown in the Spark UI:

Additional Notes
We can't get rid of the original map that tracks accumulators to stages as we still use that to associate Spark SQL operations to stages. However, we can avoid storing the entire accumulator now, and instead just store a simple map of accumulator ID to stage ID. This will be done in a followup PR: #10645
Contributor Checklist
type:and (comp:orinst:) labels in addition to any other useful labelsclose,fix, or any linking keywords when referencing an issueUse
solvesinstead, and assign the PR milestone to the issueJira ticket: [PROJ-IDENT]
Note: Once your PR is ready to merge, add it to the merge queue by commenting
/merge./merge -ccancels the queue request./merge -f --reason "reason"skips all merge queue checks; please use this judiciously, as some checks do not run at the PR-level. For more information, see this doc.