[Feature Request] Distributions should track percentiles #11

MaxGabriel · 2016-04-10T14:58:57Z

A nice feature of StatsD's timers is that they track percentiles, in addition to things that Distribution already tracks like the min, max, mean, etc. Having e.g. 99th and 95th percentile data is useful when you know the mean could be biased by outliers. For example:

At work we track the maximum size of cookies to make sure we aren't getting too close to the 4K limit. Having one user (the maximum) at the limit is very different than the 90th percentile maximum near the limit.
Distributions tracking time to complete a task might be typically be quite fast, but edge cases might pull the mean upwards (e.g. most users in your social network have 100 friends, a few users have 100,000; most runs of a script are very vast, but running it with verbose logging to debug it might slow it to a crawl).

@tibbe had this to say about the implementation:

Context on why the Distribution metric works as it does: statsd takes the approach of computing percentiles by storing every single data point. This doesn't scale well (e.g. we don't do it this way at Google for that reason). Since I wrote ekg with the Google use case in mind I didn't want to copy this limitation into ekg. The best way I know how to do this is by storing histograms (not the theoretically best approach, but the best engineering approach) and computing approximate percentiles from that. I planned to implement that, but never got around to it. It might be possible to make ekg work with the statsd approach, but it would require reworking the internals (e.g. so that each metric update results in a callback, from which we could send a packet to statsd).

The text was updated successfully, but these errors were encountered:

jpfuentes2 · 2016-04-14T03:52:10Z

Edited: Ignore me! I see there's already a client you've written.

It might be possible to make ekg work with the statsd approach, but it would require reworking the internals (e.g. so that each metric update results in a callback, from which we could send a packet to statsd).

~~Were you thinking of just implementing the callback and no StatsD client?~~

The best way I know how to do this is by storing histograms (not the theoretically best approach, but the best engineering approach) and computing approximate percentiles from that.

Have you seen HdrHistogram? It's a preferred alternative to an ordinary histogram since it avoids coordinated omission.

tibbe · 2016-04-14T06:45:24Z

Have you seen HdrHistogram? It's a preferred alternative to an ordinary histogram since it avoids coordinated omission.

I have. Unfortunately I believe they suffer from the same problem as most "theoretically better than a histogram" approaches, namely that you cannot merge two HdrHistograms created on different machines. Being able to gather statistics separately and then compute global quantiles is a really important property.

jpfuentes2 · 2016-04-14T19:20:11Z

Being able to gather statistics separately and then compute global quantiles is a really important property.

Absolutely agree here as well. I've seen both approaches used in tandem to good effect.

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Distributions should track percentiles #11

[Feature Request] Distributions should track percentiles #11

MaxGabriel commented Apr 10, 2016

jpfuentes2 commented Apr 14, 2016

tibbe commented Apr 14, 2016

jpfuentes2 commented Apr 14, 2016

[Feature Request] Distributions should track percentiles #11

[Feature Request] Distributions should track percentiles #11

Comments

MaxGabriel commented Apr 10, 2016

jpfuentes2 commented Apr 14, 2016

tibbe commented Apr 14, 2016

jpfuentes2 commented Apr 14, 2016