-
Notifications
You must be signed in to change notification settings - Fork 327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding global_cache_ttl feature #622
base: master
Are you sure you want to change the base?
adding global_cache_ttl feature #622
Conversation
Signed-off-by: Victor Amorim <victor.amorim@nubank.com.br>
6a82858
to
d97d910
Compare
@matthiasr Please, any feedback is welcome 😊 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay, work keeps me busy 😔
I like the idea! I am not so sure about the name… "global cache" describes the implementation more than the effect, and the addition to the README doesn't really explain when one would want it. Your PR description does, but a user won't see that. What else could we call the setting to make it more about the effect for the user? Something with refresh time? Or we leave it as is, but expand on it more in the documentation?
How will this interact with timestamps? What do users need to be aware of when they set this and enable timestamps? At what point will it start to impact queries due to the lookback delta?
In addition to the gauge showing whether this request was served from cache or not, I would also like to have a counter for cache hits. This would help quantify the impact, which is not reliably possible with the gauge alone, especially in an HA setup where one Prometheus may always be the one triggering the refresh. The cache hit counter should have the same labels as the metric for the requests overall so it is easy to get a ratio in PromQL.
public List<MetricFamilySamples> collect() { | ||
long start = System.nanoTime(); | ||
double error = 0; | ||
List<MetricFamilySamples> mfs = new ArrayList<>(); | ||
|
||
if (shouldCache() && shouldReturnFromCache()) { | ||
LOGGER.log(Level.INFO, "Returning from cache"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be debug level, otherwise it will be very noisy
Hi, thanks for the feedbacks!
I'm trying to think in something 😅 . I'll also update the documentation.
Sorry, I didn't understand these questions :/ What do you mean with
Agree, I'll update it. I'm not good with test, so do you think we need more or what we have is enough? |
Most of the time, Prometheus records samples at the time of scrape. However, this does not work with CloudWatch – because metric values converge, we need to look back in time. By default, the exporter will get metrics from 10 minutes ago. Now that could be terribly confusing because you would see a change in a metric 10 minutes later in Prometheus than in CloudWatch. To compensate, by default, the exporter also reports timestamps. This means even though Prometheus scrapes the exporter at, say, 08:42, it knows that the actual metric value is from 08:32, and saves the sample as such. However, when you run a query in Prometheus, it only looks back 5 minutes by default (the lookback delta). Thus, when you run an instant query (e.g. the "Table" tab in the Prometheus web interface), at at 08:42, it won't show anything. This is a frequent source of "bug" reports 😬 unfortunately it's a consequence of the model mismatch between CloudWatch and Prometheus, and not something the exporter can really do anything about. I am wondering (and would like documentation on) the effects of the cache on timestamps. Since it caches the response, presumably (and probably correctly), the timestamps it reports will be even further in the past. Now that I think of it, the problem may go beyond the lookback delta – if the same Prometheus scrapes the exporter again, it will see duplicate samples. I am not 100% sure how it handles those; it will definitely affect queries though. For example, if one were to set a global cache TTL of 5 minutes, a I think a user can avoid this, and still get the benefit you are after, by never setting this value larger than the smallest scrape interval of any Prometheus server that scrapes it. That way, each Prometheus will always see new data between scrapes. We could even use this as inspiration for the naming: call it |
This change add the
global_cache_ttl
feature. When it's value is greater than0
,cloudwatch-exporter
will return the last returned/metrics
call (just updatingcloudwatch_*
and java metrics), until the current time pass theglobal_cache_ttl
seconds.Why?
When working with Prometheus running in HA, we can have multiple calls for
cloudwatch-exporter
that can increase the AWS APIs calls. With this feature we can cache the last answer and guarantee that during theglobal_cache_ttl
seconds, which is indicate to be the same asscrape_interval
that prometheus use for scrapecloudwatch-exporter
, we will not call AWS again.Changes
global_cache_ttl
is greater than0
, AWS metrics are cachedcloudwatch_exporter_cached_answer
will return1
if that answer is cached or0
if notCloudWatchCollector
different times simulating time passing and returning cached and non cached values.