-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#1340 better metric support #1343
base: master
Are you sure you want to change the base?
Conversation
use https://github.com/jelmd/libprom instead of the bugyy digitalocean/prometheus-client-c including its compact feature.
Build fail because CI tries to link against the wrong/buggy prometheus-client-c lib. |
99ff06b
to
9969511
Compare
These tests are an absolute nightmare! They seem to be pretty bogus. Please advise, what's going wrong, what needs to be fixed. |
089ebdc
to
6f737da
Compare
26c7a72
to
9e49ddc
Compare
Actually the s390x bloat never succeed and prevents a priori all PRs from being accepted. Never heard, that anybody is running coturn on a non-{Linux,Solaris,BSD} platform. Having tests for all possible OS/arches is one thing, actually testing against all is a completely different thing - not a green at all and wastes a lot of resources. So IMHO the test matrix should be revised to reflect, what is really needed.
9e49ddc
to
6980c5b
Compare
Test bugs (home grown install wrt. libprom libs and includes and different clang-format versions depending on the distro in use) fixed. |
fixes #1337, too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good work!
- Why use https://github.com/jelmd/libprom instead of dirt bikes
digitalocean/prometheus-client-c?
src/apps/relay/mainrelay.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The recommendation is defined as a structure:
struct _PROMETHEUS
{
int prom;
int prom_port;
int prom_compact;
int prom_realm;
int prom_usernames;
int prom_usid;
int prom_rsid;
vint prom_sid_retain;
vint log_ip;
};
@@ -1,8 +1,8 @@ | |||
# Prometheus setup | |||
# Metrics setup | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add in here:
Metrics introduction
- https://docs.victoriametrics.com/keyConcepts.html | ||
- https://docs.victoriametrics.com/FAQ.html#what-is-high-cardinality | ||
- https://docs.victoriametrics.com/FAQ.html#what-is-high-churn-rate | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pms_t *sample_session_state; | ||
pms_t *sample_stun_req; | ||
pms_t *sample_stun_resp; | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The recommendation is defined as a structure
Thank you @jelmd for this monumental piece of work. But I do not think we can accept this PR in its current form:
Though this list is long and negative - there are many changes we can adopt but as separate PRs. I'm trying to be practical - if something gets broken it is going to be very hard to track it and fix it (broken as in addition to the list above). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sending back to author to split into separate PRs
Basically the Readme gives you a coarse overview. Also it exposes stringbuffer and simple log interface, so that one does not need to re-invent the wheels to just add metrics and stuff to its "project" (but can if needed). So way much better user experience than the religious (machine translated Go to C?) interface. |
Thanx @eakraly for your time and having a look at it!
Not sure, what you mean. Is it possible to pinpoint the sections, which are a problem? Basically my focus is not Prometheus, but metrics. They finally get exposed in OpenMetrics/Prometheus exposition format, yes - but I wouldn't insist on this (a lot of other formats are possible). The meat is, that the metrics gets stored somehow, and to not re-invent the wheel, libprom is good enough to store the samples and as side effect can be used easily to expose the data. However, OpenMetrics/Prometheus exposition format is that simple, that one could generate a report simply via
Yes, if no-one uses it, vendors certainly are not going to include it in their distros (is there really a difference between un- and under-maintained?). I'm not sure, what "potentially undermaintained" means in this context. Do you mean, it needs to be bloated every month or year to give people the impression, that some stuff got really changed and urge them to update ASAP? Hmm, I maintain stuff for free, so I update on demand and not to earn money for questionable stuff. However, if there is a good sponsor, I could increment the version on demand each month ;-).
What? The problem right now is, that "metric" data get updated when an allocation gets closed. So one can really forget it - the data are worth nothing at all - just paper weight, do not provide, what a provider really needs. Similar is the redis stuff - beside that it doesn't provide life time data but coarse grained samples it is probably more resource demanding than the prom stuff (IIRC I've seen some issues regarding high CPU usage wrt. redis). So right now, coturn really lacks a proper monitoring facility, which allows operators to identify leakages and bottlenecks, monitor bandwidth and resource usage, and to size the envs as needed.
You mean dropping s390x stuff? Well, actually I just ditched it, because the tests always failed on it: not because the software had an error but because the test platform was so unreliable, that the tests couldn't even be started. =8-( Very, very bad test environment, which prevents PRs to succeed just because the test platform can not be kept in a stable state to be able to start any test at all. This furthermore rises the questions: For what purpose gets the software tested on any known platform on earth? IMHO not green at all, a huge waste of resources for nothing but "compiles on every platform on earth". Who cares? Who needs it? x86_64 is sufficient for our purposes, if it works on arm64 or sparc64 and friends, fine. Everything else: please give a hint!
The PR rules want
Ehmm, nope. Can you elaborate please? Right now, as said, metrics are completely useless. They even do not deserve the term monitoring, it is just a junk yard of meaningless data. With the PR one gets the first time the opportunity to really monitor coturn application on the fly and being able to turn the knobs as needed. Depending on the degree you want to play with your knobs, you may monitor coarse grained at relay-thread level or even more fine grained on session/allocation level. So everyone's milage may vary, but the default level should be sufficient for production envs and overhead negligible (I would scare much more, if redis is enabled). Of course, if one uses its old C64, all bets are off ;-).
Thanks for nitpicking this one :). Actually paranoid people would download the source via github using HTTPS and compile it by themselves. Really paranoid people would download, review and probably file an issue, if they discover something, what is probably not ok (whatever this means). All others do not really care at all, would be happy even with HTTP, too (my experience so far) - they just want to get it work. Anybody can host/provide it for its customers/clients. So "lack of trust" or "stop using coturn" here is IMHO a really questionable assumption without any proof or relation to reality (usually distro vendors provide packages, not the users themselves, and in turn if they do not trust their vendors ...), especially because it is OpenSource! Last but not least, yes, people trust any software much more just because they can download it via OpenSSL 1000+.x.y instead of 1.0.2? Yepp, that's for sure, this software must be safe ;-).
Ehhhm, realm is not my idea, but I can see use for it (e.g one may group metrics by realm). Therefore I just kept it, but only as session_state attribute - would otherwise blow up the metrics numbers, if it really gets used with different values (threadID:sessionID is already unique, so other labels are just add. info - noise - one usually does not need). Wrt. to performance: I doubt that general assumptions and prefer some numbers. Actually I thing, if one disables redis and just use the prom stuff, it would be much, much better wrt. performance and less resource demanding. Unfortunately I haven't had time to waste my time with redis - just had a look at the code :(.
I'm not sure, whether I understand, what you mean. For each metric you need to keep counters somewhere. If one does it file based, no one needs to wonder about performance. So the obvious things to do is to keep counters in memory. Simply apps do it by maintaining global vars, others do it by using a lib like libprom to keep counters (key:value pairs alias samples) in a hash map and provide an API, which make it thread-safe (thread-safe is not really an issue for coturn, because the related counters are touched always by the same thread anyway). So basically, one could completely ditch libprom et al, if one bloats its one code with counters (like
What tests? Right now, it is by far much better than before. Wondering, whether there are any tests for the current useless metrics implementation (or even redis things?)?
Not really and not sure whether it makes any sense to break it into separate PRs - actually I think it would make it much harder to understand the features implementation. But lets see, I'm open to any suggestions :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- prometheus-client-c is pre-distributed in most systems
- It took me two days to look at the source code(prometheus-client-c and your libprom) and learn prometheus.
- I think the interface of prometheus-client-c is well defined. Simple to use. Although its implementation is a bit bug, but it does not affect our use.
- IMHO, the interface defined by libprom is a bit messy(machine translated). Some of the internal interfaces are exposed. If used in a project, it can cause a performance penalty.
So I recommend going ahead with prometheus-client-c.
You fix bug. please pull request to prometheus-client-c.
// clang-format off | ||
// different clang-format versions have different opinions on this and prevent PRs to succeed | ||
enum /* permitted values for its `has_arg' field... */ | ||
{ no_argument = 0, /* option never takes an argument */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ehmm, the PR failed because ’make lint’ was not happy with it, when it got linted on ubuntu 20.04. If fixed for the linter there, the ’make lint’ for ubuntu 22.04 failed. So a deadlock for PRs (wondering, how all the other PRs got through ;-)).
@jelmd About http server:
|
Really? Is any software actually using it? Didn't found any, when decided to fork (but obviously it was not the only reason to fork). Really strange, that this gets bundled. But package maintainer do not necessarily understand the software they package or have a closer look on it, sometimes they just volunteer to package it because someone asked for it ... ;-)
Thanx a lot :)
cluttered, sometimes not really smart/inefficient, nightmare to build (just look at the bloat coturn CI needs to get something usable out of it), buggy wrt. histograms and even tests, ...
Messy? Plz give a hint.
That's exactly the point. IMHO reinventing the wheel does not solve any problem, this is a problem. I wrote some metric exporters, enhanced apps with it and always found it so boring, unproductive that one has to copy-and-paste all the boiler plate code like "loggers" and/or string buffers etc.. Therefore I exposed them intentionally, so that one may focus on the real task (data collection and metric export) and does not need to to waste a lot of time to find full blown loggers, string buffers, etc. libs, which are actually not needed, because they are already there. So the lib should not be understood as a data collection/formation thing, but as a utility that helps to get the more or less pretty simple job done. Finally an interface is an interface and the maintainer decides, whether to expose it or to keep it hidden because unstable or buggy or just a bad hack, whatever ...
How? It is actually the purpose to have something, which is safe and fast enough to do the job. Not sure, if someone has to write a new string buffer implementation, whether it is bug free and as fast as the one in libprom ;-) Same thing wrt. to a simple logger. Last but not least, as for any lib, one does not need to use all the features/functions it provides (e.g. libapr, libexpat, libssl, ...) - it just offers functions to make the life easier, there is no "you have to use ...".
Sorry but nope, there was a reason to fork it and if you have a look on the diffs, at its Issues and PRs it is pretty obvious, that a) the project is dead, b) nobody would review a PR and thus c) it would be just a huge waste of time (I do not have BTW) to create a PR, which sits in the review queue forever. Actually some of the reasons, why github encourages people to fork ;-) |
What do you mean. Plz name what you mean! You mean promhttp_start_daemon() ? Otherwise I do not know, what the "built-in promhttp" should be.
Everything needs to be started somewhere, no change here BTW. It gets started at the same place as before, which is ok - all other threads get started there, too. And for completeness, So no, it does not "work in the admin service thread" (wouldn't it block it BTW, if it would ;-)), just the serving threads get started there, only. see also: docs/Metrics.md.
OMG. Why one should do that? I get goose bumps ;-). But of course, if someone's life is not hard enough yet, nothing prevents them to do such strange things. Can you perhaps explain, what the advantage would be and what it has actually to do with the implementation?
LOL - no, metrics are an integral part of the software - not only for performance reason it needs to collect the data by itself: only the app [authors] knows, where, when, why, what to collect and is able to do it at the right place. Using an external plugin for it does not make any sense at all. If you refer to the http part - of course there needs to be an interface, where the collected data are provided to the "public". This part gets done using the OpenMetrics/Prometheus exposition format and providing the "snapshots" via HTTP. There is no need at all to clutter the app with "plugins", which are able to format data in several 1000 different formats or expose it via SNMP or InfluxDB or whatever format (and describe it via XML schema :)) - just paperweight, because the OpenMetrics format is that simple, that it can be transformed into whatever is needed by the clients themselves easily. So IMHO, plugin === bad idea.
Can you plz be specific here? What do you mean with "simple and easy-to-use interface". Somehow this sounds like the opposite to your plugin idea ;-) |
About HTTP servers. There are several solutions:
We are currently talking about option 1. From the perspective of Prometheus design and stability, it is recommended to use Option 2. In O&M. Option 3. is also needed. Functional requirements for Prometheus:
Well, prometheus-client-c provides just that.
Implementation of option 1
@jelmd @eakraly |
Ehmm, 1st thing to note: libmicrohttpd as used in libprom is a http server (it even support https). It is not named nginx, apache httpd, or whatever, but it is an http server, pretty stable and used in several other projects (e.g. like flutter). But if one doesn't trust it, one can try to provide its own implementation, like coturn does for the web admin interface, or acme-redirect, etc. ...
and other projects and community using it. BTW: I use it for several years and never had an issue with it.
Please define "full functionality of an HTTP server". The metrics interface speaks HTTP as far as needed and that's it. Nobody claims, that it needs to provide all the same features as apache httpd, or nginx, etc. do - so if you mean this, the answer has to be: plain wrong - it does not need to .... The protocol to get the data is simply HTTP and here the interface just needs to "print out" all metrics if a correct request with the path
You mean a forked fat process?
What happens if a thread dies ;-) How should coturn control the forked process?
How get the metrics xfered to the forked process/its threads? I would be happy to see a "scott me up beamy" implementation, but as long as this isn't usable on earth I doubt "doesn't affect" wrt. to any problems and especially wrt. performance.
Again, please define "functionality of an HTTP server". Is it possible to make a list of what you think libmicrohttpd is missing? As already said, in general a metrics provider should be able to answer HTTP requests and provide the metrics via
Where is the difference? If you wanna answer an HTTP request you always need a request listener which (no matter whether you hide it in a cascade of other servers spawned as thread or fat processes, or process it directly). And you need certainly a request processor which answers the request. No matter where it sits, or how one names its cat, the resources are always needed. However, one can do it in an efficient way (right now, similar as coturn does with the relay-threads), or by introducing a lot of overhead, complexity (perhaps IPC, or file based exchange ;-) ), which would be simply a nightmare. I prefer KISSes. BTW: Especially in the coturn context I hear always "performance, performance, performance", but read actually the opposite incl. a lot of unfounded assumptions. Where does this come from?
Indeed. It does not make any sense in this context.
Dangerous misconception, flawed design. Again, just because a service provides data via HTTP protocol, it does not need to provide all the features all other known HTTP servers provide. The purpose is key. If some wants to put a proxy in front of it, for whatever strange reason, he can do it. However, if one does this, the env certainly has already other severe problems regarding security, performance, stability - basically all what you claim it should actually solve ;-). At this stage it is better to do the homework first, and consider providing services, when this has been done.
If you make such assumption, please provide some proof of it. I think libmicrohttpd users/developers would answer back: "The opposite is the case". Just the much higher complexity (features they provide, third party SW they use), is an indicator, that they are affected by much more bugs and thus "high stability" in this context is a really questionable term. One may also check the Deduced from the apps/utilities I have in use for several years and what I've read so far including the mentioned web pages I can say: libmicrohttpd and its use is at least as stable as all the other thingies you mentioned and is probably much more reliable.
Doesn't it already? What about
Nope, the opposite is the case.
Huhh? Please define "easy to use".
And why should the metrics provider in coturn implement such a feature. This is completely unrelated and a clear indicator, that your idea about infrastructure setup/metric providers is severely flawed. See above.
Not at all. It is the opposite of what you claim it should be.
Never ever. This is not a coturn but an operator problem. E.g. consider the odometer of your car being a metrics provider. It provides the data to its operator aka driver, so that it might be able to avoid dangerous situations and getting a ticket for being to fast. No normal human being would mount a 65+" OLED display on the roof of its car and use it to show the speed it currently has to the public. Total non-sense, hurts performance and implies, that affected living things are able to recognize and decrypt the shown stuff. Sound and experience are usually sufficient for Papa and Mama Tomato to know, when they need to urge baby tomato to catch up ;-).
Ehmm, absolutely confusing. Are we talking about coturn or one of its potential consumers? Please do not mix up everything.
Your totally missing the point. Actually one does not need anything of it, neither coturn nor any metrics consumer! What is simply needed by a metrics provider is a key:value store and a way to answer HTTP requests for
As libprom does. But as said, that's not really the point.
Technically - under the hood - this doesn't matter. Both are counters but the API makes sure, that one can be incremented, only. Again, a simple key:value store accomplishes the same.
key:value pairs as well, just mimic a thing sliced into several pieces. So
Not at all. That's an implementation detail of the libraries. That's a little bit odd to say, we need libraryXY and that to enumerate all its functions it provides as requirements. If several times pointed out, what a metrics provider needs. If in doubts, read the [OpenMetrics] (https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md) or [Prometheus Expostion format] docs. BTW: They do not say for reason, how the provider has to implement it, just that they need to be exposed via HTTP and how to format it. And please, try to understand this first, before making any further claims and misleading/wrong assumptions.
Plain wrong again.
Again, requiring the opposite of what you actually want does not help. Or do you ask for an additional thread to start the relay-threads as well. That's make simply no sense at all and would be BTW a very bad server design (it should fail early ...).
What about docs/Metrics.md - did you really read/reviewed the whole PR.
What do you think what src/apps/relay/prom_server.c:init_metrics() does?
LOL - isn't this obvious. If you do not collect data, what metrics do you wanna get out of it ;-)
As said, the only option to discuss is IMHO to implement the metrics provider by itself. libpromhttp is really simple and can be replaced without any effort by using libmicrohttpd directly, replacing libprom by a proper key:value store seems to require much more work and testing because it needs to be dynamic wrt. tid:sid. If one really does it, I guess it would probably end up with a hash map like thing again - and thus reinvent the wheel [which might be quadratic]. So there are other options, but not really good ones, especially if you ride the performance and stability horse again and again. |
@@ -350,9 +350,9 @@ void str_buffer_append_sz(struct str_buffer *sb, size_t sz) { | |||
str_buffer_append(sb, ssz); | |||
} | |||
|
|||
void str_buffer_append_sid(struct str_buffer *sb, turnsession_id sid) { | |||
void str_buffer_append_sid(struct str_buffer *sb, turnsession_id sid, uint32_t rsid) { | |||
char ssz[129]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this enough space to hold the new data in all cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Actually 20 + 1 + 10 + 1 = 32 would be sufficient. Not sure, why the original author has chosen such a big buffer (doesn't make any sense to me).
static char Usage[] = | ||
"Usage: turnserver [options]\n" | ||
"Options:\n" | ||
" -d, --listening-device <device-name> Listener interface device (NOT RECOMMENDED. Optional, Linux " | ||
"only).\n" | ||
" -p, --listening-port <port> TURN listener port (Default: 3478).\n" | ||
" -p, --listening-port <port> TURN listener port. Default: " _STRVAL(DEFAULT_STUN_PORT) ".\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes to stamping the port number into the help text would be lovely to have as a separate PR
Same with disabling prometheous at compile time, though that's a much bigger mess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, maybe. But hardcoding the port stuff and then rewrite it again is takes too much time w/o any benefit. Not sure, what is meant with "disabling prometheous at compile time".
@@ -45,6 +45,7 @@ static pthread_barrier_t barrier; | |||
|
|||
////////////// Auth Server //////////////// | |||
|
|||
uint32_t relay_servers_in_use = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be a static variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is needed in src/apps/relay/ns_ioalib_engine_impl.c and is a "global var". So no unless there is a special collection of global vars somewhere else...
#ifdef POOL_TRACE | ||
#define _TRACE TURN_LOG_FUNC | ||
#else | ||
// dirty talk - is for developers, only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really understanding this comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be log trashing
better?
TURN_LOG_LEVEL_INFO, "session %018llu: realm <%s> user <%s>: incoming packet %s processed, success\n", | ||
(unsigned long long)(ss->id), (const char *)(ss->realm_options.name), (const char *)(ss->username), method); | ||
TURN_LOG_FUNC(TURN_LOG_LEVEL_INFO, | ||
"session %012llu.%d: realm <%s> user <%s>: incoming packet %s processed, success\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You replace %018llu
with %012llu
in a bunch of different places.
Maybe it would be better to make a #define for this string, so it can be reused globally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume, that the original author had a reason for not using a macro. Perhaps to avoid clashes when the format gets changed.
src/server/ns_turn_server.h
Outdated
@@ -40,7 +40,10 @@ extern "C" { | |||
|
|||
//////////// defines ////////////// | |||
|
|||
#define TURN_SESSION_ID_FACTOR (1000000000000000LL) | |||
// Even if a single threaded app handles on average about 1 session/s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i suspect 1 session per second is on the low side. My production instances handle many thousands of concurrent users per thread.
Not that your analysis is wrong in the general sense, just that maybe the explanation should be a bit more cautious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the unit (k
) is missing (1k * 1G = 1T =~ 12 zeros, 1 Gs =~ 32 years). Even if a box handles 50k session/s, session ID would wrap after ~231 days: unlikely that all sessions last that long ...
@jelmd OK! I am select prometheus-client-c. the interface of prometheus-client-c is defined fine. |
d0acd7a
to
ed41ff2
Compare
Very interested in this, any updates? |
Perfect! Nice job! |
Would also love to see this merged in! Any updates/help needed? |
Fix #1340 - this PR provides live monitoring of the turn server application by providing up-to-date metrics exposed via HTTP in OpenMetrics/Prometheus Exposition format at relay-thread level per default, on relay-thread/session level if desired. For more details see
docs/Metrics.md
.