Precision of the measurements #332

fvbakel · 2021-07-07T08:21:54Z

fvbakel
Jul 7, 2021

I did notice that there is a variation in the measurements. In one of my runs the PrimeC/solution_2/sieve_1of2.c was extremely low in the ranking. So I did some attempts to get a feeling for how big that variation is.

For this test I took the PrimeC solution_2 sieve_1of2.c and run that in a loop 100 times on my old Athlon powered laptop. There were no other applications running, except what is there in standard LUbuntu.

I noticed that the standard deviation is about 5% of the median. So for a median of 1000 you would measure in 95% of the cases between 950 and 1050. However the worst case I had a difference of 40%, so in the example above measurement of 600!. That is a big variation. I have to repeat this test some times more and on my other laptops. But it is interesting. The main conclusion is that you have to take measurements of multiple runs and take the median to have a fair compare. Which is no surprise, that is just a general rule in measurements.

I noticed that in some of the performance measurements people are not aware of these measurement principles and people tend to compare one of runs, or use averages were you should use median values.

It would be great if we would be able to extend the make with an option to fair compare two solutions. This should run two solutions for x (x>100) times and measure the median, standard deviation, standard deviation % of the median, and min max value, min max percentage.

This would help in the definition on how solutions are compared. Notice that this does not matter for the order of magnitude, only when small improvements are in question.

@davepl Do you take the median values in your compare video's?

rbergen · 2021-07-07T10:05:51Z

rbergen
Jul 7, 2021
Maintainer

I think a big factor is that, as I understand it, you're running things within Docker Desktop, within WSL2, within a desktop version of Windows on a machine that may have other things going on in the fore- and/or background. In that configuration, consistency is basically impossible to secure and variations between benchmark runs can be significant. We also know that in that configuration, the performance achieved is severely impacted.

I happen to know that Dave's results are generated using Docker Engine, on top of a an Ubuntu installation that runs on bare metal and does nothing else. That makes a big difference already.

That said, the idea to implement multiple runs of each solution and averaging those out came up quite some time ago already; there have been earlier discussions on the topic in this repo. Up to this point, we haven't gotten round to implementing this. As the entire benchmark toolchain is part of the repo, PRs can be opened on it, as well.

8 replies

marghidanu Jul 7, 2021
Maintainer

I did propose a while back to do at least 5 measurements of each implementation, which should allow us to determine the variation between runs.

fvbakel Jul 9, 2021
Author

I read the Route #66 issue. It is really great to read back the discussion of things that are now in place and were a given fact when I started. You guys did already quite some good work, compliments :-)

At the risk of this discussion becoming a lot larger in scope very quickly, there are other questions about these, and basically any, performance benchmarks. One is the relevance of the hardware (architecture) you use to run them on. When things come closer together, whether you run on x86-64 or arm64 becomes relevant, but even if you're using AMD or Intel. Then looking at things a little more broadly still, there are questions about the duration of the runs, the size of the sieve, etc., all boiling down to the question what constitutes a fair comparison between programming languages in the first place.

I think we should acknowledge that we are not fair comparing program languages in the general sense. We are comparing under very specific defined conditions and rules, and that is just perfect. Just look at the real drag-race rules. By no means is a real drag-race a fair compare of cars in general, just like this race. We are comparing as fair as possible:

Program languages and the infrastructure that comes with the language, (eg compilers). There are just better compilers available for C than for Tcl, that is a property of the language to me. So the availability of compilers for the language in the given conditions is just part of the language compare.
The programmers skills in that language. In some languages improvements might be possible, but no programmer has come along to implement that.
The performance of the above in the specific conditions as defined in CONTRIBUTING.md. The discussion on runtime, number of primes, etc. are just tackled by defining these rule. For example, in real drag-races they could include a category with factory commercials cars. That is possible, but they just decide not to do that by the rules. If we keep a parameter static it is a rule, when we make it flexible, it becomes a measurement parameter. For practical reasons we might have parameters fixed at the moment that we could make flexible in the future. The fact that we are running under Docker is just a race-rule to me.

We are measuring in a consistent way the above for different conditions (measurement parameters). Again, this is similar to real drag-races where the length of the track can vary. I think that the hardware differences that you mention are just part of different measurement parameters, within the defined race rules and method of measurement. Running on different hardware is a measurement where the hardware parameter is different, just like the algorithm tag can be different. I agree with you that when there is a discussion on which is the fastest this should be taken in to account, just like the Docker version, host OS might vary.

We could tackle this fair compare discussion somehow in a document in the main directory eg, ACKNOWLEDGEMENTS.md or MEASUREMENTS.md. I am willing to compile a proposal for such a document if you want me to. I think it could help if we could define the goal of this GitHub project in one or two sentences and put that in the README.md

Maybe we should continue this part of the discussion in a separate idea.

What I would like to achieve in this idea are some tools that are helpful for a developer to make compares on their own conditions (hardware, host os, docker version, busy other processes, etc). That at least the method of measurement for a close call compare is well defined. The interpretation of the results should consider the difference in conditions.

I did propose a while back to do at least 5 measurements of each implementation

I think this still a real good idea. I have some additions to that:

the number of measurements (5) should be configurable. For some statistical relevance this should be more in the 100 range with the assumption that the standard deviation is 5%
It is oké if this mechanism would generate some type of default report, however it should also be possible to output the raw measurements results to a csv (each measurement one line.) I noticed that this is already somewhat available under the surface in the make tools. Having the raw measurements would enable any detailed data analysis on the raw data by anyone with any tool (that is what data scientists prefer in general).
The run multiple times option should of course not be the default. The default should be what the starting developer and casual interest person would want.
Furthermore it should be possible to only run a single or a group of solutions in this repeating manner.
How the group of the above point is determined should be flexible and based on different criteria/options. For example:
1. By program language (all solutions in one language folder)
2. By tag characteristics, eg all base, faithful, single threaded,1 bit solutions
3. By type of language. Eg all scripting languages. For the definitions of types we could get the types from list of programming languages by type. we could cache that Wikipedia in a static csv file (again that is whats easy to load in data science tools )
4. By year of introduction of program language
5. And the supper set, a combination of the grouping above.

I realize it is nice wish list and it might be overkill, I understand it might not be easy to get but it is what I would like to have :-). I can imagine that 2-5 could be part of a central data collection report application, that collects the data. Option 1 would really be great to have as a developer of new solutions because it allows you to compare with a consistent method on your own hardware your idea with the existing solutions.

Side Note:

The grouping could also help to reduce the default runtime of the make, by only running a subset of all available solutions.

rbergen Jul 9, 2021
Maintainer

@fvbakel I'll have to read your whole message later (with all these PRs coming in, time really is becoming a factor), but it may not be wasted time to take a look at this, when you have a minute: https://agreeable-mud-0b27ba210.azurestaticapps.net/ (backed by this repo). It's one of the components of a benchmark result sharing toolchain we're working on. I'll provide more information on the state of the whole chain in an upcoming response to #362.

fvbakel Jul 9, 2021
Author

@rbergen

I'll have to read your whole message later

Take your time, I was not expecting a quick answer :-). I just felt the need to capture some of my thoughts and idea's.

I had a look at the data collection app and it looks great! It has nice level of detail of the important parameters. It can probably be extended one day with the grouping suggestions above with pre-baked filters. I was wondering if the date of the report is the date the report was run, or the date of the branch checkout. Maybe the measurement should include some type of version indicator of the benchmarks. That could help in comparing to make sure we are comparing the same version of a solution.

Multiple idea's come to my mind on what I would like to do with the collected measurements. I guess you probably have multiple idea's for extensions too :-). But that might better be handled in a separate discussion.

rbergen Jul 10, 2021
Maintainer

Take your time, I was not expecting a quick answer :-). I just felt the need to capture some of my thoughts and idea's.

I know. :) I just felt the need to start sharing this thing (that took me a meaningful amount of time to develop) with people who express interest in the concept. :)

I had a look at the data collection app and it looks great! It has nice level of detail of the important parameters.

Thank you!

It can probably be extended one day with the grouping suggestions above with pre-baked filters.

It can be extended with many things :) It is possible to define your own filter presets which the browser will remember (because they are saved in local storage). Baking a few pre-baked presets in there would be possible, maybe allowing users to share their presets with others is also an option.

I was wondering if the date of the report is the date the report was run, or the date of the branch checkout.

The date of the report is the date the report was run. Basically, whatever you see on either screen in the app is coming from the benchmark tool JSON format, which doesn't include anything about the branch itself, at the moment.

Maybe the measurement should include some type of version indicator of the benchmarks. That could help in comparing to make sure we are comparing the same version of a solution.

As I'm convinced you know, currently there is no versioning for anything in this repo. The only thing we currently have are commit IDs, which are not suitable for versioning tasks

For now, I think the main focus should be getting the whole benchmark tool -> web service -> web application chain to work. Once that's really working, I expect user input from a broader audience will make it easier to decide on what features to add in what order. It is a public repo, so PRs on it can be opened too :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Precision of the measurements #332

{{title}}

Replies: 1 comment 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Precision of the measurements #332

fvbakel Jul 7, 2021

Replies: 1 comment · 8 replies

rbergen Jul 7, 2021 Maintainer

marghidanu Jul 7, 2021 Maintainer

fvbakel Jul 9, 2021 Author

rbergen Jul 9, 2021 Maintainer

fvbakel Jul 9, 2021 Author

rbergen Jul 10, 2021 Maintainer

fvbakel
Jul 7, 2021

Replies: 1 comment 8 replies

rbergen
Jul 7, 2021
Maintainer

marghidanu Jul 7, 2021
Maintainer

fvbakel Jul 9, 2021
Author

rbergen Jul 9, 2021
Maintainer

fvbakel Jul 9, 2021
Author

rbergen Jul 10, 2021
Maintainer