Replies: 1 comment 8 replies
-
I think a big factor is that, as I understand it, you're running things within Docker Desktop, within WSL2, within a desktop version of Windows on a machine that may have other things going on in the fore- and/or background. In that configuration, consistency is basically impossible to secure and variations between benchmark runs can be significant. We also know that in that configuration, the performance achieved is severely impacted. I happen to know that Dave's results are generated using Docker Engine, on top of a an Ubuntu installation that runs on bare metal and does nothing else. That makes a big difference already. That said, the idea to implement multiple runs of each solution and averaging those out came up quite some time ago already; there have been earlier discussions on the topic in this repo. Up to this point, we haven't gotten round to implementing this. As the entire benchmark toolchain is part of the repo, PRs can be opened on it, as well. |
Beta Was this translation helpful? Give feedback.
-
I did notice that there is a variation in the measurements. In one of my runs the PrimeC/solution_2/sieve_1of2.c was extremely low in the ranking. So I did some attempts to get a feeling for how big that variation is.
For this test I took the PrimeC solution_2 sieve_1of2.c and run that in a loop 100 times on my old Athlon powered laptop. There were no other applications running, except what is there in standard LUbuntu.
I noticed that the standard deviation is about 5% of the median. So for a median of 1000 you would measure in 95% of the cases between 950 and 1050. However the worst case I had a difference of 40%, so in the example above measurement of 600!. That is a big variation. I have to repeat this test some times more and on my other laptops. But it is interesting. The main conclusion is that you have to take measurements of multiple runs and take the median to have a fair compare. Which is no surprise, that is just a general rule in measurements.
I noticed that in some of the performance measurements people are not aware of these measurement principles and people tend to compare one of runs, or use averages were you should use median values.
It would be great if we would be able to extend the make with an option to fair compare two solutions. This should run two solutions for x (x>100) times and measure the median, standard deviation, standard deviation % of the median, and min max value, min max percentage.
This would help in the definition on how solutions are compared. Notice that this does not matter for the order of magnitude, only when small improvements are in question.
@davepl Do you take the median values in your compare video's?
Beta Was this translation helpful? Give feedback.
All reactions