Benchmarking is Hard

2025-07-30 14:53:06 #Performance

神作：Pro .NET Benchmarking | Andrey Akinshin

关于方法

Always Measure One Level Deeper 挺好的文章，讲了很多common mistakes和要注意的点
- Performance measurements should be considered guilty until proven innocent. Never trust a number generated by a computer.
- Measuring deeper is the best way to validate top-level measurements and uncover bugs. Measuring deeper will also indicate whether you are getting the best possible performance and, if not, how to improve it. Do not spend a lot of time agonizing over which deeper measurements to make.
- It can be as fancy as an interactive webpage or as simple as a text file, but a dashboard is essential for any nontrivial measurement effort.
最小值 >> 平均值：Minimum does the best at noise rejection, because we expect that any measurements higher than the minimum are due to noise.
- Lemire大佬提过很多时候时间分布其实更接近 log normal distribution（Are your memory-bound benchmarking timings normally distributed? ）当然最好能把时间分布可视化出来之后再决定用什么metric。
Testing the Performance of ClickHouse
- But if we only look at the minimum, we are going to miss cases where some runs of the query are slow and some are not (e.g. boundary effects in some cache). So we compromise by measuring the median. It is a robust statistic that is reasonably sensitive to outliers and stable enough against noise.
- Use a non-parametric bootstrap method to build a randomization distribution for the observed difference of median query run times. This method is described in detail in “A Randomized Design Used in the Comparison of Standard and Modified Fertilizer Mixtures for Tomato Plants”
- The user must be able to investigate a problematic query post-mortem, without running it again locally.
Do not believe everything you read in the papers
- Script everything，record everything，derive results from measurements
- Plan how to present results before starting work：Make formats easy to explain，Make numbers easy to read off
- Understand simple cases first：Before paying any attention to actual results, try to identify simple test cases that should have known behavior
Use “delta” measurement: The simple way to do this is if you test has a measurement loop and times the entire loop, run it for N iterations and 2N, and then use (run2 – run1)/N as the time.
- This cancels out all the fixed overhead, such as the clock/rdpmc call, the call to your benchmark method, any loop setup overhead. It doesn’t cancel out the actual loop overhead though (e.g., increment and end-of-loop check) – even though that’s often small or zero. 来自Microbenchmarking calls for idealized conditions 评论区的建议。
Randomly shuffle: if you have a series of benchmarks over time, and then you randomly shuffle all of the benchmarks, does the randomly shuffled one look basically the same as the original? 来自 Benchmarking correctly is hard
Look for performance regressions not in entire applications, but in individual subroutines.
- Applications have thousands of subroutines, so a small performance regression for an entire app is likely a large one for the affected subroutine. This way, you’re looking for ~5% regressions, not ~0.05% regressions, which makes filtering out noise easier. 来自 FBDetect: Catching Tiny Performance Regressions at Hyperscale through In-Production Monitoring
abseil / Performance Tip of the Week #39: Beware microbenchmarks bearing gifts
- Understanding why a particular benchmark does not produce representative results is a critical step in improving benchmark fidelity, and can even produce insights into production behavior.

关于测量

在代码优化中做准确的、上下文敏感的计时：TechNotes/2018/在代码优化中做上下文敏感的计时.md at master · GHScan/TechNotes
注意micro benchmark和实际运行时候的区别
- Google codes typically have large instruction footprints. Benchmarks are often cache resident.
- Individual operations were benchmarked in two ways: always triggering a cache hit and always triggering a cache miss. Having explicit benchmarks for the boundary conditions - always-cache-hit and always-cache-miss gives more insight on how changes to the code affect its operations under different conditions and help develop intuition on which operations are important to optimize to reach our goal: improve production performance.
- abseil / Performance Tip of the Week #39: Beware microbenchmarks bearing gifts
注意Timestamping本身的耗时和精度
- The typical Stopwatch resolution on Windows is about 300–500ns.
- clock_gettime(CLOCK_MONOTONIC, …) is fast — roughly 80ns — about two orders of magnitude faster than an ordinary system call.
- rdtsc() runs in about 32ns. rdtsc() may give different answers on different cores on the same machine. TSC can appear to run backwards, due to process migration. rdtsc() is not a serializing instruction. Not all cycles take the same amount of time (Turbo Boost, reduce tclock frequency for AVX, AVX2, and AVX512 instructions).
Producing Wrong Data Without Doing Anything Obviously Wrong：measurement bias is significant and commonplace，and Measurement Bias is Unpredictable
- changing the UNIX environment size changes the location of the call stack which in turn affects the alignment of local variables in various hardware structures.
- depending on link order, O3 either gives a speedup over O2 or a slow down over O2；Link order affects the alignment of code, causing conflicts within various hardware buffers (e.g. caches) and hardware heuristics (e.g. branch prediction)
CI for performance: Reliable benchmarking in noisy environments
- Measuring and Reducing CPU Usage in SQLite SQLite也用Cachegrind来测量微观性能，测量CPU cycles（correlate well with real performance），优势是high repeatability，也讨论了劣势
Performance matters by emery berger
- Memory layout affects performance makes performance evaluation difficult
- STABILIZER eliminates the effect of layout enables sound performance evaluation，Stabilizer | Emery Berger
- 多线程环境使用casual profiler
For serial jobs, don’t run on core 0, where many interrupt handlers are usually run. See /proc/interrupts (from 6.172)

benchmark的建议

2025-07-30 14:53:06 #Performance