CPU性能 —— 单核篇 | jsjtxietian's blog

CPU性能 —— 单核篇

2025-07-18 12:54:30 #Performance

CPU Microarchitecture

The implementation of Intel 12th-generation core, Golden Cove

CPU Front-End

Its main purpose is to feed prepared instructions to the CPU Back-End, which is responsible for the actual execution of instructions.

Once a program reaches a steady state, the branch predictor unit (BPU) steers the work of the CPU Frontend. The BPU predicts the target of all branch instructions and steers the next instruction fetch based on this prediction.

The Instruction Decode Queue (IDQ) provides the interface between the in-order frontend and the out-of-order backend.

CPU Backend

The CPU Back-End employs an Out-Of-Order engine that executes instructions and stores results.

The heart of the OOO engine is the 512-entry ReOrder Buffer (ROB): register renaming, allocates execution resources, tracks speculative execution.

The “Scheduler / Reservation Station” (RS) is the structure that tracks the availability of all resources for a given µop and dispatches the µop to an execution port once it is ready.

The FP/VEC stack does floating-point scalar and all packed (SIMD) operations. Operations that move values from the INT stack to FP/VEC and vice-versa (e.g., convert, extract, or insert) incur additional penalties.

Load-Store Unit

The Load-Store Unit (LSU) is responsible for operations with memory.

AGU stands for Address Generation Unit, which is required to access a memory location. AGU is required for both load and store operations to perform dynamic address calculation.

Once a load or a store leaves the scheduler, the LSU is responsible for accessing the data. Load operations save the fetched value in a register. Store operations transfer value from a register to a location in memory. LSU has a Load Buffer (also known as Load Queue) and a Store Buffer (also known as Store Queue).

When a memory load request comes, the LSU queries the L1 cache using a virtual address and looks up the physical address translation in the TLB. Those two operations are initiated simultaneously. In case of an L1 miss, the hardware initiates a query of the (private) L2 cache tags. While the L2 cache is being queried, a 64-byte wide fill buffer (FB) entry is allocated, which will keep the cache line once it arrives. As a way to lower the latency, a speculative query is sent to the L3 cache in parallel with the L2 cache lookup. Also, if two loads access the same cache line, they will hit the same FB. Such two loads will be “glued” together and only one memory request will be initiated.

Streaming stores: the processor will try to combine writes to fill an entire cache line. All these optimizations are done inside the Store Buffer. The LSU supports store-to-load forwarding when there is an older store containing all of the load’s bytes, and the store’s data has been produced and is available in the store queue.

Non-temporal memory accesses are special CPU instructions that do not keep the fetched line in the cache and drop it immediately after use.

In most high-performance processors, the order of load and store operations is not necessarily required to be the same as the program order, which is known as a weakly ordered memory model.

TLB Hierarchy

In case a translation was not found in the TLB hierarchy, it has to be retrieved from the DRAM by “walking” the kernel page tables. The key element to speed up the page walk procedure is a set of Paging-Structure Caches that cache the hot entries in the page table structure.

The Golden Cove microarchitecture has four dedicated page walkers, which allows it to process 4 page walks simultaneously. In the event of a TLB miss, these hardware units will issue the required loads into the memory subsystem and populate the TLB hierarchy with new entries.

PMU

性能监控的硬件基础：Most modern PMUs have a set of Performance Monitoring Counters (PMC) that can be used to collect various performance events that happen during the execution of a program.

方法

TMAM

Top-down Microarchitecture Analysis Method for identifying CPU bottlenecks in a program.

Pipeline slot:

The Front-end of the pipeline on recent Intel microarchitectures can allocate four uOps per cycle, while the Back-end can retire four uOps per cycle. From these capabilities the abstract concept of a pipeline slot can be derived. A pipeline slot represents the hardware resources needed to process one uOp. The Top-Down Characterization assumes that for each CPU core, on each clock cycle, there are four pipeline slots available. It then uses specially designed PMU events to measure how well those pipeline slots were utilized. The status of the pipeline slots is taken at the allocation point , where uOps leave the Front-end for the Back-end. Each pipeline slot available during an application’s runtime will be classified into one of four categories based on the simplified pipeline view described above.

At a conceptual level, TMA identifies what is stalling the execution of a program^[1].

There are internal buffers in the CPU that keep track of information about µops that are being executed. If uop for instruction was not allocated during a particular cycle of execution, it could be for two reasons: we were not able to fetch and decode it (Front End Bound), or Back End was overloaded with work and resources for new uop could not be allocated (Back End Bound). Uop that was allocated and scheduled for execution but not retired is related to the Bad Speculation bucket. If a µop was allocated and scheduled for execution but never retired, this means it came from a mispredicted path (Bad Speculation).

TMA observes the execution of a program by monitoring a specific set of performance events and then calculates metrics based on predefined formulas. The TMA hierarchy of performance bottlenecks：

We run the workload several times, each time focusing on specific metrics and drilling down until we get to the more detailed classification of performance bottleneck.

Analysis tools such as Intel VTune Profiler, AMD uprof, and Linux perf can calculate all the metrics with a single run of the benchmark.

Roofline

Throughput-oriented performance model：The “roofline” in this model expresses the fact that the performance of an application cannot exceed the machine’s capabilities. Every function and every loop in a program is limited by either compute or memory capacity of a machine.

Frame 480

图来自：Identify Performance Bottlenecks Using CPU Roofline

GUID-B6A41ED2-A53C-415C-9FA8-BD37BF477AEE-low

The Roofline Performance Model can be helpful to:

Identify performance bottlenecks.
Guide software optimizations.
Determine when we’re done optimizing.
Assess performance relative to machine capabilities.

The Roofline performance model is mainly applicable for HPC applications that have few compute-intensive loops. Do not recommend using it for general-purpose applications, such as compilers, web browsers, or databases.

Top-Down microarchitecture analysis and Roofline performance analysis should usually be a good way to start. Most of the time you’ll see a mix of problems, so you have to analyze hotspots case by case.

Mental Model：Four Cornerstones of CPU Performance

主要资料来自：Four Cornerstones of CPU Performance. | Easyperf

Predictability of Code

How well a CPU can predict the control flow of a program (Branch prediction).

Ideal: Every branch outcome is predicted correctly in 100% of the cases.
Worst: random control flow patterns.

Predictability of Data

How well a CPU can hide the latency of memory accesses by prefetching the data ahead of time.

Ideal: Every memory access is served from a closest cache.
Worst: random memory access patterns with large strides.

Execution Throughput

How well instructions progress through the CPU pipeline. This includes fetching, discovering independent instructions (aka “extracting parallelism”), and issuing and executing instructions in parallel.

Ideal: No stalls in the execution pipeline, 100% of the CPU bandwidth is utilized.
Worst: Instructions compete for a particular execution resource.

Execution Latency（Data Dependency Chains）

How well a CPU can process a long sequence of instructions, where each of them depends on a previous one.

Ideal: massively parallel application few/short dependencies.
Worst: a long sequence of dependent instructions,e.g. pointer chaising.

Case

一些简单的优化实例，持续收集。

Memory Bound

Cache-Friendly Data Structures
- folly的哈希表 CppCon 2017: Matt Kulukundis “Designing a Fast, Efficient, Cache-friendly Hash Table, Step by Step”
Access data sequentially
- loop_interchange_1 at master
- loop_interchange_2 at master
Packing the data
- data_packing at master
Aligning and padding
- mem_alignment_1 at master
- misaligned-access
- Structure splitting、Pointer inlining等技巧
Tune the code for memory hierarchy: loop blocking (tiling), cache-oblivious algorithms…
- loop_tiling_1 at master
- Cache-Oblivious Algorithms - Algorithmica
Explicit Memory Prefetching: __builtin_prefetch
- swmem_prefetch_1 at master
- prefetching
Optimizing For DTLB: Huge page
- huge_pages_1 at master
- Performance Challenge #6 - Google 幻灯片
- 除了数据以外，代码也可以：Performance Benefits of Using Huge Pages for Code. | Easyperf
Memory order violation
- mem_order_violation_1 at master
Dynamic memory allocation: jemalloc, arena…
- 无脑换mimalloc
- C++ PMR
注意一些hardware effects
- Store-to-load forwarding的4k-aliasing
- 8-way associative cache的cache-conflicts
- 内存刷新 dram-refresh
- write combining：write-combining

牢记Brendan Gregg的图：

latency

Core Bound

inlining Functions
- function_inlining_1 at master
Loop Optimizations
- Loop Invariant Code Motion, Loop Unrolling, Loop Strength Reduction, and Loop Unswitching
- Loop Interchange, Loop Blocking (Tiling), and Loop Fusion and Distribution (Fission)
- compiler_intrinsics_1 at master
- compiler_intrinsics_2 at mastern
Auto Vectorization
- vectorization_1 at master
- vectorization_2 at master
- clang：-Rpass-analysis=loop-vectorize -Rpass=loop-vectorize -Rpass-missed=loop-vectorize 看下为啥不能自动向量化
- 可以看下mwish的系列文章：SIMD Extensions and AVX
- Vectorization part7. Tips for writing vectorizable code. | Easyperf
去掉dependency chain
注意一些hardware effects
- denormal float的性能下降 floating-point

Bad Speculation

lookup table
- lookup_tables_1 at master
branchless
- algorithm - About the branchless binary search - Stack Overflow
- cmov branches_to_cmov_1 at master
其他
- 去掉虚函数 virtual_call_mispredict at master
- 去掉conditional store conditional_store_1 at master

Frontend Bound

主要资料来自：Machine code layout optimizations. | Easyperf

Basic block placement：maintain fall through between hot pieces of the code. Not taken branches are fundamentally cheaper that taken. Additionally second case better utilizes L1 I-cache and uop-cache (DSB)
- 冷热分离，__builtin_expect
- Improving performance by better code locality. | Easyperf
- BOLT Accelerate large-scale applications with BOLT - Engineering at Meta
Basic block alignment：shift the hot code down using NOPs so that the whole loop will reside in one cache line
- Code alignment issues. | Easyperf
- Code alignment options in llvm. | Easyperf
Function splitting：next hot instruction will reside in the same cache line. This improves utilization of CPU Front-End data structures like I-cache and DSB-cache.
Function grouping：place hot functions together such that they touch each other in the same cache line.
- In gold linker it can be done using –section-ordering-file option. hhvm/hphp/tools/hfsort at master · facebook/hhvm

Other

LTO
- lto at master
PGO
- pgo at master
大小核
- 检测：“Cutting Edge Chipset” Scheduling ，TBB之类的库也有代码可以参考
- 从E-core/P-core的stream性能差异开始 - 知乎
- 再讲一个p-core和e-core的不同 - 知乎

Denis Bakhvalov大佬的 Blog 和书 perf-book
Daniel Lemire’s blog
Kobzol/hardware-effects: Demonstration of various hardware effects.
Algorithms for Modern Hardware - Algorithmica

欢迎提issue交流：Issues · jsjtxietian/jsjtxietian.github.io

1.Ahmad Yasin. A top-down method for performance analysis and counters architecture. pages 35–44, 03 2014. ISBN 978-1-4799-3606-9. doi: 10.1109/ISPASS.2014.6844459. ↩