CPU性能 —— 单核篇
2025-04-02 07:30:20

CPU Microarchitecture

The implementation of Intel 12th-generation core, Golden Cove

image-20241216171158786

CPU Front-End

Its main purpose is to feed prepared instructions to the CPU Back-End, which is responsible for the actual execution of instructions.

Once a program reaches a steady state, the branch predictor unit (BPU) steers the work of the CPU Frontend. The BPU predicts the target of all branch instructions and steers the next instruction fetch based on this prediction.

The Instruction Decode Queue (IDQ) provides the interface between the in-order frontend and the out-of-order backend.

CPU Backend

The CPU Back-End employs an Out-Of-Order engine that executes instructions and stores results.

The heart of the OOO engine is the 512-entry ReOrder Buffer (ROB): register renaming, allocates execution resources, tracks speculative execution.

The “Scheduler / Reservation Station” (RS) is the structure that tracks the availability of all resources for a given µop and dispatches the µop to an execution port once it is ready.

The FP/VEC stack does floating-point scalar and all packed (SIMD) operations. Operations that move values from the INT stack to FP/VEC and vice-versa (e.g., convert, extract, or insert) incur additional penalties.

Load-Store Unit

The Load-Store Unit (LSU) is responsible for operations with memory.

AGU stands for Address Generation Unit, which is required to access a memory location. AGU is required for both load and store operations to perform dynamic address calculation.

Once a load or a store leaves the scheduler, the LSU is responsible for accessing the data. Load operations save the fetched value in a register. Store operations transfer value from a register to a location in memory. LSU has a Load Buffer (also known as Load Queue) and a Store Buffer (also known as Store Queue).

When a memory load request comes, the LSU queries the L1 cache using a virtual address and looks up the physical address translation in the TLB. Those two operations are initiated simultaneously. In case of an L1 miss, the hardware initiates a query of the (private) L2 cache tags. While the L2 cache is being queried, a 64-byte wide fill buffer (FB) entry is allocated, which will keep the cache line once it arrives. As a way to lower the latency, a speculative query is sent to the L3 cache in parallel with the L2 cache lookup. Also, if two loads access the same cache line, they will hit the same FB. Such two loads will be “glued” together and only one memory request will be initiated.

Streaming stores: the processor will try to combine writes to fill an entire cache line. All these optimizations are done inside the Store Buffer. The LSU supports store-to-load forwarding when there is an older store containing all of the load’s bytes, and the store’s data has been produced and is available in the store queue.

Non-temporal memory accesses are special CPU instructions that do not keep the fetched line in the cache and drop it immediately after use.

In most high-performance processors, the order of load and store operations is not necessarily required to be the same as the program order, which is known as a weakly ordered memory model.

TLB Hierarchy

In case a translation was not found in the TLB hierarchy, it has to be retrieved from the DRAM by “walking” the kernel page tables. The key element to speed up the page walk procedure is a set of Paging-Structure Caches that cache the hot entries in the page table structure.

The Golden Cove microarchitecture has four dedicated page walkers, which allows it to process 4 page walks simultaneously. In the event of a TLB miss, these hardware units will issue the required loads into the memory subsystem and populate the TLB hierarchy with new entries.

PMU

性能监控的硬件基础:Most modern PMUs have a set of Performance Monitoring Counters (PMC) that can be used to collect various performance events that happen during the execution of a program.

image-20241017184141468

方法

TMAM

Top-down Microarchitecture Analysis Method for identifying CPU bottlenecks in a program.

Pipeline slot:

The Front-end of the pipeline on recent Intel microarchitectures can allocate four uOps per cycle, while the Back-end can retire four uOps per cycle. From these capabilities the abstract concept of a pipeline slot can be derived. A pipeline slot represents the hardware resources needed to process one uOp. The Top-Down Characterization assumes that for each CPU core, on each clock cycle, there are four pipeline slots available. It then uses specially designed PMU events to measure how well those pipeline slots were utilized. The status of the pipeline slots is taken at the allocation point , where uOps leave the Front-end for the Back-end. Each pipeline slot available during an application’s runtime will be classified into one of four categories based on the simplified pipeline view described above.

At a conceptual level, TMA identifies what is stalling the execution of a program[1].

There are internal buffers in the CPU that keep track of information about µops that are being executed. If uop for instruction was not allocated during a particular cycle of execution, it could be for two reasons: we were not able to fetch and decode it (Front End Bound), or Back End was overloaded with work and resources for new uop could not be allocated (Back End Bound). Uop that was allocated and scheduled for execution but not retired is related to the Bad Speculation bucket. If a µop was allocated and scheduled for execution but never retired, this means it came from a mispredicted path (Bad Speculation).

image-20241017181902586

TMA observes the execution of a program by monitoring a specific set of performance events and then calculates metrics based on predefined formulas. The TMA hierarchy of performance bottlenecks:

image-20241017181902586

We run the workload several times, each time focusing on specific metrics and drilling down until we get to the more detailed classification of performance bottleneck.

Analysis tools such as Intel VTune Profiler, AMD uprof, and Linux perf can calculate all the metrics with a single run of the benchmark.

Roofline

Throughput-oriented performance model:The “roofline” in this model expresses the fact that the performance of an application cannot exceed the machine’s capabilities. Every function and every loop in a program is limited by either compute or memory capacity of a machine.

Frame 480

图来自:Identify Performance Bottlenecks Using CPU Roofline

GUID-B6A41ED2-A53C-415C-9FA8-BD37BF477AEE-low

The Roofline Performance Model can be helpful to:

  • Identify performance bottlenecks.
  • Guide software optimizations.
  • Determine when we’re done optimizing.
  • Assess performance relative to machine capabilities.

The Roofline performance model is mainly applicable for HPC applications that have few compute-intensive loops. Do not recommend using it for general-purpose applications, such as compilers, web browsers, or databases.

Top-Down microarchitecture analysis and Roofline performance analysis should usually be a good way to start. Most of the time you’ll see a mix of problems, so you have to analyze hotspots case by case.

Mental Model:Four Cornerstones of CPU Performance

主要资料来自:Four Cornerstones of CPU Performance. | Easyperf

Predictability of Code

How well a CPU can predict the control flow of a program (Branch prediction).

  • Ideal: Every branch outcome is predicted correctly in 100% of the cases.
  • Worst: random control flow patterns.

Predictability of Data

How well a CPU can hide the latency of memory accesses by prefetching the data ahead of time.

  • Ideal: Every memory access is served from a closest cache.
  • Worst: random memory access patterns with large strides.

Execution Throughput

How well instructions progress through the CPU pipeline. This includes fetching, discovering independent instructions (aka “extracting parallelism”), and issuing and executing instructions in parallel.

  • Ideal: No stalls in the execution pipeline, 100% of the CPU bandwidth is utilized.
  • Worst: Instructions compete for a particular execution resource.

Execution Latency(Data Dependency Chains)

How well a CPU can process a long sequence of instructions, where each of them depends on a previous one.

  • Ideal: massively parallel application few/short dependencies.
  • Worst: a long sequence of dependent instructions,e.g. pointer chaising.

Case

一些简单的优化实例,持续收集。

Memory Bound

牢记Brendan Gregg的图:

latency

Core Bound

Bad Speculation

Frontend Bound

主要资料来自:Machine code layout optimizations. | Easyperf

Other

More

欢迎提issue交流:Issues · jsjtxietian/jsjtxietian.github.io


  1. 1.Ahmad Yasin. A top-down method for performance analysis and counters architecture. pages 35–44, 03 2014. ISBN 978-1-4799-3606-9. doi: 10.1109/ISPASS.2014.6844459.
Prev
2025-04-02 07:30:20
Next