CIS5650小结
2025-04-02 07:30:20

刷完了课,作业完成了cuda相关的基础要求(GPU-Programming/CIS565 at main · jsjtxietian/GPU-Programming),vulkan的略过了。

Lecture1 Overview

image-20241129114855963

Lecture1-2 Cuda introduction

Cuda基础概念介绍与入门

Scheduling Threads

Blocks execute independently:In any order,parallel or series;Scheduled in any order by any number of cores(Allows code to scale with core count)

Microsoft Word - Chapter 2 - Cuda Programming Model.doc

image-20241129115404986

What actually happens when you launch a kernel say with 100 blocks each with 64 threads?

  • Let’s say on a GT80 (i.e. 16 SMs, 8 SPs each)
  • MaxT/block is 512, MaxT/SM is 768 threads
  • Chip level scheduling:
    • 100 blocks of 64 threads
    • 768/64 = max 12 blocks can be scheduled for every SM
  • SM level scheduling:
    • 12 blocks = 768 threads = 24 warps
    • Warp scheduler kicks in:
      • 8 threads -> 8 threads -> 8 threads -> 8 threads

CudaThreadingModel

image-20241129150941044

Warp

threads from a block

  • For modern GPUs – 32 threads
  • Run on the same SM
  • Unit of thread scheduling (i.e. warp scheduler)
  • Consecutive threadIdx values
  • An implementation detail – in theory
    • warpSize

2010 gpuArchShaderCores

image-20241129124647836

Memory

image-20241129125401723

Registers:Per thread;Fast, on-chip, read/write access;Increasing the number of registers used by a kernel:Lower Thread Occupancy: Potentially decrease the number of threads (well, blocks) available to that SM for scheduling.

Slide 1 register_spilling

Local Memory:Refers to memory where registers and other thread-data is spilled. Usually when one runs out of SM resources;“Local” because each thread has its own private area;Arrays declared inside kernels (if compiler can’t resolve indexing);Not really a “memory” – bytes are stored in global memory

Differences from global memory:Addressing is resolved by the compiler;Stores are cached in L2 (SM 5.x and newer)

Shared Memory:Per block, Lifetime of the block;Fast, on-chip, read/write access;Full speed random access; Acts as a user-controlled cache, Accessible from all threads in the block; Allocate statically or at kernel launch; Cache data to reduce redundant global memory access, Improve global memory access patterns; Divided into 32 32-bit banks

Global Memory:Long latency (100s cycles);Off-chip, read/write access;Random access causes performance hit;Host can read/write;Reads are cached (2.x and newer);But writes invalidate the cache

Constant Memory:Short latency, high bandwidth, read only access when all threads access the same location;Stored in global memory but cached;Host can read/write;Up to 64 KB

Lecture3 GPU Architecture

主要是cpu和gpu的架构对比

CPU optimized for sequential programming

  • Pipelines, branch prediction, superscalar, OoO
  • Reduce execution time with high clock speeds and high utilization
  • Slow memory is a constant problem

A lot of effort put into improving IPC, and avoiding memory latencies. Attacking throughput is an entirely different game.

Why GPU

  • Graphics workloads are embarrassingly parallel:Data-parallel,Pipeline-parallel
  • CPU and GPU execute in parallel
  • Hardware: texture filtering, rasterization, etc.

In simplicity, SIMD uses explicit vector instructions such as using vectorized loop;SIMT however simply relies on calling the kernel across all individual threads, thus essentially making it a “scalar”. Because each thread is a scalar. SIMT also allows for branch divergence, although it’s still not efficient.

2010 how a gpu works

image-20241129210728099

AI and Memory Wall

AI and Memory Wall. (This blogpost has been written in… | by Amir Gholami | riselab | Medium

image-20241129210909563

Deep learning requires a large amount of memory:Store the whole model for training and inference
Keep lots of memory close to the compute unit,So the data transfer is fastest possible

Lecture4 Parallel Algorithms

见作业

  • Parallel Reduction
  • Scan (Naive and Work Efficient)
  • Applications of Scan
    • Stream Compaction
    • Summed Area Tables (SAT)
    • Radix Sort

An understanding of parallel programming and GPU architecture yields efficient GPU implementations

Lecture5-7 Cuda Performance

Maximize Utilization

  • Warp Partitioning:
    • how threads from a block are divided into warps
    • Minimize divergent branches;Good partitioning also allows warps to be retired early => Better hardware utilization
    • Make threads per blocks to be a multiple of a warp (32):Incomplete warps waste unused cores, 128-256 threads per blocks is a good starting point; Try to have all threads in warp execute in lock step

image-20241202162307391

Maximize Memory Throughput

Memory Coalescing (Device Level):

  • Think in terms of parallel threads access,want threads within an iteration to be accessing continuous memory; Achieve peak bandwidth by requesting large, consecutive locations from DRAM
  • The concurrent accesses of the threads of a warp will coalesce into a number of transactions equal to the number of cache lines necessary to service all of the threads. The GPU coalesces consecutive reads in a full-warp into a single read => read global memory in a coalesce-able fashion into shared memory, Then access shared memory randomly at maximum bandwidth
  • L1 cache is 128-bytes aligned, Multiples of 128B are read; L2 cache is 32-bytes aligned, Multiples of 32B are read; L1 cache are also accessed by 32 Bytes, 1 Sector = 32 Bytes, 1 Cache Line = 4 Sectors = 128 Bytes (for L1, L2)
  • For Modern GPUs, L1 is reserved for Local Memory,accesses to global memory are cached only in L2; Global Memory -> L2 caching
  • How to Access Global Memory Efficiently in CUDA C/C++ Kernels | NVIDIA Technical Blog KirkHwuTextbook_Chapter5-CudaPerformance.pdf

image-20241202162414573

Bank Conflicts (Warp Level)

  • Shared Memory (Sometimes called a parallel data cache), Multiple threads can access shared memory at the same time, Memory is divided into banks
  • Each bank can service one address per cycle; Per-bank bandwidth: 32-bits per clock cycle; Successive 32-bit words are assigned to successive banks
  • Bank Conflict: Two simultaneous accesses to the same bank, but not the same address => Serialized !!!
  • Fast Path:All threads in a warp access different banks;Two or more threads in a warp access the same address =>“Broadcast
  • Slow Path: Multiple threads in a warp access the same bank
  • shared-memory-and-memory-banks

image-20241202163328603

Dynamic Partitioning of SM Resources

  • Performance Cliff: Increasing resource usage leads to a dramatic reduction in parallelism. For example, increasing the number of registers, unless doing so hides latency of global memory access
  • Occupancy: The ratio of active warps and maximum possible warps. Low occupancy = less latency hiding; High occupancy = better performance? (not always)

image-20241129215754263

Maximize Instruction Throughput

Data Prefetching

  • Independent instructions between a global memory read and its use can hide memory latency
  • Prefetching data from global memory can effectively increase the number of independent instructions between global memory read and use

image-20241202162629452

Instruction Mix

  • Special Function Units (SFUs) : Single-precision floating-point; Use to compute __sinf(), __expf() etc; Modern GPUs have either 16 or 32 SFUs each running at 1 instruction per clock; Use when speed trumps precision

Loop Unrolling

  • No more loop: No loop count update; No branch; Constant indices – no address arithmetic instructions
  • Need to know the size at compile time; Increased register usage -> remember at the assembly level, you’re still doing A + B = C; Instruction cache misses -> the code size increases, and potentially uses more cache to store the instructions, causing other instructions to potentially not fit on cache
  • Manually unrolling avoids the disadvantages of predication (which itself is only used for a few-instruction branch).

Thread Granularity

Independent instructions -> can leverage ILP (Instruction-level parallelism)

More registers and shared memory -> lower occupancy

优化案例

Performance Lab 手把手教了Nsight Compute怎么用(数据真的详细),以及优化Transpose和Reduction

  • Stage 0: Interleaved Addressing
  • Stage 1: Remove Modulo
  • Stage 2: Sequential Addressing & Non-divergent Warps
  • Stage 3: Reducing Warp Wastage,Add on load
  • Stage 4: Last Warp Unroll
  • Stage 5: Complete Unroll:Use Templates

Conclusion

  • Understand CUDA performance characteristics:Memory coalescing;Divergent branching;Bank conflicts;Latency hiding
  • Use peak performance metrics to guide optimization:Know peak GFLOPs and Memory Bandwidth of GPU;Know how to identify type of bottleneck,e.g. memory, core computation, or instruction overhead
  • Optimize your algorithm, then unroll loops
  • Use template gracefully:CUDA supports C++ template parameters on device and host functions,Specify block size as a function template parameter

Lecture8 Advance Topics in CUDA

一些新东西:

  • CUDA Unified Memory
  • Faster Memory Transfers
  • Zero Copy
  • CUDA Streams
  • CUDA Graph
  • CUDA Dynamic Parallelism
  • Atomic Functions
  • Warp Functions
  • Cooperative Groups

Lecture9-11 Graphics

Graphics Pipeline 到 Forward and Deferred Shading,是相对比较简练的介绍,给的参考资料也很多

Evolution: Forward -> Deferred -> Forward,Went from compute-bound to memory-bound,Modern game engines combine both techniques

NSight Graphics好用

Lecture12-13 ML

Basic Linear Algebra Subroutine (BLAS)

  • Standard set of functions for fast matrix operations
  • BLAS implementations are often optimized by hardware manufacturers for speed on a particular machine. Using things like vector registers and SIMD instructions
  • Many BLAS compatible libraries:MATLAB, numpy, R, GLM, etc.

cuBLAS is a BLAS implementation written by NVIDIA that uses CUDA

cuDNN is another closed source library from NVIDIA,Built on top of cuBLAS

  • Provides functions (forward/backward) for: Convolution, Max Pooling, Softmax, Sigmoid/ReLU/Tanh
  • Does not provide linear layer! (use cuBLAS for that!)

Computational graphs are a way of representing a neural network’s flow of information,Helps us understand what derivatives get backpropagated where. Pytorch and Tensorflow both allow dynamic ”just-in-time” updates to the computational graph, Many optimizations can be done on computational graphs (Layer fusion, quantization of weights, pruning, etc.), These optimizations are structurally impossible on Pytorch and Tensorflow v2.0.

Recap:

  • C++ code that directly calls cuBLAS and cuDNN is the industry standard for ML performance on GPU
  • However, unless you’re trying to push the limits of scale, training time is generally much cheaper than inference time
  • For training small models, Pytorch and Tensorflow are great: Lots of resources, easy to prototype, flexible, clean syntax
  • For inference, Pytorch and Tensorflow are not so great: Dynamic updates to computational graph -> can’t optimize
  • TensorRT is a good middle ground: Train model w/ PT & TF, export graph to ONNX, run with TensorRT; Automatically does a lot of perf improvements! Good baseline!

其他

  • 2023年的vulkan入门介绍,webgpu不感兴趣没看
  • 各个作业的recitation,很practical

Ray tracing in 2023

Gray = fixed-function / hardware. Improves over time.
Diamond = some kind of scheduling happening
White = programmable

Ray generation and shading is the developers responsibility. Workflow is often recursive, shaders can trace rays.

image-20241202164410929

BVH not a new idea. Been around for decades. But devil is in the details if you want it really fast. Lots of research around that, both construction and traversal, from NV and many others.

Introduction to DirectX RayTracing IntroDXR_RaytracingShaders

Real-Time_Rendering_4th-Real-Time_Ray_Tracing.pdf

RESIR

RTXGI等

Real-time Neural Radiance Caching for Path Tracing | Research

NERF:NeRFs - YouTube以及一些其他链接。

LeFohn’s Law:The job of the renderer is not to make the picture; the job of the renderer is to collect enough samples that the AI can make the picture. So sample N+1 should tell the AI as much as possible that it didn’t already know from samples [1..N].

一些建议:

image-20241202171014258

image-20241202171130855

The book Four Thousand Weeks is a good read about the philosophy and goals of time management. If you don’t have time, just read the first chapter. Moving Targets, and Why They’re Bad | Real-Time Rendering

More books:Ray Tracing Resources Page raytracinggems.com

Embedded graphics lecture

CIS 5650 - Guest Lecture - Gabe Dagani_

image-20241129222932117

image-20241129223021375

image-20241129223126469

image-20241129223527229

image-20241129223707737

image-20241129223735452