A GPU (Graphics Processing Unit) is a massively parallel processor optimized for high‑throughput, data‑parallel workloads. Originally designed for rasterization and pixel shading, modern GPUs accelerate gaming graphics, real‑time ray tracing, video processing, scientific compute, and neural network training/inference.

Core hardware architecture elements
- Compute Blocks (SM / CU / Xe Core): NVIDIA’s Streaming Multiprocessor (SM), AMD’s Compute Unit (CU), Intel’s Xe Core—each contains many ALUs, special units, and scheduling logic.
- Scalar/Vector ALUs (CUDA cores / Stream processors): The basic arithmetic units for integer and floating‑point ops.
- SIMD/Warp/Wavefront Scheduling: Threads are grouped (warp/wave) and executed in lockstep; branch divergence reduces utilization.
- Texture Units / Sampling Units: Hardware for texture fetch, filtering, and sampling in graphics pipelines.
- Ray Tracing Units (RT Cores / Ray Accelerators): Dedicated units for BVH traversal, ray/triangle intersection, and ray sorting.
- Matrix / Tensor Accelerators (Tensor Cores, Matrix Engines): Specialized ALUs for mixed‑precision matrix multiply‑accumulate (crucial for DL acceleration).
- Caches, Shared Memory, Register Files: L1 + L2 caching, on‑SM shared memory, and large register files to reduce off‑chip memory traffic.
- Memory Controller & Memory Types: GDDR6/6X for consumer GPUs, HBM2/3 for HPC/AI; memory bandwidth is often the limiting factor.
- Interconnects (NVLink / Infinity Fabric / CXL): High‑bandwidth links for multi‑GPU scaling and coherent memory models.
Software, ISA and programming models
- Vendor stacks: NVIDIA CUDA (PTX / SASS) — most mature GPU compute ecosystem; AMD ROCm and Intel oneAPI are alternative stacks.
- Cross‑platform APIs: OpenCL, Vulkan (compute + graphics), DirectX 12/12 Ultimate (graphics + DXR).
- ML frameworks & runtimes: PyTorch, TensorFlow (backends: cuDNN, TensorRT, MIOpen). ONNX common format and XLA optimizers further portability.
- Hardware instruction extensions: Tensor instructions, mixed‑precision FP16/BF16/INT8 paths, sparse kernels, ray tracing opcodes.
Leading vendors and representative architectures (2023–2025)
- NVIDIA: Ada Lovelace (GeForce RTX 40), Hopper (H100 for data‑center AI), Blackwell (next‑gen large AI accelerators). Strengths: mature CUDA ecosystem, powerful Tensor Cores, NVLink, extensive tooling (TensorRT, cuDNN).
- AMD: RDNA (gaming GPUs RX 6000/7000), CDNA (MI100/MI200/MI300 for compute/AI). Strengths: competitive raster/compute performance, ROCm ecosystem, Infinity Fabric multi‑die interconnect.
- Intel: Arc consumer GPUs and Xe‑HP/Xe‑HPC for data center; emphasis on media & broad platform integration with oneAPI.
- Apple: Integrated GPUs in Apple Silicon (tight Metal integration, unified memory for energy efficiency).
- Specialized AI accelerators: Graphcore, Cerebras, Habana (historical), and various startup chips focused on model parallelism and extreme memory bandwidth.
Process technology, transistors and packaging impact
- Process node: Moving from 7nm → 5nm → 3nm enables more SMs/CUs, larger caches and higher clocks, but increases power density and thermal challenges.
- Transistor structures: FinFET dominant for many nodes; GAA (RibbonFET / nanosheet) coming for improved gate control and leakage reduction.
- Packaging: 2.5D (interposer, CoWoS) and 3D stacking (TSV/HBM stacks, Intel Foveros) enable HBM integration and multi‑die GPUs.
- Memory choices: HBM provides extreme bandwidth and energy efficiency for HPC/AI; GDDR6/6X is cost‑effective for gaming.
- Interposer & Bridges: NVLink, proprietary bridges, and CXL enable large aggregated memory pools and low‑latency multi‑GPU scaling.
Performance metrics and selection guidance
- Compute throughput (TFLOPS): Peak FP32/FP16/BF16/INT8 numbers are useful but often limited by memory bandwidth or kernel efficiency.
- Memory bandwidth & capacity: Critical for large models, high‑res textures, and dataset throughput; HBM solutions have much higher effective bandwidth per watt.
- Latency vs throughput: Real‑time rendering and gaming need low latency; batch training values sustained throughput.
- Energy efficiency (Perf/W): Key for laptops and data centers; heterogeneous designs (on‑die NPU + GPU) can improve perf/W.
- Ecosystem maturity: CUDA and its tooling still dominate for ML; ROCm and Vulkan compute are maturing alternatives.
Practical picks by use case:
- Gaming: high FP32 throughput, strong drivers—NVIDIA RTX 40 series or AMD RDNA 3/4 equivalents.
- Training large ML models: H100/Blackwell (NVIDIA) or AMD MI300 (CDNA) with HBM and NVLink/Infiniband scale.
- Inference / Edge AI: power‑efficient Ampere/Ada/Tensorized mobile GPUs or specialized accelerators with INT8/BF16 support.
- Content creation: GPUs with solid media engines, large VRAM, and driver support across creative apps.
Modern trends and future directions
- AI‑first GPU designs with larger matrix engines and sparsity support.
- Tighter heterogenous integration: CPU + GPU + NPU in single packages or coherent memory domains (CXL, unified memory).
- Wider adoption of HBM and chiplet multi‑die GPUs for scalability.
- Open software efforts (ROCm, Vulkan compute, oneAPI) to reduce vendor lock‑in.
- Hardware‑accelerated ray tracing and real‑time denoising becoming standard for mainstream GPUs.
-=||=-收藏赞 (0)





Must log in before commenting!
Sign In Sign Up