LLM - Basic Knowledge List
Published:
This post summarizes the basic knowledge about LLM (~50 papers)
Training
- parallelisms, distributed training
- data parallelism
- allreduce grad
pipeline parallelism
- tensor parallelism
matrix computations in self-attention (SA) & feed-forward network (FFN) are composed of neuron (i.e. a row/column vector) computations:
take 2-layer FFN as an example (#neurons = d_ffn):
\(W_1 =[W_1^1, W_1^2, ..., W_1^{d_{ffn}}],\quad W_2 = \begin{bmatrix} W_2^1 \\W_2^2 \\ \vdots \\W_2^{d_{ffn}} \\ \end{bmatrix} , \quad \sigma(xW_1) W_2 = \sum_{i=1}^{d_{ffn}} \sigma(xW_1^i) W_2^i\) where
- we can gather the neurons into (disjoint) clusters, and dispatch different clusters to different computational units (e.g. GPUs) to implement tensor parallelism
- sequence parallelism
- ring attention, computation-communication overlap
- expert parallelism
- deepspeed-MoE
- zero
- zero-infinity, offloading system making use of CPU/NVMe memories
- data parallelism
training task property
compute bound
when the training scale is not extremely large, the computation cost is dominant, rather than the GPU memory accessing cost and network communication cost
the computation cost can be measured in terms of Floating-point operations (FLOPs or Flops), which is easy to estimate by the rule [backward flops details]:
\[forward\_FLOPs = 2 \times \#tokens \times \#activated\_paramters\] \[backward\_FLOPs = 2 \times forward\_FLOPs\]where “activated” is specially referred for the case of Mixture-of-Expert (MoE) models, if the model is not MoE (i.e. is dense model), the activated paramters is the total parameters
memory consuming
- during the training process, other than the parameters, we also need to save the gradients and optimizer states proportionally to the original amount of parameters
- 224444: p(16)/g(16)/p(32)/g(32)/os1(32)/os2(32)
- we need (20 * parameter_size) GB GPU memory to store the whole model states (p, g, os1, os2) for fp16-fp32 mixed precision training
Inference
- algorithm
- speculative inference
- sequoia
- sparsity
- dejavu
- powerinfer
- moefication
- speculative inference
- prompting
- RAG
- reasoning
- CoT
- ToT
- RAP
- serving system
- vllm
- tensorRT
- disaggregation (on heterogeneous hardware settings)
- DistServe
- prefill-decoding disaggregation
- Attention Offloading
- SA-FFN disaggregation
- DistServe
- FlexGen
- offloading
- inference task property
- prefill-decoding 2 stages
- KV cache
- memory bound
- when the batch size is not extremely large (~200), the bottleneck of inference speed is the GPU memory accessing of model weights & KV cache, rather than the computation cost and communication cost
- it is meaningless to transform a memory bound task to a GPU-CPU/NVMe IO bound task, e.g. designing an offloading system for inference
- prefill-decoding 2 stages
Fine-Tuning
- supervised fine-tuning (SFT)
- efficiency
- parameter-efficient fine-tuning (PEFT)
- LoRA (low-rank)
- DoRA
- LoRA+
- AdaLoRA
- LoRA (low-rank)
- memory-efficient (training)
- GaLore (low-rank, SVD)
- add noise to reduce overfitting
- NEFT
- parameter-efficient fine-tuning (PEFT)
- FT task property
- compute bound (like training)
- not memory consuming (with parameter/memory-efficient methods)
RLHF PPO
the goal: human preference alignment
- given a human preference (feedback) dataset
- a reward model r(prompt, generation) is trained in a supervised & contrastive manner in advance, which is used to score the human preference of a generation for a prompt in the stage of the following RL fine-tuning
- 3 stages in RLHF PPO
reward model training
LLM SFT
LLM RL
MDP problem setting
state: prompt (currently generated tokens)
action: the next token
policy: llm
models in the framework
reward
- output of stage 1, frozen
reference policy
- output of stage 2, frozen
policy (actor)
initialized from the reference policy model, to be optimized
by an RL algorithm, e.g. PPO, DPO
- PPO
- TRPO objective
- clip / KL penalty regularization
- DPO
- beyond PPO, the analytical form of the reward model can be derived from the learning objective of KL-penalized RLHF PPO
- DPO does not require a seperate reward model training
- PPO
value (critic)
- initialized from the reward model, to be optimized
- by a temporal difference (TD) MSE loss
- initialized from the reward model, to be optimized
Kernel
- flash attention
- 1,2,
- 3
- hopper gpu (overlap cuda core, tensor core, TMA)
- decoding
- kernel fusion
- triton
- unsloth (lora)
- triton
Model Architecture
- transformer
- SA
- GQA, MQA, MLA (deepseek)
- window slice (mixtral)
- FFN
- MoE
- MoE scaling law, #experts
- SwiGLU (llama)
- MoE
- positional encoding
- RoPE, to rotate a d-dim vector 2-dim by 2-dim
- normalization
- RMSnorm
- residual link
- early exiting
- SA
- efficient transformer (attention)
- linformer
- J-L lemma
- performer
- matmul associativity
- gaussian integral (gaussian distribution pdf integral)
- random fourier/positive feature maps
- variance analysis
- linear attention
- hyperattention
- Transformer-VQ
- linformer
- other structures (RNNs)
- RWKV
- mamba
- TTT
Compression
- kv cache, long context
- CaM
- StreamingLLM
- memory augmented
Quantization
- AWQ
Hardware
gpu
- specifications
- tflops
- membdw
- ops:bytes (tflops/membdw) ratio of the hardware
- arithmetic intensity (AIT) of an operator
- ops:bytes vs AIT => compute bound or memory bound
- nvlink
- cap
- pcie
- ib
- hierarchy
- mem - grain
- hbm - grid
- l2 cache - block cluster
- shared mem - block, warp
- register mem - thread
- mem - grain
- architecture
- tensor core
- cuda core
- TMA
- SM
- warp
- register
- cpu
- thread
- disk
- IO bdw
- page cache
- IO bdw
- specifications