LLM - Basic Knowledge List

4 minute read

Published:

This post summarizes the basic knowledge about LLM (~50 papers)

Training

  • parallelisms, distributed training
    • data parallelism
      • allreduce grad
    • pipeline parallelism

    • tensor parallelism
      • matrix computations in self-attention (SA) & feed-forward network (FFN) are composed of neuron (i.e. a row/column vector) computations:

        take 2-layer FFN as an example (#neurons = d_ffn):

        \(W_1 =[W_1^1, W_1^2, ..., W_1^{d_{ffn}}],\quad W_2 = \begin{bmatrix} W_2^1 \\W_2^2 \\ \vdots \\W_2^{d_{ffn}} \\ \end{bmatrix} , \quad \sigma(xW_1) W_2 = \sum_{i=1}^{d_{ffn}} \sigma(xW_1^i) W_2^i\) where

      \[x \in R^{*\times d},\, W_1^i \in R^{d\times1},\, W_2^i \in R^{1\times d},\, \forall_i\]
      • we can gather the neurons into (disjoint) clusters, and dispatch different clusters to different computational units (e.g. GPUs) to implement tensor parallelism
    • sequence parallelism
      • ring attention, computation-communication overlap
    • expert parallelism
      • deepspeed-MoE
    • zero
      • zero-infinity, offloading system making use of CPU/NVMe memories
  • training task property

    • compute bound

      • when the training scale is not extremely large, the computation cost is dominant, rather than the GPU memory accessing cost and network communication cost

      • the computation cost can be measured in terms of Floating-point operations (FLOPs or Flops), which is easy to estimate by the rule [backward flops details]:

        \[forward\_FLOPs = 2 \times \#tokens \times \#activated\_paramters\] \[backward\_FLOPs = 2 \times forward\_FLOPs\]
      • where “activated” is specially referred for the case of Mixture-of-Expert (MoE) models, if the model is not MoE (i.e. is dense model), the activated paramters is the total parameters

    • memory consuming

      • during the training process, other than the parameters, we also need to save the gradients and optimizer states proportionally to the original amount of parameters
      • 224444: p(16)/g(16)/p(32)/g(32)/os1(32)/os2(32)
      • we need (20 * parameter_size) GB GPU memory to store the whole model states (p, g, os1, os2) for fp16-fp32 mixed precision training

Inference

  • algorithm
    • speculative inference
      • sequoia
    • sparsity
      • dejavu
      • powerinfer
      • moefication
  • prompting
    • RAG
  • reasoning
    • CoT
    • ToT
    • RAP
  • serving system
    • vllm
    • tensorRT
    • disaggregation (on heterogeneous hardware settings)
      • DistServe
        • prefill-decoding disaggregation
      • Attention Offloading
        • SA-FFN disaggregation
    • FlexGen
      • offloading
  • inference task property
    • prefill-decoding 2 stages
      • KV cache
    • memory bound
      • when the batch size is not extremely large (~200), the bottleneck of inference speed is the GPU memory accessing of model weights & KV cache, rather than the computation cost and communication cost
      • it is meaningless to transform a memory bound task to a GPU-CPU/NVMe IO bound task, e.g. designing an offloading system for inference

Fine-Tuning

  • supervised fine-tuning (SFT)
  • efficiency
    • parameter-efficient fine-tuning (PEFT)
      • LoRA (low-rank)
        • DoRA
        • LoRA+
        • AdaLoRA
    • memory-efficient (training)
      • GaLore (low-rank, SVD)
    • add noise to reduce overfitting
      • NEFT
  • FT task property
    • compute bound (like training)
    • not memory consuming (with parameter/memory-efficient methods)
  • RLHF PPO

    • the goal: human preference alignment

      • given a human preference (feedback) dataset
      • a reward model r(prompt, generation) is trained in a supervised & contrastive manner in advance, which is used to score the human preference of a generation for a prompt in the stage of the following RL fine-tuning
    • 3 stages in RLHF PPO
      1. reward model training

      2. LLM SFT

      3. LLM RL

        • MDP problem setting

          • state: prompt (currently generated tokens)

          • action: the next token

          • policy: llm

    • models in the framework

      • reward

        • output of stage 1, frozen
      • reference policy

        • output of stage 2, frozen
      • policy (actor)

        • initialized from the reference policy model, to be optimized

          • by an RL algorithm, e.g. PPO, DPO

            • PPO
              • TRPO objective
              • clip / KL penalty regularization
            • DPO
              • beyond PPO, the analytical form of the reward model can be derived from the learning objective of KL-penalized RLHF PPO
              • DPO does not require a seperate reward model training
      • value (critic)

        • initialized from the reward model, to be optimized
          • by a temporal difference (TD) MSE loss

Kernel

  • flash attention
    • 1,2,
    • 3
      • hopper gpu (overlap cuda core, tensor core, TMA)
    • decoding
  • kernel fusion
    • triton
      • unsloth (lora)

Model Architecture

  • transformer
    • SA
      • GQA, MQA, MLA (deepseek)
      • window slice (mixtral)
    • FFN
      • MoE
        • MoE scaling law, #experts
      • SwiGLU (llama)
    • positional encoding
      • RoPE, to rotate a d-dim vector 2-dim by 2-dim
    • normalization
      • RMSnorm
    • residual link
      • early exiting
  • efficient transformer (attention)
    • linformer
      • J-L lemma
    • performer
      • matmul associativity
      • gaussian integral (gaussian distribution pdf integral)
      • random fourier/positive feature maps
      • variance analysis
    • linear attention
    • hyperattention
    • Transformer-VQ
  • other structures (RNNs)
    • RWKV
    • mamba
    • TTT

Compression

  • kv cache, long context
    • CaM
    • StreamingLLM
    • memory augmented

Quantization

  • AWQ

Hardware

  • gpu

    • specifications
      • tflops
      • membdw
        • ops:bytes (tflops/membdw) ratio of the hardware
        • arithmetic intensity (AIT) of an operator
        • ops:bytes vs AIT => compute bound or memory bound
      • nvlink
      • cap
      • pcie
      • ib
    • hierarchy
      • mem - grain
        • hbm - grid
        • l2 cache - block cluster
        • shared mem - block, warp
        • register mem - thread
    • architecture
      • tensor core
      • cuda core
      • TMA
      • SM
      • warp
      • register
    • cpu
      • thread
    • disk
      • IO bdw
        • page cache