DeepSeek Architecture

15 minute read

Published:


The Transformer Base and Evolving Attention Mechanisms

The Transformer architecture revolutionized sequence modeling with its self-attention mechanism. DeepSeek builds upon this, refining how models attend to information.

Multi-Head Attention (MHA): The Foundation

At the heart of the Transformer is Multi-Head Attention (MHA). Imagine MHA as a team of specialized “experts” each looking at different aspects of the input sequence simultaneously. Instead of performing one large attention calculation, MHA splits the input into multiple “heads.” Each head independently learns to attend to different parts of the sequence.

  • Overall Architecture: For each input token, MHA computes a “query” ($\mathbf{Q}$), “key” ($\mathbf{K}$), and “value” ($\mathbf{V}$) vector. These are then linearly projected into multiple sets. If there are $H$ heads, the original $\mathbf{Q}$, $\mathbf{K}$, and $\mathbf{V}$ are transformed into $H$ sets of smaller-dimensional $\mathbf{Q}_i$, $\mathbf{K}_i$, and $\mathbf{V}_i$ for $i=1 \dots H$.
  • How it works: Each head then performs a scaled dot-product attention calculation on its set of projections: \(\text{Attention}(\mathbf{Q}_i, \mathbf{K}_i, \mathbf{V}_i) = \text{softmax}\left(\frac{\mathbf{Q}_i \mathbf{K}_i^T}{\sqrt{d_k}}\right) \mathbf{V}_i\) where $d_k$ is the dimension of the key vectors. This allows each head to capture different types of relationships (e.g., one head might focus on grammatical dependencies, another on semantic similarity). The results from all heads are then concatenated and linearly transformed back into the desired output dimension.
  • Significance: MHA allows the model to jointly attend to information from different representation subspaces at different positions, greatly enriching its understanding of context.

Grouped-Query Attention (GQA): Optimizing for Scale

As models grow larger, a significant bottleneck during inference is the Key-Value (KV) cache. This cache stores the keys and values computed for each token in the sequence, which can consume a lot of memory, especially for long contexts. Grouped-Query Attention (GQA) is an optimization designed to alleviate this.

  • Overall Architecture: In GQA, query heads are grouped together. Instead of each query head having its own distinct key and value heads (like in MHA), multiple query heads share the same set of key and value heads. For example, if you have $H_q$ query heads, they might be grouped into $G$ groups, with each group sharing one set of KV heads. This means there are $H_q$ query heads but only $G$ key heads and $G$ value heads, where $G < H_q$.
  • How it works: Each of the $H_q$ query heads still calculates its own query vectors. However, the $G$ key and value heads serve their respective groups of query heads. When calculating attention, the query head within a group uses the shared key and value from that group. The attention calculation itself remains the same, but the shared KV pairs mean fewer unique key and value vectors need to be stored.
  • Significance: This significantly reduces the size of the KV cache, making inference more efficient, particularly for very large models. DeepSeek LLM 67B and DeepSeek Coder 33B are examples where GQA is employed to optimize inference costs without a substantial drop in performance.

Multi-head Latent Attention (MLA): Pushing Efficiency and Performance

Introduced in DeepSeek-V2 and also adopted by DeepSeek-V3, Multi-head Latent Attention (MLA) is a more advanced attention mechanism engineered to achieve superior performance compared to MHA while significantly reducing the KV cache during inference, thereby boosting efficiency even further than GQA.

  • Overall Architecture: MLA’s core innovation lies in its low-rank key-value joint compression. Instead of directly storing or computing full key and value matrices for each head, MLA projects keys and values into a much smaller “latent” dimension. Queries then attend to this compressed latent space.
  • How it works: In a simplified view, keys and values are first compressed into a lower-dimensional latent vector. This latent vector is what gets stored in the KV cache, drastically reducing memory usage. When a query comes in, it’s also projected into this compressed latent space. The attention score calculation then happens within this compressed space. This is a crucial distinction: instead of decompressing the KV cache, the query adapts to it. An “up-projection” matrix might then be used to project the aggregated context back to the desired output dimension, potentially enhancing expressiveness despite the compression.
  • Significance: By compressing the KV cache so efficiently (reportedly 12x less than GQA and 60x less than MHA for DeepSeek V3), MLA enables DeepSeek models to handle much longer contexts with lower memory footprints, crucial for processing extensive documents or codebases. The “latent” aspect implies that the model learns to identify and prioritize the most salient information from the keys and values, leading to better performance despite the compression. This effectively trades some computational complexity during query/key/value transformations for massive memory savings during inference.

Rotary Position Embedding (RoPE): Contextualizing Positions

Traditional positional encodings add fixed or learned vectors to input embeddings. Rotary Position Embedding (RoPE) offers a more elegant and effective way to inject positional information directly into the attention mechanism.

  • Overall Architecture: RoPE doesn’t add a separate positional embedding to the input. Instead, it applies a rotation transformation to the query ($\mathbf{Q}$) and key ($\mathbf{K}$) vectors at each attention layer, based on their absolute position in the sequence.
  • How it works: For each position $m$ in the sequence, a rotation matrix $R_\theta^m$ is applied to the query $q_m$ and key $k_m$. This rotation is typically applied pairwise to elements of the vectors. The key insight is that the dot product of two rotated vectors $q_m^T R_{\theta, m}^T R_{\theta, n} k_n$ (where $n$ is another position) can be reformulated to depend only on the relative position $m-n$. This means the attention score between two tokens inherently incorporates their relative distance. The farther apart two tokens are, the more their vectors are “rotated” relative to each other, subtly influencing their attention scores.
  • Significance: RoPE provides a natural way to encode relative positional information, which is often more crucial for language understanding than absolute positions. It also scales well to longer sequences and tends to generalize better to unseen sequence lengths compared to fixed positional embeddings. For MLA, a Decoupled RoPE strategy is used, meaning RoPE is applied to a specific, “decoupled” part of the key vector, ensuring compatibility with MLA’s KV compression while still retaining the benefits of positional encoding.

The DeepSeekMoE Architecture: Economical Powerhouses

The DeepSeekMoE architecture is a pivotal innovation for DeepSeek-V2, DeepSeek-V3, and dedicated DeepSeekMoE models. It enables the training of exceptionally powerful models at more economical costs by leveraging sparse computation. Instead of activating all parameters for every token, MoE models route each token (or a small group of tokens) to a select few “expert” sub-networks. This allows for models with a vast number of parameters, yet computationally efficient inference as only a small subset is active at any given time.

Strategies for Expert Specialization: Making Each Expert Count

MoE models thrive when their experts specialize. DeepSeek employs sophisticated strategies to achieve this:

  • Fine-Grained Expert Segmentation: Unlike typical MoE setups where an expert might be a full Feed-Forward Network (FFN) block, DeepSeekMoE segments experts into finer grains by splitting the FFN’s intermediate hidden dimension. Imagine an FFN layer with a large intermediate dimension. Instead of having a few “big” experts, DeepSeek breaks this into many more, smaller “expert fragments.”
    • Significance: This allows for a more flexible combination of activated experts. For example, instead of choosing “Expert A” or “Expert B” entirely, the model might select a specific small part of “Expert A” and another small part of “Expert C.” This promotes higher specialization, as each smaller segment can learn a very specific pattern or function, and crucially, reduces redundancy between experts.
  • Shared Expert Isolation: DeepSeekMoE isolates certain experts as shared experts that are always activated for every token, regardless of the router’s decision.
    • Significance: These shared experts are designed to capture common, fundamental knowledge that is universally applicable across different inputs (e.g., basic grammar rules, common factual knowledge). This mitigates redundancy among the other “routed” experts (those selected by the router), as the routed experts can then focus on more specialized knowledge (e.g., coding syntax, specific scientific facts). This dual approach enhances overall parameter efficiency and model generalization.

Load Balancing: Distributing the Work Evenly

A major challenge in MoE models is ensuring that the workload is evenly distributed among experts to avoid “expert collapse” (where only a few experts get all the traffic) and maximize training efficiency.

  • Auxiliary Losses (DeepSeek-V2): DeepSeek-V2 uses a set of auxiliary losses added to the main training objective to control load balancing at multiple levels:
    • Expert-level balance: This loss term encourages uniform token distribution across all experts, preventing some experts from being overloaded while others are underutilized.
    • Device-level balance: For distributed training, this ensures the computational load is evenly spread across different devices (GPUs or nodes).
    • Communication balance: This loss aims to minimize the data transfer overhead between devices, which can be a bottleneck in large-scale MoE systems.
    • Significance: These losses prevent experts from becoming “lazy” or “overworked” and ensure that all experts contribute effectively to the learning process, enhancing overall computational efficiency and preventing routing collapse.
  • Auxiliary-Loss-Free Strategy (DeepSeek-V3): DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing. This is a significant advancement as it aims to minimize performance degradation that can sometimes be caused by the explicit, potentially conflicting, efforts of auxiliary balancing losses.
    • How it works: While details are proprietary, it likely involves a more intrinsically designed router or expert interaction mechanism that naturally promotes balance, combined with an “extremely small complementary sequence-wise balance loss.” This “soft” approach suggests the system encourages balance through feedback mechanisms within the routing process itself, rather than relying heavily on explicit penalties.
    • Significance: By simplifying the load balancing mechanism, DeepSeek-V3 can potentially achieve better training stability and final model performance, as the model isn’t being pulled by potentially conflicting auxiliary objectives. This approach may also reduce the hyperparameter tuning burden associated with auxiliary losses.

Routing Mechanisms: Directing the Flow

The router is the component that decides which tokens go to which experts. Its design significantly impacts communication costs and efficiency in a distributed setting.

  • Device-Limited Routing (DeepSeek-V2): DeepSeek-V2 implements device-limited routing, meaning that when a token is routed, its target experts are chosen such that they reside on a fixed, limited number of devices (e.g., $M=3$ GPUs).
    • Significance: This strategy directly bounds communication costs by ensuring that tokens only need to be sent to a constrained set of GPUs. This is crucial for keeping training and inference efficient on large clusters.
  • Node-Limited Routing (DeepSeek-V3): DeepSeek-V3 refines this to node-limited routing, ensuring tokens are sent to at most a certain number of nodes (e.g., $M=4$ nodes).
    • Significance: This is an important distinction in very large-scale distributed training setups. Inter-node communication (between different physical servers) is typically much more expensive and bandwidth-limited than intra-node communication (between GPUs on the same physical server). By limiting routing at the node level, DeepSeek-V3 optimizes for the most expensive communication bottleneck, further enhancing scalability.

Token Dropping: Training Speed vs. Completeness

In some MoE training schemes, if an expert’s capacity is exceeded, some tokens might be “dropped” (not processed by any expert in that layer) to prevent bottlenecks.

  • DeepSeek-V2: Employs a token-dropping strategy for acceleration during training. While speeding up training by preventing experts from becoming overloaded, it might theoretically lead to a slight loss of information for the dropped tokens.
  • DeepSeek-V3: Due to its highly effective auxiliary-loss-free load balancing strategy, DeepSeek-V3 does not drop any tokens.
    • Significance: This ensures that every token consistently contributes to the training process, potentially leading to a more thoroughly trained and higher-quality model.

Training Objectives and Strategies: Beyond Next-Token Prediction

Beyond the standard Next-Token Prediction (NTP), where the model predicts the next word in a sequence given the preceding ones, DeepSeek models incorporate specialized training objectives to enhance their capabilities, particularly for code and multi-turn generation.

Fill-in-the-Middle (FIM): Mastering Code Completion

Fill-in-the-Middle (FIM) is a highly effective training objective for enhancing code completion and general understanding of interrupted sequences. It’s particularly crucial for models like DeepSeek-Coder, DeepSeek-Coder-V2, and DeepSeek-V3.

  • How it works: Instead of just predicting the next token in a purely left-to-right fashion, FIM trains the model to reconstruct obscured text spans within a sequence. During training, a portion of the input sequence is masked out, and the model is tasked with predicting the missing segment. This is typically done following the PSM (Prefix, Suffix, Middle) framework. In PSM, a text sequence is split into a prefix, a suffix, and a middle part. The model is then given the prefix and suffix (perhaps concatenated with special tokens to denote their roles) and asked to generate the middle part.
    • Example for code:
        def calculate_area(length, width):
            if length <= 0 or width <= 0:
                return 0
            else:
                return length * width
      
      • FIM input with PSM (conceptual):
          <prefix>def calculate_area(length, width):
              if length <= 0 or width <= 0:
                  return 0
          <mask_token>
          <suffix>return length * width
          <end_of_text>
        
      • Model predicts the <mask_token> part: else:\n (or similar, depending on tokenization)
  • Significance: FIM directly trains the model to understand context bidirectionally and fill in missing pieces, which is essential for powerful code completion features, fixing bugs, and general in-filling tasks. It teaches the model to reason about the relationship between code segments, not just linearly. DeepSeek-Coder applies this at a high rate (e.g., 0.5), while DeepSeek-V3 uses a more balanced rate (0.1) as part of its diverse training objectives, indicating a strategic balance of this objective.

Multi-Token Prediction (MTP): Denser Signals for Efficiency

Introduced in DeepSeek-V3, Multi-Token Prediction (MTP) extends the prediction scope beyond just the next single token, offering a more efficient way to learn from data.

  • How it works: At each position in the sequence, instead of predicting only the very next token ($x_{t+1}$), MTP trains the model to predict multiple future tokens simultaneously ($x_{t+1}, x_{t+2}, \dots, x_{t+k}$). This is often achieved by adding multiple independent prediction heads on top of the shared model backbone, with each head responsible for predicting a specific future token.
  • Significance:
    • Denser Training Signals: By predicting multiple tokens at once, MTP provides a “denser” training signal. Each prediction step extracts more information from the input, potentially improving data efficiency and accelerating convergence during pre-training. It encourages the model to learn richer contextual dependencies because it must optimize for multiple interrelated predictions.
    • Speculative Decoding: MTP can also be repurposed for speculative decoding during inference. In speculative decoding, a smaller, faster model (or in MTP’s case, the same model generating multiple tokens at once) proposes a draft of future tokens. The larger, more accurate model then verifies these proposed tokens in parallel. If they are correct, the generation speed is significantly boosted. MTP’s inherent ability to predict multiple future tokens makes it a natural fit for this technique, as the model can generate a short “burst” of tokens that can then be validated.