DeepSeek pretrain data

7 minute read

Published:

This series of blogs will introduce the techniques used in DeepSeek Team’s papers.

Let’s start with the data used in DeepSeek training.


The Core Data Processing Pipeline: Deduplicate, Filter, Remix

DeepSeek’s general data processing pipeline for their foundational models like DeepSeek-v2 and DeepSeek-v3 follows a robust three-stage approach: Deduplication, Filtering, and Remixing. This is first introduced in their DeepSeek-LLM paper and has been refined in subsequent models. The deduplication stage focuses on removing redundant content to enhance data diversity, the filtering stage improves information density by discarding low-quality content, and the remixing stage balances data distribution to ensure comprehensive coverage across various domains.

1. Deduplication: Eliminating Redundancy for Richer Data

The primary goal here is to remove redundant content, boost data diversity, and prevent the model from overfitting on repetitive information.

  • Innovation: DeepSeek employs cross-full Common Crawl deduplication, rather than just within a single crawl batch. Experiments have shown this global approach removes 4 times more duplicate documents compared to single-batch deduplication (as evidenced by a significant jump in deduplication rates across more dumps).
  • Technical Implementation: They leverage approximate deduplication algorithms like MinHash or SimHash to efficiently handle vast datasets. Crucially, they perform chunking of texts to prevent overlooking local repetitions within long documents.

2. Filtering: Enhancing Information Density

This stage focuses on improving the information density of the data and discarding low-quality content.

  • Multi-dimensional Quality Assessment:
    • Linguistic Quality: They use grammar correctness detection (e.g., langdetect library) to filter out non-target languages or garbled text.
    • Semantic Quality: Pre-trained models (like BERT) are used to compute text coherence scores, removing logically incoherent paragraphs.
    • Domain Relevance: Keyword/topic models are built to retain high-information-entropy documents such as technical documentation and academic papers.
  • Dynamic Threshold Adjustment: Filtering standards are relaxed for scarce domains (e.g., less common languages) to avoid excessive removal.

3. Remixing: Balancing Data Distribution

The remixing phase addresses data distribution imbalances and enhances coverage for underrepresented domains (e.g., niche programming languages, marginal academic fields).

  • Strategies:
    • Layered Sampling: Data is stratified by domain (e.g., code, medical, legal), and low-frequency domains are oversampled.
    • Synthetic Augmentation: Scarce data is augmented using back-translation or template generation (e.g., variations of code functions).
    • Proportional Control: They ensure that in the final dataset, dominant domains (e.g., English) account for $\le 60\%$, while tail domains (e.g., classical Chinese) make up $\ge 5\%$.

Tokenizer Optimization: A Linguistic Backbone

DeepSeek’s tokenizer is a critical component for efficient and accurate language representation.

  • Key Technology: They utilize Byte-level BPE (BBPE), implemented with the tokenizers library, supporting multilingual mixed training.
  • Pre-tokenization Rules:
    • Prohibition of Cross-Character Category Merging: CJK characters are kept separate from punctuation.
    • Digit Splitting: Numbers are broken down into individual digits (e.g., 123 → 1, 2, 3).
  • Vocabulary Design:
    • Base Vocabulary: Consists of 100K regular tokens + 15 special tokens (e.g., <|endoftext|>).
    • Training Corpus: 24GB of multilingual data (covering Chinese, English, code, etc.).
    • Reserved Space: The model’s vocabulary is set to 102,400, allowing for future expansion.
  • Advantages:
    • More friendly to code and mathematical symbols (e.g., += is retained as a single token).
    • Avoids fragmentation issues in CJK languages.
  • DeepSeek-v3 Specifics: The vocabulary size was increased to 128K. They also introduced a subtle but impactful strategy of randomly splitting punctuation and newlines during training. This addresses potential misinterpretation of structure due to inconsistent formatting, ensuring the model sees both joined and separated versions.

DeepSeek-Coder & DeepSeek-Coder-V2: Specializing for Code

DeepSeek-Coder and its successor, DeepSeek-Coder-V2, highlight specialized data collection and processing for code-centric LLMs.

Code Data Collection and Filtering (DeepSeek-Coder & DeepSeek-Coder-V2)

DeepSeek-Coder models adopt a stringent rule-based filtering approach for code data:

  • Data Collection Scope: Public GitHub repositories created before February 2023, limited to 87 specific programming languages.
  • Line Length Restrictions: Files with an average line length > 100 characters or a single line length > 1000 characters are discarded (to avoid machine-generated or compressed code).
  • Alphabet Ratio Restriction: Files with an alphabet ratio < 25% are removed (excluding non-code or highly encoded data).
  • XML/XSLT File Filtering: For non-XSLT language files, if <?xml version= appears within the first 100 characters, they are discarded (to avoid configuration files or template data).
  • HTML File Filtering: Files are retained only if the visible text (non-HTML tags) accounts for $\ge 20\%$ of the content and the visible text length is $\ge 100$ characters.
  • JSON/YAML File Filtering: Only files between 50 and 5000 characters are kept, discarding data-intensive content like large logs or API responses.
  • In-project Filtering: Filtering is performed within a project to preserve dependencies.
  • Syntax Error Removal: Files with syntax errors are discarded.
  • Heuristic Filtering: Heuristic rules are applied to filter out low-quality code.
  • Test Set Matching: 10-gram segments matching test sets are removed to prevent data leakage.

Dependency Analysis and File Ordering (DeepSeek-Coder)

A unique innovation for DeepSeek-Coder was to prioritize the relationships between files within the same repository.

  • Dependency Relationship Parsing: Regular expressions are used to extract import statements (Python), using statements (C#), #include directives (C/C++), building a Dependency Graph.
  • Improved Topological Sort: An enhanced topological sorting algorithm is used to arrange files according to dependencies, ensuring that the necessary context for each file appears earlier in the input sequence. This algorithm can even handle cyclical dependencies by selecting nodes with the minimum in-degree.
  • Training Sample Generation: The sorted results for each subgraph are concatenated into a single training sample, with path comments (e.g., # FILE: /src/utils.py) added to preserve original location information.
  • Benefits: This approach results in more realistic code structures, improves project-level understanding, and gracefully handles circular dependencies.

DeepSeek-Coder-V2 Data Breakdown & Expansion

DeepSeek-Coder-V2 further refines its data mix:

  • Data Composition: 60% code, 10% mathematics, and 30% natural language.
  • Final Data Volume (after filtering): 82.1 billion code tokens (338 languages) and 18.5 billion code-related text tokens (Markdown, issue discussions).
  • Web Text Collection:(also used in DeepSeek-Math)
    • Initial Seed Libraries: StackOverflow, PyTorch docs (code); StackExchange (math).
    • Expanded Recall: FastText models, trained on seed libraries, are used to recall more relevant web pages. DeepSeek-v2’s BPE tokenizer is used for non-space-separated languages to improve recall precision.
    • Domain Classification: If over 10% of web pages under a domain are recalled, it’s marked as code/math related.
    • Through 3 iterations, they acquired 70 billion code-related tokens and 221 billion math-related tokens.
    • An additional 2 iterations of GitHub collection yielded 94 billion high-quality source code tokens.
  • Total New Code Corpus: 1.17 trillion tokens (GitHub + CommonCrawl).
  • Impact: Ablation experiments with a 1B parameter model demonstrated a significant improvement in HumanEval accuracy ($+5.5\%$) and MBPP accuracy ($+4.4\%$) after pre-training with 1T new corpus tokens, showcasing the effectiveness of this refined data.

Addressing Cultural Sensitivity and Evaluation Nuances

DeepSeek-v2 notably removed content related to culturally controversial topics. When evaluating on datasets like MMLU, they observed that their model lagged behind human performance. However, they also found significant discrepancies among human evaluators and the ground truth, leading them to conclude that MMLU can be a value-sensitive dataset. This highlights DeepSeek’s commitment to both performance and ethical considerations in their model development.

DeepSeek-v3: Doubling Down on Data Quality and Specialization

DeepSeek-v3 continues this trajectory, leveraging 14.8T tokens of data.

  • Increased Math and Programming Ratios: Reflecting a strategic focus on these critical domains.
  • Integration of FIM (Fill-in-the-Middle) Strategy: Applied at a rate of 0.1 during pre-training, consistent with the PSM (Prefix-Suffix-Middle) framework. This strategy has proven effective in DeepSeek-Coder models for improving predictive capabilities without loss and enhancing middle-of-document prediction.
  • Tokenizer Enhancements: As mentioned earlier, the tokenizer vocabulary was expanded to 128K, and the nuanced handling of punctuation and newlines was introduced to robustly model code structures.