Skip to main content

Awesome LLM pre-training resources, including data, frameworks, and methods.

382
GitHub Stars
128
Curated Resources
4
Categories
6 hours ago
Last Refreshed
I. Technical ReportsII. Training StrategiesIII. Open-source DatasetsIV. Data Methods

Use this list with your AI agent

Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:

"Show me 2.3 interpretability resources from awesome-llm-pretraining"

Installation instructions →

What's inside

II. Training Strategies

  • blog2.3 Interpretability

  • blog2.3 Interpretability

  • code2.4 Model Architecture Improvements

  • code2.2 Training Strategies

  • code2.4 Model Architecture Improvements

  • code2.1 Training Frameworks

III. Open-source Datasets

  • code3.4 General-purpose (Books, Encyclopedias, Instructions, Long Contexts, etc.)

  • homepage3.4 General-purpose (Books, Encyclopedias, Instructions, Long Contexts, etc.)

  • paper3.1 Web Pages

  • paper3.1 Web Pages

  • paper3.1 Web Pages

  • resource3.3 Code

IV. Data Methods

  • code4.2 Data Mixing and Curriculum

  • code4.1 Tokenizers

  • code4.1 Tokenizers

  • code4.1 Tokenizers

  • code4.3 Data Synthesis

  • code4.2 Data Mixing and Curriculum

Resources

I. Technical Reports

Showing a sample of 128 resources. View the full list on GitHub →