awesome-llm-synthetic-data
github.com/wasiahmad/awesome-llm-synthetic-data ↗A reading list on LLM based Synthetic Data Generation 🔥
1.5k
GitHub Stars
78
Curated Resources
5
Categories
5 hours ago
Last Refreshed
2. Methods3. Application Areas4. Datasets5. Tools6. Blogs
Use this list with your AI agent
Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:
"Show me 3.1. mathematical reasoning resources from awesome-llm-synthetic-data"
Installation instructions →What's inside
5. Tools
- AgentInstruct: Toward Generative Teaching with Agentic Flows
- DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows
- Distilabel: An AI Feedback (AIF) Framework for Building Datasets with and for LLMs
- Fuxion: Synthetic Data Generation and Normalization Functions using Langchain + LLMs
3. Application Areas
- Augmenting Math Word Problems via Iterative Question Composing3.1. Mathematical Reasoning
- AutoCoder: Enhancing Code Large Language Model with AIEV-Instruct3.2. Code Generation
- CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning3.2. Code Generation
- Constitutional AI: Harmlessness from AI Feedback3.4. Alignment
- Distilling LLMs' Decomposition Abilities into Compact Language Models3.1. Mathematical Reasoning
- DyCodeEval: Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination3.2. Code Generation
2. Methods
- Automatic Instruction Evolving for Large Language Models2.1. Techniques
- CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society2.1. Techniques
- CodecLM: Aligning Language Models with Tailored Synthetic Data2.2. Instruction Generation with High Quality/Complexity
- Generating Training Data with Language Models: Towards Zero-Shot Language Understanding2.1. Techniques
- Instruction Pre-Training:Language Models are Supervised Multitask Learners2.1. Techniques
- Large Language Models Can Self-Improve2.1. Techniques
4. Datasets
- Code Alpaca: An Instruction-following LLaMA Model trained on code generation instructions
- Open Artificial Knowledge
- Synthetic-Text-To-SQL: A synthetic dataset for training language models to generate SQL queries from natural language prompts
- SynthPAI: A Synthetic Dataset for Personal Attribute Inference
Showing a sample of 78 resources. View the full list on GitHub →