Skip to main content

A comprehensive collection of data quality resources, tools, papers, and projects across various data types including traditional data, LLM pretraining/fine-tuning data, multimodal data, and more. Essential reference for researchers and practitioners in data-centric AI.

27
GitHub Stars
87
Curated Resources
7
Categories
19 hours ago
Last Refreshed
Traditional DataLarge Language Model DataMultimodal DataTabular DataTime Series DataGraph DataData-Centric AI

Use this list with your AI agent

Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:

"Show me data selection resources from awesome-data-quality"

Installation instructions →

What's inside

Data-Centric AI

Traditional Data

Large Language Model Data

  • ArgillaFine-tuning Data

    An open-source data curation platform for LLMs. (2021)

  • Assessing the Role of Data Quality in Training Bilingual Language ModelsPretraining Data

    A study revealing that unequal data quality is a major driver of performance degradation in bilingual settings, with a practical data filtering strategy for multilingual models. (2025)

  • A Survey of LLM × DATALLM Data Management

    A comprehensive survey on data-centric methods for large language models covering data processing, storage, and serving. (2025)

  • awesome-data-llmLLM Data Management

    Official repository of "LLM × DATA" survey paper with curated resources. (2025)

  • CCNetPretraining Data

    Tools for downloading and filtering CommonCrawl data. (2020)

  • CommonCrawlLLM Data Management

    A massive web crawl dataset covering diverse languages and domains. (2008)

Tabular Data

Graph Data

Time Series Data

Multimodal Data

Showing a sample of 87 resources. View the full list on GitHub →