awesome-data-quality
github.com/migoxlab/awesome-data-quality ↗A comprehensive collection of data quality resources, tools, papers, and projects across various data types including traditional data, LLM pretraining/fine-tuning data, multimodal data, and more. Essential reference for researchers and practitioners in data-centric AI.
Use this list with your AI agent
Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:
"Show me data selection resources from awesome-data-quality"
Installation instructions →What's inside
Data-Centric AI
- ADAM Optimization with Adaptive Batch SelectionData Selection
An ICLR paper on adaptive batch selection for ADAM optimization. (2024)
- Adaptive Data Optimization: Dynamic Sample Selection with Scaling LawsData Selection
An ICLR paper on dynamic sample selection using scaling laws. (2024)
- Advances, challenges and opportunities in creating data for trustworthy AISurveys
A Nature Machine Intelligence paper discussing the challenges and opportunities in creating high-quality data for AI. (2022)
- A Survey on Data Quality Dimensions and Tools for Machine LearningSurveys
A comprehensive survey reviewing 17 data quality tools for ML applications.
- A Survey on Data Selection for Language ModelsSurveys
A survey focusing on data selection techniques for language models. (2024)
- Data-centric Artificial Intelligence: A SurveySurveys
A comprehensive survey on data-centric AI approaches. (2023)
Traditional Data
- AI Data Quality MetricsData Readiness Assessment
Standardized metrics for assessing data quality in AI contexts. (2024)
- Assessing Student Adoption of Generative Artificial Intelligence across Engineering EducationData Readiness Assessment
An empirical study on data quality considerations in educational AI applications. (2025)
- A Survey on Data Quality: Classifying Poor DataPapers
A survey on data quality issues and classification. (2016)
- Data Cleaning: Problems and Current ApproachesPapers
A comprehensive overview of data cleaning approaches. (2000)
- Data Readiness Assessment FrameworkData Readiness Assessment
A framework for evaluating data quality and readiness for AI applications. (2024)
- Data Readiness for AI: A 360-Degree SurveyData Readiness Assessment
A comprehensive survey examining metrics for evaluating data readiness for AI training across structured and unstructured datasets. (2024)
Large Language Model Data
- ArgillaFine-tuning Data
An open-source data curation platform for LLMs. (2021)
- Assessing the Role of Data Quality in Training Bilingual Language ModelsPretraining Data
A study revealing that unequal data quality is a major driver of performance degradation in bilingual settings, with a practical data filtering strategy for multilingual models. (2025)
- A Survey of LLM × DATALLM Data Management
A comprehensive survey on data-centric methods for large language models covering data processing, storage, and serving. (2025)
- awesome-data-llmLLM Data Management
Official repository of "LLM × DATA" survey paper with curated resources. (2025)
- CCNetPretraining Data
Tools for downloading and filtering CommonCrawl data. (2020)
- CommonCrawlLLM Data Management
A massive web crawl dataset covering diverse languages and domains. (2008)
Tabular Data
- A Survey on Data Quality for Machine Learning in PracticePapers
A survey on data quality issues in machine learning. (2021)
- Automating Data Quality Validation for Dynamic Data IngestionPapers
A framework for automating data quality validation. (2019)
- DataProfilerTools & Projects
A Python library for data profiling and data quality validation. (2021)
- Pandas ProfilingTools & Projects
A tool for generating profile reports from pandas DataFrames. (2016)
Graph Data
- A Survey on Graph Cleaning Methods for Noise and Errors in Graph DataPapers
A survey on graph cleaning methods. (2022)
- DGLTools & Projects
A Python package for deep learning on graphs. (2018)
- Graph Data Quality: A Survey from the Database PerspectivePapers
A survey on graph data quality from a database perspective. (2022)
- NetworkXTools & Projects
A Python package for the creation, manipulation, and study of complex networks. (2008)
Time Series Data
- Cleaning Time Series Data: Current Status, Challenges, and OpportunitiesPapers
A survey on cleaning time series data. (2022)
- DartsTools & Projects
A Python library for time series forecasting and anomaly detection. (2020)
- Time Series Data Augmentation for Deep Learning: A SurveyPapers
A survey on time series data augmentation. (2020)
- tslearnTools & Projects
A machine learning toolkit dedicated to time series data. (2017)
Multimodal Data
- CLIP-BenchmarkTools & Projects
A benchmark for evaluating CLIP models. (2021)
- DataComp: In search of the next generation of multimodal datasetsPapers
A benchmark for evaluating data curation strategies. (2023)
- img2datasetTools & Projects
A tool for efficiently downloading and processing image-text datasets. (2021)
- LAION-5B: An open large-scale dataset for training next generation image-text modelsPapers
A large-scale dataset of image-text pairs. (2022)
Showing a sample of 87 resources. View the full list on GitHub →