Skip to main content

Curated list of open source tooling for data-centric AI on unstructured data.

732
GitHub Stars
11
Curated Resources
6
Categories
3 hours ago
Last Refreshed
Exploratory data analysis (EDA)CleaningAnnotationModelingValidationMonitoring

Use this list with your AI agent

Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:

"Show me monitoring resources from awesome-open-data-centric-ai"

Installation instructions →

What's inside

Monitoring

Cleaning

  • Detect duplicates

    Use the Annoy library to detect nearest neighbors in the embedding space and inspect data points that are duplicates / near duplicates.

  • Detect image issues

    Use the Cleanvision library to extrapact typical image issues (brightness, blurr, aspect ratio, SNR and duplicates) and identify critical segments through manual inspection.

  • Detect outliers

    Use the Cleanlab library to compute outlier scores based on model output (embeddings, probabilities) and inspect outlier candidates.

Modeling

  • Detect leakage

    Use nearest neighbor distances to identify candidates for data leakage and manual inspect them

Validation

Exploratory data analysis (EDA)

  • Understand distributions

    Use the Huggingface transformers library to compute image embeddings and explore the dataset based on the similarity map and additional metdata.

Showing a sample of 11 resources. View the full list on GitHub →