awesome-open-data-centric-ai
github.com/renumics/awesome-open-data-centric-ai ↗Curated list of open source tooling for data-centric AI on unstructured data.
Use this list with your AI agent
Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:
"Show me monitoring resources from awesome-open-data-centric-ai"
Installation instructions →What's inside
Monitoring
- awesome list
- awesome list
- Detect data drift
Compute the cosine distance of the k-nearest neighbor in the embedding space as the drift distance and inspect critical segments.
- MLOps awesome lists
- this list
Cleaning
- Detect duplicates
Use the Annoy library to detect nearest neighbors in the embedding space and inspect data points that are duplicates / near duplicates.
- Detect image issues
Use the Cleanvision library to extrapact typical image issues (brightness, blurr, aspect ratio, SNR and duplicates) and identify critical segments through manual inspection.
- Detect outliers
Use the Cleanlab library to compute outlier scores based on model output (embeddings, probabilities) and inspect outlier candidates.
Modeling
- Detect leakage
Use nearest neighbor distances to identify candidates for data leakage and manual inspect them
Validation
- Inspect decision boundaries
Compute a decision boundary score based on certainty ratios and inspect the results in a scatter plot.
Exploratory data analysis (EDA)
- Understand distributions
Use the Huggingface transformers library to compute image embeddings and explore the dataset based on the similarity map and additional metdata.
Showing a sample of 11 resources. View the full list on GitHub →