awesome-dataset-creation
github.com/jon-chun/awesome-dataset-creation ↗Curated list of resources for creating original datasets for original Data Science, Machine Learning and AI research and projects
Use this list with your AI agent
Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:
"Show me simulation resources from awesome-dataset-creation"
Installation instructions →What's inside
Libraries
- AirSimSimulation
AirSim is a simulator for drones, cars and more, built on Unreal and Unity engines.
- Contrastive Unpaired TranslationImage
Contrastive unpaired image-to-image translation, faster and lighter training than cyclegan.
- Denoising Diffusion PytorchImage
Implementation of DDPM
- gretel-syntheticsText, Tabular and Time-Series
Generative models for structured and unstructured text, tabular, and multi-variate time-series data featuring differentially private learning.
- JukeboxAudio
OpenAI's Jukebox- A Generative Model for Music.
- Nvidia Dataset SynthesizerSimulation
NDDS is a UE4 plugin from NVIDIA to empower computer vision researchers to export high-quality synthetic images with metadata.
Tutorials
- Annotated DiffusionReading Content
Tutorial on original diffusion model paper with code
- Learning to Generate Data by Estimating Gradients of the Data DistributionDiffusion Models
Video by Yang Song from Stanford. Excellent theory and interesting applications.
- The Unreasonable Effectiveness of Recurrent Neural NetworksReading Content
Andrej Karpathy's intro to RNNs.
Datasets
- Awesome Public Datasets
Topic centric, high quality, public data sources
- Data.gov
U.S. Government's open data
- Google Cloud Public Datasets
Publicly available and free machine learning and analytics datasets.
- Google Research Dataset Search
Discover datasets hosted in thousands of repositories across the web
- HuggingFace Datasets
Library for easily accessing and sharing datasets, and evaluation metrics for Natural Language Processing (NLP), computer vision, and audio tasks.
- Kaggle Datasets
Data science and machine learning datasets.
Academic Papers
Services
- List of Synthetic Data Startups in 2021
Not all of these necessarily have APIs.
Showing a sample of 50 resources. View the full list on GitHub →