awesome-data-engineering
github.com/igorbarinov/awesome-data-engineering ↗A curated list of data engineering tools for software developers
Use this list with your AI agent
Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:
"Show me forums resources from awesome-data-engineering"
Installation instructions →What's inside
Databases
- Actionbase
A database for user interactions (likes, views, follows) represented as graphs, with precomputed reads served in real-time.
- Akumuli
A numeric time-series database. It can be used to capture, store and process time-series data in real-time. The word "akumuli" can be translated from esperanto as "accumulate".
- Amazon RDS
Makes it easy to set up, operate, and scale a relational database in the cloud.
- Apache Geode
An open source, distributed, in-memory database for scale-out applications.
- ArangoDB
A distributed free and open-source database with a flexible data model for documents, graphs, and key-values.
- ArcadeDB
Open-source multi-model database with native graph, document, key-value, and vector support. SQL, Cypher, and Gremlin query languages. Apache 2.0 license.
Testing
- Aegis DQ
Open-source agentic data quality framework with LLM-powered diagnosis, root-cause analysis, SQL auto-fix proposals, and 31 rule types — DuckDB, Postgres, BigQuery, Databricks, Athena, Snowflake.
- daffy
Decorator-first DataFrame contracts/validation (columns/dtypes/constraints) at function boundaries. Supports Pandas/Polars/PyArrow/Modin.
- DataDriven
Interview practice with SQL query execution, Python, and data modeling exercises.
- DataKitchen
Open Source Data Observability for end-to-end Data Journey Observability, data profiling, anomaly detection, and auto-created data quality validation tests.
- DataScreenIQ
Real-time data quality firewall for pipelines and APIs. Screens rows in milliseconds for schema drift, null spikes, type mismatches, and data anomalies with PASS / WARN / BLOCK decisions.
- DQOps
An open-source data quality platform for the whole data platform lifecycle from profiling new data sources to applying full automation of data quality monitoring.
Community
- AI Dev JobsForums
Job board focused on AI, ML, and data engineering roles with 7,400+ listings, salary data, and a free REST API.
- Architecting an Apache Iceberg LakehouseBooks
A guide to designing an Apache Iceberg lakehouse from scratch.
- Best Data Science BooksBooks
This blog offers a curated list of top data science books, categorized by topics and learning stages, to aid readers in building foundational knowledge and staying updated with industry trends.
- Chain of ThoughtPodcasts
Interviews with AI and data infrastructure leaders on building production systems.
- Data CouncilConferences
The first technical conference that bridges the gap between data scientists, data engineers and data analysts.
- Data Engineering PodcastPodcasts
The show about modern data infrastructure.
Charts and Dashboards
- AI for Database
Agentic AI platform to connect any database (PostgreSQL, MySQL, MongoDB, etc.) and query in plain English; includes self-refreshing intelligent dashboards and action workflows triggered by data changes.
- Apache Superset
A modern, enterprise-ready business intelligence web application.
- C3.js
D3-based reusable chart library.
- D3.js
D3's simpler, easier to use cousin. Mostly predefined templates that you can just plug data in.
- D3Plus
D3's simpler, easier to use cousin. Mostly predefined templates that you can just plug data in.
- Highcharts
A charting library written in pure JavaScript, offering an easy way of adding interactive charts to your web site or web application.
Data Ingestion
- Airbyte
Open-source data integration for modern data teams.
- Apache Pulsar
An open-source distributed pub-sub messaging system.
- Apache Sqoop
A tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
- Arpe.io
High-speed CLI tools for database export, import, replication and migration with parallel streaming to CSV, Parquet, JSON and cloud storage, supporting PostgreSQL, MySQL, Oracle, SQL Server and 80+ sources.
- Artie
Real-time data ingestion tool leveraging change data capture.
- AWS Data Wrangler
Utility belt to handle data on AWS.
Workflow
- Airflow
A system to programmatically author, schedule, and monitor data pipelines.
- Azkaban
A batch workflow job scheduler created at LinkedIn to run Hadoop jobs. Azkaban resolves the ordering through job dependencies and provides an easy-to-use web user interface to maintain and track your workflows.
- Bonnard
Agent-native semantic layer with governed metrics, React SDK, and multi-warehouse support. Connects AI agents and dashboards to a single source of truth.
- Bruin
End-to-end data pipeline tool that combines ingestion, transformation (SQL + Python), and data quality in a single CLI. Connects to BigQuery, Snowflake, PostgreSQL, Redshift, and more. Includes VS Code extension with live previews.
- Cascading
Java based application development platform.
- Census
A reverse-ETL tool that let you sync data from your cloud data warehouse to SaaS applications like Salesforce, Marketo, HubSpot, Zendesk, etc. No engineering favors required—just SQL.
Serialization format
- AKF
The AI native file format. Trust scores, source provenance, and compliance metadata that embed into 20+ formats (DOCX, PDF, images, code). EXIF for AI.
- Apache Avro
Apache Avro™ is a data serialization system.
- Apache ORC
The smallest, fastest columnar storage for Hadoop workloads.
- Apache Parquet
A parallel implementation of gzip for modern multi-processor, multi-core machines.
- Apache Thrift
The Apache Thrift software framework, for scalable cross-language services development.
- Kryo
A fast and efficient object graph serialization framework for Java.
File System
- Alluxio
A memory-centric distributed storage system enabling reliable data sharing at memory-speed across cluster frameworks, such as Spark and MapReduce.
- AWS S3
Utils for streaming large files (S3, HDFS, gzip, bz2).
- CEPH
A unified, distributed storage system designed for excellent performance, reliability, and scalability.
- GlusterFS
Gluster Filesystem.
- HDFS
A pure python HDFS client.
- JuiceFS
A high-performance Cloud-Native file system driven by object storage for large-scale data storage.
Showing a sample of 290 resources. View the full list on GitHub →