Skip to main content

A curated list of data engineering tools for software developers

8.7k
GitHub Stars
290
Curated Resources
17
Categories
17 hours ago
Last Refreshed
DatabasesData ComparisonData IngestionFile SystemSerialization formatStream ProcessingBatch ProcessingCharts and DashboardsWorkflowData Lake ManagementELK Elastic Logstash KibanaDockerDatasetsMonitoringProfilingTestingCommunity

Use this list with your AI agent

Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:

"Show me forums resources from awesome-data-engineering"

Installation instructions →

What's inside

Databases

  • Actionbase

    A database for user interactions (likes, views, follows) represented as graphs, with precomputed reads served in real-time.

  • Akumuli

    A numeric time-series database. It can be used to capture, store and process time-series data in real-time. The word "akumuli" can be translated from esperanto as "accumulate".

  • Amazon RDS

    Makes it easy to set up, operate, and scale a relational database in the cloud.

  • Apache Geode

    An open source, distributed, in-memory database for scale-out applications.

  • ArangoDB

    A distributed free and open-source database with a flexible data model for documents, graphs, and key-values.

  • ArcadeDB

    Open-source multi-model database with native graph, document, key-value, and vector support. SQL, Cypher, and Gremlin query languages. Apache 2.0 license.

Testing

  • Aegis DQ

    Open-source agentic data quality framework with LLM-powered diagnosis, root-cause analysis, SQL auto-fix proposals, and 31 rule types — DuckDB, Postgres, BigQuery, Databricks, Athena, Snowflake.

  • daffy

    Decorator-first DataFrame contracts/validation (columns/dtypes/constraints) at function boundaries. Supports Pandas/Polars/PyArrow/Modin.

  • DataDriven

    Interview practice with SQL query execution, Python, and data modeling exercises.

  • DataKitchen

    Open Source Data Observability for end-to-end Data Journey Observability, data profiling, anomaly detection, and auto-created data quality validation tests.

  • DataScreenIQ

    Real-time data quality firewall for pipelines and APIs. Screens rows in milliseconds for schema drift, null spikes, type mismatches, and data anomalies with PASS / WARN / BLOCK decisions.

  • DQOps

    An open-source data quality platform for the whole data platform lifecycle from profiling new data sources to applying full automation of data quality monitoring.

Community

  • AI Dev JobsForums

    Job board focused on AI, ML, and data engineering roles with 7,400+ listings, salary data, and a free REST API.

  • Architecting an Apache Iceberg LakehouseBooks

    A guide to designing an Apache Iceberg lakehouse from scratch.

  • Best Data Science BooksBooks

    This blog offers a curated list of top data science books, categorized by topics and learning stages, to aid readers in building foundational knowledge and staying updated with industry trends.

  • Chain of ThoughtPodcasts

    Interviews with AI and data infrastructure leaders on building production systems.

  • Data CouncilConferences

    The first technical conference that bridges the gap between data scientists, data engineers and data analysts.

  • Data Engineering PodcastPodcasts

    The show about modern data infrastructure.

Charts and Dashboards

  • AI for Database

    Agentic AI platform to connect any database (PostgreSQL, MySQL, MongoDB, etc.) and query in plain English; includes self-refreshing intelligent dashboards and action workflows triggered by data changes.

  • Apache Superset

    A modern, enterprise-ready business intelligence web application.

  • C3.js

    D3-based reusable chart library.

  • D3.js

    D3's simpler, easier to use cousin. Mostly predefined templates that you can just plug data in.

  • D3Plus

    D3's simpler, easier to use cousin. Mostly predefined templates that you can just plug data in.

  • Highcharts

    A charting library written in pure JavaScript, offering an easy way of adding interactive charts to your web site or web application.

Data Ingestion

  • Airbyte

    Open-source data integration for modern data teams.

  • Apache Pulsar

    An open-source distributed pub-sub messaging system.

  • Apache Sqoop

    A tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

  • Arpe.io

    High-speed CLI tools for database export, import, replication and migration with parallel streaming to CSV, Parquet, JSON and cloud storage, supporting PostgreSQL, MySQL, Oracle, SQL Server and 80+ sources.

  • Artie

    Real-time data ingestion tool leveraging change data capture.

  • AWS Data Wrangler

    Utility belt to handle data on AWS.

Workflow

  • Airflow

    A system to programmatically author, schedule, and monitor data pipelines.

  • Azkaban

    A batch workflow job scheduler created at LinkedIn to run Hadoop jobs. Azkaban resolves the ordering through job dependencies and provides an easy-to-use web user interface to maintain and track your workflows.

  • Bonnard

    Agent-native semantic layer with governed metrics, React SDK, and multi-warehouse support. Connects AI agents and dashboards to a single source of truth.

  • Bruin

    End-to-end data pipeline tool that combines ingestion, transformation (SQL + Python), and data quality in a single CLI. Connects to BigQuery, Snowflake, PostgreSQL, Redshift, and more. Includes VS Code extension with live previews.

  • Cascading

    Java based application development platform.

  • Census

    A reverse-ETL tool that let you sync data from your cloud data warehouse to SaaS applications like Salesforce, Marketo, HubSpot, Zendesk, etc. No engineering favors required—just SQL.

Serialization format

  • AKF

    The AI native file format. Trust scores, source provenance, and compliance metadata that embed into 20+ formats (DOCX, PDF, images, code). EXIF for AI.

  • Apache Avro

    Apache Avro™ is a data serialization system.

  • Apache ORC

    The smallest, fastest columnar storage for Hadoop workloads.

  • Apache Parquet

    A parallel implementation of gzip for modern multi-processor, multi-core machines.

  • Apache Thrift

    The Apache Thrift software framework, for scalable cross-language services development.

  • Kryo

    A fast and efficient object graph serialization framework for Java.

File System

  • Alluxio

    A memory-centric distributed storage system enabling reliable data sharing at memory-speed across cluster frameworks, such as Spark and MapReduce.

  • AWS S3

    Utils for streaming large files (S3, HDFS, gzip, bz2).

  • CEPH

    A unified, distributed storage system designed for excellent performance, reliability, and scalability.

  • GlusterFS

    Gluster Filesystem.

  • HDFS

    A pure python HDFS client.

  • JuiceFS

    A high-performance Cloud-Native file system driven by object storage for large-scale data storage.

Showing a sample of 290 resources. View the full list on GitHub →