awesome-etl

A curated list of awesome ETL frameworks, libraries, and software.

3.6k

GitHub Stars

Curated Resources

Use this list with your AI agent

Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:

"Show me libraries resources from awesome-etl"

Airbyte
"Airbyte is an open-source data integration engine that helps you consolidate your data in your data warehouses, lakes and databases."
Alteryx
"combines data preparation, data blending, and analytics — predictive, statistical, and spatial — in a visual workflow designer."
AWS Batch
"enables developers, scientists, and engineers to easily and efficiently run hundreds of thousands of batch computing jobs on AWS."
AWS Glue
"a serverless data integration service that makes it easy for analytics users to discover, prepare, move, and integrate data from multiple sources."
Cloud Data Fusion
"Fully managed, cloud-native data integration platform."
Fivetran
"automates data movement from disparate sources into your destination."

Airflow
"Use airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed."
Argo
"an open source container-native workflow engine for orchestrating parallel jobs on Kubernetes."
Dagster
"Dagster is a data orchestrator for machine learning, analytics, and ETL. It lets you define pipelines in terms of the data flow between reusable, logical components, then test locally and run anywhere. With a unified view of pipelines and the assets they produce, Dagster can schedule and orchestrate Pandas, Spark, SQL, or anything else that Python can invoke."
Luigi
"a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in."
prefect
"a workflow orchestration framework for building resilient data pipelines in Python."
Temporal
"a scalable and reliable runtime for durable function executions called Temporal Workflow Executions."

Apache Beam
"a unified programming model for Batch and Streaming data processing."
Apache Flink
"a framework and distributed processing engine for stateful computations over unbounded and bounded data streams."
Debezium
"Change data capture for a variety of databases."
Kafka Connect
"a tool for scalably and reliably streaming data between Apache Kafka and other systems. It makes it simple to quickly define connectors that move large collections of data into and out of Kafka."
Spark
"a fast and general-purpose cluster computing system. It provides high-level APIs in Scala, Java, and Python that make parallel jobs easy to write, and an optimized engine that supports general computation graphs. It also supports a rich set of higher-level tools including Shark (Hive on Spark), MLlib for machine learning, GraphX for graph processing, and Spark Streaming."

Apache Camel
"an open source integration framework that empowers you to quickly and easily integrate various systems consuming or producing data."
Spring Batch
"A lightweight, comprehensive batch framework designed to enable the development of robust batch applications that are vital for the daily operations of enterprise systems."

Apache NiFi
"a rich, web-based interface for designing, controlling, and monitoring a dataflow."
CDAP
"Use Cask Data Application Platform to visually build and manage data applications in hybrid and multi-cloud environments."
Informatica PowerCenter
An ETL tool for extracting data from source systems, transforming it, and loading it into target systems using a visual mapping and workflow designer.
Microsoft SSIS
"a component of the Microsoft SQL Server database software that can be used to perform a broad range of data migration tasks."
N8n
"Free and open fair-code licensed node based Workflow Automation Tool. Easily automate tasks across different services."
Pentaho Data Integration (PDI)
"a graphical ETL tool for designing data integration workflows using a drag-and-drop interface, also known as Kettle."

BeautifulSoupLibraries
"a Python library for pulling data out of HTML and XML files."
CeleryLibraries
"an asynchronous task queue/job queue based on distributed message passing. It is focused on real-time operation, but supports scheduling as well."
DaskLibraries
"a flexible parallel computing library for analytics."
datasetLibraries
A wrapper around SQLAlchemy that simplifies database operations (including upserting).
dbt-coreLibraries
"enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications."
dltLibraries
"an open-source Python library that loads data from various, often messy data sources into well-structured datasets."

CloudQuery
"a cloud asset inventory built for platform teams. Sync your cloud infrastructure metadata into your data warehouse, powering insights and automation."
Pachyderm
"provides parallelized processing of multi-stage, language-agnostic pipelines with data versioning and data lineage tracking."
Redpanda Connect
"a declarative data streaming and integration tool with 300+ pre-built connectors, configured via YAML."

Showing a sample of 66 resources. View the full list on GitHub →