awesome-dataops

:sunglasses: A curated list of awesome DataOps tools

235

GitHub Stars

143

Curated Resources

Use this list with your AI agent

Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:

"Show me graph database resources from awesome-dataops"

AgeGraph Database
A multi-model database that supports both graph and relational data models.
AkumuliTime Series Database
Can be used to capture, store and process time-series data in real-time.
Apache AccumuloKey-Value Database
A sorted, distributed key-value store that provides robust and scalable data storage.
Apache CassandraColumnar Database
Open source column based DBMS designed to handle large amounts of data.
Apache CouchDBDocument-Oriented Database
An open-source document-oriented NoSQL database, implemented in Erlang.
Apache DruidColumnar Database
Designed to quickly ingest massive quantities of event data, and provide low-latency queries.

Alluxio
A virtual distributed storage system.
Amazon Simple Storage Service (S3)
Object storage built to retrieve any amount of data from anywhere.
Apache Hadoop Distributed File System (HDFS)
A distributed file system.
GlusterFS
A software defined distributed storage that can scale to several petabytes.
Google Cloud Storage (GCS)
Object storage for companies of all sizes, to store any amount of data.
LakeFS
Open source tool that transforms your object storage into a Git-like repository.

Amazon Kinesis
Easily collect, process, and analyze video and data streams in real time.
Apache Gobblin
A framework that simplifies common aspects of big data such as data ingestion.
Apache Kafka
Open-source distributed event streaming platform used by thousands of companies.
Apache Pulsar
Distributed pub-sub messaging platform with a flexible messaging model and intuitive API.
Embulk
A parallel bulk data loader that helps data transfer between various storages.
Fluentd
Collects events from various data sources and writes them to files.

Amazon Redshift
Accelerate your time to insights with fast, easy, and secure cloud data warehousing.
Apache Hive
Facilitates reading, writing, and managing large datasets residing in distributed storage.
Apache Kylin
An open source, distributed analytical data warehouse for big data.
Google BigQuery
Serverless, highly scalable, and cost-effective multicloud data warehouse.

Amundsen
Data discovery and metadata engine for improving the productivity when interacting with data.
Apache Atlas
Provides open metadata management and governance capabilities to build a data catalog.
CKAN
Open-source DMS (data management system) for powering data hubs and data portals.
DataHub
LinkedIn's generalized metadata search & discovery tool.
Magda
A federated, open-source data catalog for all your big data and small data.
Marquez
Service for the collection, aggregation, and visualization of a data ecosystem's metadata.

Apache Airflow
A platform to programmatically author, schedule, and monitor workflows.
Apache Oozie
An extensible, scalable and reliable system to manage complex Hadoop workloads.
Azkaban
Batch workflow job scheduler created at LinkedIn to run Hadoop jobs.
Dagster
An orchestration platform for the development, production, and observation of data assets.
Luigi
Python module that helps you build complex pipelines of batch jobs.
Prefect
A workflow management system, designed for modern infrastructure.

Apache Avro
A data serialization system which is compact, fast and provides rich data structures.
Apache HudiData Table Format
Manages the storage of large analytical datasets on DFS.
Apache IcebergData Table Format
Open table format for huge analytic datasets.
Apache ORC
A self-describing type-aware columnar file format designed for Hadoop workloads.
Apache Parquet
A columnar storage format which provides efficient storage and encoding of data.
Delta LakeData Table Format
An open source project that enables building a Lakehouse architecture on top of data lakes.

Apache Beam
A unified model for defining both batch and streaming data-parallel processing pipelines.
Apache Flink
An open source stream processing framework with powerful capabilities.
Apache Hadoop MapReduce
A framework for writing applications which process vast amounts of data.
Apache Nifi
An easy to use, powerful, and reliable system to process and distribute data.
Apache Samza
A distributed stream processing framework which uses Apache Kafka and Hadoop YARN.
Apache Spark
A unified analytics engine for large-scale data processing.

Showing a sample of 143 resources. View the full list on GitHub →