awesome-dataops
github.com/kelvins/awesome-dataops ↗:sunglasses: A curated list of awesome DataOps tools
Use this list with your AI agent
Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:
"Show me graph database resources from awesome-dataops"
Installation instructions →What's inside
Database
- AgeGraph Database
A multi-model database that supports both graph and relational data models.
- AkumuliTime Series Database
Can be used to capture, store and process time-series data in real-time.
- Apache AccumuloKey-Value Database
A sorted, distributed key-value store that provides robust and scalable data storage.
- Apache CassandraColumnar Database
Open source column based DBMS designed to handle large amounts of data.
- Apache CouchDBDocument-Oriented Database
An open-source document-oriented NoSQL database, implemented in Erlang.
- Apache DruidColumnar Database
Designed to quickly ingest massive quantities of event data, and provide low-latency queries.
File System
- Alluxio
A virtual distributed storage system.
- Amazon Simple Storage Service (S3)
Object storage built to retrieve any amount of data from anywhere.
- Apache Hadoop Distributed File System (HDFS)
A distributed file system.
- GlusterFS
A software defined distributed storage that can scale to several petabytes.
- Google Cloud Storage (GCS)
Object storage for companies of all sizes, to store any amount of data.
- LakeFS
Open source tool that transforms your object storage into a Git-like repository.
Data Ingestion
- Amazon Kinesis
Easily collect, process, and analyze video and data streams in real time.
- Apache Gobblin
A framework that simplifies common aspects of big data such as data ingestion.
- Apache Kafka
Open-source distributed event streaming platform used by thousands of companies.
- Apache Pulsar
Distributed pub-sub messaging platform with a flexible messaging model and intuitive API.
- Embulk
A parallel bulk data loader that helps data transfer between various storages.
- Fluentd
Collects events from various data sources and writes them to files.
Data Warehouse
- Amazon Redshift
Accelerate your time to insights with fast, easy, and secure cloud data warehousing.
- Apache Hive
Facilitates reading, writing, and managing large datasets residing in distributed storage.
- Apache Kylin
An open source, distributed analytical data warehouse for big data.
- Google BigQuery
Serverless, highly scalable, and cost-effective multicloud data warehouse.
Data Catalog
- Amundsen
Data discovery and metadata engine for improving the productivity when interacting with data.
- Apache Atlas
Provides open metadata management and governance capabilities to build a data catalog.
- CKAN
Open-source DMS (data management system) for powering data hubs and data portals.
- DataHub
LinkedIn's generalized metadata search & discovery tool.
- Magda
A federated, open-source data catalog for all your big data and small data.
- Marquez
Service for the collection, aggregation, and visualization of a data ecosystem's metadata.
Data Workflow
- Apache Airflow
A platform to programmatically author, schedule, and monitor workflows.
- Apache Oozie
An extensible, scalable and reliable system to manage complex Hadoop workloads.
- Azkaban
Batch workflow job scheduler created at LinkedIn to run Hadoop jobs.
- Dagster
An orchestration platform for the development, production, and observation of data assets.
- Luigi
Python module that helps you build complex pipelines of batch jobs.
- Prefect
A workflow management system, designed for modern infrastructure.
Data Serialization
- Apache Avro
A data serialization system which is compact, fast and provides rich data structures.
- Apache HudiData Table Format
Manages the storage of large analytical datasets on DFS.
- Apache IcebergData Table Format
Open table format for huge analytic datasets.
- Apache ORC
A self-describing type-aware columnar file format designed for Hadoop workloads.
- Apache Parquet
A columnar storage format which provides efficient storage and encoding of data.
- Delta LakeData Table Format
An open source project that enables building a Lakehouse architecture on top of data lakes.
Data Processing
- Apache Beam
A unified model for defining both batch and streaming data-parallel processing pipelines.
- Apache Flink
An open source stream processing framework with powerful capabilities.
- Apache Hadoop MapReduce
A framework for writing applications which process vast amounts of data.
- Apache Nifi
An easy to use, powerful, and reliable system to process and distribute data.
- Apache Samza
A distributed stream processing framework which uses Apache Kafka and Hadoop YARN.
- Apache Spark
A unified analytics engine for large-scale data processing.
Showing a sample of 143 resources. View the full list on GitHub →