Skip to main content

:sunglasses: A curated list of awesome DataOps tools

229
GitHub Stars
143
Curated Resources
18
Categories
22 hours ago
Last Refreshed
Data CatalogData ExplorationData IngestionData WorkflowData ProcessingData QualityData SerializationData VisualizationData WarehouseDatabaseFile SystemLogging and MonitoringMetadata ServiceSQL PlaygroundSQL Query EngineBooksOther ListsSlack

Use this list with your AI agent

Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:

"Show me graph database resources from awesome-dataops"

Installation instructions →

What's inside

Database

  • AgeGraph Database

    A multi-model database that supports both graph and relational data models.

  • AkumuliTime Series Database

    Can be used to capture, store and process time-series data in real-time.

  • Apache AccumuloKey-Value Database

    A sorted, distributed key-value store that provides robust and scalable data storage.

  • Apache CassandraColumnar Database

    Open source column based DBMS designed to handle large amounts of data.

  • Apache CouchDBDocument-Oriented Database

    An open-source document-oriented NoSQL database, implemented in Erlang.

  • Apache DruidColumnar Database

    Designed to quickly ingest massive quantities of event data, and provide low-latency queries.

File System

Data Ingestion

  • Amazon Kinesis

    Easily collect, process, and analyze video and data streams in real time.

  • Apache Gobblin

    A framework that simplifies common aspects of big data such as data ingestion.

  • Apache Kafka

    Open-source distributed event streaming platform used by thousands of companies.

  • Apache Pulsar

    Distributed pub-sub messaging platform with a flexible messaging model and intuitive API.

  • Embulk

    A parallel bulk data loader that helps data transfer between various storages.

  • Fluentd

    Collects events from various data sources and writes them to files.

Data Warehouse

  • Amazon Redshift

    Accelerate your time to insights with fast, easy, and secure cloud data warehousing.

  • Apache Hive

    Facilitates reading, writing, and managing large datasets residing in distributed storage.

  • Apache Kylin

    An open source, distributed analytical data warehouse for big data.

  • Google BigQuery

    Serverless, highly scalable, and cost-effective multicloud data warehouse.

Data Catalog

  • Amundsen

    Data discovery and metadata engine for improving the productivity when interacting with data.

  • Apache Atlas

    Provides open metadata management and governance capabilities to build a data catalog.

  • CKAN

    Open-source DMS (data management system) for powering data hubs and data portals.

  • DataHub

    LinkedIn's generalized metadata search & discovery tool.

  • Magda

    A federated, open-source data catalog for all your big data and small data.

  • Marquez

    Service for the collection, aggregation, and visualization of a data ecosystem's metadata.

Data Workflow

  • Apache Airflow

    A platform to programmatically author, schedule, and monitor workflows.

  • Apache Oozie

    An extensible, scalable and reliable system to manage complex Hadoop workloads.

  • Azkaban

    Batch workflow job scheduler created at LinkedIn to run Hadoop jobs.

  • Dagster

    An orchestration platform for the development, production, and observation of data assets.

  • Luigi

    Python module that helps you build complex pipelines of batch jobs.

  • Prefect

    A workflow management system, designed for modern infrastructure.

Data Serialization

  • Apache Avro

    A data serialization system which is compact, fast and provides rich data structures.

  • Apache HudiData Table Format

    Manages the storage of large analytical datasets on DFS.

  • Apache IcebergData Table Format

    Open table format for huge analytic datasets.

  • Apache ORC

    A self-describing type-aware columnar file format designed for Hadoop workloads.

  • Apache Parquet

    A columnar storage format which provides efficient storage and encoding of data.

  • Delta LakeData Table Format

    An open source project that enables building a Lakehouse architecture on top of data lakes.

Data Processing

  • Apache Beam

    A unified model for defining both batch and streaming data-parallel processing pipelines.

  • Apache Flink

    An open source stream processing framework with powerful capabilities.

  • Apache Hadoop MapReduce

    A framework for writing applications which process vast amounts of data.

  • Apache Nifi

    An easy to use, powerful, and reliable system to process and distribute data.

  • Apache Samza

    A distributed stream processing framework which uses Apache Kafka and Hadoop YARN.

  • Apache Spark

    A unified analytics engine for large-scale data processing.

Showing a sample of 143 resources. View the full list on GitHub →