Skip to main content

🧊 A curated list of OLAP databases, data lake tools, columnar engines, and analytics frameworks for data engineers.

127
GitHub Stars
219
Curated Resources
16
Categories
21 hours ago
Last Refreshed
OLAP DatabasesStorage enginesData lakeBrokers and distributed messagingIngestion and queryingSchedulerDurable executionETL, ELT and reverse ETLBI & VisualizationDatasetsBenchmarkReadingsPeople to followEventsCommunitiesπŸ“ License

Use this list with your AI agent

Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:

"Show me open table formats resources from awesome-olap"

Installation instructions β†’

What's inside

Data lake

Readings

  • ACID propertiesTransactions

    Atomicity, Consistency, Isolation, Durability β€” the four guarantees that define correct database transaction behavior.

  • ANN (approximate nearest neighbor)Vector similarity search

    Family of algorithms that trade exact accuracy for speed when finding the closest vectors in high-dimensional space.

  • AntithesisBlogs to follow

    Blog from the autonomous testing platform covering distributed systems correctness, fault injection, and database reliability.

  • Apache Arrow SIMD parallel processingVectorized query processing

    Single Instruction Multiple Data β€” CPU instruction-level parallelism that processes multiple columnar values in a single clock cycle.

  • Apache Arrow vectorized executionVectorized query processing

    Talk on how Arrow's columnar memory layout enables SIMD-accelerated batch processing in query engines.

  • Apache Flink state managementPapers

    Carbone et al. (2017) on Flink's state backend, incremental checkpointing, and exactly-once fault tolerance.

Events

  • ACM SIGMOD/PODS

    Leading international forum for database researchers and practitioners.

  • Community Over Code

    The Apache Software Foundation's official conference (formerly ApacheCon).

  • Confluent Current

    The Data Streaming Event focused on Apache Kafka and real-time data streaming.

  • Databricks Data+AI Summit

    The world's largest data, analytics, and AI conference.

  • Data Council

    Technical conference on data engineering, infrastructure, and analytics.

  • dbt Summit

    The world's largest gathering of dbt users and analytics engineering practitioners.

ETL, ELT and reverse ETL

  • Airbyte

    Open-source ELT platform with 300+ pre-built connectors for syncing data to your warehouse.

  • Census

    Reverse ETL platform for syncing data warehouse data to CRMs, ad tools, and other SaaS.

  • dbt

    SQL-based transformation framework that runs inside your warehouse; the standard tool for the T in ELT.

  • Debezium

    Open-source CDC (Change Data Capture) platform that streams row-level changes from databases like PostgreSQL, MySQL, and MongoDB into Kafka and downstream systems.

Ingestion and querying

  • Akka StreamsStream processing

    Reactive stream processing library for JVM, built on the actor model.

  • Apache ArrowIn-memory processing

    Low-level in-memory columnar data format with zero-copy access across languages via gRPC/IPC interfaces.

  • Apache Arrow DataFusionIn-memory processing

    High-level SQL and DataFrame query engine built on Apache Arrow, written in Rust.

  • Apache BeamStream processing

    Unified SDK for cross-language stream and batch processing. Available in Go, Python, Java, Scala and TypeScript.

  • Apache FlinkStream processing

    Stateful stream processing with exactly-once semantics, supporting event time and out-of-order data.

  • Apache Kafka StreamsStream processing

    Lightweight stream processing library embedded in the Kafka client, no separate cluster required.

People to follow

Scheduler

  • Apache Airflow

    Platform for programmatically authoring, scheduling, and monitoring data pipelines as DAGs.

  • Dagster

    Data orchestration platform with an asset-centric approach, lineage tracking, and built-in observability.

OLAP Databases

  • Apache DorisReal-time analytics

    MPP analytical database with MySQL-compatible interface, optimized for high-concurrency queries and real-time data ingestion.

  • Apache DruidReal-time analytics

    Real-time OLAP database optimized for streaming ingestion, time-series analytics, and sub-second queries on high-cardinality data.

  • Apache HBaseReal-time analytics

    Distributed, wide-column NoSQL database on top of HDFS, modeled after Google Bigtable.

  • Apache PinotReal-time analytics

    Distributed OLAP datastore for user-facing real-time analytics, designed for low-latency queries at high concurrency.

  • AWS RedshiftManaged cloud services

    Fully managed petabyte-scale data warehouse on AWS.

  • Azure Synapse AnalyticsManaged cloud services

    Unified analytics service combining data integration, warehousing, and big data on Azure.

Showing a sample of 219 resources. View the full list on GitHub β†’