awesome-olap
github.com/samber/awesome-olap βπ§ A curated list of OLAP databases, data lake tools, columnar engines, and analytics frameworks for data engineers.
Use this list with your AI agent
Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:
"Show me open table formats resources from awesome-olap"
Installation instructions βWhat's inside
Data lake
- (2022) Open Table Formats: Delta vs Iceberg vs HudiOpen table formats
- (2023) Choosing an open table format for your transactional data lake on AWSOpen table formats
- (2024) Apache Iceberg vs Delta Lake vs Apache Hudi: Choosing the Right Table FormatOpen table formats
- Apache Arrow Columnar FormatFile formats and serialization
Columnar format for in-memory Apache Arrow processing.
- Apache AvroFile formats and serialization
Row-oriented serialization for data streaming purpose.
- Apache HDFSObject Storage
Hadoop distributed file system, the original large-scale storage layer for the big data ecosystem.
Readings
- ACID propertiesTransactions
Atomicity, Consistency, Isolation, Durability β the four guarantees that define correct database transaction behavior.
- ANN (approximate nearest neighbor)Vector similarity search
Family of algorithms that trade exact accuracy for speed when finding the closest vectors in high-dimensional space.
- AntithesisBlogs to follow
Blog from the autonomous testing platform covering distributed systems correctness, fault injection, and database reliability.
- Apache Arrow SIMD parallel processingVectorized query processing
Single Instruction Multiple Data β CPU instruction-level parallelism that processes multiple columnar values in a single clock cycle.
- Apache Arrow vectorized executionVectorized query processing
Talk on how Arrow's columnar memory layout enables SIMD-accelerated batch processing in query engines.
- Apache Flink state managementPapers
Carbone et al. (2017) on Flink's state backend, incremental checkpointing, and exactly-once fault tolerance.
Events
- ACM SIGMOD/PODS
Leading international forum for database researchers and practitioners.
- Community Over Code
The Apache Software Foundation's official conference (formerly ApacheCon).
- Confluent Current
The Data Streaming Event focused on Apache Kafka and real-time data streaming.
- Databricks Data+AI Summit
The world's largest data, analytics, and AI conference.
- Data Council
Technical conference on data engineering, infrastructure, and analytics.
- dbt Summit
The world's largest gathering of dbt users and analytics engineering practitioners.
ETL, ELT and reverse ETL
- Airbyte
Open-source ELT platform with 300+ pre-built connectors for syncing data to your warehouse.
- Census
Reverse ETL platform for syncing data warehouse data to CRMs, ad tools, and other SaaS.
- dbt
SQL-based transformation framework that runs inside your warehouse; the standard tool for the T in ELT.
- Debezium
Open-source CDC (Change Data Capture) platform that streams row-level changes from databases like PostgreSQL, MySQL, and MongoDB into Kafka and downstream systems.
Ingestion and querying
- Akka StreamsStream processing
Reactive stream processing library for JVM, built on the actor model.
- Apache ArrowIn-memory processing
Low-level in-memory columnar data format with zero-copy access across languages via gRPC/IPC interfaces.
- Apache Arrow DataFusionIn-memory processing
High-level SQL and DataFrame query engine built on Apache Arrow, written in Rust.
- Apache BeamStream processing
Unified SDK for cross-language stream and batch processing. Available in Go, Python, Java, Scala and TypeScript.
- Apache FlinkStream processing
Stateful stream processing with exactly-once semantics, supporting event time and out-of-order data.
- Apache Kafka StreamsStream processing
Lightweight stream processing library embedded in the Kafka client, no separate cluster required.
People to follow
- Alexey Milovidov
Co-founder and CTO of ClickHouse
- Andrew Lamb
PMC member for Apache Arrow, DataFusion, and Parquet
- Andy Grove
PMC member of Apache Arrow and DataFusion. Author of "How Query Engines Work"
- Fokko Driesprong
PMC member on Apache Avro, Airflow, Druid, Iceberg, and Parquet
- Gian Merlino
Co-founder and CTO of Imply, co-creator of Apache Druid
- Hannes MΓΌhleisen
Co-creator of DuckDB, CEO of DuckDB Labs
Scheduler
- Apache Airflow
Platform for programmatically authoring, scheduling, and monitoring data pipelines as DAGs.
- Dagster
Data orchestration platform with an asset-centric approach, lineage tracking, and built-in observability.
OLAP Databases
- Apache DorisReal-time analytics
MPP analytical database with MySQL-compatible interface, optimized for high-concurrency queries and real-time data ingestion.
- Apache DruidReal-time analytics
Real-time OLAP database optimized for streaming ingestion, time-series analytics, and sub-second queries on high-cardinality data.
- Apache HBaseReal-time analytics
Distributed, wide-column NoSQL database on top of HDFS, modeled after Google Bigtable.
- Apache PinotReal-time analytics
Distributed OLAP datastore for user-facing real-time analytics, designed for low-latency queries at high concurrency.
- AWS RedshiftManaged cloud services
Fully managed petabyte-scale data warehouse on AWS.
- Azure Synapse AnalyticsManaged cloud services
Unified analytics service combining data integration, warehousing, and big data on Azure.
Showing a sample of 219 resources. View the full list on GitHub β