awesome-datalake

📚 Awesome DataLake

GitHub Stars

Curated Resources

Use this list with your AI agent

Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:

"Show me data lake storages resources from awesome-datalake"

Alluxio
data orchestration for analytics and machine learning in the cloud.
DVC
ML Experiments and Data Management with Git
HDFS
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware.
lakeFS
lakeFS is an open-source tool that transforms your object storage into a Git-like repository. It enables you to manage your data lake the way you manage your code.
Minio
MinIO is a High Performance Object Storage released under GNU Affero General Public License v3.0. It is API compatible with Amazon S3 cloud storage service.
Nessie
Project Nessie is a Transactional Catalog for Data Lakes with Git-like semantics.

Apache Amoro
Apache Amoro (incubating) is a Lakehouse management system built on open data lake formats.
Geolake
Universal solution for geospatial data tailored to data lakehouse systems for the first time in the industry.
LakeSoul
LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.
LHBench
Lakehouse storage system benchmark.
OpenHouse
Open Control Plane for Tables in Data Lakehouse.

Apache Avro
Apache Avro is a data serialization system.
Apache ORC
ORC is a self-describing type-aware columnar file format designed for Hadoop workloads.
Apache Parquet
Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides high performance compression and encoding schemes to handle complex data in bulk and is supported in many programming language and analytics tools.

Apache Flink
Apache Flink is an open source stream processing framework with powerful stream- and batch-processing capabilities.
Apache Hive
The Apache Hive (TM) data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL.
Apache Sedona
A cluster computing framework for processing large-scale geospatial data.
Apache Spark
Spark is a unified analytics engine for large-scale data processing.
Doris
Apache Doris is an easy-to-use, high performance and unified analytics database. It can access databases and data lakes including Apache Hive, Apache Iceberg, Apache Hudi, Apache Paimon, LakeSoul, Elasticsearch, MySQL, Oracle, and SQLServer.
Dremio
Dremio is a next-generation data lake engine that liberates your data with live, interactive queries directly on cloud data lake storage, including S3 and lakeFS.

Apache Gravitino
Apache Gravitino is a high-performance, geo-distributed, and federated metadata lake. It manages the metadata directly in different sources, types, and regions. It also provides users with unified metadata access for data and AI assets.
Metacat
Metacat is a unified metadata exploration API service. You can explore Hive, RDS, Teradata, Redshift, S3 and Cassandra.
Polaris Catalog
Polaris Catalog is an open source catalog for Apache Iceberg. Polaris Catalog implements Iceberg’s open REST API for multi-engine interoperability with Apache Doris, Apache Flink, Apache Spark, PyIceberg, StarRocks and Trino.
Unity Catalog
Open, Multi-modal Catalog for Data & AI.

Apache Hudi
Upserts, Deletes And Incremental Processing on Big Data.
Apache Iceberg
Iceberg is a high-performance format for huge analytic tables. Iceberg brings the reliability and simplicity of SQL tables to big data, while making it possible for engines like Spark, Trino, Flink, Presto, Hive and Impala to safely work with the same tables, at the same time.
Apache Paimon
Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
Apache XTable
Apache XTable (incubating) is a cross-table converter for lakehouse table formats that facilitates interoperability across data processing systems and query engines.
Delta Lake

Apache Ranger
To enable, monitor and manage comprehensive data security across the Hadoop platform and beyond.
Kerberos
The Network Authentication Protocol.

Cuelake
Use SQL to build ELT pipelines on a data lakehouse.
Kylo
Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc
Smart Data Lake
Smart Automation Tool for building modern Data Lakes and Data Pipelines

Showing a sample of 42 resources. View the full list on GitHub →