awesome-hadoop

A curated list of amazingly awesome Hadoop and Hadoop ecosystem resources

1.1k

GitHub Stars

163

Curated Resources

Use this list with your AI agent

Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:

"Show me dsl resources from awesome-hadoop"

akela
Mozilla's utility library for Hadoop, HBase, Pig, etc.
Apache DataFu
A collection of libraries for working with large-scale data in Hadoop
Apache Pig
Apache Pig
Lipstick
Pig workflow visualization tool.
packetpig
Open Source Big Data Security Analytics
PigPen
PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig, but you don't need to know much about Pig to use it.

ankush
A big data cluster management tool that creates and manages clusters of different technologies.
Apache Ambari
Apache Ambari
Apache Bigtop
Apache Bigtop: Packaging and tests of the Apache Hadoop ecosystem
Apache Curator
ZooKeeper client wrapper and rich ZooKeeper framework
Apache Zookeeper
Apache Zookeeper
Ganglia Monitoring System

Apache Accumulo
The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.
Apache Cassandra
Apache HBase
Apache HBase
Apache Phoenix
A SQL skin over HBase supporting secondary indices
Haeinsa
Haeinsa is linearly scalable multi-row, multi-table transaction library for HBase
Hannibal
Hannibal is tool to help monitor and maintain HBase-Clusters that are configured for manual splitting.

Apache AirFlow
Airflow is a workflow automation and scheduling system that can be used to author and manage data pipelines
Apache Falcon
Data management and processing platform
Apache NiFi
A dataflow system
Apache Oozie
Apache Oozie
Azkaban
Luigi
Python package that helps you build complex pipelines of batch jobs

Apache Apex (incubating)
Enterprise-grade unified stream and batch processing engine.
Apache Crunch
Apache Flink
Apache Flink is a platform for efficient, distributed, general-purpose data processing.
Apache Livy (incubating)
Apache Livy (incubating) is web service that exposes a REST interface for managing long running Apache Spark contexts in your cluster. With Livy, new applications can be built on top of Apache Spark that require fine grained interaction with many Spark contexts.
Apache Spark
Cascading
Cascading is the proven application development platform for building data applications on Hadoop.

Apache Atlas
Metadata tagging & lineage capture suppoting complex business data taxonomies
Apache Calcite
A Dynamic Data Management Framework
Apache Kudu
Kudu provides a combination of fast inserts/updates and efficient columnar scans to enable multiple real-time analytic workloads across a single storage layer, complementing HDFS and Apache HBase.
Confluent Schema registry for Kafka
Schema Registry provides a serving layer for your metadata. It provides a RESTful interface for storing and retrieving Avro schemas.
Hortonworks Schema Registry
Schema Registry is a framework to build metadata repositories.

Apache Avro
Apache Avro is a data serialization system.
Apache Parquet
Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
Apache Superset (incubating)
Apache Superset (incubating) is a modern, enterprise-ready business intelligence web application
Apache Thrift
Apache Zeppelin
A web-based notebook that enables interactive data analytics
gohadoop
Native go clients for Apache Hadoop YARN.

Showing a sample of 163 resources. View the full list on GitHub →