awesome-hadoop
github.com/youngwookim/awesome-hadoop ↗A curated list of amazingly awesome Hadoop and Hadoop ecosystem resources
Use this list with your AI agent
Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:
"Show me dsl resources from awesome-hadoop"
Installation instructions →What's inside
DSL
- akela
Mozilla's utility library for Hadoop, HBase, Pig, etc.
- Apache DataFu
A collection of libraries for working with large-scale data in Hadoop
- Apache Pig
Apache Pig
- Lipstick
Pig workflow visualization tool.
- packetpig
Open Source Big Data Security Analytics
- PigPen
PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig, but you don't need to know much about Pig to use it.
Packaging, Provisioning and Monitoring
- ankush
A big data cluster management tool that creates and manages clusters of different technologies.
- Apache Ambari
Apache Ambari
- Apache Bigtop
Apache Bigtop: Packaging and tests of the Apache Hadoop ecosystem
- Apache Curator
ZooKeeper client wrapper and rich ZooKeeper framework
- Apache Zookeeper
Apache Zookeeper
- Ganglia Monitoring System
NoSQL
- Apache Accumulo
The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.
- Apache Cassandra
- Apache HBase
Apache HBase
- Apache Phoenix
A SQL skin over HBase supporting secondary indices
- Haeinsa
Haeinsa is linearly scalable multi-row, multi-table transaction library for HBase
- Hannibal
Hannibal is tool to help monitor and maintain HBase-Clusters that are configured for manual splitting.
Workflow, Lifecycle and Governance
- Apache AirFlow
Airflow is a workflow automation and scheduling system that can be used to author and manage data pipelines
- Apache Falcon
Data management and processing platform
- Apache NiFi
A dataflow system
- Apache Oozie
Apache Oozie
- Azkaban
- Luigi
Python package that helps you build complex pipelines of batch jobs
Distributed Computing and Programming
- Apache Apex (incubating)
Enterprise-grade unified stream and batch processing engine.
- Apache Crunch
- Apache Flink
Apache Flink is a platform for efficient, distributed, general-purpose data processing.
- Apache Livy (incubating)
Apache Livy (incubating) is web service that exposes a REST interface for managing long running Apache Spark contexts in your cluster. With Livy, new applications can be built on top of Apache Spark that require fine grained interaction with many Spark contexts.
- Apache Spark
- Cascading
Cascading is the proven application development platform for building data applications on Hadoop.
Data Management
- Apache Atlas
Metadata tagging & lineage capture suppoting complex business data taxonomies
- Apache Calcite
A Dynamic Data Management Framework
- Apache Kudu
Kudu provides a combination of fast inserts/updates and efficient columnar scans to enable multiple real-time analytic workloads across a single storage layer, complementing HDFS and Apache HBase.
- Confluent Schema registry for Kafka
Schema Registry provides a serving layer for your metadata. It provides a RESTful interface for storing and retrieving Avro schemas.
- Hortonworks Schema Registry
Schema Registry is a framework to build metadata repositories.
Libraries and Tools
- Apache Avro
Apache Avro is a data serialization system.
- Apache Parquet
Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
- Apache Superset (incubating)
Apache Superset (incubating) is a modern, enterprise-ready business intelligence web application
- Apache Thrift
- Apache Zeppelin
A web-based notebook that enables interactive data analytics
- gohadoop
Native go clients for Apache Hadoop YARN.
Hadoop and Big Data Events
Showing a sample of 163 resources. View the full list on GitHub →