Skip to main content

A curated list of amazingly awesome Hadoop and Hadoop ecosystem resources

1.1k
GitHub Stars
163
Curated Resources
22
Categories
5 hours ago
Last Refreshed
HadoopYARNNoSQLSQL on HadoopData ManagementWorkflow, Lifecycle and GovernanceData Ingestion and IntegrationDSLLibraries and ToolsRealtime Data ProcessingDistributed Computing and ProgrammingPackaging, Provisioning and MonitoringSearchSearch Engine FrameworkSecurityBenchmarkMachine learning and Big Data analyticsMisc.WebsitesPresentationsBooksHadoop and Big Data Events

Use this list with your AI agent

Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:

"Show me dsl resources from awesome-hadoop"

Installation instructions →

What's inside

DSL

  • akela

    Mozilla's utility library for Hadoop, HBase, Pig, etc.

  • Apache DataFu

    A collection of libraries for working with large-scale data in Hadoop

  • Apache Pig

    Apache Pig

  • Lipstick

    Pig workflow visualization tool.

  • packetpig

    Open Source Big Data Security Analytics

  • PigPen

    PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig, but you don't need to know much about Pig to use it.

Packaging, Provisioning and Monitoring

NoSQL

  • Apache Accumulo

    The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.

  • Apache Cassandra

  • Apache HBase

    Apache HBase

  • Apache Phoenix

    A SQL skin over HBase supporting secondary indices

  • Haeinsa

    Haeinsa is linearly scalable multi-row, multi-table transaction library for HBase

  • Hannibal

    Hannibal is tool to help monitor and maintain HBase-Clusters that are configured for manual splitting.

Workflow, Lifecycle and Governance

Distributed Computing and Programming

  • Apache Apex (incubating)

    Enterprise-grade unified stream and batch processing engine.

  • Apache Crunch

  • Apache Flink

    Apache Flink is a platform for efficient, distributed, general-purpose data processing.

  • Apache Livy (incubating)

    Apache Livy (incubating) is web service that exposes a REST interface for managing long running Apache Spark contexts in your cluster. With Livy, new applications can be built on top of Apache Spark that require fine grained interaction with many Spark contexts.

  • Apache Spark

  • Cascading

    Cascading is the proven application development platform for building data applications on Hadoop.

Data Management

  • Apache Atlas

    Metadata tagging & lineage capture suppoting complex business data taxonomies

  • Apache Calcite

    A Dynamic Data Management Framework

  • Apache Kudu

    Kudu provides a combination of fast inserts/updates and efficient columnar scans to enable multiple real-time analytic workloads across a single storage layer, complementing HDFS and Apache HBase.

  • Confluent Schema registry for Kafka

    Schema Registry provides a serving layer for your metadata. It provides a RESTful interface for storing and retrieving Avro schemas.

  • Hortonworks Schema Registry

    Schema Registry is a framework to build metadata repositories.

Libraries and Tools

  • Apache Avro

    Apache Avro is a data serialization system.

  • Apache Parquet

    Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.

  • Apache Superset (incubating)

    Apache Superset (incubating) is a modern, enterprise-ready business intelligence web application

  • Apache Thrift

  • Apache Zeppelin

    A web-based notebook that enables interactive data analytics

  • gohadoop

    Native go clients for Apache Hadoop YARN.

Showing a sample of 163 resources. View the full list on GitHub →