data-engineering-collection
github.com/exajobs/data-engineering-collection ↗A collection of awesome software, libraries, Learning Tutorials, documents, books, resources and interesting stuff about Big Data Science & Engineering
Use this list with your AI agent
Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:
"Show me 2001 - 2010 resources from data-engineering-collection"
Installation instructions →What's inside
Interesting Papers
- 20032001 - 2010
The Google File System.
- 20042001 - 2010
MapReduce: Simplied Data Processing on Large Clusters.
- 20062001 - 2010
Bigtable: A Distributed Storage System for Structured Data.
- 20062001 - 2010
The Chubby lock service for loosely-coupled distributed systems.
- 20072001 - 2010
Dynamo: Amazon’s Highly Available Key-value Store.
- 20082001 - 2010
Chukwa: A large-scale monitoring system.
Internet of things and sensor data
- 2lemetry
Platform for Internet of things.
- Ably
Pub/sub messaging platform for IoT
- Apache Edgent (Incubating)
a programming model and micro-kernel style runtime that can be embedded in gateways and small footprint edge devices enabling local, real-time, analytics on the edge devices.
- Azure IoT Hub
Cloud-based bi-directional monitoring and messaging hub
Applications
- 411
an web application for alert management resulting from scheduled searches into Elasticsearch.
- Adobe spindle
Next-generation web analytics processing with Scala, Spark, and Parquet.
- Apache Metron
a platform that integrates a variety of open source big data technologies in order to offer a centralized tool for security monitoring and analysis.
- Apache Nutch
open source web crawler.
- Apache OODT
capturing, processing and sharing of data for NASA's scientific archives.
- Apache Tika
content analysis toolkit.
Embedded Databases
- Actian PSQL
ACID-compliant DBMS developed by Pervasive Software, optimized for embedding in applications.
- BerkeleyDB
a software library that provides a high-performance embedded database for key/value data.
SQL-like processing
- Actian SQL for Hadoop
high performance interactive SQL access to all Hadoop data.
- Apache Calcite
framework that allows efficient translation of queries involving heterogeneous and federated data.
- Apache Drill
framework for interactive analysis, inspired by Dremel.
- Apache HCatalog
table and storage management layer for Hadoop.
- Apache Hive
SQL-like data warehouse system for Hadoop.
- Apache Phoenix
SQL skin over HBase.
Resources
- Actian Vector
column-oriented analytic database.
- Actian Versant
commercial object-oriented database management systems .
- ActorDB
a distributed SQL database with the scalability of a KV store, while keeping the query capabilities of a relational database.
- AddThis Hydra
distributed data processing and storage system originally developed at AddThis.
- Alluxio
reliable file sharing at memory speed across cluster frameworks.
- Amazon Redshift
Amazon's cloud offering, also based on a columnar datastore backend.
Key-value Data Model
- Aerospike
NoSQL flash-optimized, in-memory. Open source and "Server code in 'C' (not Java or Erlang) precisely tuned to avoid context switching and memory copies."
- Bolt
an embedded key-value database for Go.
- BTDB
Key Value Database in .Net with Object DB Layer, RPC, dynamic IL and much more
- BuntDB
a fast, embeddable, in-memory key/value database for Go with custom indexing and geospatial support.
Graph Data Model
- AgensGraph
a new generation multi-model graph database for the modern complex data environment.
- Apache Giraph
implementation of Pregel, based on Hadoop.
- Apache Spark Bagel
implementation of Pregel, part of Spark.
Showing a sample of 637 resources. View the full list on GitHub →