└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # Awesome Open Source Data Engineering [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome) 2 | A curated list of open source tools used in analytics platforms and data engineering ecosystem 3 | ![Open Source Data Engineering Landscape 2025](https://github.com/user-attachments/assets/fe9e97a8-abd8-47a9-8429-15130055785c) 4 | 5 | For more information about the above compiled landscape for 2025, please refer to the published blog post on [Pracdata.io](https://www.pracdata.io/p/open-source-data-engineering-landscape-2025) 6 | 7 | ## Table of contents 8 | - [Storage Systems](https://github.com/pracdata/awesome-open-source-data-engineering?tab=readme-ov-file#storage-systems) 9 | - [Data Lake Platform](https://github.com/pracdata/awesome-open-source-data-engineering?tab=readme-ov-file#data-lake-platform) 10 | - [Data Integration](https://github.com/pracdata/awesome-open-source-data-engineering?tab=readme-ov-file#data-integration) 11 | - [Data Processing & Computation](https://github.com/pracdata/awesome-open-source-data-engineering?tab=readme-ov-file#data-processing-and-computation) 12 | - [Workflow Management & DataOps](https://github.com/pracdata/awesome-open-source-data-engineering?tab=readme-ov-file#workflow-management--dataops) 13 | - [Data Infrastructure](https://github.com/pracdata/awesome-open-source-data-engineering?tab=readme-ov-file#data-infrastructure) 14 | - [Metadata Management](https://github.com/pracdata/awesome-open-source-data-engineering?tab=readme-ov-file#metadata-management) 15 | - [Analytics & Visualisation](https://github.com/pracdata/awesome-open-source-data-engineering?tab=readme-ov-file#analytics--visualisation) 16 | - [ML/AI Platform](https://github.com/pracdata/awesome-open-source-data-engineering?tab=readme-ov-file#mlai-platform) 17 | 18 | ## STORAGE SYSTEMS 19 | 20 | ### Relational DBMS 21 | - [PostgreSQL](https://github.com/postgres/postgres) - Advanced object-relational database management system 22 | - [MySQL](https://github.com/mysql/mysql-server) - One of the most popular open Source Databases 23 | - [MariaDB](https://github.com/MariaDB/server) - A popular MySQL server fork 24 | - [Supabase](https://github.com/supabase/supabase) - An open source Firebase alternative 25 | - [SQlite](https://github.com/sqlite/sqlite) - Most popular embedded database engine 26 | 27 | ### Distributed SQL DBMS 28 | - [Citus](https://github.com/citusdata/citus) - A popular distributed PostgreSQL as an extension 29 | - [CockroachDB](https://github.com/cockroachdb/cockroach) - A cloud-native distributed SQL database 30 | - [YugabyteDB](https://github.com/yugabyte/yugabyte-db) - A cloud-native distributed SQL database 31 | - [TiDB](https://github.com/pingcap/tidb) - A cloud-native, distributed, MySQL-Compatible database 32 | - [OceanBase](https://github.com/oceanbase/oceanbase) - A scalable distributed relational database 33 | - [ShardingSphere](https://github.com/apache/shardingsphere) - A Distributed SQL transaction & query engine 34 | - [Neon](https://github.com/neondatabase/neon) - A serverless open-source alternative to AWS Aurora Postgres 35 | - [CrateDB](https://github.com/crate/crate) - A distributed and scalable PostgreSQL-compatible SQL database 36 | 37 | ### Cache Store 38 | - [Redis](https://github.com/redis/redis) - A popular key-value based cache store 39 | - [Memcached](https://github.com/memcached/memcached) - A high performance multithreadedkey-value cache store 40 | - [Dragonfly](https://github.com/dragonflydb/dragonfly) - A modern cache store compatible with Redis and Memcached APIs 41 | 42 | ### In-memory SQL Database 43 | - [Apache Ignite](https://github.com/apache/ignite) - A distributed, ACID-compliant in-memory DBMS 44 | - [ReadySet](https://github.com/readysettech/readyset) - A MySQL and Postgres wire-compatible caching layer 45 | - [VoltDB](https://github.com/voltdb/) - A distributed, horizontally-scalable, ACID-compliant database 46 | 47 | ### Document Store 48 | - [MongoDB](https://github.com/mongodb/mongo) - A cross-platform, document-oriented NoSQL database 49 | - [RavenDB](https://github.com/ravendb/ravendb) - An ACID NoSQL document database 50 | - [RethinkDB](https://github.com/rethinkdb/rethinkdb) | ⚠️ Inactive | - A distributed document-oriented database for real-time applications 51 | - [CouchDB](https://github.com/apache/couchdb) - A Scalable document-oriented NoSQL database 52 | - [Couchbase](https://github.com/couchbase) - A modern cloud-native NoSQL distributed database 53 | - [FerretDB](https://github.com/FerretDB/FerretDB) - A truly Open Source MongoDB alternative! 54 | - [LowDB](https://github.com/typicode/lowdb) | ⚠️ Inactive | - A simple and fast JSON database 55 | 56 | ### NoSQL Multi-model 57 | - [OrientDB](https://github.com/orientechnologies/orientdb) - A Multi-model DBMS supporting Graph, Document, Reactive, Full-Text and Geospatial models 58 | - [ArrangoDB](https://github.com/arangodb/arangodb) - A Multi-model database with flexible data models for documents, graphs, and key-values 59 | - [SurrealDB](https://github.com/surrealdb/surrealdb) - A scalable, distributed, collaborative, document-graph database 60 | - [EdgeDB](https://github.com/edgedb/edgedb) - A graph-relational database with declarative schema 61 | 62 | ### Graph Database 63 | - [Neo4j](https://github.com/neo4j/neo4j) - A high performance leading graph database 64 | - [JunasGraph](https://github.com/JanusGraph/janusgraph) - A highly scalable distributed graph database 65 | - [HugeGraph](https://github.com/apache/incubator-hugegraph) - A fast-speed and highly-scalable graph database 66 | - [NebulaGraph](https://github.com/vesoft-inc/nebula) - A distributed, horizontal scalability, fast open-source graph database 67 | - [Cayley](https://github.com/cayleygraph/cayley) | ⚠️ Inactive | - Inspired by the graph database behind Google's Knowledge Graph 68 | - [Dgraph](https://github.com/dgraph-io/dgraph) - A horizontally scalable and distributed GraphQL database with a graph backend 69 | - [Apache Age](https://github.com/apache/age) - A graph database as an extension to PostgreSQL 70 | - [FalkorDB](https://github.com/FalkorDB/falkordb) - A graph database that uses GraphBLAS under the hood, tailored for LLMs 71 | 72 | ### Distributed Key-value Store 73 | - [Riak](https://github.com/basho/riak) | ⚠️ Inactive | - A decentralized key-value datastore from Basho Technologies 74 | - [FoundationDB](https://github.com/apple/foundationdb) - A distributed, transactional key-value store from Apple 75 | - [etcd](https://github.com/etcd-io/etcd) - A distributed reliable key-value store written in Go 76 | - [TiKV](https://github.com/tikv/tikv) - A distributed transactional key-value database, originally created to complement TiDB 77 | - [Immudb](https://github.com/codenotary/immudb) - A database with built-in cryptographic proof and verification 78 | - [Valkey](https://github.com/valkey-io/valkey) - A distributed key-value datastore forked from Redis 79 | - [Apache Kvrocks](https://github.com/apache/kvrocks) - A distributed key-value database that uses RocksDB as storage engine 80 | 81 | ### Wide-column Key-value Store 82 | - [Apache Cassandra](https://github.com/apache/cassandra) - A highly-scalable LSM-Tree based partitioned row store 83 | - [Apache Hbase](https://github.com/apache/hbase) - A distributed wide column-oriented store modeled after Google' Bigtable 84 | - [Scylla](https://github.com/scylladb/scylladb) - LSM-Tree based wide-column API-compatible with Apache Cassandra and Amazon DynamoDB 85 | - [Apache Accumulo](https://github.com/apache/accumulo) - A distributed key-value store with scalable data storage and retrieval, on top of Hadoop 86 | 87 | ### Embedded Key-value Store 88 | - [LevelDB](https://github.com/google/leveldb) | ⚠️ Inactive | - A fast key-value storage library written at Google 89 | - [RocksDB](https://github.com/facebook/rocksdb) - An embeddable, persistent key-value store developed by Meta (Facebook) 90 | - [MyRocks](https://github.com/facebook/mysql-5.6) - A RocksDB storage engine for MySQL 91 | - [BadgerDB](https://github.com/dgraph-io/badger) - An embeddable, fast key-value database written in pure Go 92 | 93 | ### Search Engine 94 | - [Apache Solr](https://github.com/apache/solr) - A fast distributed search database built on Apache Lucene 95 | - [Elastic Search](https://github.com/elastic/elasticsearch) - A distributed, RESTful search engine optimized for speed 96 | - [Sphinx](https://github.com/sphinxsearch/sphinx) | ⚠️ Inactive | - A fulltext search engine with high speed of indexation 97 | - [Meilisearch](https://github.com/meilisearch/meilisearch) - A fast search API with great integration support 98 | - [OpenSearch](https://github.com/opensearch-project/OpenSearch) - A community-driven, open source fork of Elasticsearch and Kibana 99 | - [Quickwit](https://github.com/quickwit-oss/quickwit) - A fast cloud-native search engine for observability data 100 | - [ParadeDB](https://github.com/paradedb/paradedb) - A search engine built on Postgres 101 | 102 | ### Streaming Database 103 | - [RisingWave](https://github.com/risingwavelabs/risingwave) - A scalable Postgres for stream processing, analytics, and management 104 | - [Materialize](https://github.com/MaterializeInc/materialize) - A real-time data warehouse purpose-built for operational workloads 105 | - [EventStoreDB](https://github.com/EventStore/EventStore) - An event-native database designed for event sourcing and event-driven architectures 106 | - [KsqlDB](https://github.com/confluentinc/ksql) - A database for building stream processing applications on top of Apache Kafka 107 | - [Timeplus Proton](https://github.com/timeplus-io/proton) - A streaming SQL engine, fast and lightweight, powered by ClickHouse 108 | - [Fluss](https://github.com/alibaba/fluss) - A streaming storage serving as the real-time data layer for Lakehouse architectures 109 | 110 | ### Time-Series Database 111 | - [Influxdb](https://github.com/influxdata/influxdb) - A scalable datastore for metrics, events, and real-time analytics 112 | - [TimeScaleDB](https://github.com/timescale/timescaledb) - A fast ingest time-series SQL database packaged as a PostgreSQL extension 113 | - [Apache IoTDB](https://github.com/apache/iotdb) - An Internet of Things database with seamless integration with the Hadoop and Spark ecology 114 | - [Netflix Atlas](https://github.com/Netflix/atlas) - An n-memory dimensional time series database developed and open sourced by Netflix 115 | - [QuestDB](https://github.com/questdb/questdb) - A time-series database for fast ingest and SQL queries 116 | - [TDEngine](https://github.com/taosdata/TDengine) - A high-performance, cloud native time-series database optimized for Internet of Things (IoT) 117 | - [KairosDB](https://github.com/kairosdb/kairosdb) | ⚠️ Inactive | - A scalable time series database written in Java 118 | - [GreptimeDB](https://github.com/GreptimeTeam/greptimedb) - A cloud-native, unified time series database for metrics, logs and events 119 | - [HoraeDB](https://github.com/apache/horaedb) - A distributed, cloud native time-series database 120 | 121 | ### Columnar OLAP Database 122 | - [Apache Kudu](https://github.com/apache/kudu) - A column-oriented data store for the Apache Hadoop ecosystem 123 | - [Greeenplum](https://github.com/greenplum-db/gpdb-archive) | ⛔️ Archived | - A column-oriented massively parallel PostgreSQL for analytics 124 | - [MonetDB](https://github.com/MonetDB/MonetDB) - A high-performance columnar database originally developed by the CWI database research group 125 | - [Databend](https://github.com/datafuselabs/databend) - An lastic, workload-aware cloud-native data warehouse built in Rust 126 | - [ByConity](https://github.com/ByConity/ByConity) - A cloud-native data warehouse forked from ClickHouse 127 | - [Hydra](https://github.com/hydradatabase/hydra) | ⚠️ Inactive | - A column-oriented Postgres extension 128 | 129 | ### Real-time OLAP Engine 130 | - [ClickHouse](https://github.com/ClickHouse/ClickHouse) - A real-time column-oriented database originally developed at Yandex 131 | - [Apache Pinot](https://github.com/apache/pinot) - A a real-time distributed OLAP datastore open sourced by LinkedIn 132 | - [Apache Druid](https://github.com/apache/druid) - A high performance real-time OLAP engine developed and open sourced by Metamarkets 133 | - [Apache Kylin](https://github.com/apache/kylin) - A distributed OLAP engine designed to provide multi-dimensional analysis on Hadoop 134 | - [Apache Doris](https://github.com/apache/doris) - A high-performance and real-time analytical database based on MPP architecture 135 | - [StarRocks](https://github.com/StarRocks/StarRocks) - A sub-second OLAP database supporting multi-dimensional analytics (Linux Foundation project) 136 | 137 | ### In-process OLAP Engine 138 | - [DuckDB](https://github.com/duckdb/duckdb) - An in-process SQL OLAP Database Management System 139 | - [GlareDB](https://github.com/GlareDB/glaredb) - A SQL database for running analytics across distributed data 140 | - [Apache DataFusion](https://github.com/apache/datafusion) - An extensible query engine with SQL and Dataframe APIs 141 | - [chdb](https://github.com/chdb-io/chdb) - An in-process OLAP SQL Engine powered by ClickHouse 142 | - [SlateDB](https://github.com/slatedb/slatedb) - A cloud-native embedded storage engine built on object storage 143 | 144 | ### OLAP Extensions 145 | - [pg_duckdb](https://github.com/duckdb/pg_duckdb) - A Postgres extension that embeds DuckDB's analytics engine 146 | - [pg_analytics](https://github.com/paradedb/pg_analytics) - A DuckDB-powered analytics extension for Postgres 147 | - [pg_mooncake](https://github.com/Mooncake-Labs/pg_mooncake) - A columnar storage extension for Postres based on DuckDB 148 | - [pg_parquet](https://github.com/CrunchyData/pg_parquet) - A Postgres extension for reading and writing data lake Parquet files 149 | 150 | ## DATA LAKE PLATFORM 151 | 152 | ### Distributed File System 153 | - [Apache Hadoop HDFS](https://github.com/apache/hadoop) - A highly scalable distributed block-based file system 154 | - [GlusterFS](https://github.com/gluster/glusterfs) | ⚠️ Inactive | - A scalable distributed storage that can scale to several petabytes 155 | - [JuiceFS](https://github.com/juicedata/juicefs) - A distributed POSIX file system built on top of Redis and S3 156 | - [Lustre](https://github.com/lustre) - A distributed parallel file system purpose-built to provide global POSIX-compliant namespace 157 | 158 | ### Distributed Object Store 159 | - [Apache Ozone](https://github.com/apache/ozone) - A scalable, redundant, and distributed object store for Apache Hadoop 160 | - [Ceph](https://github.com/ceph/ceph) - A distributed object, block, and file storage platform 161 | - [Minio](https://github.com/minio/minio) - A high performance object storage being API compatible with Amazon S3 162 | - [Garage](https://git.deuxfleurs.fr/Deuxfleurs/garage) - A S3-compatible distributed object storage designed for self-hosting at a small-to-medium scale 163 | 164 | ### Serialisation Framework 165 | - [Apache Parquet](https://github.com/apache/parquet-format) - An efficient columnar binary storage format that supports nested data 166 | - [Apache Avro](https://github.com/apache/avro) - An efficient and fast row-based binary serialisation framework 167 | - [Apache ORC](https://github.com/apache/orc) - A self-describing type-aware columnar file format designed for Hadoop 168 | - [Lance](https://github.com/lancedb/lance) - A modern columnar data format for ML and LLMs implemented in Rust 169 | - [Vortex](https://github.com/spiraldb/vortex) - A highly extensible and fast columnar file format 170 | - [Arrow Feather](https://github.com/apache/arrow) - A portable file format for storing Arrow tables or data frames 171 | 172 | ### Open Table Format 173 | - [Apache Hudi](https://github.com/apache/hudi) - An open table format desined to support incremental data ingestion on cloud and Hadoop 174 | - [Apache Iceberg](https://github.com/apache/iceberg) - A high-performance table format for large analytic tables developed at Netflix 175 | - [Delta Lake](https://github.com/delta-io/delta) - A storage framework for building Lakehouse architecture developed by Databricks 176 | - [Apache Paimon](https://github.com/apache/incubator-paimon) - An Apache inclubating project to support streaming high-speed data ingestion 177 | - [OpenHouse](https://github.com/linkedin/openhouse) - A declarative catalog with data services for open Data Lakehouse formats 178 | 179 | ### Native Open Table Format Library 180 | - [Delta-rs](https://github.com/delta-io/delta-rs) - A native Rust library for Delta Lake, with bindings into Python 181 | - [PyIceberg](https://github.com/apache/iceberg-python) - A native Python library for interacting with Iceberg table format 182 | - [Hudi-rs](https://github.com/apache/hudi-rs)- A native Rust library for Apache Hudi, with bindings into Python 183 | 184 | ### Universal Lakehouse 185 | - [Apache XTable](https://github.com/apache/incubator-xtable) - A unified framework supporting interoperability across multiple open-source table formats 186 | - [Apache Amoro](https://github.com/apache/amoro) - A Lakehouse management system built on open data lake formats 187 | 188 | ## DATA INTEGRATION 189 | 190 | ### Data Integration Platform 191 | - [Airbyte](https://github.com/airbytehq/airbyte) - A data integration platform for ETL / ELT data pipelines with wide range of connectors 192 | - [Apache Nifi](https://github.com/apache/nifi) - A reliable, scalable low-code data integration platform with good enterprise support 193 | - [Apache Camel](https://github.com/apache/camel) - An embeddable integration framework supporting many enterprise integration patterns 194 | - [Apache Gobblin](https://github.com/apache/gobblin) - A distributed data integration framework built by LinkedIn supporting both streaming and batch data 195 | - [Apache Inlong](https://github.com/apache/Inlong) - An integration framework for supporting massive data, originally built at Tencent 196 | - [Meltano](https://github.com/meltano/meltano) - A declarative code-first data integration engine 197 | - [Apache SeaTunnel](https://github.com/apache/seatunnel) - A high-performance, distributed data integration tool supporting vairous ingestion patterns 198 | - [Estuary Flow](https://github.com/estuary/flow) - A real-time ETL and data pipeline platform for quick data integration 199 | - [dlt](https://github.com/dlt-hub/dlt) - A lightweight data integration library for Python-first data platforms 200 | 201 | ### CDC Tool 202 | - [Debezium](https://github.com/debezium/debezium) - A change data capture framework supporting variety of databases 203 | - [Kafka Connect](https://github.com/apache/kafka) - A streaming data integration framework and runtime on top of Apache Kafka supporting CDC 204 | - [Redpanda Conenct](https://github.com/redpanda-data/connect) - A data streaming and integration framework on top of Redpanda 205 | - [Flink CDC](https://github.com/apache/flink-cdc) - CDC Connectors for Apache Flink engine supporting different databases 206 | - [Brooklin](https://github.com/linkedin/brooklin) | ⚠️ Inactive | - A distributed platform for streaming data between various heterogeneous source and destination systems 207 | - [RudderStack](https://github.com/rudderlabs/rudder-server) - A headless Customer Data Platform to build data pipelines, open alternative to Segment 208 | - [Artie Transfer](https://github.com/artie-labs/transfer) - A real-time CDC replication solution between OLTP and OLAP databases 209 | - [Dozer](https://github.com/getdozer/dozer) - A real-time CDC based data integration tool between various sources and sinks 210 | - [PeerDB](https://github.com/PeerDB-io/peerdb) - A CDC tool to replicate data from Postgres to data warehouses, queues and other storage 211 | 212 | ### Data Migration 213 | - [DBmate](https://github.com/amacneil/dbmate) - A lightweight, framework-agnostic database migration tool. 214 | - [Ingestr](https://github.com/bruin-data/ingestr) - A CLI tool to copy data between any databases with a single command 215 | - [Sling](https://github.com/slingdata-io/sling-cli) - A CLI tool to transfer data from a source to target storage/database 216 | 217 | ### Log & Event Collection 218 | - [CloudQuery](https://github.com/cloudquery/cloudquery) - An ETL tool for syncing data from cloud APIs to variety of supported destinations 219 | - [Snowplow](https://github.com/snowplow/snowplow) | ⚠️ Inactive | - A cloud-native engine for collecting behavioral data and load into various cloud storage systems 220 | - [EventMesh](https://github.com/apache/eventmesh) - A serverless event middlewar for collecting and loading event data into various targets 221 | - [Apache Flume](https://github.com/apache/flume) | ⚠️ Inactive | - A scalable distributed log aggregation service 222 | - [Steampipe](https://github.com/turbot/steampipe) - A zero-ETL solution for getting data directly from APIs and services 223 | - [Jitsu](https://github.com/jitsucom/jitsu) - A fully-scriptable data ingestion engine for collecting event data 224 | 225 | ### Event Hub 226 | - [Apache Kafka](https://github.com/apache/kafka) - A highly scalable distributed event store and streaming platform 227 | - [NSQ](https://github.com/nsqio/nsq) - A realtime distributed messaging platform designed to operate at scale 228 | - [Apache Pulsar](https://github.com/apache/pulsar) - A scalable distributed pub-sub messaging system 229 | - [Apache RocketMQ](https://github.com/apache/rocketmq) - A a cloud native messaging and streaming platform 230 | - [Redpanda](https://github.com/redpanda-data/redpanda) - A high performance Kafka API compatible streaming data platform 231 | - [Memphis](https://github.com/memphisdev/memphis) | ⚠️ Inactive | - A scalable data streaming platform for building event-driven applications 232 | - [AutoMQ](https://github.com/AutoMQ/automq) - A a cloud-first alternative to Kafka using S3 as the main storage layer 233 | 234 | ### Reverse ETL 235 | - [Multiwoven](https://github.com/Multiwoven/multiwoven) - A Reverse ETL open source alternative to Hightouch and RudderStack 236 | 237 | 238 | ## DATA PROCESSING AND COMPUTATION 239 | 240 | ### Unified Processing 241 | - [Apache Beam](https://github.com/apache/beam) - A unified programming model supporting execution on popular distributed processing backends 242 | - [Apache Spark](https://github.com/apache/spark) - A unified analytics engine for large-scale data processing 243 | - [Dinky](https://github.com/DataLinkDC/dinky) - A unified streaming & batch computation platform based on Apache Flink 244 | - [Feldora](https://github.com/feldera/feldera) - A unified incremental computation engine 245 | 246 | ### Batch processing 247 | - [Hadoop MapReduce](https://github.com/apache/hadoop) - A highly scalable distributed batch processing framework from Apache Hadoop project 248 | - [Apache Tez](https://github.com/apache/tez) - A distributed data processing pipeline built for Apache Hive and Hadoop 249 | 250 | ### Stream Processing 251 | - [Apache Flink](https://github.com/apache/flink) - A scalable high throughput stream processing framework 252 | - [Apache Samza](https://github.com/apache/samza) - A distributed stream processing framework which uses Kafka and Hadoop, originally developed by LinkedIn 253 | - [Apache Storm](https://github.com/apache/storm) - A distributed realtime computation system based on Actor Model framework 254 | - [Akka](https://github.com/akka/akka) - A highly concurrent, distributed, message-driven processing system based on Actor Model 255 | - [Bytewax](https://github.com/bytewax/bytewax) - A Python stream processing framework with a Rust distributed processing engine 256 | - [Timeplus Proton](https://github.com/timeplus-io/proton) - A streaming SQL engine, fast and lightweight, powered by ClickHouse 257 | - [FastStream](https://github.com/airtai/faststream) - A Python framework for interacting with event streams such as Apache Kafka 258 | - [Bento](https://github.com/warpstreamlabs/bento) - A stream processing engine from WarpStream Labs (forked from Benthos) 259 | - [Fluvio](https://github.com/infinyon/fluvio) - A lean distributed stream processing system written in Rust and web assembly 260 | - [Arroyo](https://github.com/ArroyoSystems/arroyo) - A distributed stream processing engine written in Rust 261 | 262 | ### Python Processing Framework 263 | - [Polars](https://github.com/pola-rs/polars) - A multithreaded Dataframe with vectorized query engine, written in Rust 264 | - [PySpark](https://github.com/apache/spark) - An interface for Apache Spark in Python 265 | - [Vaex](https://github.com/vaexio/vaex) - A high performance Python library for big tabular datasets. 266 | - [Apache Arrow](https://github.com/apache/arrow) - An efficient in-memory data format 267 | - [Ibis](https://github.com/ibis-project/ibis) - A portable Python dataframe library supporting many engine backends 268 | - [SQLFrame](https://github.com/eakmanrq/sqlframe) - A Spark DataFrame API compatible library for data transformation 269 | - [Daft](https://github.com/Eventual-Inc/Daft) - A distributed query engine for large-scale data processing using Python or SQL 270 | - [cuDF](https://github.com/rapidsai/cudf) - A GPU-accelerated pandas API dataFrame library 271 | 272 | ### Python Workflow Scaling 273 | - [Dask](https://github.com/dask/dask) - A flexible parallel computing library with task scheduling 274 | - [RAY](https://github.com/ray-project/ray) - A unified framework with distributed runtime for scaling Python applications 275 | - [Modin](https://github.com/modin-project/modin) - A library for scaling Pandas workflows to multi-threded execution 276 | - [Pandaral·lel](https://github.com/nalepae/pandarallel) | ⚠️ Inactive | - A library to parallelize Pandas operations on all available CPUs 277 | 278 | ### SQL Toolkit 279 | - [SQLAlchemy](https://github.com/sqlalchemy/sqlalchemy) - A Python SQL toolkit and Object Relational Mapper 280 | - [SQLGlot](https://github.com/tobymao/sqlglot) - A Python SQL parser and transpiler 281 | 282 | 283 | ## WORKFLOW MANAGEMENT & DATAOPS 284 | 285 | ### Workflow Orchestration 286 | - [Apache Airflow](https://github.com/apache/airflow) - A plaform for creating and scheduling workflows as directed acyclic graphs (DAGs) of tasks 287 | - [Prefect](https://github.com/PrefectHQ/prefect) - A Python based workflow orchestration tool 288 | - [Argo](https://github.com/argoproj/argo-workflows) - A container-native workflow engine for orchestrating parallel jobs on Kubernetes 289 | - [Azkaban](https://github.com/azkaban/azkaban) | ⚠️ Inactive | - A batch workflow job scheduler created at LinkedIn to run Hadoop jobs 290 | - [Cadence](https://github.com/uber/cadence) - A distributed, scalable available orchestration supporting different language client libraries 291 | - [Dagster](https://github.com/dagster-io/dagster) - A cloud-native data pipeline orchestrator written in Python 292 | - [Apache DolpinScheduler](https://github.com/apache/dolphinscheduler) - A low-code high performance workflow orchestration platform 293 | - [Luigi](https://github.com/spotify/luigi) - A python library for building complex pipelines of batch jobs 294 | - [Flyte](https://github.com/flyteorg/flyte) - A scalable and flexible workflow orchestration platform for both data and ML workloads 295 | - [Kestra](https://github.com/kestra-io/kestra) - A declarative language-agnostic worfklow orchestration and scheduling platform 296 | - [Mage.ai](https://github.com/mage-ai/mage-ai) - A platform for integrating, cheduling and managing data pipelines 297 | - [Temporal](https://github.com/temporalio/temporal) - A resilient workflow management system, originated as a fork of Uber's Cadence 298 | - [Windmill](https://github.com/windmill-labs/windmill) - A fast workflow engine, and open-source alternative to Airplane and Retool 299 | - [Maestro](https://github.com/Netflix/maestro) - A general-purpose workflow orchestrator developed by Netflix 300 | 301 | ### Job Scheduling 302 | - [Celery](https://github.com/celery/celery) - A distributed Task Queue system for Python 303 | - [DKron](https://github.com/distribworks/dkron) - A distributed, fault tolerant job scheduling system 304 | - [ApScheduler](https://github.com/agronholm/apscheduler/) - An advanced task scheduler and task queue system for Python 305 | 306 | ### Data Quality 307 | - [Data-diff](https://github.com/datafold/data-diff) | ⛔️ Archived | - A tool for comparing tables within or across databases 308 | - [Great Expectations](https://github.com/great-expectations/great_expectations) - A data validation and profiling tool written in Python 309 | - [Deeque](https://github.com/awslabs/deequ) - A library based on Apache Spark for measuring data quality in large datasets 310 | - [Pandera](https://github.com/unionai-oss/pandera) - A light-weight, flexible, and expressive statistical data testing library 311 | - [Soda](https://github.com/sodadata/soda-core) - A CLI tool and Python library for data quality testing 312 | - [Pydantic](https://github.com/pydantic/pydantic) - A data validation library using Python type hints 313 | 314 | ### Data Versioning 315 | - [LakeFS](https://github.com/treeverse/lakeFS) - A data version control for data stored in data lakes 316 | - [Project Nessie](https://github.com/projectnessie/nessie) - A transactional Catalog for Data Lakes with Git-like semantics 317 | - [DVC](https://github.com/iterative/dvc) - A data version control tool for data and ML experiments 318 | - [Dolt](https://github.com/dolthub/dolt) - A Git for data tool 319 | - [Git-lfs](https://github.com/git-lfs/git-lfs) - A Git extension for versioning large files 320 | - [Datachain](https://github.com/iterative/datachain) - A Python-based framework for versioning for unstructured Data 321 | 322 | ### Data Modeling 323 | - [dbt](https://github.com/dbt-labs/dbt-core) - A data modeling and transformation tool for data pipelines 324 | - [SQLMesh](https://github.com/TobikoData/sqlmesh) - A data transformation and modeling framework that is backwards compatible with dbt 325 | 326 | ### Pipeline Observability 327 | - [Elementry](https://github.com/elementary-data/elementary) - A dbt-native data observability solution to monitor data pipelines 328 | 329 | 330 | ## DATA INFRASTRUCTURE 331 | 332 | ### Resource Scheduling 333 | - [Apache Yarn](https://github.com/apache/hadoop) - The default Resource Scheduler for Apache Hadoop clusters 334 | - [Apache Mesos](https://github.com/apache/mesos) - A resource scheduling and cluster resource abstraction framework developed by Ph.D. students at UC Berkeley 335 | - [Kubernetes](https://github.com/kubernetes/kubernetes) - A production-grade container scheduling and management tool 336 | - [Apache YuniKorn](https://github.com/apache/yunikorn-core) - A light-weight, universal resource scheduler for container orchestrator systems 337 | - [Docker](https://github.com/docker) - The popular OS-level virtualization and containerization software 338 | 339 | ### Cluster Administration 340 | - [Apache Ambari](https://github.com/apache/ambari) - A tool for provisioning, managing, and monitoring of Apache Hadoop clusters 341 | - [Apache Helix](https://github.com/apache/helix) - A generic cluster management framework developed at LinkedIn 342 | 343 | ### Security 344 | - [Apache Knox](https://github.com/apache/knox) - A gateway and SSO service for managing access to Hadoop clusters 345 | - [Apache Ranger](https://github.com/apache/ranger) - A security and governance platform for Hadoop and other popular services 346 | - [Kerberos](https://github.com/krb5/krb5) - A popular enterprise network authentication protocol 347 | 348 | ### Metrics Store 349 | - [Influxdb](https://github.com/influxdata/influxdb) - A scalable datastore for metrics and events 350 | - [Mimir](https://github.com/grafana/mimir) - A scalable long-term metrics storage for Prometheus, developed by Grafana Labs 351 | - [OpenTSDB](https://github.com/OpenTSDB/opentsdb) - A distributed, scalable Time Series Database written on top of Apache Hbase 352 | - [M3](https://github.com/m3db/m3) - A distributed TSDB and metrics storage and aggregator 353 | 354 | ### Observability Framework 355 | - [Prometheus](https://github.com/prometheus/prometheus) - A popular metric collection and management tool 356 | - [ELK](https://www.elastic.co/elastic-stack) - A poular observability stack comprsing of Elasticsearch, Kibana, Beats, and Logstash 357 | - [Graphite](https://github.com/graphite-project) - An established infrastructure monitoring and observability system 358 | - [OpenTelemetry](https://github.com/open-telemetry) - A collection of APIs, SDKs, and tools for managing and monitoring metrics 359 | - [VictoriaMetrics](https://github.com/VictoriaMetrics/VictoriaMetrics/) - An scalable monitoring solution with a time series database 360 | - [Zabbix](https://github.com/zabbix/zabbix) - A real-time infrastructure and application monitoring service 361 | 362 | ### Monitoring Dashboard 363 | - [Grafana](https://github.com/grafana/grafana) - A popular open and composable observability and data visualization platform 364 | - [Kibana](https://github.com/elastic/kibana) - The visualistion and search dashboard for Elasticsearch 365 | - [Redpanda Console](https://github.com/redpanda-data/console) - A UI for monitoring and managing Apache Kafka and Redpanda workloads 366 | 367 | ### Log & Metrics Pipeline 368 | - [Fluentd](https://github.com/fluent/fluentd) - A metric collection, buffering and router service 369 | - [Fluent Bit](https://github.com/fluent/fluent-bit) - A fast log processor and forwarder, and part of the Fluentd ecosystem 370 | - [Logstash](https://github.com/elastic/logstash) - A server-side log and metric transport and processor, as part of the ELK stack 371 | - [Telegraf](https://github.com/influxdata/telegraf) - A plugin-driven server agent for collecting & reporting metrics developed by Influxdata 372 | - [Vector](https://github.com/vectordotdev/vector) - A high-performance, end-to-end (agent & aggregator) observability data pipeline 373 | - [StatsD](https://github.com/statsd/statsd) | ⚠️ Inactive | - A network daemon for collection, aggregation and routing of metrics 374 | 375 | ### Cost Management 376 | - [OpenCost](https://github.com/opencost/opencost) - Cost monitoring for Kubernetes workloads and cloud costs 377 | 378 | ## METADATA MANAGEMENT 379 | 380 | ### Metadata Platform 381 | - [Amundsen](https://github.com/amundsen-io/amundsen) - A data discovery and metadata engine developed by Lyft engineers 382 | - [Apache Atlas](https://github.com/apache/atlas) - A data observability platform for Apache Hadoop ecosystem 383 | - [DataHub](https://github.com/datahub-project/datahub) - A metadata platform for the modern data stack developed at Netflix 384 | - [Marquez](https://github.com/MarquezProject/marquez) - A metadata service for the collection, aggregation, and visualization of metadata 385 | - [ckan](https://github.com/ckan/ckan) - A data management system for cataloging, managing and accessing data 386 | - [Open Metadata](https://github.com/open-metadata/OpenMetadata) - A unified platform for discovery and governance, using a central metadata repository 387 | - [ODD Platform](https://github.com/opendatadiscovery/odd-platform) - A data discovery and observability platform 388 | 389 | ### Open Standards 390 | - [Open Lineage](https://github.com/OpenLineage/OpenLineage) - An open standard for lineage metadata collection 391 | - [Open Metadata](https://github.com/open-metadata/OpenMetadata) - A unified metadata platform providing open stadards for managing metadata 392 | - [Egeria](https://github.com/odpi/egeria) - Open metadata and governance standards to facilitate metadata exchange 393 | 394 | ### Schema & Catalog Service 395 | - [Hive Metastore](https://github.com/apache/hive) - A popular schema management and metastore service as part of the Apache hive project 396 | - [Confluent Schema Registry](https://github.com/confluentinc/schema-registry) - A schema registry for Kafka, developed by Confluent 397 | - [Apache Polaris](https://github.com/apache/polaris) - An interoperable, open source catalog for Apache Iceberg 398 | - [Unity Catalog](https://github.com/unitycatalog/unitycatalog) - A Universal catalog for Data Lakehouse formats and other data/AI assets 399 | - [Lakekeeper](https://github.com/lakekeeper/lakekeeper) - A Rust native Apache Iceberg REST Catalog 400 | - [Apache Gravitino](https://github.com/apache/gravitino) - A geo-distributed and federated open data catalog 401 | 402 | 403 | ## ANALYTICS & VISUALISATION 404 | 405 | ### BI & Dashboard 406 | - [Apache Superset](https://github.com/apache/superset) - A poular open source data visualization and data exploration platform 407 | - [Metabase](https://github.com/metabase/metabase) - A simple data visualisation and exploration dashboard 408 | - [Redash](https://github.com/getredash/redash) - A tool to explore, query, visualize, and share data with many data source connectors 409 | - [Lightdash](https://github.com/lightdash/lightdash) - A self-service BI to turn dbt project into a full-stack BI platform 410 | 411 | ## BI as Code (Web App) 412 | - [Streamlit](https://github.com/streamlit/streamlit) - A python tool to package and share data as web apps 413 | - [Evidence](https://github.com/evidence-dev/evidence) - A tool to build interactive data visualizations in pure SQL and markdown 414 | - [dash](https://github.com/plotly/dash) - A Python framework for building ML & data science web apps 415 | - [Vizro](https://github.com/mckinsey/vizro) - A toolkit for creating modular data visualization applications 416 | - [Mercury](https://github.com/mljar/mercury) - A tool to convert Jupyter Notebooks to web apps 417 | - [Quary](https://github.com/quarylabs/quary) - A code-based BI solution 418 | 419 | ### Query & Collaboration 420 | - [Hue](https://github.com/cloudera/hue) - A query and data exploration tool with Hadoop ecosystem support, developed by Cloudera 421 | - [Apache Zeppelin](https://github.com/apache/zeppelin) - A web-base Notebook for interactive data analytics and collaboration for Hadoop 422 | - [Querybook](https://github.com/pinterest/querybook) - A simple query and notebook UI developed by Pinterest 423 | - [Jupyter](https://github.com/jupyter/notebook) - A popular interactive web-based notebook application 424 | - [IPython](https://github.com/ipython/ipython) - An enhanced interactive Python shell for data analysis 425 | - [Datasette](https://github.com/simonw/datasette) - A tool for exploring and publishing data 426 | 427 | ### MPP Query Engine 428 | - [Apache Hive](https://github.com/apache/hive) - A data warehousing and MPP engine on top of Hadoop 429 | - [Apache Implala](https://github.com/apache/impala) - A MPP engine mainly for Hadoop clusters, developed by Cloudera 430 | - [Presto](https://github.com/prestodb/presto) - A distributed SQL query engine for big data 431 | - [Trino](https://github.com/trinodb/trino) - The former PrestoSQL distributed SQL query engine 432 | - [Apache Drill](https://github.com/apache/drill) - A distributed MPP query engine against NoSQL and Hadoop data storage systems 433 | - [DataFusion Ballista](https://github.com/apache/datafusion-ballista) - A distributed query execution engine based on Apache DataFusion 434 | 435 | ### Semantic & Middleware Layer 436 | - [Alluxio](https://github.com/Alluxio/alluxio) - A data orchestration and virtual distributed storage system 437 | - [Cube](https://github.com/cube-js/cube) - A semantic layer for building data applications supporting popular databse engines 438 | - [Apache Linkis](https://github.com/apache/linkis) - A computation middleware to facilitate connection and orchestration between applications and data engines 439 | - [Apache Gluten](https://github.com/apache/incubator-gluten) - A middle layer for offloading JVM-based SQL engines execution to native engines 440 | - [Apache OpenDAL](https://github.com/apache/opendal) - An open data access Llyer that enables seamless interaction with diverse storage services 441 | 442 | ### Data Sharing 443 | - [delta-sharing](https://github.com/delta-io/delta-sharing) - An open protocol for secure real-time exchange of large datasets 444 | 445 | 446 | ## ML/AI PLATFORM 447 | 448 | ### Vector Storage 449 | - [milvus](https://github.com/milvus-io/milvus) - A cloud-native vector database, storage for AI applications 450 | - [qdrant](https://github.com/qdrant/qdrant) - A high-performance, scalable Vector database for AI 451 | - [chroma](https://github.com/chroma-core/chroma) - An AI-native embedding database for building LLM apps 452 | - [marqo](https://github.com/marqo-ai/marqo) - An end-to-end vector search engine for both text and images 453 | - [LanceDB](https://github.com/lancedb/lancedb) - A serverless vector database for AI applications written in Rust 454 | - [weaviate](https://github.com/weaviate/weaviate) - A scalable, cloud-native supporting storage of both objects and vectors 455 | - [deeplake](https://github.com/activeloopai/deeplake) - A storage format optimized AI database for deep-learning applications 456 | - [Vespa](https://github.com/vespa-engine/vespa) - A storage to organize vectors, tensors, text and structured data 457 | - [vald](https://github.com/vdaas/vald) - A scalable distributed approximate nearest neighbor (ANN) dense vector search engine 458 | - [pgvector](https://github.com/pgvector/pgvector) - A vector similarity search as a Postgres extension 459 | 460 | ### MLOps 461 | - [mlflow](https://github.com/mlflow/mlflow) - A a platform to streamline machine learning development and lifecycle management 462 | - [Metaflow](https://github.com/Netflix/metaflow) - A tool to build and manage ML/AI, and data science projects, developed at Netflix 463 | - [SkyPilot](https://github.com/skypilot-org/skypilot) - A framework for running LLMs, AI, and batch jobs on any cloud 464 | - [Jina](https://github.com/jina-ai/jina) - A tool to build multimodal AI applications with cloud-native stack 465 | - [NNI](https://github.com/microsoft/nni) | ⛔️ Archived | - An autoML toolkit for automate machine learning lifecycle, from Microsoft 466 | - [BentoML](https://github.com/bentoml/BentoML) - A framework for building reliable and scalable AI applications 467 | - [Determined AI](https://github.com/determined-ai/determined) - An ML platform that simplifies distributed training, tuning and experiment tracking 468 | - [RAY](https://github.com/ray-project/ray) - A unified framework for scaling AI and Python applications 469 | - [kubeflow](https://github.com/kubeflow/kubeflow) - A cloud-native platform for ML operations - pipelines, training and deployment 470 | - [Kedro](https://github.com/kedro-org/kedro) - A toolbox and framework for building production-ready data science and ML workflows 471 | - [Pachyderm](https://github.com/pachyderm/pachyderm) - A calable ML and Data Science data processing workflow management platform 472 | 473 | ### LLMOps 474 | - [Dify](https://github.com/langgenius/dify) - LLM development platform nwith AI workflow, RAG pipeline and model management 475 | - [Haystack](https://github.com/deepset-ai/haystack) - AI orchestration framework to build customizable, production-ready LLM applications 476 | - [Superduper](https://github.com/superduper-io/superduper) - a Python based framework for building AI-data workflows and applications 477 | - [Cognee](https://github.com/topoteretes/cognee) - LLM Memory Engine for implementing LLM Workflows 478 | - [vLLM](https://github.com/vllm-project/vllm) - A high-throughput and memory-efficient inference and serving engine for LLMs 479 | 480 | --------------------------------------------------------------------------------