└── README.md


/README.md:
--------------------------------------------------------------------------------
  1 | # Awesome Open Source Data Engineering [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome)
  2 | A curated list of open source tools used in analytics platforms and data engineering ecosystem
  3 | ![Open Source Data Engineering Landscape 2025](https://github.com/user-attachments/assets/fe9e97a8-abd8-47a9-8429-15130055785c)
  4 | 
  5 | For more information about the above compiled landscape for 2025, please refer to the published blog post on [Pracdata.io](https://www.pracdata.io/p/open-source-data-engineering-landscape-2025)
  6 | 
  7 | ## Table of contents
  8 | - [Storage Systems](https://github.com/pracdata/awesome-open-source-data-engineering?tab=readme-ov-file#storage-systems)
  9 | - [Data Lake Platform](https://github.com/pracdata/awesome-open-source-data-engineering?tab=readme-ov-file#data-lake-platform)
 10 | - [Data Integration](https://github.com/pracdata/awesome-open-source-data-engineering?tab=readme-ov-file#data-integration)
 11 | - [Data Processing & Computation](https://github.com/pracdata/awesome-open-source-data-engineering?tab=readme-ov-file#data-processing-and-computation)
 12 | - [Workflow Management & DataOps](https://github.com/pracdata/awesome-open-source-data-engineering?tab=readme-ov-file#workflow-management--dataops)
 13 | - [Data Infrastructure](https://github.com/pracdata/awesome-open-source-data-engineering?tab=readme-ov-file#data-infrastructure)
 14 | - [Metadata Management](https://github.com/pracdata/awesome-open-source-data-engineering?tab=readme-ov-file#metadata-management)
 15 | - [Analytics & Visualisation](https://github.com/pracdata/awesome-open-source-data-engineering?tab=readme-ov-file#analytics--visualisation)
 16 | - [ML/AI Platform](https://github.com/pracdata/awesome-open-source-data-engineering?tab=readme-ov-file#mlai-platform)
 17 | 
 18 | ## STORAGE SYSTEMS
 19 | 
 20 | ### Relational DBMS
 21 | - [PostgreSQL](https://github.com/postgres/postgres) - Advanced object-relational database management system
 22 | - [MySQL](https://github.com/mysql/mysql-server) - One of the most popular open Source Databases
 23 | - [MariaDB](https://github.com/MariaDB/server) - A popular MySQL server fork
 24 | - [Supabase](https://github.com/supabase/supabase) - An open source Firebase alternative
 25 | - [SQlite](https://github.com/sqlite/sqlite) - Most popular embedded database engine
 26 | 
 27 | ### Distributed SQL DBMS
 28 | - [Citus](https://github.com/citusdata/citus) - A popular distributed PostgreSQL as an extension
 29 | - [CockroachDB](https://github.com/cockroachdb/cockroach) - A cloud-native distributed SQL database
 30 | - [YugabyteDB](https://github.com/yugabyte/yugabyte-db) - A cloud-native distributed SQL database
 31 | - [TiDB](https://github.com/pingcap/tidb) - A cloud-native, distributed, MySQL-Compatible database
 32 | - [OceanBase](https://github.com/oceanbase/oceanbase) - A scalable distributed relational database
 33 | - [ShardingSphere](https://github.com/apache/shardingsphere) - A Distributed SQL transaction & query engine
 34 | - [Neon](https://github.com/neondatabase/neon) - A serverless open-source alternative to AWS Aurora Postgres
 35 | - [CrateDB](https://github.com/crate/crate) - A distributed and scalable PostgreSQL-compatible SQL database
 36 | 
 37 | ### Cache Store
 38 | - [Redis](https://github.com/redis/redis) - A popular key-value based cache store
 39 | - [Memcached](https://github.com/memcached/memcached) - A high performance multithreadedkey-value cache store
 40 | - [Dragonfly](https://github.com/dragonflydb/dragonfly) - A modern cache store compatible with Redis and Memcached APIs
 41 | 
 42 | ### In-memory SQL Database
 43 | - [Apache Ignite](https://github.com/apache/ignite) - A distributed, ACID-compliant in-memory DBMS 
 44 | - [ReadySet](https://github.com/readysettech/readyset) - A MySQL and Postgres wire-compatible caching layer
 45 | - [VoltDB](https://github.com/voltdb/) - A distributed, horizontally-scalable, ACID-compliant database 
 46 | 
 47 | ### Document Store
 48 | - [MongoDB](https://github.com/mongodb/mongo) - A cross-platform, document-oriented NoSQL database
 49 | - [RavenDB](https://github.com/ravendb/ravendb) - An ACID NoSQL document database
 50 | - [RethinkDB](https://github.com/rethinkdb/rethinkdb) | ⚠️ Inactive | - A distributed document-oriented database for real-time applications
 51 | - [CouchDB](https://github.com/apache/couchdb) - A Scalable document-oriented NoSQL database
 52 | - [Couchbase](https://github.com/couchbase) - A modern cloud-native NoSQL distributed database
 53 | - [FerretDB](https://github.com/FerretDB/FerretDB) - A truly Open Source MongoDB alternative!
 54 | - [LowDB](https://github.com/typicode/lowdb) | ⚠️ Inactive | - A simple and fast JSON database 
 55 | 
 56 | ### NoSQL Multi-model
 57 | - [OrientDB](https://github.com/orientechnologies/orientdb) - A Multi-model DBMS supporting Graph, Document, Reactive, Full-Text and Geospatial models
 58 | - [ArrangoDB](https://github.com/arangodb/arangodb) - A Multi-model database with flexible data models for documents, graphs, and key-values
 59 | - [SurrealDB](https://github.com/surrealdb/surrealdb) - A scalable, distributed, collaborative, document-graph database
 60 | - [EdgeDB](https://github.com/edgedb/edgedb) - A graph-relational database with declarative schema
 61 | 
 62 | ### Graph Database
 63 | - [Neo4j](https://github.com/neo4j/neo4j) - A high performance leading graph database
 64 | - [JunasGraph](https://github.com/JanusGraph/janusgraph) - A highly scalable distributed graph database
 65 | - [HugeGraph](https://github.com/apache/incubator-hugegraph) - A fast-speed and highly-scalable graph database
 66 | - [NebulaGraph](https://github.com/vesoft-inc/nebula) - A distributed, horizontal scalability, fast open-source graph database
 67 | - [Cayley](https://github.com/cayleygraph/cayley) | ⚠️ Inactive | - Inspired by the graph database behind Google's Knowledge Graph
 68 | - [Dgraph](https://github.com/dgraph-io/dgraph) - A horizontally scalable and distributed GraphQL database with a graph backend
 69 | - [Apache Age](https://github.com/apache/age) - A graph database as an extension to PostgreSQL
 70 | - [FalkorDB](https://github.com/FalkorDB/falkordb) - A graph database that uses GraphBLAS under the hood, tailored for LLMs
 71 | 
 72 | ### Distributed Key-value Store
 73 | - [Riak](https://github.com/basho/riak) | ⚠️ Inactive | - A decentralized key-value datastore from Basho Technologies
 74 | - [FoundationDB](https://github.com/apple/foundationdb) - A distributed, transactional key-value store from Apple
 75 | - [etcd](https://github.com/etcd-io/etcd) - A distributed reliable key-value store written in Go
 76 | - [TiKV](https://github.com/tikv/tikv) - A distributed transactional key-value database, originally created to complement TiDB
 77 | - [Immudb](https://github.com/codenotary/immudb) - A database with built-in cryptographic proof and verification
 78 | - [Valkey](https://github.com/valkey-io/valkey) - A distributed key-value datastore forked from Redis
 79 | - [Apache Kvrocks](https://github.com/apache/kvrocks) - A distributed key-value database that uses RocksDB as storage engine 
 80 | 
 81 | ### Wide-column Key-value Store
 82 | - [Apache Cassandra](https://github.com/apache/cassandra) - A highly-scalable LSM-Tree based partitioned row store
 83 | - [Apache Hbase](https://github.com/apache/hbase) - A distributed wide column-oriented store modeled after Google' Bigtable
 84 | - [Scylla](https://github.com/scylladb/scylladb) - LSM-Tree based wide-column API-compatible with Apache Cassandra and Amazon DynamoDB
 85 | - [Apache Accumulo](https://github.com/apache/accumulo) - A distributed key-value store with scalable data storage and retrieval, on top of Hadoop
 86 | 
 87 | ### Embedded Key-value Store
 88 | - [LevelDB](https://github.com/google/leveldb) | ⚠️ Inactive | - A fast key-value storage library written at Google
 89 | - [RocksDB](https://github.com/facebook/rocksdb) - An embeddable, persistent key-value store developed by Meta (Facebook)
 90 | - [MyRocks](https://github.com/facebook/mysql-5.6) - A RocksDB storage engine for MySQL
 91 | - [BadgerDB](https://github.com/dgraph-io/badger) - An embeddable, fast key-value database written in pure Go
 92 | 
 93 | ### Search Engine
 94 | - [Apache Solr](https://github.com/apache/solr) - A fast distributed search database built on Apache Lucene
 95 | - [Elastic Search](https://github.com/elastic/elasticsearch) - A distributed, RESTful search engine optimized for speed
 96 | - [Sphinx](https://github.com/sphinxsearch/sphinx) | ⚠️ Inactive | - A fulltext search engine with high speed of indexation
 97 | - [Meilisearch](https://github.com/meilisearch/meilisearch) - A fast search API with great integration support
 98 | - [OpenSearch](https://github.com/opensearch-project/OpenSearch) - A community-driven, open source fork of Elasticsearch and Kibana
 99 | - [Quickwit](https://github.com/quickwit-oss/quickwit) - A fast cloud-native search engine for observability data
100 | - [ParadeDB](https://github.com/paradedb/paradedb) - A search engine built on Postgres
101 | 
102 | ### Streaming Database
103 | - [RisingWave](https://github.com/risingwavelabs/risingwave) - A scalable Postgres for stream processing, analytics, and management
104 | - [Materialize](https://github.com/MaterializeInc/materialize) - A real-time data warehouse purpose-built for operational workloads
105 | - [EventStoreDB](https://github.com/EventStore/EventStore) - An event-native database designed for event sourcing and event-driven architectures
106 | - [KsqlDB](https://github.com/confluentinc/ksql) - A database for building stream processing applications on top of Apache Kafka
107 | - [Timeplus Proton](https://github.com/timeplus-io/proton) - A streaming SQL engine, fast and lightweight, powered by ClickHouse
108 | - [Fluss](https://github.com/alibaba/fluss) - A streaming storage serving as the real-time data layer for Lakehouse architectures
109 | 
110 | ### Time-Series Database
111 | - [Influxdb](https://github.com/influxdata/influxdb) - A scalable datastore for metrics, events, and real-time analytics
112 | - [TimeScaleDB](https://github.com/timescale/timescaledb) - A fast ingest time-series SQL database packaged as a PostgreSQL extension
113 | - [Apache IoTDB](https://github.com/apache/iotdb) - An Internet of Things database with seamless integration with the Hadoop and Spark ecology
114 | - [Netflix Atlas](https://github.com/Netflix/atlas) - An n-memory dimensional time series database developed and open sourced by Netflix
115 | - [QuestDB](https://github.com/questdb/questdb) - A time-series database for fast ingest and SQL queries
116 | - [TDEngine](https://github.com/taosdata/TDengine) - A high-performance, cloud native time-series database optimized for Internet of Things (IoT)
117 | - [KairosDB](https://github.com/kairosdb/kairosdb) | ⚠️ Inactive | - A scalable time series database written in Java
118 | - [GreptimeDB](https://github.com/GreptimeTeam/greptimedb) - A cloud-native, unified time series database for metrics, logs and events
119 | - [HoraeDB](https://github.com/apache/horaedb) - A distributed, cloud native time-series database
120 | 
121 | ### Columnar OLAP Database
122 | - [Apache Kudu](https://github.com/apache/kudu) -  A column-oriented data store for the Apache Hadoop ecosystem
123 | - [Greeenplum](https://github.com/greenplum-db/gpdb-archive) | ⛔️ Archived | -  A column-oriented massively parallel PostgreSQL for analytics
124 | - [MonetDB](https://github.com/MonetDB/MonetDB) - A high-performance columnar database originally developed by the CWI database research group
125 | - [Databend](https://github.com/datafuselabs/databend) - An lastic, workload-aware cloud-native data warehouse built in Rust
126 | - [ByConity](https://github.com/ByConity/ByConity) - A cloud-native data warehouse forked from ClickHouse
127 | - [Hydra](https://github.com/hydradatabase/hydra) | ⚠️ Inactive | - A column-oriented Postgres extension
128 | 
129 | ### Real-time OLAP Engine
130 | - [ClickHouse](https://github.com/ClickHouse/ClickHouse) - A real-time column-oriented database originally developed at Yandex
131 | - [Apache Pinot](https://github.com/apache/pinot) - A a real-time distributed OLAP datastore open sourced by LinkedIn
132 | - [Apache Druid](https://github.com/apache/druid) - A high performance real-time OLAP engine developed and open sourced by Metamarkets
133 | - [Apache Kylin](https://github.com/apache/kylin) - A distributed OLAP engine designed to provide multi-dimensional analysis on Hadoop
134 | - [Apache Doris](https://github.com/apache/doris) - A high-performance and real-time analytical database based on MPP architecture
135 | - [StarRocks](https://github.com/StarRocks/StarRocks) -  A sub-second OLAP database supporting multi-dimensional analytics (Linux Foundation project)
136 | 
137 | ### In-process OLAP Engine
138 | - [DuckDB](https://github.com/duckdb/duckdb) - An in-process SQL OLAP Database Management System
139 | - [GlareDB](https://github.com/GlareDB/glaredb) - A SQL database for running analytics across distributed data
140 | - [Apache DataFusion](https://github.com/apache/datafusion) - An extensible query engine with SQL and Dataframe APIs
141 | - [chdb](https://github.com/chdb-io/chdb) - An in-process OLAP SQL Engine powered by ClickHouse
142 | - [SlateDB](https://github.com/slatedb/slatedb) - A cloud-native embedded storage engine built on object storage
143 | 
144 | ### OLAP Extensions
145 | - [pg_duckdb](https://github.com/duckdb/pg_duckdb) - A Postgres extension that embeds DuckDB's analytics engine
146 | - [pg_analytics](https://github.com/paradedb/pg_analytics) - A DuckDB-powered analytics extension for Postgres
147 | - [pg_mooncake](https://github.com/Mooncake-Labs/pg_mooncake) - A columnar storage extension for Postres based on DuckDB
148 | - [pg_parquet](https://github.com/CrunchyData/pg_parquet) - A Postgres extension for reading and writing data lake Parquet files
149 | 
150 | ## DATA LAKE PLATFORM
151 | 
152 | ### Distributed File System
153 | - [Apache Hadoop HDFS](https://github.com/apache/hadoop) - A highly scalable distributed block-based file system 
154 | - [GlusterFS](https://github.com/gluster/glusterfs) | ⚠️ Inactive | - A scalable distributed storage that can scale to several petabytes
155 | - [JuiceFS](https://github.com/juicedata/juicefs) - A distributed POSIX file system built on top of Redis and S3
156 | - [Lustre](https://github.com/lustre) - A distributed parallel file system purpose-built to provide global POSIX-compliant namespace
157 | 
158 | ### Distributed Object Store
159 | - [Apache Ozone](https://github.com/apache/ozone) - A scalable, redundant, and distributed object store for Apache Hadoop 
160 | - [Ceph](https://github.com/ceph/ceph) - A distributed object, block, and file storage platform
161 | - [Minio](https://github.com/minio/minio) - A high performance object storage being API compatible with Amazon S3
162 | - [Garage](https://git.deuxfleurs.fr/Deuxfleurs/garage) - A S3-compatible distributed object storage designed for self-hosting at a small-to-medium scale
163 | 
164 | ### Serialisation Framework
165 | - [Apache Parquet](https://github.com/apache/parquet-format) - An efficient columnar binary storage format that supports nested data
166 | - [Apache Avro](https://github.com/apache/avro) - An efficient and fast row-based binary serialisation framework
167 | - [Apache ORC](https://github.com/apache/orc) - A self-describing type-aware columnar file format designed for Hadoop
168 | - [Lance](https://github.com/lancedb/lance) - A modern columnar data format for ML and LLMs implemented in Rust
169 | - [Vortex](https://github.com/spiraldb/vortex) - A highly extensible and fast columnar file format
170 | - [Arrow Feather](https://github.com/apache/arrow) - A portable file format for storing Arrow tables or data frames
171 | 
172 | ### Open Table Format
173 | - [Apache Hudi](https://github.com/apache/hudi) - An open table format desined to support incremental data ingestion on cloud and Hadoop
174 | - [Apache Iceberg](https://github.com/apache/iceberg) -  A high-performance table format for large analytic tables developed at Netflix
175 | - [Delta Lake](https://github.com/delta-io/delta) - A storage framework for building Lakehouse architecture developed by Databricks
176 | - [Apache Paimon](https://github.com/apache/incubator-paimon) - An Apache inclubating project to support streaming high-speed data ingestion
177 | - [OpenHouse](https://github.com/linkedin/openhouse) - A declarative catalog with data services for open Data Lakehouse formats
178 | 
179 | ### Native Open Table Format Library
180 | - [Delta-rs](https://github.com/delta-io/delta-rs) - A native Rust library for Delta Lake, with bindings into Python
181 | - [PyIceberg](https://github.com/apache/iceberg-python) - A native Python library for interacting with Iceberg table format
182 | - [Hudi-rs](https://github.com/apache/hudi-rs)- A native Rust library for Apache Hudi, with bindings into Python
183 | 
184 | ### Universal Lakehouse
185 | - [Apache XTable](https://github.com/apache/incubator-xtable) - A unified framework supporting interoperability across multiple open-source table formats
186 | - [Apache Amoro](https://github.com/apache/amoro) - A Lakehouse management system built on open data lake formats
187 | 
188 | ## DATA INTEGRATION
189 | 
190 | ### Data Integration Platform
191 | - [Airbyte](https://github.com/airbytehq/airbyte) - A data integration platform for ETL / ELT data pipelines with wide range of connectors 
192 | - [Apache Nifi](https://github.com/apache/nifi) - A reliable, scalable low-code data integration platform with good enterprise support
193 | - [Apache Camel](https://github.com/apache/camel) - An embeddable integration framework supporting many enterprise integration patterns
194 | - [Apache Gobblin](https://github.com/apache/gobblin) - A distributed data integration framework built by LinkedIn supporting both streaming and batch data
195 | - [Apache Inlong](https://github.com/apache/Inlong) - An integration framework for supporting massive data, originally built at Tencent
196 | - [Meltano](https://github.com/meltano/meltano) - A declarative code-first data integration engine 
197 | - [Apache SeaTunnel](https://github.com/apache/seatunnel) - A high-performance, distributed data integration tool supporting vairous ingestion patterns
198 | - [Estuary Flow](https://github.com/estuary/flow) - A real-time ETL and data pipeline platform for quick data integration
199 | - [dlt](https://github.com/dlt-hub/dlt) - A lightweight data integration library for Python-first data platforms
200 | 
201 | ### CDC Tool
202 | - [Debezium](https://github.com/debezium/debezium) - A change data capture framework supporting variety of databases
203 | - [Kafka Connect](https://github.com/apache/kafka) - A streaming data integration framework and runtime on top of Apache Kafka supporting CDC
204 | - [Redpanda Conenct](https://github.com/redpanda-data/connect) - A data streaming and integration framework on top of Redpanda
205 | - [Flink CDC](https://github.com/apache/flink-cdc) - CDC Connectors for Apache Flink engine supporting different databases
206 | - [Brooklin](https://github.com/linkedin/brooklin) | ⚠️ Inactive | - A distributed platform for streaming data between various heterogeneous source and destination systems
207 | - [RudderStack](https://github.com/rudderlabs/rudder-server) - A headless Customer Data Platform to build data pipelines, open alternative to Segment
208 | - [Artie Transfer](https://github.com/artie-labs/transfer) - A real-time CDC replication solution between OLTP and OLAP databases
209 | - [Dozer](https://github.com/getdozer/dozer) - A real-time CDC based data integration tool between various sources and sinks
210 | - [PeerDB](https://github.com/PeerDB-io/peerdb) - A CDC tool to replicate data from Postgres to data warehouses, queues and other storage
211 | 
212 | ### Data Migration
213 | - [DBmate](https://github.com/amacneil/dbmate) - A lightweight, framework-agnostic database migration tool.
214 | - [Ingestr](https://github.com/bruin-data/ingestr) - A CLI tool to copy data between any databases with a single command
215 | - [Sling](https://github.com/slingdata-io/sling-cli) - A CLI tool to transfer data from a source to target storage/database
216 | 
217 | ### Log & Event Collection
218 | - [CloudQuery](https://github.com/cloudquery/cloudquery) - An ETL tool for syncing data from cloud APIs to variety of supported destinations 
219 | - [Snowplow](https://github.com/snowplow/snowplow) | ⚠️ Inactive | - A cloud-native engine for collecting behavioral data and load into various cloud storage systems
220 | - [EventMesh](https://github.com/apache/eventmesh) - A serverless event middlewar for collecting and loading event data into various targets
221 | - [Apache Flume](https://github.com/apache/flume) | ⚠️ Inactive | - A scalable distributed log aggregation service
222 | - [Steampipe](https://github.com/turbot/steampipe) - A zero-ETL solution for getting data directly from APIs and services
223 | - [Jitsu](https://github.com/jitsucom/jitsu) - A fully-scriptable data ingestion engine for collecting event data
224 | 
225 | ### Event Hub
226 | - [Apache Kafka](https://github.com/apache/kafka) - A highly scalable distributed event store and streaming platform
227 | - [NSQ](https://github.com/nsqio/nsq) - A realtime distributed messaging platform designed to operate at scale
228 | - [Apache Pulsar](https://github.com/apache/pulsar) - A scalable distributed pub-sub messaging system
229 | - [Apache RocketMQ](https://github.com/apache/rocketmq) - A a cloud native messaging and streaming platform
230 | - [Redpanda](https://github.com/redpanda-data/redpanda) - A high performance Kafka API compatible streaming data platform 
231 | - [Memphis](https://github.com/memphisdev/memphis) | ⚠️ Inactive | - A scalable data streaming platform for building event-driven applications
232 | - [AutoMQ](https://github.com/AutoMQ/automq) - A a cloud-first alternative to Kafka using S3 as the main storage layer
233 | 
234 | ### Reverse ETL
235 | - [Multiwoven](https://github.com/Multiwoven/multiwoven) - A Reverse ETL open source alternative to Hightouch and RudderStack
236 | 
237 | 
238 | ## DATA PROCESSING AND COMPUTATION
239 | 
240 | ### Unified Processing
241 | - [Apache Beam](https://github.com/apache/beam) - A unified programming model supporting execution on popular distributed processing backends 
242 | - [Apache Spark](https://github.com/apache/spark) - A unified analytics engine for large-scale data processing 
243 | - [Dinky](https://github.com/DataLinkDC/dinky) - A unified streaming & batch computation platform based on Apache Flink
244 | - [Feldora](https://github.com/feldera/feldera) - A unified incremental computation engine 
245 | 
246 | ### Batch processing
247 | - [Hadoop MapReduce](https://github.com/apache/hadoop) - A  highly scalable distributed batch processing framework from Apache Hadoop project
248 | - [Apache Tez](https://github.com/apache/tez) - A distributed data processing pipeline built for Apache Hive and Hadoop
249 | 
250 | ### Stream Processing
251 | - [Apache Flink](https://github.com/apache/flink) - A scalable high throughput stream processing framework 
252 | - [Apache Samza](https://github.com/apache/samza) - A distributed stream processing framework which uses Kafka and Hadoop, originally developed by LinkedIn
253 | - [Apache Storm](https://github.com/apache/storm) - A distributed realtime computation system based on  Actor Model framework
254 | - [Akka](https://github.com/akka/akka) - A highly concurrent, distributed, message-driven processing system based on Actor Model 
255 | - [Bytewax](https://github.com/bytewax/bytewax) - A Python stream processing framework with a Rust distributed processing engine
256 | - [Timeplus Proton](https://github.com/timeplus-io/proton) - A streaming SQL engine, fast and lightweight, powered by ClickHouse
257 | - [FastStream](https://github.com/airtai/faststream) - A Python framework for interacting with event streams such as Apache Kafka
258 | - [Bento](https://github.com/warpstreamlabs/bento) - A stream processing engine from WarpStream Labs (forked from Benthos)
259 | - [Fluvio](https://github.com/infinyon/fluvio) - A lean distributed stream processing system written in Rust and web assembly
260 | - [Arroyo](https://github.com/ArroyoSystems/arroyo) - A distributed stream processing engine written in Rust
261 | 
262 | ### Python Processing Framework
263 | - [Polars](https://github.com/pola-rs/polars) - A multithreaded Dataframe with vectorized query engine, written in Rust
264 | - [PySpark](https://github.com/apache/spark) - An interface for Apache Spark in Python
265 | - [Vaex](https://github.com/vaexio/vaex) - A high performance Python library for  big tabular datasets.
266 | - [Apache Arrow](https://github.com/apache/arrow) - An efficient in-memory data format
267 | - [Ibis](https://github.com/ibis-project/ibis) - A portable Python dataframe library supporting many engine backends
268 | - [SQLFrame](https://github.com/eakmanrq/sqlframe) - A Spark DataFrame API compatible library for data transformation
269 | - [Daft](https://github.com/Eventual-Inc/Daft) - A distributed query engine for large-scale data processing using Python or SQL
270 | - [cuDF](https://github.com/rapidsai/cudf) -  A GPU-accelerated pandas API dataFrame library 
271 | 
272 | ### Python Workflow Scaling
273 | - [Dask](https://github.com/dask/dask) - A flexible parallel computing library with task scheduling
274 | - [RAY](https://github.com/ray-project/ray) - A unified framework with distributed runtime for scaling Python applications
275 | - [Modin](https://github.com/modin-project/modin) - A library for scaling Pandas workflows to multi-threded execution
276 | - [Pandaral·lel](https://github.com/nalepae/pandarallel) | ⚠️ Inactive | - A library to parallelize Pandas operations on all available CPUs
277 | 
278 | ### SQL Toolkit
279 | - [SQLAlchemy](https://github.com/sqlalchemy/sqlalchemy) - A Python SQL toolkit and Object Relational Mapper
280 | - [SQLGlot](https://github.com/tobymao/sqlglot) - A Python SQL parser and transpiler
281 | 
282 | 
283 | ## WORKFLOW MANAGEMENT & DATAOPS
284 | 
285 | ### Workflow Orchestration
286 | - [Apache Airflow](https://github.com/apache/airflow) - A plaform for creating and scheduling workflows as directed acyclic graphs (DAGs) of tasks
287 | - [Prefect](https://github.com/PrefectHQ/prefect) - A Python based workflow orchestration tool 
288 | - [Argo](https://github.com/argoproj/argo-workflows) - A container-native workflow engine for orchestrating parallel jobs on Kubernetes 
289 | - [Azkaban](https://github.com/azkaban/azkaban) | ⚠️ Inactive | - A batch workflow job scheduler created at LinkedIn to run Hadoop jobs
290 | - [Cadence](https://github.com/uber/cadence) - A distributed, scalable available orchestration supporting different language client libraries
291 | - [Dagster](https://github.com/dagster-io/dagster) - A cloud-native data pipeline orchestrator written in Python
292 | - [Apache DolpinScheduler](https://github.com/apache/dolphinscheduler) - A low-code high performance workflow orchestration platform
293 | - [Luigi](https://github.com/spotify/luigi) - A python library for building complex pipelines of batch jobs
294 | - [Flyte](https://github.com/flyteorg/flyte) - A scalable and flexible workflow orchestration platform for both data and ML workloads
295 | - [Kestra](https://github.com/kestra-io/kestra) - A declarative language-agnostic worfklow orchestration and scheduling platform
296 | - [Mage.ai](https://github.com/mage-ai/mage-ai) - A platform for integrating, cheduling and managing data pipelines
297 | - [Temporal](https://github.com/temporalio/temporal) - A resilient workflow management system, originated as a fork of Uber's Cadence
298 | - [Windmill](https://github.com/windmill-labs/windmill) - A fast workflow engine, and open-source alternative to Airplane and Retool
299 | - [Maestro](https://github.com/Netflix/maestro) - A general-purpose workflow orchestrator developed by Netflix
300 | 
301 | ### Job Scheduling
302 | - [Celery](https://github.com/celery/celery) - A distributed Task Queue system for Python
303 | - [DKron](https://github.com/distribworks/dkron) - A distributed, fault tolerant job scheduling system
304 | - [ApScheduler](https://github.com/agronholm/apscheduler/) - An advanced task scheduler and task queue system for Python
305 | 
306 | ### Data Quality
307 | - [Data-diff](https://github.com/datafold/data-diff) | ⛔️ Archived | - A tool for comparing tables within or across databases 
308 | - [Great Expectations](https://github.com/great-expectations/great_expectations) - A data validation and profiling tool written in Python
309 | - [Deeque](https://github.com/awslabs/deequ) - A library based on Apache Spark for measuring data quality in large datasets
310 | - [Pandera](https://github.com/unionai-oss/pandera) - A light-weight, flexible, and expressive statistical data testing library
311 | - [Soda](https://github.com/sodadata/soda-core) - A CLI tool and Python library for data quality testing
312 | - [Pydantic](https://github.com/pydantic/pydantic) - A data validation library using Python type hints 
313 | 
314 | ### Data Versioning
315 | - [LakeFS](https://github.com/treeverse/lakeFS) - A data version control for data stored in data lakes
316 | - [Project Nessie](https://github.com/projectnessie/nessie) - A transactional Catalog for Data Lakes with Git-like semantics
317 | - [DVC](https://github.com/iterative/dvc) - A data version control tool for data and ML experiments
318 | - [Dolt](https://github.com/dolthub/dolt) - A Git for data tool
319 | - [Git-lfs](https://github.com/git-lfs/git-lfs) - A Git extension for versioning large files
320 | - [Datachain](https://github.com/iterative/datachain) - A Python-based framework for versioning for unstructured Data
321 | 
322 | ### Data Modeling
323 | - [dbt](https://github.com/dbt-labs/dbt-core) - A data modeling and transformation tool for data pipelines
324 | - [SQLMesh](https://github.com/TobikoData/sqlmesh) - A data transformation and modeling framework that is backwards compatible with dbt
325 | 
326 | ### Pipeline Observability
327 | - [Elementry](https://github.com/elementary-data/elementary) - A dbt-native data observability solution to monitor data pipelines
328 | 
329 | 
330 | ## DATA INFRASTRUCTURE
331 | 
332 | ### Resource Scheduling
333 | - [Apache Yarn](https://github.com/apache/hadoop) - The default Resource Scheduler for Apache Hadoop clusters
334 | - [Apache Mesos](https://github.com/apache/mesos) - A resource scheduling and cluster resource abstraction framework developed by Ph.D. students at UC Berkeley
335 | - [Kubernetes](https://github.com/kubernetes/kubernetes) - A production-grade container scheduling and management tool
336 | - [Apache YuniKorn](https://github.com/apache/yunikorn-core) - A light-weight, universal resource scheduler for container orchestrator systems
337 | - [Docker](https://github.com/docker) - The popular OS-level virtualization and containerization software
338 | 
339 | ### Cluster Administration
340 | - [Apache Ambari](https://github.com/apache/ambari) - A tool for provisioning, managing, and monitoring of Apache Hadoop clusters 
341 | - [Apache Helix](https://github.com/apache/helix) - A generic cluster management framework developed at LinkedIn
342 | 
343 | ### Security
344 | - [Apache Knox](https://github.com/apache/knox) - A gateway and SSO service for managing access to Hadoop clusters
345 | - [Apache Ranger](https://github.com/apache/ranger) - A security and governance platform for Hadoop and other popular services
346 | - [Kerberos](https://github.com/krb5/krb5) - A popular enterprise network authentication protocol
347 | 
348 | ### Metrics Store
349 | - [Influxdb](https://github.com/influxdata/influxdb) - A scalable datastore for metrics and events
350 | - [Mimir](https://github.com/grafana/mimir) - A scalable long-term metrics storage for Prometheus, developed by Grafana Labs
351 | - [OpenTSDB](https://github.com/OpenTSDB/opentsdb) - A distributed, scalable Time Series Database written on top of Apache Hbase
352 | - [M3](https://github.com/m3db/m3) - A distributed TSDB and metrics storage and aggregator
353 | 
354 | ### Observability Framework
355 | - [Prometheus](https://github.com/prometheus/prometheus) - A popular metric collection and management tool
356 | - [ELK](https://www.elastic.co/elastic-stack) - A poular observability stack comprsing of Elasticsearch, Kibana, Beats, and Logstash
357 | - [Graphite](https://github.com/graphite-project) - An established infrastructure monitoring and observability system
358 | - [OpenTelemetry](https://github.com/open-telemetry) - A collection of APIs, SDKs, and tools for managing and monitoring metrics
359 | - [VictoriaMetrics](https://github.com/VictoriaMetrics/VictoriaMetrics/) - An scalable monitoring solution with a time series database
360 | - [Zabbix](https://github.com/zabbix/zabbix) - A real-time infrastructure and application monitoring service
361 | 
362 | ### Monitoring Dashboard
363 | - [Grafana](https://github.com/grafana/grafana) - A popular open and composable observability and data visualization platform
364 | - [Kibana](https://github.com/elastic/kibana) - The visualistion and search dashboard for Elasticsearch
365 | - [Redpanda Console](https://github.com/redpanda-data/console) - A UI for monitoring and managing Apache Kafka and Redpanda workloads
366 | 
367 | ### Log & Metrics Pipeline
368 | - [Fluentd](https://github.com/fluent/fluentd) - A metric collection, buffering and router service
369 | - [Fluent Bit](https://github.com/fluent/fluent-bit) - A fast log processor and forwarder, and part of the Fluentd ecosystem
370 | - [Logstash](https://github.com/elastic/logstash) - A server-side log and metric transport and processor, as part of the ELK stack
371 | - [Telegraf](https://github.com/influxdata/telegraf) - A plugin-driven server agent for collecting & reporting metrics developed by Influxdata
372 | - [Vector](https://github.com/vectordotdev/vector) - A  high-performance, end-to-end (agent & aggregator) observability data pipeline
373 | - [StatsD](https://github.com/statsd/statsd) | ⚠️ Inactive | - A network daemon for collection, aggregation and routing of metrics
374 | 
375 | ### Cost Management
376 | - [OpenCost](https://github.com/opencost/opencost) - Cost monitoring for Kubernetes workloads and cloud costs
377 | 
378 | ## METADATA MANAGEMENT
379 | 
380 | ### Metadata Platform
381 | - [Amundsen](https://github.com/amundsen-io/amundsen) - A data discovery and metadata engine developed by Lyft engineers
382 | - [Apache Atlas](https://github.com/apache/atlas) - A data observability platform for Apache Hadoop ecosystem
383 | - [DataHub](https://github.com/datahub-project/datahub) - A metadata platform for the modern data stack developed at Netflix
384 | - [Marquez](https://github.com/MarquezProject/marquez) - A metadata service for the collection, aggregation, and visualization of metadata
385 | - [ckan](https://github.com/ckan/ckan) - A data management system  for cataloging, managing and accessing data
386 | - [Open Metadata](https://github.com/open-metadata/OpenMetadata) - A unified platform for discovery and governance, using a central metadata repository
387 | - [ODD Platform](https://github.com/opendatadiscovery/odd-platform) - A data discovery and observability platform
388 | 
389 | ### Open Standards
390 | - [Open Lineage](https://github.com/OpenLineage/OpenLineage) - An open standard for lineage metadata collection 
391 | - [Open Metadata](https://github.com/open-metadata/OpenMetadata) - A unified metadata platform providing open stadards for managing metadata
392 | - [Egeria](https://github.com/odpi/egeria) - Open metadata and governance standards to facilitate metadata exchange
393 | 
394 | ### Schema & Catalog Service
395 | - [Hive Metastore](https://github.com/apache/hive) - A popular schema management and metastore service as part of the Apache hive project
396 | - [Confluent Schema Registry](https://github.com/confluentinc/schema-registry) - A schema registry for Kafka, developed by Confluent
397 | - [Apache Polaris](https://github.com/apache/polaris) - An interoperable, open source catalog for Apache Iceberg
398 | - [Unity Catalog](https://github.com/unitycatalog/unitycatalog) - A Universal catalog for Data Lakehouse formats and other data/AI assets
399 | - [Lakekeeper](https://github.com/lakekeeper/lakekeeper) - A Rust native Apache Iceberg REST Catalog
400 | - [Apache Gravitino](https://github.com/apache/gravitino) - A geo-distributed and federated open data catalog
401 | 
402 | 
403 | ## ANALYTICS & VISUALISATION
404 | 
405 | ### BI & Dashboard
406 | - [Apache Superset](https://github.com/apache/superset) - A poular open source data visualization and data exploration platform 
407 | - [Metabase](https://github.com/metabase/metabase) - A simple data visualisation and exploration dashboard
408 | - [Redash](https://github.com/getredash/redash) - A tool to explore, query, visualize, and share data with many data source connectors
409 | - [Lightdash](https://github.com/lightdash/lightdash) - A self-service BI to turn dbt project into a full-stack BI platform
410 | 
411 | ## BI as Code (Web App)
412 | - [Streamlit](https://github.com/streamlit/streamlit) - A python tool to package and share data as web apps
413 | - [Evidence](https://github.com/evidence-dev/evidence) - A tool to build interactive data visualizations in pure SQL and markdown
414 | - [dash](https://github.com/plotly/dash) - A Python framework for building ML & data science web apps
415 | - [Vizro](https://github.com/mckinsey/vizro) - A toolkit for creating modular data visualization applications
416 | - [Mercury](https://github.com/mljar/mercury) - A tool to convert Jupyter Notebooks to web apps
417 | - [Quary](https://github.com/quarylabs/quary) - A code-based BI solution
418 | 
419 | ### Query & Collaboration
420 | - [Hue](https://github.com/cloudera/hue) - A query and data exploration tool with Hadoop ecosystem support, developed by Cloudera
421 | - [Apache Zeppelin](https://github.com/apache/zeppelin) - A web-base Notebook for interactive data analytics and collaboration for Hadoop
422 | - [Querybook](https://github.com/pinterest/querybook) - A simple query and notebook UI developed by Pinterest
423 | - [Jupyter](https://github.com/jupyter/notebook) - A popular interactive web-based notebook application
424 | - [IPython](https://github.com/ipython/ipython) - An enhanced interactive Python shell for data analysis
425 | - [Datasette](https://github.com/simonw/datasette) - A tool for exploring and publishing data 
426 | 
427 | ### MPP Query Engine
428 | - [Apache Hive](https://github.com/apache/hive) - A data warehousing and MPP engine on top of Hadoop
429 | - [Apache Implala](https://github.com/apache/impala) - A MPP engine mainly for Hadoop clusters, developed by Cloudera 
430 | - [Presto](https://github.com/prestodb/presto) - A distributed SQL query engine for big data
431 | - [Trino](https://github.com/trinodb/trino) - The former PrestoSQL distributed SQL query engine
432 | - [Apache Drill](https://github.com/apache/drill) - A distributed MPP query engine against NoSQL and Hadoop data storage systems
433 | - [DataFusion Ballista](https://github.com/apache/datafusion-ballista) - A distributed query execution engine based on Apache DataFusion
434 | 
435 | ### Semantic & Middleware Layer
436 | - [Alluxio](https://github.com/Alluxio/alluxio) - A data orchestration and virtual distributed storage system
437 | - [Cube](https://github.com/cube-js/cube) - A semantic layer for building data applications supporting popular databse engines
438 | - [Apache Linkis](https://github.com/apache/linkis) - A computation middleware to facilitate connection and orchestration between applications and data engines
439 | - [Apache Gluten](https://github.com/apache/incubator-gluten) - A middle layer for offloading JVM-based SQL engines execution to native engines
440 | - [Apache OpenDAL](https://github.com/apache/opendal) - An open data access Llyer that enables seamless interaction with diverse storage services
441 | 
442 | ### Data Sharing
443 | - [delta-sharing](https://github.com/delta-io/delta-sharing) - An open protocol for secure real-time exchange of large datasets
444 | 
445 | 
446 | ## ML/AI PLATFORM
447 | 
448 | ### Vector Storage
449 | - [milvus](https://github.com/milvus-io/milvus) -  A cloud-native vector database, storage for AI applications 
450 | - [qdrant](https://github.com/qdrant/qdrant) - A high-performance, scalable Vector database for AI
451 | - [chroma](https://github.com/chroma-core/chroma) - An AI-native embedding database for building LLM apps
452 | - [marqo](https://github.com/marqo-ai/marqo) - An end-to-end vector search engine for both text and images
453 | - [LanceDB](https://github.com/lancedb/lancedb) - A serverless vector database for AI applications written in Rust
454 | - [weaviate](https://github.com/weaviate/weaviate) - A scalable, cloud-native supporting storage of both objects and vectors
455 | - [deeplake](https://github.com/activeloopai/deeplake) -  A storage format optimized AI database for deep-learning applications
456 | - [Vespa](https://github.com/vespa-engine/vespa) - A storage to organize vectors, tensors, text and structured data
457 | - [vald](https://github.com/vdaas/vald) - A scalable distributed approximate nearest neighbor (ANN) dense vector search engine
458 | - [pgvector](https://github.com/pgvector/pgvector) - A vector similarity search as a Postgres extension
459 | 
460 | ### MLOps
461 | - [mlflow](https://github.com/mlflow/mlflow) - A a platform to streamline machine learning development and lifecycle management
462 | - [Metaflow](https://github.com/Netflix/metaflow) - A tool to build and manage ML/AI, and data science projects, developed at Netflix
463 | - [SkyPilot](https://github.com/skypilot-org/skypilot) - A framework for running LLMs, AI, and batch jobs on any cloud
464 | - [Jina](https://github.com/jina-ai/jina) - A tool to build multimodal AI applications with cloud-native stack
465 | - [NNI](https://github.com/microsoft/nni) | ⛔️ Archived | - An autoML toolkit for automate machine learning lifecycle, from Microsoft
466 | - [BentoML](https://github.com/bentoml/BentoML) - A framework for building reliable and scalable AI applications
467 | - [Determined AI](https://github.com/determined-ai/determined) - An ML platform that simplifies distributed training, tuning and experiment tracking
468 | - [RAY](https://github.com/ray-project/ray) - A unified framework for scaling AI and Python applications
469 | - [kubeflow](https://github.com/kubeflow/kubeflow) - A cloud-native platform for ML operations - pipelines, training and deployment
470 | - [Kedro](https://github.com/kedro-org/kedro) - A toolbox and framework for building production-ready data science and ML workflows
471 | - [Pachyderm](https://github.com/pachyderm/pachyderm) - A calable ML and Data Science data processing workflow management platform
472 | 
473 | ### LLMOps
474 | - [Dify](https://github.com/langgenius/dify) - LLM development platform nwith AI workflow, RAG pipeline and model management
475 | - [Haystack](https://github.com/deepset-ai/haystack) - AI orchestration framework to build customizable, production-ready LLM applications
476 | - [Superduper](https://github.com/superduper-io/superduper) - a Python based framework for building AI-data workflows and applications
477 | - [Cognee](https://github.com/topoteretes/cognee) - LLM Memory Engine for implementing LLM Workflows
478 | - [vLLM](https://github.com/vllm-project/vllm) - A high-throughput and memory-efficient inference and serving engine for LLMs
479 | 
480 | 


--------------------------------------------------------------------------------