├── README.md └── logo.jpg /README.md: -------------------------------------------------------------------------------- 1 | # DuckStreams 2 | 3 | 4 | 5 | ### Description 6 | DuckStreams acts as a virtual database on top of your Kafka and Redpanda 7 | clusters, effectively providing an ephemeral SQL interface for querying 8 | streaming data. 9 | 10 | ## Scope 11 | DuckStreams turns Kafka and Redpanda topics into SQL-accessible virtual tables, 12 | allowing users to query streaming data in real-time. The project interfaces with 13 | the schema registry to map topics to tables and supports deserializing data in 14 | formats like JSON, Protobuf, and Thrift. Using DuckDB, DuckStreams creates 15 | ephemeral tables, runs queries on them, and returns results without persisting 16 | any data. It’s designed to be lightweight, fully in-memory, and ideal for 17 | querying dynamic stream data without caching or long-term storage. 18 | 19 | ## How it works 20 | 21 | First, the service is served as a python DBAPI-compatible driver. If asked for its list of tables, 22 | queried through `INFORMATION_SCHEMA.TABLES`, it interfaces with your streaming cluster's 23 | schema registry to get a list of topics. This table is made ephemeral through duckdb so that 24 | you can apply predicates, grouping and run any SQL against it. 25 | 26 | Similarly, when running ANY SQL statement against the database, we parse-out the virtual table 27 | name, which should match an existing topic (or `INFORMATION_SCHEMA.TABLES`), then it will simply: 28 | 29 | 1. figure out the topic 30 | 1. fire up a client + consumer, and apply the time and partition predicate 31 | 1. deserialize the data into memory, load it into an ephemeral, in-memory duckdb table 32 | 1. run the SQL you ran against this ephemeral table in duck db, retrieve the result set 33 | 1. return it through a DBAPI-compatible interface 34 | 35 | ## Configuration 36 | 37 | * clusters: define your clusters into a yaml file 38 | 39 | * policies: 40 | * levels inheritance: top-level, cluster-level or table-level 41 | * parameters 42 | * row_limit: limit the number of rows the consumer will read, it just stops once reached 43 | * time_range_limit: define a max time range that can be queries, can be anything from seconds to years 44 | * bytes (?) 45 | * cells (?) 46 | 47 | ## Thoughts & questions 48 | 49 | * nesting: it's pretty common to have deeply/oddly nested schemas on the transport layer, 50 | how good are duckdb's support for complex schema? arbitrary json? Should we auto-comlumnize 51 | things as we deserialize? automagically? based on configs? 52 | -------------------------------------------------------------------------------- /logo.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mistercrunch/duckstreams/772c65246681d064a1df8a1cedee1e509f6756e4/logo.jpg --------------------------------------------------------------------------------