├── README.md
└── logo.jpg


/README.md:
--------------------------------------------------------------------------------
 1 | # DuckStreams
 2 | 
 3 | <img src="logo.jpg" width="250">
 4 | 
 5 | ### Description
 6 | DuckStreams acts as a virtual database on top of your Kafka and Redpanda
 7 | clusters, effectively providing an ephemeral SQL interface for querying
 8 | streaming data.
 9 | 
10 | ## Scope 
11 | DuckStreams turns Kafka and Redpanda topics into SQL-accessible virtual tables,
12 | allowing users to query streaming data in real-time. The project interfaces with
13 | the schema registry to map topics to tables and supports deserializing data in
14 | formats like JSON, Protobuf, and Thrift. Using DuckDB, DuckStreams creates
15 | ephemeral tables, runs queries on them, and returns results without persisting
16 | any data. It’s designed to be lightweight, fully in-memory, and ideal for
17 | querying dynamic stream data without caching or long-term storage.
18 | 
19 | ## How it works
20 | 
21 | First, the service is served as a python DBAPI-compatible driver. If asked for its list of tables,
22 | queried through `INFORMATION_SCHEMA.TABLES`, it interfaces with your streaming cluster's
23 | schema registry to get a list of topics. This table is made ephemeral through duckdb so that
24 | you can apply predicates, grouping and run any SQL against it.
25 | 
26 | Similarly, when running ANY SQL statement against the database, we parse-out the virtual table
27 | name, which should match an existing topic (or `INFORMATION_SCHEMA.TABLES`), then it will simply:
28 | 
29 | 1. figure out the topic
30 | 1. fire up a client + consumer, and apply the time and partition predicate
31 | 1. deserialize the data into memory, load it into an ephemeral, in-memory duckdb table
32 | 1. run the SQL you ran against this ephemeral table in duck db, retrieve the result set
33 | 1. return it through a DBAPI-compatible interface
34 | 
35 | ## Configuration
36 | 
37 | * clusters: define your clusters into a yaml file
38 | 
39 | * policies:
40 |   * levels inheritance: top-level, cluster-level or table-level 
41 |   * parameters
42 |     * row_limit: limit the number of rows the consumer will read, it just stops once reached
43 |     * time_range_limit: define a max time range that can be queries, can be anything from seconds to years
44 |     * bytes (?)
45 |     * cells (?)
46 | 
47 | ## Thoughts & questions
48 | 
49 | * nesting: it's pretty common to have deeply/oddly nested schemas on the transport layer,
50 |   how good are duckdb's support for complex schema? arbitrary json? Should we auto-comlumnize
51 |   things as we deserialize? automagically? based on configs?
52 | 


--------------------------------------------------------------------------------
/logo.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mistercrunch/duckstreams/772c65246681d064a1df8a1cedee1e509f6756e4/logo.jpg


--------------------------------------------------------------------------------