├── .gitignore
├── .python-version
├── README.md
├── docs
    └── boringdata.png
├── pyproject.toml
├── src
    └── boringcatalog
    │   ├── __init__.py
    │   ├── catalog.py
    │   ├── cli.py
    │   └── duckdb_init.sql
├── tests
    └── test_catalog.py
└── uv.lock


/.gitignore:
--------------------------------------------------------------------------------
 1 | # Python-generated files
 2 | __pycache__/
 3 | *.py[oc]
 4 | build/
 5 | dist/
 6 | wheels/
 7 | *.egg-info
 8 | .env
 9 | # Virtual environments
10 | .venv
11 | 


--------------------------------------------------------------------------------
/.python-version:
--------------------------------------------------------------------------------
1 | 3.10
2 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | **[boringdata.io](https://boringdata.io) — Kickstart your Iceberg journey with our data stack templates.**
  2 | 
  3 | <img src="docs/boringdata.png" alt="Boring Data" width="400">
  4 | 
  5 | ----
  6 | # Boring Catalog
  7 | 
  8 | A lightweight, file-based Iceberg catalog implementation using a single JSON file (e.g., on S3, local disk, or any fsspec-compatible storage).
  9 | 
 10 | ## Why Boring Catalog?
 11 | - No need to host or maintain a dedicated catalog service
 12 | - Easy to use, easy to understand, perfect to get started with Iceberg
 13 | - DuckDB CLI interface to easily explore your iceberg tables and metadata
 14 | 
 15 | ## How It Works
 16 | Boring Catalog stores all Iceberg catalog state in a single JSON file:
 17 | - Namespaces and tables are tracked in this file
 18 | - S3 conditional writes prevent concurrent modifications when storing catalog on S3
 19 | - The `.ice/index` file in your project directory stores the configuration for your catalog, including:
 20 |   - `catalog_uri`: the path to your catalog JSON file
 21 |   - `catalog_name`: the logical name of your catalog
 22 |   - `properties`: additional properties (e.g., warehouse location)
 23 | 
 24 | ## Installation
 25 | ```bash
 26 | pip install boringcatalog
 27 | ```
 28 | 
 29 | ## Quickstart
 30 | 
 31 | ### Initialize a Catalog
 32 | ```bash
 33 | ice init
 34 | ```
 35 | 
 36 | That's it ! Your catalog is now ready to use.
 37 | 
 38 | 2 files are created:
 39 |    - `warehouse/catalog/catalog_boring.json` = catalog file 
 40 |    - `.ice/index` = points to the catalog location (similar to a git index file, but for Iceberg)
 41 | 
 42 | 
 43 | *Note: You can also specify a remote location for your Iceberg data and catalog file:*
 44 | ```bash
 45 | ice init -p warehouse=s3://mybucket/mywarehouse
 46 | ```
 47 | More details on the [Custom Init and Catalog Location](#custom-init-and-catalog-location) section.
 48 | 
 49 | *Note: If you are using an S3 path (e.g., `s3://...`) for your catalog file or warehouse, make sure your CLI environment is authenticated with AWS. For example, you can set your AWS profile with:*
 50 | 
 51 | ```bash
 52 | export AWS_PROFILE=your-provider
 53 | ```
 54 | 
 55 | *You must have valid AWS credentials configured for the CLI to access S3 resources.*
 56 | 
 57 | You can then start using the catalog:
 58 | 
 59 | ### Commit a table
 60 | ```bash
 61 | # Get some data
 62 | curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet -o /tmp/yellow_tripdata_2023-01.parquet
 63 | 
 64 | # Commit the table
 65 | ice commit my_table --source /tmp/yellow_tripdata_2023-01.parquet
 66 | ```
 67 | 
 68 | ### Check the commit history:
 69 | 
 70 | ```bash
 71 | ice log 
 72 | ```
 73 | 
 74 | ### Explore your Iceberg (data and metadata) with DuckDB
 75 | ```bash
 76 | ice duck
 77 | ```
 78 | This opens an interactive DuckDB session with pointers to all your tables and namespaces.
 79 | 
 80 | Example DuckDB queries:
 81 | ```
 82 | show;                               -- show all tables               
 83 | select * from catalog.namespaces;   -- list namespaces
 84 | select * from catalog.tables;       -- list tables
 85 | select * from <namespace>.<table>;  -- query iceberg table
 86 | ```
 87 | 
 88 | ## Python Usage
 89 | 
 90 | ```python
 91 | from boringcatalog import BoringCatalog
 92 | 
 93 | # Auto-detects .ice/index in the current working directory
 94 | catalog = BoringCatalog()
 95 | 
 96 | # Or specify a catalog
 97 | catalog = BoringCatalog(name="mycat", uri="path/to/catalog.json")
 98 | 
 99 | # Interact with your iceberg catalog
100 | catalog.create_namespace("my_namespace")
101 | catalog.create_table("my_namespace", "my_table")
102 | catalog.load_table("my_namespace.my_table")
103 | 
104 | import pyarrow.parquet as pq
105 | df = pq.read_table("/tmp/yellow_tripdata_2023-01.parquet")
106 | table = catalog.load_table(("ice_default", "my_table"))
107 | table.append(df)
108 | ```
109 | 
110 | 
111 | ## Custom Init and Catalog Location
112 | 
113 | You can configure your Iceberg catalog in several ways, depending on where you want to store your catalog metadata (the JSON file) and your Iceberg data (the warehouse):
114 | - The `warehouse` property determines where your Iceberg tables' data will be stored.
115 | - The `--catalog` option lets you specify the exact path for your catalog JSON file.
116 | - If you use both, the catalog file will be created at the path you specify, and the warehouse will be used for table data.
117 | 
118 | ### Examples
119 | | Command Example | Catalog File Location | Warehouse/Data Location | Use Case |
120 | |-----------------|----------------------|------------------------|----------|
121 | | `ice init` | `warehouse/catalog/catalog_boring.json` | `warehouse/` | Local, simple |
122 | | `ice init -p warehouse=...` | `<warehouse>/catalog/catalog_boring.json` | `<warehouse>/` | Custom warehouse |
123 | | `ice init --catalog ...` | `<custom>.json` | (to define when creating a table) | Custom catalog file |
124 | | `ice init --catalog ... -p warehouse=...` | `<custom>.json` | `<warehouse>/` | Full control |
125 | | `ice init --catalog ... --catalog-name ...` | `<custom>.json` | (to define when creating a table) | Custom name & file |
126 | 
127 | ### Edge Cases & Manual Editing
128 | - **Custom Catalog Name:** By default, the catalog is named `"boring"`, but you can set a custom name with `--catalog-name`. This name is used in the catalog JSON and for file naming if you don't specify a custom path.
129 | - **Re-initialization:** If you run `ice init` multiple times in the same directory, the `.ice/index` file will be overwritten with the new configuration. This is useful if you want to re-point your project to a different catalog, but be aware that it will not migrate or merge any existing data.
130 | - **Manual Editing:** Advanced users can manually edit `.ice/index` to point to a different catalog file or change the catalog name. If you do this, make sure the `catalog_uri` and `catalog_name` fields are consistent with your actual catalog JSON file. If you set a `warehouse` property but do not update `catalog_uri`, Boring Catalog will always use the `catalog_uri` from the index file.
131 | 
132 | ## Roadmap
133 | - [ ] Improve CLI to allow MERGE operation, partition spec, etc.
134 | - [ ] Improve CLI to get info about table schema / partition spec / etc.
135 | - [ ] Expose REST API for integration with AWS, Snowflake, etc.
136 | 


--------------------------------------------------------------------------------
/docs/boringdata.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/boringdata/boring-catalog/4c85dbddb9039f1d03a941da39f952366fe5050a/docs/boringdata.png


--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------
 1 | [project]
 2 | name = "boringcatalog"
 3 | version = "0.4.0"
 4 | description = "A DuckDB-based Iceberg catalog implementation"
 5 | readme = "README.md"
 6 | authors = [
 7 |     { name = "huraultj", email = "julien.hurault@sumeo.io" }
 8 | ]
 9 | requires-python = ">=3.10"
10 | dependencies = [
11 |     "s3fs>=2023.12.0",
12 |     "pyiceberg>=0.9.0",
13 |     "click>=8.0.0",
14 |     "duckdb>=0.9.0",
15 |     "pyiceberg[pyarrow]>=0.9.0"
16 | ]
17 | urls = {Homepage = "https://github.com/boringdata/boring-catalog"}
18 | 
19 | [project.scripts]
20 | ice = "boringcatalog.cli:cli"
21 | 
22 | [project.optional-dependencies]
23 | test = [
24 |     "pytest>=7.0.0",
25 |     "pandas>=2.0.0",
26 |     "pyarrow>=14.0.0"
27 | ]
28 | 
29 | [build-system]
30 | requires = ["hatchling"]
31 | build-backend = "hatchling.build"
32 | 
33 | [tool.pytest.ini_options]
34 | testpaths = ["tests"]
35 | python_files = ["test_*.py"]
36 | addopts = "-v --tb=short"
37 | 


--------------------------------------------------------------------------------
/src/boringcatalog/__init__.py:
--------------------------------------------------------------------------------
1 | """A DuckDB-based Iceberg catalog implementation."""
2 | 
3 | from .catalog import BoringCatalog
4 | 
5 | __all__ = ["BoringCatalog"]
6 | 
7 | 


--------------------------------------------------------------------------------
/src/boringcatalog/catalog.py:
--------------------------------------------------------------------------------
  1 | from typing import Dict, List, Optional, Set, Tuple, Union, Any
  2 | import uuid
  3 | import json
  4 | import os
  5 | import tempfile
  6 | import fsspec
  7 | import logging
  8 | from pyiceberg.io import load_file_io
  9 | from pyiceberg.partitioning import UNPARTITIONED_PARTITION_SPEC, PartitionSpec
 10 | from pyiceberg.schema import Schema
 11 | from pyiceberg.serializers import FromInputFile
 12 | from pyiceberg.table import CommitTableResponse, Table
 13 | from pyiceberg.table.locations import load_location_provider
 14 | from pyiceberg.table.metadata import new_table_metadata
 15 | from pyiceberg.table.sorting import UNSORTED_SORT_ORDER, SortOrder
 16 | from pyiceberg.table.update import TableRequirement, TableUpdate
 17 | from pyiceberg.typedef import EMPTY_DICT, Identifier, Properties
 18 | from pyiceberg.types import strtobool
 19 | from pyiceberg.catalog import (
 20 |     Catalog,
 21 |     MetastoreCatalog,
 22 |     METADATA_LOCATION,
 23 |     PREVIOUS_METADATA_LOCATION,
 24 |     TABLE_TYPE,
 25 |     ICEBERG,
 26 |     PropertiesUpdateSummary,
 27 | )
 28 | from pyiceberg.exceptions import (
 29 |     NamespaceAlreadyExistsError,
 30 |     NamespaceNotEmptyError,
 31 |     NoSuchNamespaceError,
 32 |     NoSuchTableError,
 33 |     TableAlreadyExistsError,
 34 |     NoSuchPropertyException,
 35 |     NoSuchIcebergTableError,
 36 |     CommitFailedException,
 37 | )
 38 | 
 39 | 
 40 | from time import perf_counter
 41 | # Set up logging
 42 | logger = logging.getLogger(__name__)
 43 | 
 44 | DEFAULT_INIT_CATALOG_TABLES = "true"
 45 | DEFAULT_CATALOG_NAME = "boring"
 46 | class ConcurrentModificationError(CommitFailedException):
 47 |     """Raised when a concurrent modification is detected."""
 48 |     pass
 49 | 
 50 | class BoringCatalog(MetastoreCatalog):
 51 |     """A simple file-based Iceberg catalog implementation."""
 52 |     
 53 |     def __init__(self, name: str = None, **properties: str):
 54 |         # If name or properties are not provided, try to read them from .ice/index once
 55 |         index_path = os.path.join(os.getcwd(), ".ice/index")
 56 |         index = None
 57 |         if (name is None or not properties) and os.path.exists(index_path):
 58 |             with open(index_path, 'r') as f:
 59 |                 index = json.load(f)
 60 |             if name is None:
 61 |                 name = index.get("catalog_name", DEFAULT_CATALOG_NAME)
 62 |             if not properties:
 63 |                 properties = index.get("properties", {})
 64 |         if name is None:
 65 |             name = DEFAULT_CATALOG_NAME
 66 |         super().__init__(name, **properties)
 67 |         
 68 |         if index is not None and "catalog_uri" in index:
 69 |             self.uri = index["catalog_uri"]
 70 |             self.properties = index["properties"]
 71 |         elif self.properties.get("uri"):
 72 |             self.uri = self.properties.get("uri")
 73 |         elif self.properties.get("warehouse"):
 74 |             self.uri = os.path.join(os.path.join(self.properties.get("warehouse"), "catalog"), f"catalog_{name}.json")
 75 |         else:
 76 |             raise ValueError("Either provide 'catalog' or 'warehouse' property to initialize BoringCatalog")
 77 | 
 78 |         # Always infer warehouse if missing and uri is set
 79 |         if self.uri and not self.properties.get("warehouse"):
 80 |             warehouse_path = os.path.dirname(self.uri)
 81 |             self.properties["warehouse"] = warehouse_path
 82 |             logging.info(f"No --warehouse specified for the catalog. Using catalog folder to store iceberg data: {warehouse_path}")
 83 | 
 84 |         init_catalog_tables = strtobool(self.properties.get("init_catalog_tables", DEFAULT_INIT_CATALOG_TABLES))
 85 |         
 86 |         if init_catalog_tables:
 87 |             self._ensure_tables_exist()
 88 | 
 89 |     @property
 90 |     def catalog(self):
 91 |         catalog, _ = self._read_catalog_json()
 92 |         return catalog
 93 | 
 94 |     @property
 95 |     def latest_snapshot(self, table_identifier: str):
 96 |         table = self.load_table(table_identifier)
 97 |         io = load_file_io(properties=self.properties, location=self.uri)
 98 |         file = io.new_input(metadata_location) 
 99 |         return file
100 | 
101 |     def _ensure_tables_exist(self):
102 |         """Ensure catalog directory and catalog.json exist."""
103 |         try:
104 | 
105 |             io = load_file_io(properties=self.properties, location=self.uri)
106 |             
107 |             # Check if catalog file exists
108 |             input_file = io.new_input(self.uri)
109 |             if not input_file.exists():
110 |                 # Create initial catalog structure
111 |                 initial_catalog = {
112 |                     "catalog_name": self.name,
113 |                     "namespaces": {},
114 |                     "tables": {}
115 |                 }
116 |                 
117 |                 # Write the initial catalog file
118 |                 with io.new_output(self.uri).create(overwrite=True) as f:
119 |                     f.write(json.dumps(initial_catalog, indent=2).encode('utf-8'))
120 | 
121 |         except Exception as e:
122 |             raise ValueError(f"Failed to initialize catalog at {self.uri}: {str(e)}")
123 | 
124 |     def _read_catalog_json(self):
125 |         """Read catalog.json using FileIO, returning (data, etag)."""
126 |         try:
127 |             io = load_file_io(properties=self.properties, location=self.uri)
128 |             input_file = io.new_input(self.uri)
129 |             
130 |             if not input_file.exists():
131 |                 return {"catalog_name": self.name, "namespaces": {}, "tables": {}}, None
132 |                 
133 |             with input_file.open() as f:
134 |                 data = json.loads(f.read().decode('utf-8'))
135 |             
136 |             # Get metadata for ETag
137 |             metadata = input_file.metadata() if hasattr(input_file, 'metadata') else {}
138 |             etag = metadata.get("ETag")
139 |             return data, etag
140 |             
141 |         except Exception as e:
142 |             if 'No such file' in str(e) or 'not found' in str(e) or '404' in str(e):
143 |                 return {"catalog_name": self.name, "namespaces": {}, "tables": {}}, None
144 |             raise
145 | 
146 |     def _write_catalog_json(self, data, etag=None):
147 |         """Write catalog.json using FileIO, using ETag for concurrency if provided."""
148 |         try:
149 |             io = load_file_io(properties=self.properties, location=self.uri)
150 |             
151 |             # Create output file with ETag check if provided
152 |             output_file = io.new_output(self.uri)
153 |             if etag is not None and hasattr(output_file, 'set_metadata'):
154 |                 output_file.set_metadata({"if_match": etag})
155 | 
156 |             with output_file.create(overwrite=True) as f:
157 |                 f.write(json.dumps(data, indent=2).encode('utf-8'))
158 |                 
159 |         except Exception as e:
160 |             if 'PreconditionFailed' in str(e) or '412' in str(e):
161 |                 raise ConcurrentModificationError("catalog.json was modified concurrently")
162 |             raise
163 | 
164 |     def _table_key(self, namespace: str, table_name: str) -> str:
165 |         return f"{namespace}.{table_name}"
166 |     
167 |     def create_table(
168 |         self,
169 |         identifier: Union[str, Identifier],
170 |         schema: Union[Schema, "pa.Schema"],
171 |         location: Optional[str] = None,
172 |         partition_spec: PartitionSpec = UNPARTITIONED_PARTITION_SPEC,
173 |         sort_order: SortOrder = UNSORTED_SORT_ORDER,
174 |         properties: Properties = EMPTY_DICT,
175 |     ) -> Table:
176 |         """Create an Iceberg table."""
177 |         schema: Schema = self._convert_schema_if_needed(schema)  # type: ignore
178 |         namespace_tuple = Catalog.namespace_from(identifier)
179 |         namespace = Catalog.namespace_to_string(namespace_tuple)
180 |         table_name = Catalog.table_name_from(identifier)
181 |         table_key = self._table_key(namespace, table_name)
182 | 
183 |         data, etag = self._read_catalog_json()
184 |         if namespace not in data["namespaces"]:
185 |             raise NoSuchNamespaceError(f"Namespace does not exist: {namespace}")
186 |         if table_key in data.get("tables", {}):
187 |             raise TableAlreadyExistsError(f"Table {namespace}.{table_name} already exists")
188 | 
189 |         location = self._resolve_table_location(location, namespace, table_name)
190 |         location_provider = load_location_provider(table_location=location, table_properties=properties)
191 |         metadata_location = location_provider.new_table_metadata_file_location()
192 | 
193 |         metadata = new_table_metadata(
194 |             location=location, schema=schema, partition_spec=partition_spec, sort_order=sort_order, properties=properties
195 |         )
196 |         io = load_file_io(properties=self.properties, location=metadata_location)
197 |         self._write_metadata(metadata, io, metadata_location)
198 | 
199 |         # Add table entry to catalog.json
200 |         if "tables" not in data:
201 |             data["tables"] = {}
202 |         data["tables"][table_key] = {
203 |             "namespace": namespace,
204 |             "name": table_name,
205 |             "metadata_location": metadata_location
206 |         }
207 | 
208 |         self._write_catalog_json(data, etag)
209 | 
210 |         return self.load_table(identifier)
211 | 
212 |     def load_table(self, identifier: Union[str, Identifier], catalog_name: str = None) -> Table:
213 |         """Load the table's metadata and return the table instance using catalog.json."""
214 |         namespace_tuple = Catalog.namespace_from(identifier)
215 |         namespace = Catalog.namespace_to_string(namespace_tuple)
216 |         table_name = Catalog.table_name_from(identifier)
217 |         table_key = self._table_key(namespace, table_name)
218 |         data, _ = self._read_catalog_json()
219 |         table_entry = data.get("tables", {}).get(table_key)
220 |         if not table_entry:
221 |             raise NoSuchTableError(f"Table does not exist: {namespace}.{table_name}")
222 |         metadata_location = table_entry["metadata_location"]
223 |         io = load_file_io(properties=self.properties, location=metadata_location)
224 |         file = io.new_input(metadata_location)
225 |         metadata = FromInputFile.table_metadata(file)
226 |         return Table(
227 |             identifier=Catalog.identifier_to_tuple(namespace) + (table_name,),
228 |             metadata=metadata,
229 |             metadata_location=metadata_location,
230 |             io=self._load_file_io(metadata.properties, metadata_location),
231 |             catalog=self
232 |         )
233 | 
234 |     def drop_table(self, identifier: Union[str, Identifier]) -> None:
235 |         """Drop a table."""
236 |         namespace_tuple = Catalog.namespace_from(identifier)
237 |         namespace = Catalog.namespace_to_string(namespace_tuple)
238 |         table_name = Catalog.table_name_from(identifier)
239 |         table_key = self._table_key(namespace, table_name)
240 |         data, etag = self._read_catalog_json()
241 |         if table_key not in data.get("tables", {}):
242 |             raise NoSuchTableError(f"Table does not exist: {namespace}.{table_name}")
243 |         del data["tables"][table_key]
244 |         self._write_catalog_json(data, etag)
245 | 
246 |     def rename_table(self, from_identifier: Union[str, Identifier], to_identifier: Union[str, Identifier]) -> Table:
247 |         """Rename a table."""
248 |         from_namespace_tuple = Catalog.namespace_from(from_identifier)
249 |         from_namespace = Catalog.namespace_to_string(from_namespace_tuple)
250 |         from_table_name = Catalog.table_name_from(from_identifier)
251 |         from_table_key = self._table_key(from_namespace, from_table_name)
252 | 
253 |         to_namespace_tuple = Catalog.namespace_from(to_identifier)
254 |         to_namespace = Catalog.namespace_to_string(to_namespace_tuple)
255 |         to_table_name = Catalog.table_name_from(to_identifier)
256 |         to_table_key = self._table_key(to_namespace, to_table_name)
257 | 
258 |         data, etag = self._read_catalog_json()
259 |         if not self._namespace_exists(to_namespace):
260 |             raise NoSuchNamespaceError(f"Namespace does not exist: {to_namespace}")
261 |         
262 |         if from_table_key not in data.get("tables", {}):
263 |             raise NoSuchTableError(f"Table does not exist: {from_namespace}.{from_table_name}")
264 |             
265 |         if to_table_key in data.get("tables", {}):
266 |             raise TableAlreadyExistsError(f"Table {to_namespace}.{to_table_name} already exists")
267 | 
268 |         table_entry = data["tables"][from_table_key]
269 |         table_entry["namespace"] = to_namespace
270 |         table_entry["name"] = to_table_name
271 |         data["tables"][to_table_key] = table_entry
272 |         del data["tables"][from_table_key]
273 |         
274 |         self._write_catalog_json(data, etag)
275 |         return self.load_table(to_identifier)
276 | 
277 |     def create_namespace(self, namespace: Union[str, Identifier], properties: Properties = EMPTY_DICT) -> None:
278 |         """Create a namespace in the catalog.json file."""
279 |         namespace_str = Catalog.namespace_to_string(namespace)
280 |         data, etag = self._read_catalog_json()
281 |         if namespace_str in data["namespaces"]:
282 |             raise NamespaceAlreadyExistsError(f"Namespace already exists: {namespace_str}")
283 |         data["namespaces"][namespace_str] = {"properties": properties or {"exists": "true"}}
284 |         self._write_catalog_json(data, etag)
285 | 
286 |     def drop_namespace(self, namespace: Union[str, Identifier]) -> None:
287 |         """Drop a namespace from catalog.json."""
288 |         namespace_str = Catalog.namespace_to_string(namespace)
289 |         data, etag = self._read_catalog_json()
290 |         if namespace_str not in data["namespaces"]:
291 |             raise NoSuchNamespaceError(f"Namespace does not exist: {namespace_str}")
292 |         if any(tbl["namespace"] == namespace_str for tbl in data["tables"].values()):
293 |             raise NamespaceNotEmptyError(f"Namespace {namespace_str} is not empty.")
294 |         del data["namespaces"][namespace_str]
295 |         self._write_catalog_json(data, etag)
296 | 
297 |     def list_tables(self, namespace: Union[str, Identifier]) -> List[Identifier]:
298 |         """List tables under the given namespace from catalog.json."""
299 |         namespace_str = Catalog.namespace_to_string(namespace)
300 |         data, _ = self._read_catalog_json()
301 |         if namespace_str and namespace_str not in data["namespaces"]:
302 |             raise NoSuchNamespaceError(f"Namespace does not exist: {namespace_str}")
303 |         return [
304 |             Catalog.identifier_to_tuple(tbl["namespace"]) + (tbl["name"],)
305 |             for tbl in data.get("tables", {}).values()
306 |             if tbl["namespace"] == namespace_str
307 |         ]
308 | 
309 |     def list_namespaces(self, namespace: Union[str, Identifier] = ()) -> List[Identifier]:
310 |         """List namespaces from catalog.json."""
311 |         data, _ = self._read_catalog_json()
312 |         all_namespaces = list(data["namespaces"].keys())
313 |         if not namespace:
314 |             return [Catalog.identifier_to_tuple(ns) for ns in all_namespaces]
315 |         ns_tuple = Catalog.identifier_to_tuple(namespace)
316 |         ns_prefix = Catalog.namespace_to_string(namespace)
317 |         # Only return direct children
318 |         result = []
319 |         for ns in all_namespaces:
320 |             ns_parts = Catalog.identifier_to_tuple(ns)
321 |             if ns_parts[:len(ns_tuple)] == ns_tuple and len(ns_parts) == len(ns_tuple) + 1:
322 |                 result.append(ns_parts)
323 |         return result
324 | 
325 |     def load_namespace_properties(self, namespace: Union[str, Identifier]) -> Properties:
326 |         """Get properties for a namespace from catalog.json."""
327 |         namespace_str = Catalog.namespace_to_string(namespace)
328 |         data, _ = self._read_catalog_json()
329 |         if namespace_str not in data["namespaces"]:
330 |             raise NoSuchNamespaceError(f"Namespace {namespace_str} does not exist")
331 |         return data["namespaces"][namespace_str].get("properties", {})
332 | 
333 |     def _namespace_exists(self, namespace: Union[str, Identifier]) -> bool:
334 |         """Check if a namespace exists in catalog.json."""
335 |         namespace_str = Catalog.namespace_to_string(namespace)
336 |         data, _ = self._read_catalog_json()
337 |         return namespace_str in data["namespaces"]
338 | 
339 |     def _table_exists(self, identifier: Union[str, Identifier]) -> bool:
340 |         """Check if a table exists in catalog.json."""
341 |         namespace_tuple = Catalog.namespace_from(identifier)
342 |         namespace = Catalog.namespace_to_string(namespace_tuple)
343 |         table_name = Catalog.table_name_from(identifier)
344 |         table_key = self._table_key(namespace, table_name)
345 |         data, _ = self._read_catalog_json()
346 |         return table_key in data.get("tables", {})
347 | 
348 |     def list_views(self, namespace: Union[str, Identifier]) -> List[Identifier]:
349 |         return []
350 | 
351 |     def drop_view(self, identifier: Union[str, Identifier]) -> None:
352 |         raise NotImplementedError("Views are not supported")
353 | 
354 |     def view_exists(self, identifier: Union[str, Identifier]) -> bool:
355 |         return False
356 | 
357 |     def commit_table(
358 |         self, table: Table, requirements: Tuple[TableRequirement, ...], updates: Tuple[TableUpdate, ...]
359 |     ) -> CommitTableResponse:
360 |         """Commit updates to a table."""
361 |         table_identifier = table.name()
362 |         namespace_tuple = Catalog.namespace_from(table_identifier)
363 |         namespace = Catalog.namespace_to_string(namespace_tuple)
364 |         table_name = Catalog.table_name_from(table_identifier)
365 | 
366 |         current_table: Optional[Table]
367 |         try:
368 |             current_table = self.load_table(table_identifier)
369 |         except NoSuchTableError:
370 |             current_table = None
371 | 
372 |         updated_staged_table = self._update_and_stage_table(current_table, table.name(), requirements, updates)
373 |         if current_table and updated_staged_table.metadata == current_table.metadata:
374 |             return CommitTableResponse(metadata=current_table.metadata, metadata_location=current_table.metadata_location)
375 | 
376 |         self._write_metadata(
377 |             metadata=updated_staged_table.metadata,
378 |             io=updated_staged_table.io,
379 |             metadata_path=updated_staged_table.metadata_location,
380 |         )
381 | 
382 |         try:
383 |             data, etag = self._read_catalog_json()
384 |             table_key = self._table_key(namespace, table_name)
385 |             
386 |             if current_table:
387 |                 if data["tables"][table_key]["metadata_location"] != current_table.metadata_location:
388 |                     raise CommitFailedException(f"Table has been updated by another process: {namespace}.{table_name}")
389 |                 data["tables"][table_key]["previous_metadata_location"] = current_table.metadata_location
390 |             else:
391 |                 if table_key in data["tables"]:
392 |                     raise TableAlreadyExistsError(f"Table {namespace}.{table_name} already exists")
393 |                 data["tables"][table_key] = {
394 |                     "namespace": namespace,
395 |                     "name": table_name,
396 |                     "previous_metadata_location": None
397 |                 }
398 |             
399 |             data["tables"][table_key]["metadata_location"] = updated_staged_table.metadata_location
400 |             self._write_catalog_json(data, etag)
401 | 
402 |         except Exception as e:
403 |             try:
404 |                 updated_staged_table.io.delete(updated_staged_table.metadata_location)
405 |             except Exception:
406 |                 pass
407 |             raise e
408 | 
409 |         return CommitTableResponse(
410 |             metadata=updated_staged_table.metadata,
411 |             metadata_location=updated_staged_table.metadata_location
412 |         )
413 | 
414 |     def register_table(self, identifier: Union[str, Identifier], metadata_location: str) -> Table:
415 |         """Register a new table using existing metadata."""
416 |         namespace_tuple = Catalog.namespace_from(identifier)
417 |         namespace = Catalog.namespace_to_string(namespace_tuple)
418 |         table_name = Catalog.table_name_from(identifier)
419 |         table_key = self._table_key(namespace, table_name)
420 | 
421 |         if not self._namespace_exists(namespace):
422 |             raise NoSuchNamespaceError(f"Namespace does not exist: {namespace}")
423 | 
424 |         data, etag = self._read_catalog_json()
425 |         if table_key in data.get("tables", {}):
426 |             raise TableAlreadyExistsError(f"Table {namespace}.{table_name} already exists")
427 | 
428 |         data["tables"][table_key] = {
429 |             "namespace": namespace,
430 |             "name": table_name,
431 |             "metadata_location": metadata_location,
432 |             "previous_metadata_location": None
433 |         }
434 |         self._write_catalog_json(data, etag)
435 | 
436 |         return self.load_table(identifier)
437 | 
438 |     def update_namespace_properties(
439 |         self, namespace: Union[str, Identifier], removals: Optional[Set[str]] = None, updates: Properties = EMPTY_DICT
440 |     ) -> PropertiesUpdateSummary:
441 |         """Remove provided property keys and update properties for a namespace in catalog.json."""
442 |         namespace_str = Catalog.namespace_to_string(namespace)
443 |         data, etag = self._read_catalog_json()
444 |         if namespace_str not in data["namespaces"]:
445 |             raise NoSuchNamespaceError(f"Namespace {namespace_str} does not exist")
446 |         current_properties = data["namespaces"][namespace_str].get("properties", {})
447 |         if removals:
448 |             for key in removals:
449 |                 current_properties.pop(key, None)
450 |         if updates:
451 |             current_properties.update(updates)
452 |         data["namespaces"][namespace_str]["properties"] = current_properties
453 |         self._write_catalog_json(data, etag)
454 |         # Return a dummy PropertiesUpdateSummary for now (implement as needed)
455 |         return PropertiesUpdateSummary()
456 | 


--------------------------------------------------------------------------------
/src/boringcatalog/cli.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import json
  3 | import click
  4 | import duckdb
  5 | import subprocess
  6 | import tempfile
  7 | import logging
  8 | from string import Template
  9 | from .catalog import BoringCatalog
 10 | import pyarrow.parquet as pq
 11 | import datetime
 12 | # Configure logging to display in CLI
 13 | logging.basicConfig(
 14 |     format='%(message)s',
 15 |     level=logging.INFO
 16 | )
 17 | 
 18 | DEFAULT_NAMESPACE  = "ice_default"
 19 | DEFAULT_CATALOG_NAME = "boring"
 20 | # Silence pyiceberg logs
 21 | logging.getLogger('pyiceberg').setLevel(logging.WARNING)
 22 | 
 23 | def ensure_ice_dir():
 24 |     """Ensure .ice directory exists and return its path."""
 25 |     ice_dir = os.path.abspath('.ice')
 26 |     os.makedirs(ice_dir, exist_ok=True)
 27 |     return ice_dir
 28 | 
 29 | def load_index():
 30 |     """Load configuration from .ice/index if it exists."""
 31 |     index_path = os.path.join(ensure_ice_dir(), 'index')
 32 |     try:
 33 |         with open(index_path, 'r') as f:
 34 |             return json.load(f)
 35 |     except (FileNotFoundError, json.JSONDecodeError):
 36 |         return None
 37 | 
 38 | def save_index(properties, catalog_uri=None, catalog_name=None):
 39 |     """Save configuration to .ice/index with separate catalog_uri, catalog_name, and properties sections."""
 40 |     config = {
 41 |         "catalog_uri": catalog_uri,
 42 |         "catalog_name": catalog_name,
 43 |         "properties": properties
 44 |     }
 45 |     index_path = os.path.join(ensure_ice_dir(), 'index')
 46 |     with open(index_path, 'w') as f:
 47 |         json.dump(config, f, indent=2)
 48 | 
 49 | def get_catalog():
 50 |     """Get catalog instance from stored configuration."""
 51 |     config = load_index()
 52 |     if not config:
 53 |         raise click.ClickException(
 54 |             "No catalog configuration found. Run 'ice init' first."
 55 |         )
 56 |     
 57 |     properties = config.get("properties", {})
 58 |     if config.get("catalog_uri"):
 59 |         properties["uri"] = config["catalog_uri"]
 60 |     # Use catalog_name from top-level field if present, else default
 61 |     catalog_name = config.get("catalog_name", DEFAULT_CATALOG_NAME)
 62 |     return BoringCatalog(catalog_name, **properties)
 63 | 
 64 | def print_version(ctx, param, value):
 65 |     if not value or ctx.resilient_parsing:
 66 |         return
 67 |     click.echo('Boring Catalog version 0.2.0')
 68 |     ctx.exit()
 69 | 
 70 | def get_sql_template():
 71 |     """Read the SQL template file."""
 72 |     template_path = os.path.join(os.path.dirname(__file__), 'duckdb_init.sql')
 73 |     with open(template_path, 'r') as f:
 74 |         return Template(f.read())
 75 | 
 76 | def print_table_log(catalog, table_identifier, label=None):
 77 |     """Print the log (snapshots) for a given table identifier."""
 78 |     if not catalog._table_exists(table_identifier):
 79 |         return False
 80 |     if label:
 81 |         click.echo(label)
 82 |     table = catalog.load_table(table_identifier)
 83 |     snapshots = sorted(table.snapshots(), key=lambda x: x.timestamp_ms, reverse=True)
 84 |     if not snapshots:
 85 |         click.echo(f"No snapshots found for table {table_identifier}.")
 86 |         return False
 87 |     for snap in snapshots:
 88 |         click.echo(f"commit {snap.snapshot_id:<20}")
 89 |         ts = datetime.datetime.utcfromtimestamp(int(snap.timestamp_ms) / 1000).strftime('%Y-%m-%d %H:%M:%S UTC')
 90 |         click.echo(f"  Table: {table_identifier:<25}")
 91 |         click.echo(f"  Date: {ts:<25}")
 92 |         click.echo(f"  Operation: {str(snap.summary.operation):<15}")
 93 |         click.echo(f"  Summary:")
 94 |         summary = snap.summary.additional_properties
 95 |         max_key_len = max(len(str(k)) for k in summary.keys()) if summary else 0
 96 |         for k, v in summary.items():
 97 |             click.echo(f"  {k.ljust(max_key_len)} : {v}")
 98 |         click.echo(f" ")
 99 |     return True
100 | 
101 | @click.group(invoke_without_command=True)
102 | @click.option('--version', is_flag=True, callback=print_version, expose_value=False, is_eager=True, help='Show version and exit')
103 | @click.pass_context
104 | def cli(ctx):
105 |     """Boring Catalog CLI tool.
106 |     
107 |     Run 'ice COMMAND --help' for more information on a command.
108 |     """
109 |     # Show help if no command is provided
110 |     if ctx.invoked_subcommand is None:
111 |         click.echo(ctx.get_help())
112 |         ctx.exit()
113 | 
114 | @cli.command()
115 | @click.option('--catalog', help='Custom location for catalog.json (default: warehouse/catalog/catalog_<catalog_name>.json)')
116 | @click.option('--property', '-p', multiple=True, help='Properties in the format key=value')
117 | @click.option('--catalog-name', default=DEFAULT_CATALOG_NAME, show_default=True, help='Name of the catalog (used in file naming and metadata)')
118 | def init(catalog, property, catalog_name):
119 |     """Initialize a new Boring Catalog."""
120 | 
121 |     try:
122 |         properties = {}
123 |         for prop in property:
124 |             try:
125 |                 key, value = prop.split('=', 1)
126 |                 properties[key.strip()] = value.strip()
127 |             except ValueError:
128 |                 raise click.ClickException(f"Invalid property format: {prop}. Use key=value format")
129 |         
130 |         if not catalog and not "warehouse" in properties:
131 |             catalog = f"warehouse/catalog/catalog_{catalog_name}.json"
132 |             properties["warehouse"] = "warehouse"
133 | 
134 |         elif not catalog and "warehouse" in properties:
135 |             catalog = f"{properties['warehouse']}/catalog/catalog_{catalog_name}.json"
136 | 
137 |         # Do NOT save catalog_name in properties anymore
138 |         save_index(properties, catalog, catalog_name)
139 | 
140 |         properties["uri"] = catalog
141 |         catalog_instance = BoringCatalog(catalog_name, **properties)
142 |         
143 |         # Display information in specific order
144 |         click.echo(f"Initialized Boring Catalog in {os.path.join('.ice', 'index')}")
145 |         click.echo(f"Catalog location: {catalog}")
146 |         if "warehouse" in properties:
147 |             click.echo(f"Warehouse location: {properties['warehouse']}")
148 |         click.echo(f"Catalog name: {catalog_name}")
149 | 
150 |     except Exception as e:
151 |         click.echo(f"Error initializing catalog: {str(e)}", err=True)
152 |         raise click.Abort()
153 | 
154 | @cli.command(name='list-namespaces')
155 | @click.argument('parent', required=False)
156 | def list_namespaces(parent):
157 |     """List all namespaces or child namespaces of PARENT."""
158 |     try:
159 |         catalog = get_catalog()
160 |         namespaces = catalog.list_namespaces(parent if parent else ())
161 |         
162 |         if not namespaces:
163 |             click.echo("No namespaces found.")
164 |             return
165 | 
166 |         click.echo("Namespaces:")
167 |         for ns in namespaces:
168 |             click.echo(f"  {'.'.join(ns)}")
169 |     except Exception as e:
170 |         click.echo(f"Error listing namespaces: {str(e)}", err=True)
171 |         raise click.Abort()
172 | 
173 | @cli.command(name='list-tables')
174 | @click.argument('namespace', required=False)
175 | def list_tables(namespace):
176 |     """List all tables in the specified NAMESPACE, or all tables in all namespaces if not specified."""
177 |     try:
178 |         catalog = get_catalog()
179 |         
180 |         if namespace:
181 |             tables = catalog.list_tables(namespace)
182 |             if not tables:
183 |                 click.echo(f"No tables found in namespace '{namespace}'.")
184 |                 return
185 |             click.echo(f"Tables in namespace '{namespace}':")
186 |             for table in tables:
187 |                 table_name = table[-1]
188 |                 click.echo(f"  {table_name}")
189 |         else:
190 |             namespaces = catalog.list_namespaces()
191 |             found_any = False
192 |             for ns_tuple in namespaces:
193 |                 ns = ".".join(ns_tuple)
194 |                 tables = catalog.list_tables(ns)
195 |                 if tables:
196 |                     found_any = True
197 |                     click.echo(f"Tables in namespace '{ns}':")
198 |                     for table in tables:
199 |                         table_name = table[-1]
200 |                         click.echo(f"  {table_name}")
201 |             if not found_any:
202 |                 click.echo("No tables found in any namespace.")
203 |     except Exception as e:
204 |         click.echo(f"Error listing tables: {str(e)}", err=True)
205 |         raise click.Abort()
206 | 
207 | @cli.command(context_settings=dict(
208 |     ignore_unknown_options=True,
209 |     allow_extra_args=True,
210 | ))
211 | @click.option('--catalog-path', help='Optional path to a catalog.json')
212 | @click.argument('duckdb_args', nargs=-1)
213 | def duck(catalog_path=None, duckdb_args=()):
214 |     """Open DuckDB CLI with catalog configuration. Optionally provide a path to a catalog.json. Extra arguments are passed to DuckDB CLI."""
215 |     try:
216 |         if catalog_path:
217 |             properties = {"uri": os.path.abspath(catalog_path)}
218 |             catalog = BoringCatalog(DEFAULT_CATALOG_NAME, **properties)
219 |         else:
220 |             config = load_index()
221 |             if not config:
222 |                 raise click.ClickException(
223 |                     "No catalog configuration found. Run 'ice init' first."
224 |                 )
225 |             catalog = get_catalog()
226 | 
227 |         if len(catalog.list_namespaces()) == 0:
228 |             raise click.ClickException("No namespaces found in catalog. Run 'ice create-namespace' to create a namespace.")
229 |         
230 |         if len(catalog.catalog.get("tables", {}).keys()) == 0:
231 |             raise click.ClickException("No tables found in catalog. Run 'ice commit' to create a table.")
232 |         
233 |         # Get SQL template and substitute variables
234 |         template_str = get_sql_template().template
235 |         # Add S3 configuration at the beginning of the script
236 |         if "s3" in catalog.uri:
237 |             s3_config = (
238 |                 ".mode list\n"
239 |                 ".header off\n"
240 |                 "SELECT 'boring-catalog: Loading s3 secrets...' ;\n"
241 |                 ".mode line\n"
242 |                 "CREATE OR REPLACE SECRET secret (TYPE s3, PROVIDER credential_chain);\n"
243 |             )
244 |             # Insert the S3 configuration right after the first comment line
245 |             lines = template_str.split('\n')
246 |             template_str = lines[0] + '\n' + s3_config + '\n'.join(lines[1:])
247 |         template = Template(template_str)
248 | 
249 |         sql = template.substitute(CATALOG_JSON=catalog.uri)
250 | 
251 |         # Write the SQL to a temporary file
252 |         with tempfile.NamedTemporaryFile(mode='w', suffix='.sql', delete=False) as f:
253 |             f.write(sql)
254 | 
255 |         # Start DuckDB CLI with the initialization script and extra args
256 |         cmd = ['duckdb', '--init', f.name] + list(duckdb_args)
257 |         
258 |         subprocess.run(cmd)
259 | 
260 |         # Clean up
261 |         os.unlink(f.name)
262 | 
263 |     except Exception as e:
264 |         click.echo(f"Error starting DuckDB CLI: {str(e)}", err=True)
265 |         raise click.Abort()
266 | 
267 | @cli.command(name='create-namespace')
268 | @click.argument('namespace', required=True)
269 | @click.option('--property', '-p', multiple=True, help='Properties in the format key=value')
270 | def create_namespace(namespace, property):
271 |     """Create a new namespace in the catalog.
272 |     
273 |     NAMESPACE is the name of the namespace to create (e.g. 'my_namespace' or 'parent.child')
274 |     """
275 |     try:
276 |         catalog = get_catalog()
277 |         
278 |         # Parse properties if provided
279 |         properties = {}
280 |         for prop in property:
281 |             try:
282 |                 key, value = prop.split('=', 1)
283 |                 properties[key.strip()] = value.strip()
284 |             except ValueError:
285 |                 raise click.ClickException(f"Invalid property format: {prop}. Use key=value format")
286 |         
287 |         # Create the namespace
288 |         catalog.create_namespace(namespace, properties)
289 |         click.echo(f"Created namespace: {namespace}")
290 |         if properties:
291 |             click.echo("Properties:")
292 |             for key, value in properties.items():
293 |                 click.echo(f"  {key}: {value}")
294 |             
295 |     except Exception as e:
296 |         click.echo(f"Error creating namespace: {str(e)}", err=True)
297 |         raise click.Abort()
298 | 
299 | # Add a utility function to resolve table identifier with namespace logic
300 | 
301 | def resolve_table_identifier_with_namespace(catalog, table_identifier):
302 |     """Resolve table identifier to include namespace, creating default if needed (like in commit)."""
303 |     if len(table_identifier.split(".")) == 1:
304 |         namespaces = catalog.list_namespaces()
305 |         if len(namespaces) == 0:
306 |             click.echo(f"No namespace found, creating and using default namespace: {DEFAULT_NAMESPACE}")
307 |             namespace = DEFAULT_NAMESPACE
308 |             catalog.create_namespace(namespace)
309 |         elif len(namespaces) == 1:
310 |             namespace = namespaces[0][0]
311 |         else:
312 |             raise click.ClickException("No namespace specified. Please specify a namespace for the table.")
313 |         table_identifier = f"{namespace}.{table_identifier}"
314 |     return table_identifier
315 | 
316 | @cli.command(name='commit')
317 | @click.argument('table_identifier', required=True)
318 | @click.option('--source', required=True, help='Parquet file URI to commit as a new snapshot')
319 | @click.option('--mode', default='append', help='Mode to commit the file', type=click.Choice(['append', 'overwrite']))
320 | def commit(table_identifier, source, mode):
321 |     """Commit a new snapshot to a table from a Parquet file."""
322 |     try:
323 |         catalog = get_catalog()
324 |         table_identifier = resolve_table_identifier_with_namespace(catalog, table_identifier)
325 |         df = pq.read_table(source)
326 |         if not catalog.table_exists(table_identifier):
327 |             click.echo(f"Table {table_identifier} does not exist in the catalog. Creating it now...")
328 |             catalog.create_table(table_identifier, schema=df.schema)
329 |         table = catalog.load_table(table_identifier)
330 |         if mode == "append":
331 |             table.append(df)
332 |         elif mode == "overwrite":
333 |             table.overwrite(df)
334 |         else:
335 |             raise click.ClickException(f"Invalid mode: {mode}. Use 'append' or 'overwrite'.")
336 |         click.echo(f"Committed {source} to table {table_identifier}")
337 |     except Exception as e:
338 |         click.echo(f"Error committing file to table: {str(e)}", err=True)
339 |         raise click.Abort()
340 | 
341 | @cli.command(name='log')
342 | @click.argument('table_identifier', required=False)
343 | def log_snapshots(table_identifier):
344 |     """Print all snapshot entries for a table or all tables in the current catalog or default namespace."""
345 |     try:
346 |         catalog = get_catalog()
347 |         if not table_identifier:
348 |             # Default to the default namespace, create if needed
349 |             namespaces = catalog.list_namespaces()
350 |             if not any(ns[0] == DEFAULT_NAMESPACE for ns in namespaces):
351 |                 click.echo(f"No namespace found, creating and using default namespace: {DEFAULT_NAMESPACE}")
352 |                 catalog.create_namespace(DEFAULT_NAMESPACE)
353 |             tables = catalog.list_tables(DEFAULT_NAMESPACE)
354 |             if not tables:
355 |                 click.echo(f"No tables found in default namespace '{DEFAULT_NAMESPACE}'.")
356 |                 return
357 |             found_any = False
358 |             for table in tables:
359 |                 table_identifier_full = f"{DEFAULT_NAMESPACE}.{table[-1]}"
360 |                 if print_table_log(catalog, table_identifier_full, label=f"=== Log for table: {table_identifier_full} ==="):
361 |                     found_any = True
362 |             if not found_any:
363 |                 click.echo("No snapshots found for any table in the default namespace.")
364 |             return
365 |         # If a table_identifier is provided, resolve it as in commit
366 |         table_identifier = resolve_table_identifier_with_namespace(catalog, table_identifier)
367 |         if not catalog._table_exists(table_identifier):
368 |             raise click.ClickException(f"Table {table_identifier} does not exist in the catalog.")
369 |         print_table_log(catalog, table_identifier)
370 |     except Exception as e:
371 |         click.echo(f"Error loading catalog or snapshots: {str(e)}", err=True)
372 |         raise click.Abort()
373 | 
374 | @cli.command(name='catalog')
375 | def print_catalog():
376 |     """Print the current catalog.json as JSON."""
377 |     try:
378 |         catalog = get_catalog()
379 |         catalog_json = catalog.catalog
380 |         click.echo(json.dumps(catalog_json, indent=2))
381 |     except Exception as e:
382 |         click.echo(f"Error printing catalog: {str(e)}", err=True)
383 |         raise click.Abort()
384 | 
385 | if __name__ == '__main__':
386 |     cli() 


--------------------------------------------------------------------------------
/src/boringcatalog/duckdb_init.sql:
--------------------------------------------------------------------------------
 1 | -- Install and load extensions
 2 | SET VARIABLE catalog_json = '${CATALOG_JSON}';
 3 | SET VARIABLE tmp_file = '/tmp/iceberg_init.sql';
 4 | 
 5 | .mode list
 6 | .header off 
 7 | SELECT 'boring-catalog: Loading extensions...';
 8 | INSTALL iceberg;
 9 | LOAD iceberg;
10 | 
11 | SELECT 'boring-catalog: Init schemas and tables...' ;
12 | -- Create schemas
13 | CREATE SCHEMA IF NOT EXISTS catalog;
14 | CREATE TABLE catalog.namespaces AS 
15 | SELECT 
16 |     namespace, 
17 |     unnest(properties.properties)
18 | FROM (
19 |     UNPIVOT (
20 |         SELECT unnest(namespaces) 
21 |         FROM read_json(getvariable('catalog_json'))
22 |     ) ON COLUMNS(*) INTO name namespace value properties
23 | );
24 | 
25 | CREATE OR REPLACE TABLE catalog.namespaces AS 
26 | SELECT 
27 |     namespace,
28 |     unnest(properties.properties)
29 | FROM (
30 |     UNPIVOT (
31 |         SELECT 
32 |         unnest(namespaces)
33 |         FROM read_json(getvariable('catalog_json')) 
34 |     ) ON COLUMNS(*) INTO name namespace value properties
35 | );
36 | CREATE OR REPLACE TABLE catalog.tables AS 
37 | SELECT
38 |     properties.namespace as namespace,
39 |     table_name as table_name,
40 |     unnest(properties)
41 | FROM (
42 |     UNPIVOT (
43 |         SELECT unnest(tables) 
44 |         FROM read_json(getvariable('catalog_json'))
45 |     ) ON COLUMNS(*) INTO name table_name value properties
46 | );
47 | 
48 | 
49 | .mode list
50 | .header off
51 | .once getvariable("tmp_file")
52 | select 'CREATE SCHEMA IF NOT EXISTS ' || i || ';' 
53 | from (select namespace from catalog.namespaces) x(i);
54 | .read getvariable("tmp_file")
55 | 
56 | 
57 | .mode list
58 | .header off
59 | .once getvariable("tmp_file")
60 | select 'CREATE OR REPLACE VIEW ' || j || ' AS SELECT * FROM iceberg_scan(''' || k || ''');' 
61 | from (select table_name, metadata_location from catalog.tables) x(j,k);
62 | .read getvariable("tmp_file")
63 | 
64 | .mode list
65 | .header off
66 | .once getvariable("tmp_file")
67 | select 'CREATE TABLE ' || j || '_metadata AS SELECT * FROM iceberg_metadata(''' || k || ''');' 
68 | from (select table_name, metadata_location from catalog.tables) x(j,k);
69 | .read getvariable("tmp_file")
70 | 
71 | .mode list
72 | .header off
73 | .once getvariable("tmp_file")
74 | select 'CREATE TABLE ' || j || '_snapshots AS SELECT unnest(snapshots, recursive:=true) from read_json(''' || k || ''');' 
75 | from (select table_name, metadata_location from catalog.tables) x(j,k);
76 | 
77 | .read getvariable("tmp_file")
78 | 
79 | SELECT '' ;
80 | SELECT 'Everything is ready! ' ;
81 | SELECT '' ;
82 | SELECT 'Here are some commands to help you get started:' ;
83 | SELECT ' > show;                               -- show all tables' ;
84 | SELECT ' > select * from catalog.namespaces;   -- list namespaces' ;
85 | SELECT ' > select * from catalog.tables;       -- list tables' ;
86 | SELECT ' > select * from <namespace>.<table>;  -- query iceberg table' ;
87 | 
88 | SELECT '' ;
89 | 
90 | .mode duckbox
91 | .prompt 'ice ➜ ' 


--------------------------------------------------------------------------------
/tests/test_catalog.py:
--------------------------------------------------------------------------------
  1 | # (Move file from src/boringcatalog/test_catalog.py to tests/test_catalog.py) 
  2 | import os
  3 | import subprocess
  4 | import sys
  5 | import pytest
  6 | import pyarrow as pa
  7 | import pyarrow.parquet as pq
  8 | import pandas as pd
  9 | import json
 10 | from boringcatalog import BoringCatalog
 11 | import shutil
 12 | import logging
 13 | 
 14 | @pytest.fixture(scope="function")
 15 | def tmp_catalog_dir(tmp_path):
 16 |     return tmp_path
 17 | 
 18 | @pytest.fixture(scope="function")
 19 | def dummy_parquet(tmp_path):
 20 |     # Create a small dummy parquet file
 21 |     df = pd.DataFrame({"id": [1, 2], "value": ["a", "b"]})
 22 |     table = pa.Table.from_pandas(df)
 23 |     parquet_path = tmp_path / "dummy.parquet"
 24 |     pq.write_table(table, parquet_path)
 25 |     return parquet_path
 26 | 
 27 | def run_cli(args, cwd):
 28 |     cmd = [sys.executable, "-m", "boringcatalog.cli"] + args
 29 |     result = subprocess.run(cmd, cwd=cwd, capture_output=True, text=True)
 30 |     print("STDOUT:\n", result.stdout)
 31 |     print("STDERR:\n", result.stderr)
 32 |     return result
 33 | 
 34 | @pytest.mark.parametrize("args,expected_index,do_workflow", [
 35 |     # ice init (no args)
 36 |     ([], {
 37 |         "catalog_uri": "warehouse/catalog/catalog_boring.json",
 38 |         "properties": {"warehouse": "warehouse"}
 39 |     }, True),
 40 |     # ice init -p warehouse=warehouse3/
 41 |     (["-p", "warehouse=warehouse3"], {
 42 |         "catalog_uri": "warehouse3/catalog/catalog_boring.json",
 43 |         "properties": {"warehouse": "warehouse3"}
 44 |     }, True),
 45 |     # ice init --catalog warehouse2/catalog_boring.json
 46 |     (["--catalog", "warehouse2/catalog_boring.json"], {
 47 |         "catalog_uri": "warehouse2/catalog_boring.json",
 48 |         "properties": {}
 49 |     }, False),
 50 |     # ice init --catalog tt/catalog.json -p warehouse=warehouse4
 51 |     (["--catalog", "tt/catalog.json", "-p", "warehouse=warehouse4"], {
 52 |         "catalog_uri": "tt/catalog.json",
 53 |         "properties": {"warehouse": "warehouse4"}
 54 |     }, False),
 55 |     # ice init -p warehouse=tttrr
 56 |     (["-p", "warehouse=tttrr"], {
 57 |         "catalog_uri": "tttrr/catalog/catalog_boring.json",
 58 |         "properties": {"warehouse": "tttrr"}
 59 |     }, False),
 60 | ])
 61 | def test_ice_init_variants(tmp_path, args, expected_index, do_workflow, caplog):
 62 |     # Clean up .ice if it exists
 63 |     ice_dir = tmp_path / ".ice"
 64 |     if ice_dir.exists():
 65 |         shutil.rmtree(ice_dir)
 66 |     # If warehouse is needed, create it
 67 |     warehouse = expected_index["properties"].get("warehouse")
 68 |     if warehouse:
 69 |         warehouse_dir = tmp_path / warehouse
 70 |         warehouse_dir.mkdir(parents=True, exist_ok=True)
 71 |     # Run CLI
 72 |     result = run_cli(["init"] + args, cwd=tmp_path)
 73 |     assert result.returncode == 0
 74 |     index_path = tmp_path / ".ice" / "index"
 75 |     assert index_path.exists(), f".ice/index not created for args {args}"
 76 |     # Check content
 77 |     with open(index_path) as f:
 78 |         index = json.load(f)
 79 |     assert index["catalog_uri"] == expected_index["catalog_uri"], f"catalog_uri mismatch for args {args}"
 80 |     # Only check properties equality if warehouse is specified in expected_index
 81 |     if expected_index["properties"]:
 82 |         assert index["properties"] == expected_index["properties"], f"properties mismatch for args {args}"
 83 |     # Check BoringCatalog usage
 84 |     os.chdir(tmp_path)
 85 |     caplog.set_level(logging.INFO)
 86 |     catalog = BoringCatalog()
 87 |     # If warehouse is not specified, it should default to the catalog folder
 88 |     if not expected_index["properties"].get("warehouse"):
 89 |         expected_warehouse = str(os.path.dirname(index["catalog_uri"]))
 90 |         assert catalog.properties["warehouse"] == expected_warehouse
 91 |         assert f"Using catalog folder to store iceberg data: {expected_warehouse}" in caplog.text
 92 |     namespaces = catalog.list_namespaces()
 93 |     assert isinstance(namespaces, list)
 94 |     # If do_workflow, run commit, log, catalog commands
 95 |     if do_workflow:
 96 |         # Create dummy parquet
 97 |         df = pd.DataFrame({"id": [1, 2], "value": ["a", "b"]})
 98 |         table = pa.Table.from_pandas(df)
 99 |         parquet_path = tmp_path / "dummy.parquet"
100 |         pq.write_table(table, parquet_path)
101 |         # ice commit my_table --source dummy.parquet
102 |         result = run_cli([
103 |             "commit", "my_table", "--source", str(parquet_path)
104 |         ], cwd=tmp_path)
105 |         assert result.returncode == 0
106 |         assert "Committed" in result.stdout
107 |         # ice log my_table
108 |         result = run_cli(["log", "my_table"], cwd=tmp_path)
109 |         assert result.returncode == 0
110 |         assert "commit" in result.stdout
111 |         # ice catalog
112 |         result = run_cli(["catalog"], cwd=tmp_path)
113 |         assert result.returncode == 0
114 |         assert '"tables"' in result.stdout
115 | 
116 |     # 5. ice duck (just check it starts, don't wait for interactive)
117 |     # We'll skip actually running duckdb interactively in CI
118 |     # result = run_cli(["duck"], cwd=tmp_catalog_dir)
119 |     # assert result.returncode == 0 
120 | 
121 | def test_custom_catalog_name(tmp_path):
122 |     # Clean up .ice if it exists
123 |     ice_dir = tmp_path / ".ice"
124 |     if ice_dir.exists():
125 |         shutil.rmtree(ice_dir)
126 |     warehouse_dir = tmp_path / "customwarehouse"
127 |     warehouse_dir.mkdir(parents=True, exist_ok=True)
128 |     custom_name = "mycat"
129 |     # Run CLI with custom catalog name
130 |     result = run_cli([
131 |         "init", "-p", "warehouse=customwarehouse", "--catalog-name", custom_name
132 |     ], cwd=tmp_path)
133 |     assert result.returncode == 0
134 |     index_path = tmp_path / ".ice" / "index"
135 |     assert index_path.exists(), ".ice/index not created for custom catalog name"
136 |     with open(index_path) as f:
137 |         index = json.load(f)
138 |     assert index["catalog_uri"].endswith(f"catalog_{custom_name}.json")
139 |     assert index["catalog_name"] == custom_name
140 |     # Check BoringCatalog instance uses the custom name
141 |     os.chdir(tmp_path)
142 |     catalog = BoringCatalog()
143 |     assert catalog.name == custom_name
144 |     # Check catalog.json content
145 |     with open(index["catalog_uri"]) as f:
146 |         catalog_json = json.load(f)
147 |     assert catalog_json["catalog_name"] == custom_name 
148 | 
149 | def test_custom_catalog_file_path(tmp_path):
150 |     # Test initializing with a fully custom catalog file path
151 |     ice_dir = tmp_path / ".ice"
152 |     if ice_dir.exists():
153 |         shutil.rmtree(ice_dir)
154 |     custom_catalog_path = tmp_path / "mydir" / "mycustom.json"
155 |     custom_catalog_path.parent.mkdir(parents=True, exist_ok=True)
156 |     result = run_cli([
157 |         "init", "--catalog", str(custom_catalog_path), "--catalog-name", "specialcat"
158 |     ], cwd=tmp_path)
159 |     assert result.returncode == 0
160 |     index_path = tmp_path / ".ice" / "index"
161 |     assert index_path.exists(), ".ice/index not created for custom catalog path"
162 |     with open(index_path) as f:
163 |         index = json.load(f)
164 |     assert index["catalog_uri"] == str(custom_catalog_path)
165 |     assert index["catalog_name"] == "specialcat"
166 |     assert custom_catalog_path.exists(), "Custom catalog file was not created"
167 |     # Check BoringCatalog loads from this path
168 |     os.chdir(tmp_path)
169 |     catalog = BoringCatalog()
170 |     assert catalog.name == "specialcat"
171 |     assert catalog.uri == str(custom_catalog_path)
172 | 
173 | 
174 | def test_reinit_overwrite_behavior(tmp_path):
175 |     # Test running ice init twice in the same directory
176 |     ice_dir = tmp_path / ".ice"
177 |     if ice_dir.exists():
178 |         shutil.rmtree(ice_dir)
179 |     warehouse_dir = tmp_path / "warehouse"
180 |     warehouse_dir.mkdir(parents=True, exist_ok=True)
181 |     # First init
182 |     result1 = run_cli(["init", "-p", "warehouse=warehouse"], cwd=tmp_path)
183 |     assert result1.returncode == 0
184 |     index_path = tmp_path / ".ice" / "index"
185 |     assert index_path.exists()
186 |     with open(index_path) as f:
187 |         index1 = json.load(f)
188 |     # Second init (should overwrite or succeed)
189 |     result2 = run_cli(["init", "-p", "warehouse=warehouse"], cwd=tmp_path)
190 |     # Accept both overwrite and success (should not crash)
191 |     assert result2.returncode == 0
192 |     with open(index_path) as f:
193 |         index2 = json.load(f)
194 |     # The index file should still be valid and point to the same warehouse
195 |     assert index2["properties"]["warehouse"] == "warehouse"
196 | 
197 | 
198 | def test_manual_index_loading(tmp_path):
199 |     # Test loading a catalog from a manually created .ice/index file
200 |     ice_dir = tmp_path / ".ice"
201 |     ice_dir.mkdir(exist_ok=True)
202 |     custom_catalog_path = tmp_path / "manualcat.json"
203 |     # Write a minimal catalog file
204 |     with open(custom_catalog_path, "w") as f:
205 |         json.dump({"catalog_name": "manualcat", "namespaces": {}, "tables": {}}, f)
206 |     # Write a custom .ice/index
207 |     index = {
208 |         "catalog_uri": str(custom_catalog_path),
209 |         "catalog_name": "manualcat",
210 |         "properties": {"warehouse": "manualwarehouse"}
211 |     }
212 |     with open(ice_dir / "index", "w") as f:
213 |         json.dump(index, f)
214 |     os.chdir(tmp_path)
215 |     catalog = BoringCatalog()
216 |     assert catalog.name == "manualcat"
217 |     assert catalog.uri == str(custom_catalog_path)
218 |     assert catalog.properties["warehouse"] == "manualwarehouse"
219 |     # Now test missing catalog_name (should default to 'boring')
220 |     index2 = {
221 |         "catalog_uri": str(custom_catalog_path),
222 |         "properties": {"warehouse": "manualwarehouse"}
223 |     }
224 |     with open(ice_dir / "index", "w") as f:
225 |         json.dump(index2, f)
226 |     catalog2 = BoringCatalog()
227 |     assert catalog2.name == "boring"
228 |     assert catalog2.uri == str(custom_catalog_path)
229 |     assert catalog2.properties["warehouse"] == "manualwarehouse" 
230 | 


--------------------------------------------------------------------------------