├── .gitignore
├── .python-version
├── README.md
├── docs
└── boringdata.png
├── pyproject.toml
├── src
└── boringcatalog
│ ├── __init__.py
│ ├── catalog.py
│ ├── cli.py
│ └── duckdb_init.sql
├── tests
└── test_catalog.py
└── uv.lock
/.gitignore:
--------------------------------------------------------------------------------
1 | # Python-generated files
2 | __pycache__/
3 | *.py[oc]
4 | build/
5 | dist/
6 | wheels/
7 | *.egg-info
8 | .env
9 | # Virtual environments
10 | .venv
11 |
--------------------------------------------------------------------------------
/.python-version:
--------------------------------------------------------------------------------
1 | 3.10
2 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | **[boringdata.io](https://boringdata.io) — Kickstart your Iceberg journey with our data stack templates.**
2 |
3 |
4 |
5 | ----
6 | # Boring Catalog
7 |
8 | A lightweight, file-based Iceberg catalog implementation using a single JSON file (e.g., on S3, local disk, or any fsspec-compatible storage).
9 |
10 | ## Why Boring Catalog?
11 | - No need to host or maintain a dedicated catalog service
12 | - Easy to use, easy to understand, perfect to get started with Iceberg
13 | - DuckDB CLI interface to easily explore your iceberg tables and metadata
14 |
15 | ## How It Works
16 | Boring Catalog stores all Iceberg catalog state in a single JSON file:
17 | - Namespaces and tables are tracked in this file
18 | - S3 conditional writes prevent concurrent modifications when storing catalog on S3
19 | - The `.ice/index` file in your project directory stores the configuration for your catalog, including:
20 | - `catalog_uri`: the path to your catalog JSON file
21 | - `catalog_name`: the logical name of your catalog
22 | - `properties`: additional properties (e.g., warehouse location)
23 |
24 | ## Installation
25 | ```bash
26 | pip install boringcatalog
27 | ```
28 |
29 | ## Quickstart
30 |
31 | ### Initialize a Catalog
32 | ```bash
33 | ice init
34 | ```
35 |
36 | That's it ! Your catalog is now ready to use.
37 |
38 | 2 files are created:
39 | - `warehouse/catalog/catalog_boring.json` = catalog file
40 | - `.ice/index` = points to the catalog location (similar to a git index file, but for Iceberg)
41 |
42 |
43 | *Note: You can also specify a remote location for your Iceberg data and catalog file:*
44 | ```bash
45 | ice init -p warehouse=s3://mybucket/mywarehouse
46 | ```
47 | More details on the [Custom Init and Catalog Location](#custom-init-and-catalog-location) section.
48 |
49 | *Note: If you are using an S3 path (e.g., `s3://...`) for your catalog file or warehouse, make sure your CLI environment is authenticated with AWS. For example, you can set your AWS profile with:*
50 |
51 | ```bash
52 | export AWS_PROFILE=your-provider
53 | ```
54 |
55 | *You must have valid AWS credentials configured for the CLI to access S3 resources.*
56 |
57 | You can then start using the catalog:
58 |
59 | ### Commit a table
60 | ```bash
61 | # Get some data
62 | curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet -o /tmp/yellow_tripdata_2023-01.parquet
63 |
64 | # Commit the table
65 | ice commit my_table --source /tmp/yellow_tripdata_2023-01.parquet
66 | ```
67 |
68 | ### Check the commit history:
69 |
70 | ```bash
71 | ice log
72 | ```
73 |
74 | ### Explore your Iceberg (data and metadata) with DuckDB
75 | ```bash
76 | ice duck
77 | ```
78 | This opens an interactive DuckDB session with pointers to all your tables and namespaces.
79 |
80 | Example DuckDB queries:
81 | ```
82 | show; -- show all tables
83 | select * from catalog.namespaces; -- list namespaces
84 | select * from catalog.tables; -- list tables
85 | select * from .; -- query iceberg table
86 | ```
87 |
88 | ## Python Usage
89 |
90 | ```python
91 | from boringcatalog import BoringCatalog
92 |
93 | # Auto-detects .ice/index in the current working directory
94 | catalog = BoringCatalog()
95 |
96 | # Or specify a catalog
97 | catalog = BoringCatalog(name="mycat", uri="path/to/catalog.json")
98 |
99 | # Interact with your iceberg catalog
100 | catalog.create_namespace("my_namespace")
101 | catalog.create_table("my_namespace", "my_table")
102 | catalog.load_table("my_namespace.my_table")
103 |
104 | import pyarrow.parquet as pq
105 | df = pq.read_table("/tmp/yellow_tripdata_2023-01.parquet")
106 | table = catalog.load_table(("ice_default", "my_table"))
107 | table.append(df)
108 | ```
109 |
110 |
111 | ## Custom Init and Catalog Location
112 |
113 | You can configure your Iceberg catalog in several ways, depending on where you want to store your catalog metadata (the JSON file) and your Iceberg data (the warehouse):
114 | - The `warehouse` property determines where your Iceberg tables' data will be stored.
115 | - The `--catalog` option lets you specify the exact path for your catalog JSON file.
116 | - If you use both, the catalog file will be created at the path you specify, and the warehouse will be used for table data.
117 |
118 | ### Examples
119 | | Command Example | Catalog File Location | Warehouse/Data Location | Use Case |
120 | |-----------------|----------------------|------------------------|----------|
121 | | `ice init` | `warehouse/catalog/catalog_boring.json` | `warehouse/` | Local, simple |
122 | | `ice init -p warehouse=...` | `/catalog/catalog_boring.json` | `/` | Custom warehouse |
123 | | `ice init --catalog ...` | `.json` | (to define when creating a table) | Custom catalog file |
124 | | `ice init --catalog ... -p warehouse=...` | `.json` | `/` | Full control |
125 | | `ice init --catalog ... --catalog-name ...` | `.json` | (to define when creating a table) | Custom name & file |
126 |
127 | ### Edge Cases & Manual Editing
128 | - **Custom Catalog Name:** By default, the catalog is named `"boring"`, but you can set a custom name with `--catalog-name`. This name is used in the catalog JSON and for file naming if you don't specify a custom path.
129 | - **Re-initialization:** If you run `ice init` multiple times in the same directory, the `.ice/index` file will be overwritten with the new configuration. This is useful if you want to re-point your project to a different catalog, but be aware that it will not migrate or merge any existing data.
130 | - **Manual Editing:** Advanced users can manually edit `.ice/index` to point to a different catalog file or change the catalog name. If you do this, make sure the `catalog_uri` and `catalog_name` fields are consistent with your actual catalog JSON file. If you set a `warehouse` property but do not update `catalog_uri`, Boring Catalog will always use the `catalog_uri` from the index file.
131 |
132 | ## Roadmap
133 | - [ ] Improve CLI to allow MERGE operation, partition spec, etc.
134 | - [ ] Improve CLI to get info about table schema / partition spec / etc.
135 | - [ ] Expose REST API for integration with AWS, Snowflake, etc.
136 |
--------------------------------------------------------------------------------
/docs/boringdata.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/boringdata/boring-catalog/4c85dbddb9039f1d03a941da39f952366fe5050a/docs/boringdata.png
--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------
1 | [project]
2 | name = "boringcatalog"
3 | version = "0.4.0"
4 | description = "A DuckDB-based Iceberg catalog implementation"
5 | readme = "README.md"
6 | authors = [
7 | { name = "huraultj", email = "julien.hurault@sumeo.io" }
8 | ]
9 | requires-python = ">=3.10"
10 | dependencies = [
11 | "s3fs>=2023.12.0",
12 | "pyiceberg>=0.9.0",
13 | "click>=8.0.0",
14 | "duckdb>=0.9.0",
15 | "pyiceberg[pyarrow]>=0.9.0"
16 | ]
17 | urls = {Homepage = "https://github.com/boringdata/boring-catalog"}
18 |
19 | [project.scripts]
20 | ice = "boringcatalog.cli:cli"
21 |
22 | [project.optional-dependencies]
23 | test = [
24 | "pytest>=7.0.0",
25 | "pandas>=2.0.0",
26 | "pyarrow>=14.0.0"
27 | ]
28 |
29 | [build-system]
30 | requires = ["hatchling"]
31 | build-backend = "hatchling.build"
32 |
33 | [tool.pytest.ini_options]
34 | testpaths = ["tests"]
35 | python_files = ["test_*.py"]
36 | addopts = "-v --tb=short"
37 |
--------------------------------------------------------------------------------
/src/boringcatalog/__init__.py:
--------------------------------------------------------------------------------
1 | """A DuckDB-based Iceberg catalog implementation."""
2 |
3 | from .catalog import BoringCatalog
4 |
5 | __all__ = ["BoringCatalog"]
6 |
7 |
--------------------------------------------------------------------------------
/src/boringcatalog/catalog.py:
--------------------------------------------------------------------------------
1 | from typing import Dict, List, Optional, Set, Tuple, Union, Any
2 | import uuid
3 | import json
4 | import os
5 | import tempfile
6 | import fsspec
7 | import logging
8 | from pyiceberg.io import load_file_io
9 | from pyiceberg.partitioning import UNPARTITIONED_PARTITION_SPEC, PartitionSpec
10 | from pyiceberg.schema import Schema
11 | from pyiceberg.serializers import FromInputFile
12 | from pyiceberg.table import CommitTableResponse, Table
13 | from pyiceberg.table.locations import load_location_provider
14 | from pyiceberg.table.metadata import new_table_metadata
15 | from pyiceberg.table.sorting import UNSORTED_SORT_ORDER, SortOrder
16 | from pyiceberg.table.update import TableRequirement, TableUpdate
17 | from pyiceberg.typedef import EMPTY_DICT, Identifier, Properties
18 | from pyiceberg.types import strtobool
19 | from pyiceberg.catalog import (
20 | Catalog,
21 | MetastoreCatalog,
22 | METADATA_LOCATION,
23 | PREVIOUS_METADATA_LOCATION,
24 | TABLE_TYPE,
25 | ICEBERG,
26 | PropertiesUpdateSummary,
27 | )
28 | from pyiceberg.exceptions import (
29 | NamespaceAlreadyExistsError,
30 | NamespaceNotEmptyError,
31 | NoSuchNamespaceError,
32 | NoSuchTableError,
33 | TableAlreadyExistsError,
34 | NoSuchPropertyException,
35 | NoSuchIcebergTableError,
36 | CommitFailedException,
37 | )
38 |
39 |
40 | from time import perf_counter
41 | # Set up logging
42 | logger = logging.getLogger(__name__)
43 |
44 | DEFAULT_INIT_CATALOG_TABLES = "true"
45 | DEFAULT_CATALOG_NAME = "boring"
46 | class ConcurrentModificationError(CommitFailedException):
47 | """Raised when a concurrent modification is detected."""
48 | pass
49 |
50 | class BoringCatalog(MetastoreCatalog):
51 | """A simple file-based Iceberg catalog implementation."""
52 |
53 | def __init__(self, name: str = None, **properties: str):
54 | # If name or properties are not provided, try to read them from .ice/index once
55 | index_path = os.path.join(os.getcwd(), ".ice/index")
56 | index = None
57 | if (name is None or not properties) and os.path.exists(index_path):
58 | with open(index_path, 'r') as f:
59 | index = json.load(f)
60 | if name is None:
61 | name = index.get("catalog_name", DEFAULT_CATALOG_NAME)
62 | if not properties:
63 | properties = index.get("properties", {})
64 | if name is None:
65 | name = DEFAULT_CATALOG_NAME
66 | super().__init__(name, **properties)
67 |
68 | if index is not None and "catalog_uri" in index:
69 | self.uri = index["catalog_uri"]
70 | self.properties = index["properties"]
71 | elif self.properties.get("uri"):
72 | self.uri = self.properties.get("uri")
73 | elif self.properties.get("warehouse"):
74 | self.uri = os.path.join(os.path.join(self.properties.get("warehouse"), "catalog"), f"catalog_{name}.json")
75 | else:
76 | raise ValueError("Either provide 'catalog' or 'warehouse' property to initialize BoringCatalog")
77 |
78 | # Always infer warehouse if missing and uri is set
79 | if self.uri and not self.properties.get("warehouse"):
80 | warehouse_path = os.path.dirname(self.uri)
81 | self.properties["warehouse"] = warehouse_path
82 | logging.info(f"No --warehouse specified for the catalog. Using catalog folder to store iceberg data: {warehouse_path}")
83 |
84 | init_catalog_tables = strtobool(self.properties.get("init_catalog_tables", DEFAULT_INIT_CATALOG_TABLES))
85 |
86 | if init_catalog_tables:
87 | self._ensure_tables_exist()
88 |
89 | @property
90 | def catalog(self):
91 | catalog, _ = self._read_catalog_json()
92 | return catalog
93 |
94 | @property
95 | def latest_snapshot(self, table_identifier: str):
96 | table = self.load_table(table_identifier)
97 | io = load_file_io(properties=self.properties, location=self.uri)
98 | file = io.new_input(metadata_location)
99 | return file
100 |
101 | def _ensure_tables_exist(self):
102 | """Ensure catalog directory and catalog.json exist."""
103 | try:
104 |
105 | io = load_file_io(properties=self.properties, location=self.uri)
106 |
107 | # Check if catalog file exists
108 | input_file = io.new_input(self.uri)
109 | if not input_file.exists():
110 | # Create initial catalog structure
111 | initial_catalog = {
112 | "catalog_name": self.name,
113 | "namespaces": {},
114 | "tables": {}
115 | }
116 |
117 | # Write the initial catalog file
118 | with io.new_output(self.uri).create(overwrite=True) as f:
119 | f.write(json.dumps(initial_catalog, indent=2).encode('utf-8'))
120 |
121 | except Exception as e:
122 | raise ValueError(f"Failed to initialize catalog at {self.uri}: {str(e)}")
123 |
124 | def _read_catalog_json(self):
125 | """Read catalog.json using FileIO, returning (data, etag)."""
126 | try:
127 | io = load_file_io(properties=self.properties, location=self.uri)
128 | input_file = io.new_input(self.uri)
129 |
130 | if not input_file.exists():
131 | return {"catalog_name": self.name, "namespaces": {}, "tables": {}}, None
132 |
133 | with input_file.open() as f:
134 | data = json.loads(f.read().decode('utf-8'))
135 |
136 | # Get metadata for ETag
137 | metadata = input_file.metadata() if hasattr(input_file, 'metadata') else {}
138 | etag = metadata.get("ETag")
139 | return data, etag
140 |
141 | except Exception as e:
142 | if 'No such file' in str(e) or 'not found' in str(e) or '404' in str(e):
143 | return {"catalog_name": self.name, "namespaces": {}, "tables": {}}, None
144 | raise
145 |
146 | def _write_catalog_json(self, data, etag=None):
147 | """Write catalog.json using FileIO, using ETag for concurrency if provided."""
148 | try:
149 | io = load_file_io(properties=self.properties, location=self.uri)
150 |
151 | # Create output file with ETag check if provided
152 | output_file = io.new_output(self.uri)
153 | if etag is not None and hasattr(output_file, 'set_metadata'):
154 | output_file.set_metadata({"if_match": etag})
155 |
156 | with output_file.create(overwrite=True) as f:
157 | f.write(json.dumps(data, indent=2).encode('utf-8'))
158 |
159 | except Exception as e:
160 | if 'PreconditionFailed' in str(e) or '412' in str(e):
161 | raise ConcurrentModificationError("catalog.json was modified concurrently")
162 | raise
163 |
164 | def _table_key(self, namespace: str, table_name: str) -> str:
165 | return f"{namespace}.{table_name}"
166 |
167 | def create_table(
168 | self,
169 | identifier: Union[str, Identifier],
170 | schema: Union[Schema, "pa.Schema"],
171 | location: Optional[str] = None,
172 | partition_spec: PartitionSpec = UNPARTITIONED_PARTITION_SPEC,
173 | sort_order: SortOrder = UNSORTED_SORT_ORDER,
174 | properties: Properties = EMPTY_DICT,
175 | ) -> Table:
176 | """Create an Iceberg table."""
177 | schema: Schema = self._convert_schema_if_needed(schema) # type: ignore
178 | namespace_tuple = Catalog.namespace_from(identifier)
179 | namespace = Catalog.namespace_to_string(namespace_tuple)
180 | table_name = Catalog.table_name_from(identifier)
181 | table_key = self._table_key(namespace, table_name)
182 |
183 | data, etag = self._read_catalog_json()
184 | if namespace not in data["namespaces"]:
185 | raise NoSuchNamespaceError(f"Namespace does not exist: {namespace}")
186 | if table_key in data.get("tables", {}):
187 | raise TableAlreadyExistsError(f"Table {namespace}.{table_name} already exists")
188 |
189 | location = self._resolve_table_location(location, namespace, table_name)
190 | location_provider = load_location_provider(table_location=location, table_properties=properties)
191 | metadata_location = location_provider.new_table_metadata_file_location()
192 |
193 | metadata = new_table_metadata(
194 | location=location, schema=schema, partition_spec=partition_spec, sort_order=sort_order, properties=properties
195 | )
196 | io = load_file_io(properties=self.properties, location=metadata_location)
197 | self._write_metadata(metadata, io, metadata_location)
198 |
199 | # Add table entry to catalog.json
200 | if "tables" not in data:
201 | data["tables"] = {}
202 | data["tables"][table_key] = {
203 | "namespace": namespace,
204 | "name": table_name,
205 | "metadata_location": metadata_location
206 | }
207 |
208 | self._write_catalog_json(data, etag)
209 |
210 | return self.load_table(identifier)
211 |
212 | def load_table(self, identifier: Union[str, Identifier], catalog_name: str = None) -> Table:
213 | """Load the table's metadata and return the table instance using catalog.json."""
214 | namespace_tuple = Catalog.namespace_from(identifier)
215 | namespace = Catalog.namespace_to_string(namespace_tuple)
216 | table_name = Catalog.table_name_from(identifier)
217 | table_key = self._table_key(namespace, table_name)
218 | data, _ = self._read_catalog_json()
219 | table_entry = data.get("tables", {}).get(table_key)
220 | if not table_entry:
221 | raise NoSuchTableError(f"Table does not exist: {namespace}.{table_name}")
222 | metadata_location = table_entry["metadata_location"]
223 | io = load_file_io(properties=self.properties, location=metadata_location)
224 | file = io.new_input(metadata_location)
225 | metadata = FromInputFile.table_metadata(file)
226 | return Table(
227 | identifier=Catalog.identifier_to_tuple(namespace) + (table_name,),
228 | metadata=metadata,
229 | metadata_location=metadata_location,
230 | io=self._load_file_io(metadata.properties, metadata_location),
231 | catalog=self
232 | )
233 |
234 | def drop_table(self, identifier: Union[str, Identifier]) -> None:
235 | """Drop a table."""
236 | namespace_tuple = Catalog.namespace_from(identifier)
237 | namespace = Catalog.namespace_to_string(namespace_tuple)
238 | table_name = Catalog.table_name_from(identifier)
239 | table_key = self._table_key(namespace, table_name)
240 | data, etag = self._read_catalog_json()
241 | if table_key not in data.get("tables", {}):
242 | raise NoSuchTableError(f"Table does not exist: {namespace}.{table_name}")
243 | del data["tables"][table_key]
244 | self._write_catalog_json(data, etag)
245 |
246 | def rename_table(self, from_identifier: Union[str, Identifier], to_identifier: Union[str, Identifier]) -> Table:
247 | """Rename a table."""
248 | from_namespace_tuple = Catalog.namespace_from(from_identifier)
249 | from_namespace = Catalog.namespace_to_string(from_namespace_tuple)
250 | from_table_name = Catalog.table_name_from(from_identifier)
251 | from_table_key = self._table_key(from_namespace, from_table_name)
252 |
253 | to_namespace_tuple = Catalog.namespace_from(to_identifier)
254 | to_namespace = Catalog.namespace_to_string(to_namespace_tuple)
255 | to_table_name = Catalog.table_name_from(to_identifier)
256 | to_table_key = self._table_key(to_namespace, to_table_name)
257 |
258 | data, etag = self._read_catalog_json()
259 | if not self._namespace_exists(to_namespace):
260 | raise NoSuchNamespaceError(f"Namespace does not exist: {to_namespace}")
261 |
262 | if from_table_key not in data.get("tables", {}):
263 | raise NoSuchTableError(f"Table does not exist: {from_namespace}.{from_table_name}")
264 |
265 | if to_table_key in data.get("tables", {}):
266 | raise TableAlreadyExistsError(f"Table {to_namespace}.{to_table_name} already exists")
267 |
268 | table_entry = data["tables"][from_table_key]
269 | table_entry["namespace"] = to_namespace
270 | table_entry["name"] = to_table_name
271 | data["tables"][to_table_key] = table_entry
272 | del data["tables"][from_table_key]
273 |
274 | self._write_catalog_json(data, etag)
275 | return self.load_table(to_identifier)
276 |
277 | def create_namespace(self, namespace: Union[str, Identifier], properties: Properties = EMPTY_DICT) -> None:
278 | """Create a namespace in the catalog.json file."""
279 | namespace_str = Catalog.namespace_to_string(namespace)
280 | data, etag = self._read_catalog_json()
281 | if namespace_str in data["namespaces"]:
282 | raise NamespaceAlreadyExistsError(f"Namespace already exists: {namespace_str}")
283 | data["namespaces"][namespace_str] = {"properties": properties or {"exists": "true"}}
284 | self._write_catalog_json(data, etag)
285 |
286 | def drop_namespace(self, namespace: Union[str, Identifier]) -> None:
287 | """Drop a namespace from catalog.json."""
288 | namespace_str = Catalog.namespace_to_string(namespace)
289 | data, etag = self._read_catalog_json()
290 | if namespace_str not in data["namespaces"]:
291 | raise NoSuchNamespaceError(f"Namespace does not exist: {namespace_str}")
292 | if any(tbl["namespace"] == namespace_str for tbl in data["tables"].values()):
293 | raise NamespaceNotEmptyError(f"Namespace {namespace_str} is not empty.")
294 | del data["namespaces"][namespace_str]
295 | self._write_catalog_json(data, etag)
296 |
297 | def list_tables(self, namespace: Union[str, Identifier]) -> List[Identifier]:
298 | """List tables under the given namespace from catalog.json."""
299 | namespace_str = Catalog.namespace_to_string(namespace)
300 | data, _ = self._read_catalog_json()
301 | if namespace_str and namespace_str not in data["namespaces"]:
302 | raise NoSuchNamespaceError(f"Namespace does not exist: {namespace_str}")
303 | return [
304 | Catalog.identifier_to_tuple(tbl["namespace"]) + (tbl["name"],)
305 | for tbl in data.get("tables", {}).values()
306 | if tbl["namespace"] == namespace_str
307 | ]
308 |
309 | def list_namespaces(self, namespace: Union[str, Identifier] = ()) -> List[Identifier]:
310 | """List namespaces from catalog.json."""
311 | data, _ = self._read_catalog_json()
312 | all_namespaces = list(data["namespaces"].keys())
313 | if not namespace:
314 | return [Catalog.identifier_to_tuple(ns) for ns in all_namespaces]
315 | ns_tuple = Catalog.identifier_to_tuple(namespace)
316 | ns_prefix = Catalog.namespace_to_string(namespace)
317 | # Only return direct children
318 | result = []
319 | for ns in all_namespaces:
320 | ns_parts = Catalog.identifier_to_tuple(ns)
321 | if ns_parts[:len(ns_tuple)] == ns_tuple and len(ns_parts) == len(ns_tuple) + 1:
322 | result.append(ns_parts)
323 | return result
324 |
325 | def load_namespace_properties(self, namespace: Union[str, Identifier]) -> Properties:
326 | """Get properties for a namespace from catalog.json."""
327 | namespace_str = Catalog.namespace_to_string(namespace)
328 | data, _ = self._read_catalog_json()
329 | if namespace_str not in data["namespaces"]:
330 | raise NoSuchNamespaceError(f"Namespace {namespace_str} does not exist")
331 | return data["namespaces"][namespace_str].get("properties", {})
332 |
333 | def _namespace_exists(self, namespace: Union[str, Identifier]) -> bool:
334 | """Check if a namespace exists in catalog.json."""
335 | namespace_str = Catalog.namespace_to_string(namespace)
336 | data, _ = self._read_catalog_json()
337 | return namespace_str in data["namespaces"]
338 |
339 | def _table_exists(self, identifier: Union[str, Identifier]) -> bool:
340 | """Check if a table exists in catalog.json."""
341 | namespace_tuple = Catalog.namespace_from(identifier)
342 | namespace = Catalog.namespace_to_string(namespace_tuple)
343 | table_name = Catalog.table_name_from(identifier)
344 | table_key = self._table_key(namespace, table_name)
345 | data, _ = self._read_catalog_json()
346 | return table_key in data.get("tables", {})
347 |
348 | def list_views(self, namespace: Union[str, Identifier]) -> List[Identifier]:
349 | return []
350 |
351 | def drop_view(self, identifier: Union[str, Identifier]) -> None:
352 | raise NotImplementedError("Views are not supported")
353 |
354 | def view_exists(self, identifier: Union[str, Identifier]) -> bool:
355 | return False
356 |
357 | def commit_table(
358 | self, table: Table, requirements: Tuple[TableRequirement, ...], updates: Tuple[TableUpdate, ...]
359 | ) -> CommitTableResponse:
360 | """Commit updates to a table."""
361 | table_identifier = table.name()
362 | namespace_tuple = Catalog.namespace_from(table_identifier)
363 | namespace = Catalog.namespace_to_string(namespace_tuple)
364 | table_name = Catalog.table_name_from(table_identifier)
365 |
366 | current_table: Optional[Table]
367 | try:
368 | current_table = self.load_table(table_identifier)
369 | except NoSuchTableError:
370 | current_table = None
371 |
372 | updated_staged_table = self._update_and_stage_table(current_table, table.name(), requirements, updates)
373 | if current_table and updated_staged_table.metadata == current_table.metadata:
374 | return CommitTableResponse(metadata=current_table.metadata, metadata_location=current_table.metadata_location)
375 |
376 | self._write_metadata(
377 | metadata=updated_staged_table.metadata,
378 | io=updated_staged_table.io,
379 | metadata_path=updated_staged_table.metadata_location,
380 | )
381 |
382 | try:
383 | data, etag = self._read_catalog_json()
384 | table_key = self._table_key(namespace, table_name)
385 |
386 | if current_table:
387 | if data["tables"][table_key]["metadata_location"] != current_table.metadata_location:
388 | raise CommitFailedException(f"Table has been updated by another process: {namespace}.{table_name}")
389 | data["tables"][table_key]["previous_metadata_location"] = current_table.metadata_location
390 | else:
391 | if table_key in data["tables"]:
392 | raise TableAlreadyExistsError(f"Table {namespace}.{table_name} already exists")
393 | data["tables"][table_key] = {
394 | "namespace": namespace,
395 | "name": table_name,
396 | "previous_metadata_location": None
397 | }
398 |
399 | data["tables"][table_key]["metadata_location"] = updated_staged_table.metadata_location
400 | self._write_catalog_json(data, etag)
401 |
402 | except Exception as e:
403 | try:
404 | updated_staged_table.io.delete(updated_staged_table.metadata_location)
405 | except Exception:
406 | pass
407 | raise e
408 |
409 | return CommitTableResponse(
410 | metadata=updated_staged_table.metadata,
411 | metadata_location=updated_staged_table.metadata_location
412 | )
413 |
414 | def register_table(self, identifier: Union[str, Identifier], metadata_location: str) -> Table:
415 | """Register a new table using existing metadata."""
416 | namespace_tuple = Catalog.namespace_from(identifier)
417 | namespace = Catalog.namespace_to_string(namespace_tuple)
418 | table_name = Catalog.table_name_from(identifier)
419 | table_key = self._table_key(namespace, table_name)
420 |
421 | if not self._namespace_exists(namespace):
422 | raise NoSuchNamespaceError(f"Namespace does not exist: {namespace}")
423 |
424 | data, etag = self._read_catalog_json()
425 | if table_key in data.get("tables", {}):
426 | raise TableAlreadyExistsError(f"Table {namespace}.{table_name} already exists")
427 |
428 | data["tables"][table_key] = {
429 | "namespace": namespace,
430 | "name": table_name,
431 | "metadata_location": metadata_location,
432 | "previous_metadata_location": None
433 | }
434 | self._write_catalog_json(data, etag)
435 |
436 | return self.load_table(identifier)
437 |
438 | def update_namespace_properties(
439 | self, namespace: Union[str, Identifier], removals: Optional[Set[str]] = None, updates: Properties = EMPTY_DICT
440 | ) -> PropertiesUpdateSummary:
441 | """Remove provided property keys and update properties for a namespace in catalog.json."""
442 | namespace_str = Catalog.namespace_to_string(namespace)
443 | data, etag = self._read_catalog_json()
444 | if namespace_str not in data["namespaces"]:
445 | raise NoSuchNamespaceError(f"Namespace {namespace_str} does not exist")
446 | current_properties = data["namespaces"][namespace_str].get("properties", {})
447 | if removals:
448 | for key in removals:
449 | current_properties.pop(key, None)
450 | if updates:
451 | current_properties.update(updates)
452 | data["namespaces"][namespace_str]["properties"] = current_properties
453 | self._write_catalog_json(data, etag)
454 | # Return a dummy PropertiesUpdateSummary for now (implement as needed)
455 | return PropertiesUpdateSummary()
456 |
--------------------------------------------------------------------------------
/src/boringcatalog/cli.py:
--------------------------------------------------------------------------------
1 | import os
2 | import json
3 | import click
4 | import duckdb
5 | import subprocess
6 | import tempfile
7 | import logging
8 | from string import Template
9 | from .catalog import BoringCatalog
10 | import pyarrow.parquet as pq
11 | import datetime
12 | # Configure logging to display in CLI
13 | logging.basicConfig(
14 | format='%(message)s',
15 | level=logging.INFO
16 | )
17 |
18 | DEFAULT_NAMESPACE = "ice_default"
19 | DEFAULT_CATALOG_NAME = "boring"
20 | # Silence pyiceberg logs
21 | logging.getLogger('pyiceberg').setLevel(logging.WARNING)
22 |
23 | def ensure_ice_dir():
24 | """Ensure .ice directory exists and return its path."""
25 | ice_dir = os.path.abspath('.ice')
26 | os.makedirs(ice_dir, exist_ok=True)
27 | return ice_dir
28 |
29 | def load_index():
30 | """Load configuration from .ice/index if it exists."""
31 | index_path = os.path.join(ensure_ice_dir(), 'index')
32 | try:
33 | with open(index_path, 'r') as f:
34 | return json.load(f)
35 | except (FileNotFoundError, json.JSONDecodeError):
36 | return None
37 |
38 | def save_index(properties, catalog_uri=None, catalog_name=None):
39 | """Save configuration to .ice/index with separate catalog_uri, catalog_name, and properties sections."""
40 | config = {
41 | "catalog_uri": catalog_uri,
42 | "catalog_name": catalog_name,
43 | "properties": properties
44 | }
45 | index_path = os.path.join(ensure_ice_dir(), 'index')
46 | with open(index_path, 'w') as f:
47 | json.dump(config, f, indent=2)
48 |
49 | def get_catalog():
50 | """Get catalog instance from stored configuration."""
51 | config = load_index()
52 | if not config:
53 | raise click.ClickException(
54 | "No catalog configuration found. Run 'ice init' first."
55 | )
56 |
57 | properties = config.get("properties", {})
58 | if config.get("catalog_uri"):
59 | properties["uri"] = config["catalog_uri"]
60 | # Use catalog_name from top-level field if present, else default
61 | catalog_name = config.get("catalog_name", DEFAULT_CATALOG_NAME)
62 | return BoringCatalog(catalog_name, **properties)
63 |
64 | def print_version(ctx, param, value):
65 | if not value or ctx.resilient_parsing:
66 | return
67 | click.echo('Boring Catalog version 0.2.0')
68 | ctx.exit()
69 |
70 | def get_sql_template():
71 | """Read the SQL template file."""
72 | template_path = os.path.join(os.path.dirname(__file__), 'duckdb_init.sql')
73 | with open(template_path, 'r') as f:
74 | return Template(f.read())
75 |
76 | def print_table_log(catalog, table_identifier, label=None):
77 | """Print the log (snapshots) for a given table identifier."""
78 | if not catalog._table_exists(table_identifier):
79 | return False
80 | if label:
81 | click.echo(label)
82 | table = catalog.load_table(table_identifier)
83 | snapshots = sorted(table.snapshots(), key=lambda x: x.timestamp_ms, reverse=True)
84 | if not snapshots:
85 | click.echo(f"No snapshots found for table {table_identifier}.")
86 | return False
87 | for snap in snapshots:
88 | click.echo(f"commit {snap.snapshot_id:<20}")
89 | ts = datetime.datetime.utcfromtimestamp(int(snap.timestamp_ms) / 1000).strftime('%Y-%m-%d %H:%M:%S UTC')
90 | click.echo(f" Table: {table_identifier:<25}")
91 | click.echo(f" Date: {ts:<25}")
92 | click.echo(f" Operation: {str(snap.summary.operation):<15}")
93 | click.echo(f" Summary:")
94 | summary = snap.summary.additional_properties
95 | max_key_len = max(len(str(k)) for k in summary.keys()) if summary else 0
96 | for k, v in summary.items():
97 | click.echo(f" {k.ljust(max_key_len)} : {v}")
98 | click.echo(f" ")
99 | return True
100 |
101 | @click.group(invoke_without_command=True)
102 | @click.option('--version', is_flag=True, callback=print_version, expose_value=False, is_eager=True, help='Show version and exit')
103 | @click.pass_context
104 | def cli(ctx):
105 | """Boring Catalog CLI tool.
106 |
107 | Run 'ice COMMAND --help' for more information on a command.
108 | """
109 | # Show help if no command is provided
110 | if ctx.invoked_subcommand is None:
111 | click.echo(ctx.get_help())
112 | ctx.exit()
113 |
114 | @cli.command()
115 | @click.option('--catalog', help='Custom location for catalog.json (default: warehouse/catalog/catalog_.json)')
116 | @click.option('--property', '-p', multiple=True, help='Properties in the format key=value')
117 | @click.option('--catalog-name', default=DEFAULT_CATALOG_NAME, show_default=True, help='Name of the catalog (used in file naming and metadata)')
118 | def init(catalog, property, catalog_name):
119 | """Initialize a new Boring Catalog."""
120 |
121 | try:
122 | properties = {}
123 | for prop in property:
124 | try:
125 | key, value = prop.split('=', 1)
126 | properties[key.strip()] = value.strip()
127 | except ValueError:
128 | raise click.ClickException(f"Invalid property format: {prop}. Use key=value format")
129 |
130 | if not catalog and not "warehouse" in properties:
131 | catalog = f"warehouse/catalog/catalog_{catalog_name}.json"
132 | properties["warehouse"] = "warehouse"
133 |
134 | elif not catalog and "warehouse" in properties:
135 | catalog = f"{properties['warehouse']}/catalog/catalog_{catalog_name}.json"
136 |
137 | # Do NOT save catalog_name in properties anymore
138 | save_index(properties, catalog, catalog_name)
139 |
140 | properties["uri"] = catalog
141 | catalog_instance = BoringCatalog(catalog_name, **properties)
142 |
143 | # Display information in specific order
144 | click.echo(f"Initialized Boring Catalog in {os.path.join('.ice', 'index')}")
145 | click.echo(f"Catalog location: {catalog}")
146 | if "warehouse" in properties:
147 | click.echo(f"Warehouse location: {properties['warehouse']}")
148 | click.echo(f"Catalog name: {catalog_name}")
149 |
150 | except Exception as e:
151 | click.echo(f"Error initializing catalog: {str(e)}", err=True)
152 | raise click.Abort()
153 |
154 | @cli.command(name='list-namespaces')
155 | @click.argument('parent', required=False)
156 | def list_namespaces(parent):
157 | """List all namespaces or child namespaces of PARENT."""
158 | try:
159 | catalog = get_catalog()
160 | namespaces = catalog.list_namespaces(parent if parent else ())
161 |
162 | if not namespaces:
163 | click.echo("No namespaces found.")
164 | return
165 |
166 | click.echo("Namespaces:")
167 | for ns in namespaces:
168 | click.echo(f" {'.'.join(ns)}")
169 | except Exception as e:
170 | click.echo(f"Error listing namespaces: {str(e)}", err=True)
171 | raise click.Abort()
172 |
173 | @cli.command(name='list-tables')
174 | @click.argument('namespace', required=False)
175 | def list_tables(namespace):
176 | """List all tables in the specified NAMESPACE, or all tables in all namespaces if not specified."""
177 | try:
178 | catalog = get_catalog()
179 |
180 | if namespace:
181 | tables = catalog.list_tables(namespace)
182 | if not tables:
183 | click.echo(f"No tables found in namespace '{namespace}'.")
184 | return
185 | click.echo(f"Tables in namespace '{namespace}':")
186 | for table in tables:
187 | table_name = table[-1]
188 | click.echo(f" {table_name}")
189 | else:
190 | namespaces = catalog.list_namespaces()
191 | found_any = False
192 | for ns_tuple in namespaces:
193 | ns = ".".join(ns_tuple)
194 | tables = catalog.list_tables(ns)
195 | if tables:
196 | found_any = True
197 | click.echo(f"Tables in namespace '{ns}':")
198 | for table in tables:
199 | table_name = table[-1]
200 | click.echo(f" {table_name}")
201 | if not found_any:
202 | click.echo("No tables found in any namespace.")
203 | except Exception as e:
204 | click.echo(f"Error listing tables: {str(e)}", err=True)
205 | raise click.Abort()
206 |
207 | @cli.command(context_settings=dict(
208 | ignore_unknown_options=True,
209 | allow_extra_args=True,
210 | ))
211 | @click.option('--catalog-path', help='Optional path to a catalog.json')
212 | @click.argument('duckdb_args', nargs=-1)
213 | def duck(catalog_path=None, duckdb_args=()):
214 | """Open DuckDB CLI with catalog configuration. Optionally provide a path to a catalog.json. Extra arguments are passed to DuckDB CLI."""
215 | try:
216 | if catalog_path:
217 | properties = {"uri": os.path.abspath(catalog_path)}
218 | catalog = BoringCatalog(DEFAULT_CATALOG_NAME, **properties)
219 | else:
220 | config = load_index()
221 | if not config:
222 | raise click.ClickException(
223 | "No catalog configuration found. Run 'ice init' first."
224 | )
225 | catalog = get_catalog()
226 |
227 | if len(catalog.list_namespaces()) == 0:
228 | raise click.ClickException("No namespaces found in catalog. Run 'ice create-namespace' to create a namespace.")
229 |
230 | if len(catalog.catalog.get("tables", {}).keys()) == 0:
231 | raise click.ClickException("No tables found in catalog. Run 'ice commit' to create a table.")
232 |
233 | # Get SQL template and substitute variables
234 | template_str = get_sql_template().template
235 | # Add S3 configuration at the beginning of the script
236 | if "s3" in catalog.uri:
237 | s3_config = (
238 | ".mode list\n"
239 | ".header off\n"
240 | "SELECT 'boring-catalog: Loading s3 secrets...' ;\n"
241 | ".mode line\n"
242 | "CREATE OR REPLACE SECRET secret (TYPE s3, PROVIDER credential_chain);\n"
243 | )
244 | # Insert the S3 configuration right after the first comment line
245 | lines = template_str.split('\n')
246 | template_str = lines[0] + '\n' + s3_config + '\n'.join(lines[1:])
247 | template = Template(template_str)
248 |
249 | sql = template.substitute(CATALOG_JSON=catalog.uri)
250 |
251 | # Write the SQL to a temporary file
252 | with tempfile.NamedTemporaryFile(mode='w', suffix='.sql', delete=False) as f:
253 | f.write(sql)
254 |
255 | # Start DuckDB CLI with the initialization script and extra args
256 | cmd = ['duckdb', '--init', f.name] + list(duckdb_args)
257 |
258 | subprocess.run(cmd)
259 |
260 | # Clean up
261 | os.unlink(f.name)
262 |
263 | except Exception as e:
264 | click.echo(f"Error starting DuckDB CLI: {str(e)}", err=True)
265 | raise click.Abort()
266 |
267 | @cli.command(name='create-namespace')
268 | @click.argument('namespace', required=True)
269 | @click.option('--property', '-p', multiple=True, help='Properties in the format key=value')
270 | def create_namespace(namespace, property):
271 | """Create a new namespace in the catalog.
272 |
273 | NAMESPACE is the name of the namespace to create (e.g. 'my_namespace' or 'parent.child')
274 | """
275 | try:
276 | catalog = get_catalog()
277 |
278 | # Parse properties if provided
279 | properties = {}
280 | for prop in property:
281 | try:
282 | key, value = prop.split('=', 1)
283 | properties[key.strip()] = value.strip()
284 | except ValueError:
285 | raise click.ClickException(f"Invalid property format: {prop}. Use key=value format")
286 |
287 | # Create the namespace
288 | catalog.create_namespace(namespace, properties)
289 | click.echo(f"Created namespace: {namespace}")
290 | if properties:
291 | click.echo("Properties:")
292 | for key, value in properties.items():
293 | click.echo(f" {key}: {value}")
294 |
295 | except Exception as e:
296 | click.echo(f"Error creating namespace: {str(e)}", err=True)
297 | raise click.Abort()
298 |
299 | # Add a utility function to resolve table identifier with namespace logic
300 |
301 | def resolve_table_identifier_with_namespace(catalog, table_identifier):
302 | """Resolve table identifier to include namespace, creating default if needed (like in commit)."""
303 | if len(table_identifier.split(".")) == 1:
304 | namespaces = catalog.list_namespaces()
305 | if len(namespaces) == 0:
306 | click.echo(f"No namespace found, creating and using default namespace: {DEFAULT_NAMESPACE}")
307 | namespace = DEFAULT_NAMESPACE
308 | catalog.create_namespace(namespace)
309 | elif len(namespaces) == 1:
310 | namespace = namespaces[0][0]
311 | else:
312 | raise click.ClickException("No namespace specified. Please specify a namespace for the table.")
313 | table_identifier = f"{namespace}.{table_identifier}"
314 | return table_identifier
315 |
316 | @cli.command(name='commit')
317 | @click.argument('table_identifier', required=True)
318 | @click.option('--source', required=True, help='Parquet file URI to commit as a new snapshot')
319 | @click.option('--mode', default='append', help='Mode to commit the file', type=click.Choice(['append', 'overwrite']))
320 | def commit(table_identifier, source, mode):
321 | """Commit a new snapshot to a table from a Parquet file."""
322 | try:
323 | catalog = get_catalog()
324 | table_identifier = resolve_table_identifier_with_namespace(catalog, table_identifier)
325 | df = pq.read_table(source)
326 | if not catalog.table_exists(table_identifier):
327 | click.echo(f"Table {table_identifier} does not exist in the catalog. Creating it now...")
328 | catalog.create_table(table_identifier, schema=df.schema)
329 | table = catalog.load_table(table_identifier)
330 | if mode == "append":
331 | table.append(df)
332 | elif mode == "overwrite":
333 | table.overwrite(df)
334 | else:
335 | raise click.ClickException(f"Invalid mode: {mode}. Use 'append' or 'overwrite'.")
336 | click.echo(f"Committed {source} to table {table_identifier}")
337 | except Exception as e:
338 | click.echo(f"Error committing file to table: {str(e)}", err=True)
339 | raise click.Abort()
340 |
341 | @cli.command(name='log')
342 | @click.argument('table_identifier', required=False)
343 | def log_snapshots(table_identifier):
344 | """Print all snapshot entries for a table or all tables in the current catalog or default namespace."""
345 | try:
346 | catalog = get_catalog()
347 | if not table_identifier:
348 | # Default to the default namespace, create if needed
349 | namespaces = catalog.list_namespaces()
350 | if not any(ns[0] == DEFAULT_NAMESPACE for ns in namespaces):
351 | click.echo(f"No namespace found, creating and using default namespace: {DEFAULT_NAMESPACE}")
352 | catalog.create_namespace(DEFAULT_NAMESPACE)
353 | tables = catalog.list_tables(DEFAULT_NAMESPACE)
354 | if not tables:
355 | click.echo(f"No tables found in default namespace '{DEFAULT_NAMESPACE}'.")
356 | return
357 | found_any = False
358 | for table in tables:
359 | table_identifier_full = f"{DEFAULT_NAMESPACE}.{table[-1]}"
360 | if print_table_log(catalog, table_identifier_full, label=f"=== Log for table: {table_identifier_full} ==="):
361 | found_any = True
362 | if not found_any:
363 | click.echo("No snapshots found for any table in the default namespace.")
364 | return
365 | # If a table_identifier is provided, resolve it as in commit
366 | table_identifier = resolve_table_identifier_with_namespace(catalog, table_identifier)
367 | if not catalog._table_exists(table_identifier):
368 | raise click.ClickException(f"Table {table_identifier} does not exist in the catalog.")
369 | print_table_log(catalog, table_identifier)
370 | except Exception as e:
371 | click.echo(f"Error loading catalog or snapshots: {str(e)}", err=True)
372 | raise click.Abort()
373 |
374 | @cli.command(name='catalog')
375 | def print_catalog():
376 | """Print the current catalog.json as JSON."""
377 | try:
378 | catalog = get_catalog()
379 | catalog_json = catalog.catalog
380 | click.echo(json.dumps(catalog_json, indent=2))
381 | except Exception as e:
382 | click.echo(f"Error printing catalog: {str(e)}", err=True)
383 | raise click.Abort()
384 |
385 | if __name__ == '__main__':
386 | cli()
--------------------------------------------------------------------------------
/src/boringcatalog/duckdb_init.sql:
--------------------------------------------------------------------------------
1 | -- Install and load extensions
2 | SET VARIABLE catalog_json = '${CATALOG_JSON}';
3 | SET VARIABLE tmp_file = '/tmp/iceberg_init.sql';
4 |
5 | .mode list
6 | .header off
7 | SELECT 'boring-catalog: Loading extensions...';
8 | INSTALL iceberg;
9 | LOAD iceberg;
10 |
11 | SELECT 'boring-catalog: Init schemas and tables...' ;
12 | -- Create schemas
13 | CREATE SCHEMA IF NOT EXISTS catalog;
14 | CREATE TABLE catalog.namespaces AS
15 | SELECT
16 | namespace,
17 | unnest(properties.properties)
18 | FROM (
19 | UNPIVOT (
20 | SELECT unnest(namespaces)
21 | FROM read_json(getvariable('catalog_json'))
22 | ) ON COLUMNS(*) INTO name namespace value properties
23 | );
24 |
25 | CREATE OR REPLACE TABLE catalog.namespaces AS
26 | SELECT
27 | namespace,
28 | unnest(properties.properties)
29 | FROM (
30 | UNPIVOT (
31 | SELECT
32 | unnest(namespaces)
33 | FROM read_json(getvariable('catalog_json'))
34 | ) ON COLUMNS(*) INTO name namespace value properties
35 | );
36 | CREATE OR REPLACE TABLE catalog.tables AS
37 | SELECT
38 | properties.namespace as namespace,
39 | table_name as table_name,
40 | unnest(properties)
41 | FROM (
42 | UNPIVOT (
43 | SELECT unnest(tables)
44 | FROM read_json(getvariable('catalog_json'))
45 | ) ON COLUMNS(*) INTO name table_name value properties
46 | );
47 |
48 |
49 | .mode list
50 | .header off
51 | .once getvariable("tmp_file")
52 | select 'CREATE SCHEMA IF NOT EXISTS ' || i || ';'
53 | from (select namespace from catalog.namespaces) x(i);
54 | .read getvariable("tmp_file")
55 |
56 |
57 | .mode list
58 | .header off
59 | .once getvariable("tmp_file")
60 | select 'CREATE OR REPLACE VIEW ' || j || ' AS SELECT * FROM iceberg_scan(''' || k || ''');'
61 | from (select table_name, metadata_location from catalog.tables) x(j,k);
62 | .read getvariable("tmp_file")
63 |
64 | .mode list
65 | .header off
66 | .once getvariable("tmp_file")
67 | select 'CREATE TABLE ' || j || '_metadata AS SELECT * FROM iceberg_metadata(''' || k || ''');'
68 | from (select table_name, metadata_location from catalog.tables) x(j,k);
69 | .read getvariable("tmp_file")
70 |
71 | .mode list
72 | .header off
73 | .once getvariable("tmp_file")
74 | select 'CREATE TABLE ' || j || '_snapshots AS SELECT unnest(snapshots, recursive:=true) from read_json(''' || k || ''');'
75 | from (select table_name, metadata_location from catalog.tables) x(j,k);
76 |
77 | .read getvariable("tmp_file")
78 |
79 | SELECT '' ;
80 | SELECT 'Everything is ready! ' ;
81 | SELECT '' ;
82 | SELECT 'Here are some commands to help you get started:' ;
83 | SELECT ' > show; -- show all tables' ;
84 | SELECT ' > select * from catalog.namespaces; -- list namespaces' ;
85 | SELECT ' > select * from catalog.tables; -- list tables' ;
86 | SELECT ' > select * from .; -- query iceberg table' ;
87 |
88 | SELECT '' ;
89 |
90 | .mode duckbox
91 | .prompt 'ice ➜ '
--------------------------------------------------------------------------------
/tests/test_catalog.py:
--------------------------------------------------------------------------------
1 | # (Move file from src/boringcatalog/test_catalog.py to tests/test_catalog.py)
2 | import os
3 | import subprocess
4 | import sys
5 | import pytest
6 | import pyarrow as pa
7 | import pyarrow.parquet as pq
8 | import pandas as pd
9 | import json
10 | from boringcatalog import BoringCatalog
11 | import shutil
12 | import logging
13 |
14 | @pytest.fixture(scope="function")
15 | def tmp_catalog_dir(tmp_path):
16 | return tmp_path
17 |
18 | @pytest.fixture(scope="function")
19 | def dummy_parquet(tmp_path):
20 | # Create a small dummy parquet file
21 | df = pd.DataFrame({"id": [1, 2], "value": ["a", "b"]})
22 | table = pa.Table.from_pandas(df)
23 | parquet_path = tmp_path / "dummy.parquet"
24 | pq.write_table(table, parquet_path)
25 | return parquet_path
26 |
27 | def run_cli(args, cwd):
28 | cmd = [sys.executable, "-m", "boringcatalog.cli"] + args
29 | result = subprocess.run(cmd, cwd=cwd, capture_output=True, text=True)
30 | print("STDOUT:\n", result.stdout)
31 | print("STDERR:\n", result.stderr)
32 | return result
33 |
34 | @pytest.mark.parametrize("args,expected_index,do_workflow", [
35 | # ice init (no args)
36 | ([], {
37 | "catalog_uri": "warehouse/catalog/catalog_boring.json",
38 | "properties": {"warehouse": "warehouse"}
39 | }, True),
40 | # ice init -p warehouse=warehouse3/
41 | (["-p", "warehouse=warehouse3"], {
42 | "catalog_uri": "warehouse3/catalog/catalog_boring.json",
43 | "properties": {"warehouse": "warehouse3"}
44 | }, True),
45 | # ice init --catalog warehouse2/catalog_boring.json
46 | (["--catalog", "warehouse2/catalog_boring.json"], {
47 | "catalog_uri": "warehouse2/catalog_boring.json",
48 | "properties": {}
49 | }, False),
50 | # ice init --catalog tt/catalog.json -p warehouse=warehouse4
51 | (["--catalog", "tt/catalog.json", "-p", "warehouse=warehouse4"], {
52 | "catalog_uri": "tt/catalog.json",
53 | "properties": {"warehouse": "warehouse4"}
54 | }, False),
55 | # ice init -p warehouse=tttrr
56 | (["-p", "warehouse=tttrr"], {
57 | "catalog_uri": "tttrr/catalog/catalog_boring.json",
58 | "properties": {"warehouse": "tttrr"}
59 | }, False),
60 | ])
61 | def test_ice_init_variants(tmp_path, args, expected_index, do_workflow, caplog):
62 | # Clean up .ice if it exists
63 | ice_dir = tmp_path / ".ice"
64 | if ice_dir.exists():
65 | shutil.rmtree(ice_dir)
66 | # If warehouse is needed, create it
67 | warehouse = expected_index["properties"].get("warehouse")
68 | if warehouse:
69 | warehouse_dir = tmp_path / warehouse
70 | warehouse_dir.mkdir(parents=True, exist_ok=True)
71 | # Run CLI
72 | result = run_cli(["init"] + args, cwd=tmp_path)
73 | assert result.returncode == 0
74 | index_path = tmp_path / ".ice" / "index"
75 | assert index_path.exists(), f".ice/index not created for args {args}"
76 | # Check content
77 | with open(index_path) as f:
78 | index = json.load(f)
79 | assert index["catalog_uri"] == expected_index["catalog_uri"], f"catalog_uri mismatch for args {args}"
80 | # Only check properties equality if warehouse is specified in expected_index
81 | if expected_index["properties"]:
82 | assert index["properties"] == expected_index["properties"], f"properties mismatch for args {args}"
83 | # Check BoringCatalog usage
84 | os.chdir(tmp_path)
85 | caplog.set_level(logging.INFO)
86 | catalog = BoringCatalog()
87 | # If warehouse is not specified, it should default to the catalog folder
88 | if not expected_index["properties"].get("warehouse"):
89 | expected_warehouse = str(os.path.dirname(index["catalog_uri"]))
90 | assert catalog.properties["warehouse"] == expected_warehouse
91 | assert f"Using catalog folder to store iceberg data: {expected_warehouse}" in caplog.text
92 | namespaces = catalog.list_namespaces()
93 | assert isinstance(namespaces, list)
94 | # If do_workflow, run commit, log, catalog commands
95 | if do_workflow:
96 | # Create dummy parquet
97 | df = pd.DataFrame({"id": [1, 2], "value": ["a", "b"]})
98 | table = pa.Table.from_pandas(df)
99 | parquet_path = tmp_path / "dummy.parquet"
100 | pq.write_table(table, parquet_path)
101 | # ice commit my_table --source dummy.parquet
102 | result = run_cli([
103 | "commit", "my_table", "--source", str(parquet_path)
104 | ], cwd=tmp_path)
105 | assert result.returncode == 0
106 | assert "Committed" in result.stdout
107 | # ice log my_table
108 | result = run_cli(["log", "my_table"], cwd=tmp_path)
109 | assert result.returncode == 0
110 | assert "commit" in result.stdout
111 | # ice catalog
112 | result = run_cli(["catalog"], cwd=tmp_path)
113 | assert result.returncode == 0
114 | assert '"tables"' in result.stdout
115 |
116 | # 5. ice duck (just check it starts, don't wait for interactive)
117 | # We'll skip actually running duckdb interactively in CI
118 | # result = run_cli(["duck"], cwd=tmp_catalog_dir)
119 | # assert result.returncode == 0
120 |
121 | def test_custom_catalog_name(tmp_path):
122 | # Clean up .ice if it exists
123 | ice_dir = tmp_path / ".ice"
124 | if ice_dir.exists():
125 | shutil.rmtree(ice_dir)
126 | warehouse_dir = tmp_path / "customwarehouse"
127 | warehouse_dir.mkdir(parents=True, exist_ok=True)
128 | custom_name = "mycat"
129 | # Run CLI with custom catalog name
130 | result = run_cli([
131 | "init", "-p", "warehouse=customwarehouse", "--catalog-name", custom_name
132 | ], cwd=tmp_path)
133 | assert result.returncode == 0
134 | index_path = tmp_path / ".ice" / "index"
135 | assert index_path.exists(), ".ice/index not created for custom catalog name"
136 | with open(index_path) as f:
137 | index = json.load(f)
138 | assert index["catalog_uri"].endswith(f"catalog_{custom_name}.json")
139 | assert index["catalog_name"] == custom_name
140 | # Check BoringCatalog instance uses the custom name
141 | os.chdir(tmp_path)
142 | catalog = BoringCatalog()
143 | assert catalog.name == custom_name
144 | # Check catalog.json content
145 | with open(index["catalog_uri"]) as f:
146 | catalog_json = json.load(f)
147 | assert catalog_json["catalog_name"] == custom_name
148 |
149 | def test_custom_catalog_file_path(tmp_path):
150 | # Test initializing with a fully custom catalog file path
151 | ice_dir = tmp_path / ".ice"
152 | if ice_dir.exists():
153 | shutil.rmtree(ice_dir)
154 | custom_catalog_path = tmp_path / "mydir" / "mycustom.json"
155 | custom_catalog_path.parent.mkdir(parents=True, exist_ok=True)
156 | result = run_cli([
157 | "init", "--catalog", str(custom_catalog_path), "--catalog-name", "specialcat"
158 | ], cwd=tmp_path)
159 | assert result.returncode == 0
160 | index_path = tmp_path / ".ice" / "index"
161 | assert index_path.exists(), ".ice/index not created for custom catalog path"
162 | with open(index_path) as f:
163 | index = json.load(f)
164 | assert index["catalog_uri"] == str(custom_catalog_path)
165 | assert index["catalog_name"] == "specialcat"
166 | assert custom_catalog_path.exists(), "Custom catalog file was not created"
167 | # Check BoringCatalog loads from this path
168 | os.chdir(tmp_path)
169 | catalog = BoringCatalog()
170 | assert catalog.name == "specialcat"
171 | assert catalog.uri == str(custom_catalog_path)
172 |
173 |
174 | def test_reinit_overwrite_behavior(tmp_path):
175 | # Test running ice init twice in the same directory
176 | ice_dir = tmp_path / ".ice"
177 | if ice_dir.exists():
178 | shutil.rmtree(ice_dir)
179 | warehouse_dir = tmp_path / "warehouse"
180 | warehouse_dir.mkdir(parents=True, exist_ok=True)
181 | # First init
182 | result1 = run_cli(["init", "-p", "warehouse=warehouse"], cwd=tmp_path)
183 | assert result1.returncode == 0
184 | index_path = tmp_path / ".ice" / "index"
185 | assert index_path.exists()
186 | with open(index_path) as f:
187 | index1 = json.load(f)
188 | # Second init (should overwrite or succeed)
189 | result2 = run_cli(["init", "-p", "warehouse=warehouse"], cwd=tmp_path)
190 | # Accept both overwrite and success (should not crash)
191 | assert result2.returncode == 0
192 | with open(index_path) as f:
193 | index2 = json.load(f)
194 | # The index file should still be valid and point to the same warehouse
195 | assert index2["properties"]["warehouse"] == "warehouse"
196 |
197 |
198 | def test_manual_index_loading(tmp_path):
199 | # Test loading a catalog from a manually created .ice/index file
200 | ice_dir = tmp_path / ".ice"
201 | ice_dir.mkdir(exist_ok=True)
202 | custom_catalog_path = tmp_path / "manualcat.json"
203 | # Write a minimal catalog file
204 | with open(custom_catalog_path, "w") as f:
205 | json.dump({"catalog_name": "manualcat", "namespaces": {}, "tables": {}}, f)
206 | # Write a custom .ice/index
207 | index = {
208 | "catalog_uri": str(custom_catalog_path),
209 | "catalog_name": "manualcat",
210 | "properties": {"warehouse": "manualwarehouse"}
211 | }
212 | with open(ice_dir / "index", "w") as f:
213 | json.dump(index, f)
214 | os.chdir(tmp_path)
215 | catalog = BoringCatalog()
216 | assert catalog.name == "manualcat"
217 | assert catalog.uri == str(custom_catalog_path)
218 | assert catalog.properties["warehouse"] == "manualwarehouse"
219 | # Now test missing catalog_name (should default to 'boring')
220 | index2 = {
221 | "catalog_uri": str(custom_catalog_path),
222 | "properties": {"warehouse": "manualwarehouse"}
223 | }
224 | with open(ice_dir / "index", "w") as f:
225 | json.dump(index2, f)
226 | catalog2 = BoringCatalog()
227 | assert catalog2.name == "boring"
228 | assert catalog2.uri == str(custom_catalog_path)
229 | assert catalog2.properties["warehouse"] == "manualwarehouse"
230 |
--------------------------------------------------------------------------------