├── .gitignore ├── .python-version ├── README.md ├── docs └── boringdata.png ├── pyproject.toml ├── src └── boringcatalog │ ├── __init__.py │ ├── catalog.py │ ├── cli.py │ └── duckdb_init.sql ├── tests └── test_catalog.py └── uv.lock /.gitignore: -------------------------------------------------------------------------------- 1 | # Python-generated files 2 | __pycache__/ 3 | *.py[oc] 4 | build/ 5 | dist/ 6 | wheels/ 7 | *.egg-info 8 | .env 9 | # Virtual environments 10 | .venv 11 | -------------------------------------------------------------------------------- /.python-version: -------------------------------------------------------------------------------- 1 | 3.10 2 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | **[boringdata.io](https://boringdata.io) — Kickstart your Iceberg journey with our data stack templates.** 2 | 3 | Boring Data 4 | 5 | ---- 6 | # Boring Catalog 7 | 8 | A lightweight, file-based Iceberg catalog implementation using a single JSON file (e.g., on S3, local disk, or any fsspec-compatible storage). 9 | 10 | ## Why Boring Catalog? 11 | - No need to host or maintain a dedicated catalog service 12 | - Easy to use, easy to understand, perfect to get started with Iceberg 13 | - DuckDB CLI interface to easily explore your iceberg tables and metadata 14 | 15 | ## How It Works 16 | Boring Catalog stores all Iceberg catalog state in a single JSON file: 17 | - Namespaces and tables are tracked in this file 18 | - S3 conditional writes prevent concurrent modifications when storing catalog on S3 19 | - The `.ice/index` file in your project directory stores the configuration for your catalog, including: 20 | - `catalog_uri`: the path to your catalog JSON file 21 | - `catalog_name`: the logical name of your catalog 22 | - `properties`: additional properties (e.g., warehouse location) 23 | 24 | ## Installation 25 | ```bash 26 | pip install boringcatalog 27 | ``` 28 | 29 | ## Quickstart 30 | 31 | ### Initialize a Catalog 32 | ```bash 33 | ice init 34 | ``` 35 | 36 | That's it ! Your catalog is now ready to use. 37 | 38 | 2 files are created: 39 | - `warehouse/catalog/catalog_boring.json` = catalog file 40 | - `.ice/index` = points to the catalog location (similar to a git index file, but for Iceberg) 41 | 42 | 43 | *Note: You can also specify a remote location for your Iceberg data and catalog file:* 44 | ```bash 45 | ice init -p warehouse=s3://mybucket/mywarehouse 46 | ``` 47 | More details on the [Custom Init and Catalog Location](#custom-init-and-catalog-location) section. 48 | 49 | *Note: If you are using an S3 path (e.g., `s3://...`) for your catalog file or warehouse, make sure your CLI environment is authenticated with AWS. For example, you can set your AWS profile with:* 50 | 51 | ```bash 52 | export AWS_PROFILE=your-provider 53 | ``` 54 | 55 | *You must have valid AWS credentials configured for the CLI to access S3 resources.* 56 | 57 | You can then start using the catalog: 58 | 59 | ### Commit a table 60 | ```bash 61 | # Get some data 62 | curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet -o /tmp/yellow_tripdata_2023-01.parquet 63 | 64 | # Commit the table 65 | ice commit my_table --source /tmp/yellow_tripdata_2023-01.parquet 66 | ``` 67 | 68 | ### Check the commit history: 69 | 70 | ```bash 71 | ice log 72 | ``` 73 | 74 | ### Explore your Iceberg (data and metadata) with DuckDB 75 | ```bash 76 | ice duck 77 | ``` 78 | This opens an interactive DuckDB session with pointers to all your tables and namespaces. 79 | 80 | Example DuckDB queries: 81 | ``` 82 | show; -- show all tables 83 | select * from catalog.namespaces; -- list namespaces 84 | select * from catalog.tables; -- list tables 85 | select * from .; -- query iceberg table 86 | ``` 87 | 88 | ## Python Usage 89 | 90 | ```python 91 | from boringcatalog import BoringCatalog 92 | 93 | # Auto-detects .ice/index in the current working directory 94 | catalog = BoringCatalog() 95 | 96 | # Or specify a catalog 97 | catalog = BoringCatalog(name="mycat", uri="path/to/catalog.json") 98 | 99 | # Interact with your iceberg catalog 100 | catalog.create_namespace("my_namespace") 101 | catalog.create_table("my_namespace", "my_table") 102 | catalog.load_table("my_namespace.my_table") 103 | 104 | import pyarrow.parquet as pq 105 | df = pq.read_table("/tmp/yellow_tripdata_2023-01.parquet") 106 | table = catalog.load_table(("ice_default", "my_table")) 107 | table.append(df) 108 | ``` 109 | 110 | 111 | ## Custom Init and Catalog Location 112 | 113 | You can configure your Iceberg catalog in several ways, depending on where you want to store your catalog metadata (the JSON file) and your Iceberg data (the warehouse): 114 | - The `warehouse` property determines where your Iceberg tables' data will be stored. 115 | - The `--catalog` option lets you specify the exact path for your catalog JSON file. 116 | - If you use both, the catalog file will be created at the path you specify, and the warehouse will be used for table data. 117 | 118 | ### Examples 119 | | Command Example | Catalog File Location | Warehouse/Data Location | Use Case | 120 | |-----------------|----------------------|------------------------|----------| 121 | | `ice init` | `warehouse/catalog/catalog_boring.json` | `warehouse/` | Local, simple | 122 | | `ice init -p warehouse=...` | `/catalog/catalog_boring.json` | `/` | Custom warehouse | 123 | | `ice init --catalog ...` | `.json` | (to define when creating a table) | Custom catalog file | 124 | | `ice init --catalog ... -p warehouse=...` | `.json` | `/` | Full control | 125 | | `ice init --catalog ... --catalog-name ...` | `.json` | (to define when creating a table) | Custom name & file | 126 | 127 | ### Edge Cases & Manual Editing 128 | - **Custom Catalog Name:** By default, the catalog is named `"boring"`, but you can set a custom name with `--catalog-name`. This name is used in the catalog JSON and for file naming if you don't specify a custom path. 129 | - **Re-initialization:** If you run `ice init` multiple times in the same directory, the `.ice/index` file will be overwritten with the new configuration. This is useful if you want to re-point your project to a different catalog, but be aware that it will not migrate or merge any existing data. 130 | - **Manual Editing:** Advanced users can manually edit `.ice/index` to point to a different catalog file or change the catalog name. If you do this, make sure the `catalog_uri` and `catalog_name` fields are consistent with your actual catalog JSON file. If you set a `warehouse` property but do not update `catalog_uri`, Boring Catalog will always use the `catalog_uri` from the index file. 131 | 132 | ## Roadmap 133 | - [ ] Improve CLI to allow MERGE operation, partition spec, etc. 134 | - [ ] Improve CLI to get info about table schema / partition spec / etc. 135 | - [ ] Expose REST API for integration with AWS, Snowflake, etc. 136 | -------------------------------------------------------------------------------- /docs/boringdata.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/boringdata/boring-catalog/4c85dbddb9039f1d03a941da39f952366fe5050a/docs/boringdata.png -------------------------------------------------------------------------------- /pyproject.toml: -------------------------------------------------------------------------------- 1 | [project] 2 | name = "boringcatalog" 3 | version = "0.4.0" 4 | description = "A DuckDB-based Iceberg catalog implementation" 5 | readme = "README.md" 6 | authors = [ 7 | { name = "huraultj", email = "julien.hurault@sumeo.io" } 8 | ] 9 | requires-python = ">=3.10" 10 | dependencies = [ 11 | "s3fs>=2023.12.0", 12 | "pyiceberg>=0.9.0", 13 | "click>=8.0.0", 14 | "duckdb>=0.9.0", 15 | "pyiceberg[pyarrow]>=0.9.0" 16 | ] 17 | urls = {Homepage = "https://github.com/boringdata/boring-catalog"} 18 | 19 | [project.scripts] 20 | ice = "boringcatalog.cli:cli" 21 | 22 | [project.optional-dependencies] 23 | test = [ 24 | "pytest>=7.0.0", 25 | "pandas>=2.0.0", 26 | "pyarrow>=14.0.0" 27 | ] 28 | 29 | [build-system] 30 | requires = ["hatchling"] 31 | build-backend = "hatchling.build" 32 | 33 | [tool.pytest.ini_options] 34 | testpaths = ["tests"] 35 | python_files = ["test_*.py"] 36 | addopts = "-v --tb=short" 37 | -------------------------------------------------------------------------------- /src/boringcatalog/__init__.py: -------------------------------------------------------------------------------- 1 | """A DuckDB-based Iceberg catalog implementation.""" 2 | 3 | from .catalog import BoringCatalog 4 | 5 | __all__ = ["BoringCatalog"] 6 | 7 | -------------------------------------------------------------------------------- /src/boringcatalog/catalog.py: -------------------------------------------------------------------------------- 1 | from typing import Dict, List, Optional, Set, Tuple, Union, Any 2 | import uuid 3 | import json 4 | import os 5 | import tempfile 6 | import fsspec 7 | import logging 8 | from pyiceberg.io import load_file_io 9 | from pyiceberg.partitioning import UNPARTITIONED_PARTITION_SPEC, PartitionSpec 10 | from pyiceberg.schema import Schema 11 | from pyiceberg.serializers import FromInputFile 12 | from pyiceberg.table import CommitTableResponse, Table 13 | from pyiceberg.table.locations import load_location_provider 14 | from pyiceberg.table.metadata import new_table_metadata 15 | from pyiceberg.table.sorting import UNSORTED_SORT_ORDER, SortOrder 16 | from pyiceberg.table.update import TableRequirement, TableUpdate 17 | from pyiceberg.typedef import EMPTY_DICT, Identifier, Properties 18 | from pyiceberg.types import strtobool 19 | from pyiceberg.catalog import ( 20 | Catalog, 21 | MetastoreCatalog, 22 | METADATA_LOCATION, 23 | PREVIOUS_METADATA_LOCATION, 24 | TABLE_TYPE, 25 | ICEBERG, 26 | PropertiesUpdateSummary, 27 | ) 28 | from pyiceberg.exceptions import ( 29 | NamespaceAlreadyExistsError, 30 | NamespaceNotEmptyError, 31 | NoSuchNamespaceError, 32 | NoSuchTableError, 33 | TableAlreadyExistsError, 34 | NoSuchPropertyException, 35 | NoSuchIcebergTableError, 36 | CommitFailedException, 37 | ) 38 | 39 | 40 | from time import perf_counter 41 | # Set up logging 42 | logger = logging.getLogger(__name__) 43 | 44 | DEFAULT_INIT_CATALOG_TABLES = "true" 45 | DEFAULT_CATALOG_NAME = "boring" 46 | class ConcurrentModificationError(CommitFailedException): 47 | """Raised when a concurrent modification is detected.""" 48 | pass 49 | 50 | class BoringCatalog(MetastoreCatalog): 51 | """A simple file-based Iceberg catalog implementation.""" 52 | 53 | def __init__(self, name: str = None, **properties: str): 54 | # If name or properties are not provided, try to read them from .ice/index once 55 | index_path = os.path.join(os.getcwd(), ".ice/index") 56 | index = None 57 | if (name is None or not properties) and os.path.exists(index_path): 58 | with open(index_path, 'r') as f: 59 | index = json.load(f) 60 | if name is None: 61 | name = index.get("catalog_name", DEFAULT_CATALOG_NAME) 62 | if not properties: 63 | properties = index.get("properties", {}) 64 | if name is None: 65 | name = DEFAULT_CATALOG_NAME 66 | super().__init__(name, **properties) 67 | 68 | if index is not None and "catalog_uri" in index: 69 | self.uri = index["catalog_uri"] 70 | self.properties = index["properties"] 71 | elif self.properties.get("uri"): 72 | self.uri = self.properties.get("uri") 73 | elif self.properties.get("warehouse"): 74 | self.uri = os.path.join(os.path.join(self.properties.get("warehouse"), "catalog"), f"catalog_{name}.json") 75 | else: 76 | raise ValueError("Either provide 'catalog' or 'warehouse' property to initialize BoringCatalog") 77 | 78 | # Always infer warehouse if missing and uri is set 79 | if self.uri and not self.properties.get("warehouse"): 80 | warehouse_path = os.path.dirname(self.uri) 81 | self.properties["warehouse"] = warehouse_path 82 | logging.info(f"No --warehouse specified for the catalog. Using catalog folder to store iceberg data: {warehouse_path}") 83 | 84 | init_catalog_tables = strtobool(self.properties.get("init_catalog_tables", DEFAULT_INIT_CATALOG_TABLES)) 85 | 86 | if init_catalog_tables: 87 | self._ensure_tables_exist() 88 | 89 | @property 90 | def catalog(self): 91 | catalog, _ = self._read_catalog_json() 92 | return catalog 93 | 94 | @property 95 | def latest_snapshot(self, table_identifier: str): 96 | table = self.load_table(table_identifier) 97 | io = load_file_io(properties=self.properties, location=self.uri) 98 | file = io.new_input(metadata_location) 99 | return file 100 | 101 | def _ensure_tables_exist(self): 102 | """Ensure catalog directory and catalog.json exist.""" 103 | try: 104 | 105 | io = load_file_io(properties=self.properties, location=self.uri) 106 | 107 | # Check if catalog file exists 108 | input_file = io.new_input(self.uri) 109 | if not input_file.exists(): 110 | # Create initial catalog structure 111 | initial_catalog = { 112 | "catalog_name": self.name, 113 | "namespaces": {}, 114 | "tables": {} 115 | } 116 | 117 | # Write the initial catalog file 118 | with io.new_output(self.uri).create(overwrite=True) as f: 119 | f.write(json.dumps(initial_catalog, indent=2).encode('utf-8')) 120 | 121 | except Exception as e: 122 | raise ValueError(f"Failed to initialize catalog at {self.uri}: {str(e)}") 123 | 124 | def _read_catalog_json(self): 125 | """Read catalog.json using FileIO, returning (data, etag).""" 126 | try: 127 | io = load_file_io(properties=self.properties, location=self.uri) 128 | input_file = io.new_input(self.uri) 129 | 130 | if not input_file.exists(): 131 | return {"catalog_name": self.name, "namespaces": {}, "tables": {}}, None 132 | 133 | with input_file.open() as f: 134 | data = json.loads(f.read().decode('utf-8')) 135 | 136 | # Get metadata for ETag 137 | metadata = input_file.metadata() if hasattr(input_file, 'metadata') else {} 138 | etag = metadata.get("ETag") 139 | return data, etag 140 | 141 | except Exception as e: 142 | if 'No such file' in str(e) or 'not found' in str(e) or '404' in str(e): 143 | return {"catalog_name": self.name, "namespaces": {}, "tables": {}}, None 144 | raise 145 | 146 | def _write_catalog_json(self, data, etag=None): 147 | """Write catalog.json using FileIO, using ETag for concurrency if provided.""" 148 | try: 149 | io = load_file_io(properties=self.properties, location=self.uri) 150 | 151 | # Create output file with ETag check if provided 152 | output_file = io.new_output(self.uri) 153 | if etag is not None and hasattr(output_file, 'set_metadata'): 154 | output_file.set_metadata({"if_match": etag}) 155 | 156 | with output_file.create(overwrite=True) as f: 157 | f.write(json.dumps(data, indent=2).encode('utf-8')) 158 | 159 | except Exception as e: 160 | if 'PreconditionFailed' in str(e) or '412' in str(e): 161 | raise ConcurrentModificationError("catalog.json was modified concurrently") 162 | raise 163 | 164 | def _table_key(self, namespace: str, table_name: str) -> str: 165 | return f"{namespace}.{table_name}" 166 | 167 | def create_table( 168 | self, 169 | identifier: Union[str, Identifier], 170 | schema: Union[Schema, "pa.Schema"], 171 | location: Optional[str] = None, 172 | partition_spec: PartitionSpec = UNPARTITIONED_PARTITION_SPEC, 173 | sort_order: SortOrder = UNSORTED_SORT_ORDER, 174 | properties: Properties = EMPTY_DICT, 175 | ) -> Table: 176 | """Create an Iceberg table.""" 177 | schema: Schema = self._convert_schema_if_needed(schema) # type: ignore 178 | namespace_tuple = Catalog.namespace_from(identifier) 179 | namespace = Catalog.namespace_to_string(namespace_tuple) 180 | table_name = Catalog.table_name_from(identifier) 181 | table_key = self._table_key(namespace, table_name) 182 | 183 | data, etag = self._read_catalog_json() 184 | if namespace not in data["namespaces"]: 185 | raise NoSuchNamespaceError(f"Namespace does not exist: {namespace}") 186 | if table_key in data.get("tables", {}): 187 | raise TableAlreadyExistsError(f"Table {namespace}.{table_name} already exists") 188 | 189 | location = self._resolve_table_location(location, namespace, table_name) 190 | location_provider = load_location_provider(table_location=location, table_properties=properties) 191 | metadata_location = location_provider.new_table_metadata_file_location() 192 | 193 | metadata = new_table_metadata( 194 | location=location, schema=schema, partition_spec=partition_spec, sort_order=sort_order, properties=properties 195 | ) 196 | io = load_file_io(properties=self.properties, location=metadata_location) 197 | self._write_metadata(metadata, io, metadata_location) 198 | 199 | # Add table entry to catalog.json 200 | if "tables" not in data: 201 | data["tables"] = {} 202 | data["tables"][table_key] = { 203 | "namespace": namespace, 204 | "name": table_name, 205 | "metadata_location": metadata_location 206 | } 207 | 208 | self._write_catalog_json(data, etag) 209 | 210 | return self.load_table(identifier) 211 | 212 | def load_table(self, identifier: Union[str, Identifier], catalog_name: str = None) -> Table: 213 | """Load the table's metadata and return the table instance using catalog.json.""" 214 | namespace_tuple = Catalog.namespace_from(identifier) 215 | namespace = Catalog.namespace_to_string(namespace_tuple) 216 | table_name = Catalog.table_name_from(identifier) 217 | table_key = self._table_key(namespace, table_name) 218 | data, _ = self._read_catalog_json() 219 | table_entry = data.get("tables", {}).get(table_key) 220 | if not table_entry: 221 | raise NoSuchTableError(f"Table does not exist: {namespace}.{table_name}") 222 | metadata_location = table_entry["metadata_location"] 223 | io = load_file_io(properties=self.properties, location=metadata_location) 224 | file = io.new_input(metadata_location) 225 | metadata = FromInputFile.table_metadata(file) 226 | return Table( 227 | identifier=Catalog.identifier_to_tuple(namespace) + (table_name,), 228 | metadata=metadata, 229 | metadata_location=metadata_location, 230 | io=self._load_file_io(metadata.properties, metadata_location), 231 | catalog=self 232 | ) 233 | 234 | def drop_table(self, identifier: Union[str, Identifier]) -> None: 235 | """Drop a table.""" 236 | namespace_tuple = Catalog.namespace_from(identifier) 237 | namespace = Catalog.namespace_to_string(namespace_tuple) 238 | table_name = Catalog.table_name_from(identifier) 239 | table_key = self._table_key(namespace, table_name) 240 | data, etag = self._read_catalog_json() 241 | if table_key not in data.get("tables", {}): 242 | raise NoSuchTableError(f"Table does not exist: {namespace}.{table_name}") 243 | del data["tables"][table_key] 244 | self._write_catalog_json(data, etag) 245 | 246 | def rename_table(self, from_identifier: Union[str, Identifier], to_identifier: Union[str, Identifier]) -> Table: 247 | """Rename a table.""" 248 | from_namespace_tuple = Catalog.namespace_from(from_identifier) 249 | from_namespace = Catalog.namespace_to_string(from_namespace_tuple) 250 | from_table_name = Catalog.table_name_from(from_identifier) 251 | from_table_key = self._table_key(from_namespace, from_table_name) 252 | 253 | to_namespace_tuple = Catalog.namespace_from(to_identifier) 254 | to_namespace = Catalog.namespace_to_string(to_namespace_tuple) 255 | to_table_name = Catalog.table_name_from(to_identifier) 256 | to_table_key = self._table_key(to_namespace, to_table_name) 257 | 258 | data, etag = self._read_catalog_json() 259 | if not self._namespace_exists(to_namespace): 260 | raise NoSuchNamespaceError(f"Namespace does not exist: {to_namespace}") 261 | 262 | if from_table_key not in data.get("tables", {}): 263 | raise NoSuchTableError(f"Table does not exist: {from_namespace}.{from_table_name}") 264 | 265 | if to_table_key in data.get("tables", {}): 266 | raise TableAlreadyExistsError(f"Table {to_namespace}.{to_table_name} already exists") 267 | 268 | table_entry = data["tables"][from_table_key] 269 | table_entry["namespace"] = to_namespace 270 | table_entry["name"] = to_table_name 271 | data["tables"][to_table_key] = table_entry 272 | del data["tables"][from_table_key] 273 | 274 | self._write_catalog_json(data, etag) 275 | return self.load_table(to_identifier) 276 | 277 | def create_namespace(self, namespace: Union[str, Identifier], properties: Properties = EMPTY_DICT) -> None: 278 | """Create a namespace in the catalog.json file.""" 279 | namespace_str = Catalog.namespace_to_string(namespace) 280 | data, etag = self._read_catalog_json() 281 | if namespace_str in data["namespaces"]: 282 | raise NamespaceAlreadyExistsError(f"Namespace already exists: {namespace_str}") 283 | data["namespaces"][namespace_str] = {"properties": properties or {"exists": "true"}} 284 | self._write_catalog_json(data, etag) 285 | 286 | def drop_namespace(self, namespace: Union[str, Identifier]) -> None: 287 | """Drop a namespace from catalog.json.""" 288 | namespace_str = Catalog.namespace_to_string(namespace) 289 | data, etag = self._read_catalog_json() 290 | if namespace_str not in data["namespaces"]: 291 | raise NoSuchNamespaceError(f"Namespace does not exist: {namespace_str}") 292 | if any(tbl["namespace"] == namespace_str for tbl in data["tables"].values()): 293 | raise NamespaceNotEmptyError(f"Namespace {namespace_str} is not empty.") 294 | del data["namespaces"][namespace_str] 295 | self._write_catalog_json(data, etag) 296 | 297 | def list_tables(self, namespace: Union[str, Identifier]) -> List[Identifier]: 298 | """List tables under the given namespace from catalog.json.""" 299 | namespace_str = Catalog.namespace_to_string(namespace) 300 | data, _ = self._read_catalog_json() 301 | if namespace_str and namespace_str not in data["namespaces"]: 302 | raise NoSuchNamespaceError(f"Namespace does not exist: {namespace_str}") 303 | return [ 304 | Catalog.identifier_to_tuple(tbl["namespace"]) + (tbl["name"],) 305 | for tbl in data.get("tables", {}).values() 306 | if tbl["namespace"] == namespace_str 307 | ] 308 | 309 | def list_namespaces(self, namespace: Union[str, Identifier] = ()) -> List[Identifier]: 310 | """List namespaces from catalog.json.""" 311 | data, _ = self._read_catalog_json() 312 | all_namespaces = list(data["namespaces"].keys()) 313 | if not namespace: 314 | return [Catalog.identifier_to_tuple(ns) for ns in all_namespaces] 315 | ns_tuple = Catalog.identifier_to_tuple(namespace) 316 | ns_prefix = Catalog.namespace_to_string(namespace) 317 | # Only return direct children 318 | result = [] 319 | for ns in all_namespaces: 320 | ns_parts = Catalog.identifier_to_tuple(ns) 321 | if ns_parts[:len(ns_tuple)] == ns_tuple and len(ns_parts) == len(ns_tuple) + 1: 322 | result.append(ns_parts) 323 | return result 324 | 325 | def load_namespace_properties(self, namespace: Union[str, Identifier]) -> Properties: 326 | """Get properties for a namespace from catalog.json.""" 327 | namespace_str = Catalog.namespace_to_string(namespace) 328 | data, _ = self._read_catalog_json() 329 | if namespace_str not in data["namespaces"]: 330 | raise NoSuchNamespaceError(f"Namespace {namespace_str} does not exist") 331 | return data["namespaces"][namespace_str].get("properties", {}) 332 | 333 | def _namespace_exists(self, namespace: Union[str, Identifier]) -> bool: 334 | """Check if a namespace exists in catalog.json.""" 335 | namespace_str = Catalog.namespace_to_string(namespace) 336 | data, _ = self._read_catalog_json() 337 | return namespace_str in data["namespaces"] 338 | 339 | def _table_exists(self, identifier: Union[str, Identifier]) -> bool: 340 | """Check if a table exists in catalog.json.""" 341 | namespace_tuple = Catalog.namespace_from(identifier) 342 | namespace = Catalog.namespace_to_string(namespace_tuple) 343 | table_name = Catalog.table_name_from(identifier) 344 | table_key = self._table_key(namespace, table_name) 345 | data, _ = self._read_catalog_json() 346 | return table_key in data.get("tables", {}) 347 | 348 | def list_views(self, namespace: Union[str, Identifier]) -> List[Identifier]: 349 | return [] 350 | 351 | def drop_view(self, identifier: Union[str, Identifier]) -> None: 352 | raise NotImplementedError("Views are not supported") 353 | 354 | def view_exists(self, identifier: Union[str, Identifier]) -> bool: 355 | return False 356 | 357 | def commit_table( 358 | self, table: Table, requirements: Tuple[TableRequirement, ...], updates: Tuple[TableUpdate, ...] 359 | ) -> CommitTableResponse: 360 | """Commit updates to a table.""" 361 | table_identifier = table.name() 362 | namespace_tuple = Catalog.namespace_from(table_identifier) 363 | namespace = Catalog.namespace_to_string(namespace_tuple) 364 | table_name = Catalog.table_name_from(table_identifier) 365 | 366 | current_table: Optional[Table] 367 | try: 368 | current_table = self.load_table(table_identifier) 369 | except NoSuchTableError: 370 | current_table = None 371 | 372 | updated_staged_table = self._update_and_stage_table(current_table, table.name(), requirements, updates) 373 | if current_table and updated_staged_table.metadata == current_table.metadata: 374 | return CommitTableResponse(metadata=current_table.metadata, metadata_location=current_table.metadata_location) 375 | 376 | self._write_metadata( 377 | metadata=updated_staged_table.metadata, 378 | io=updated_staged_table.io, 379 | metadata_path=updated_staged_table.metadata_location, 380 | ) 381 | 382 | try: 383 | data, etag = self._read_catalog_json() 384 | table_key = self._table_key(namespace, table_name) 385 | 386 | if current_table: 387 | if data["tables"][table_key]["metadata_location"] != current_table.metadata_location: 388 | raise CommitFailedException(f"Table has been updated by another process: {namespace}.{table_name}") 389 | data["tables"][table_key]["previous_metadata_location"] = current_table.metadata_location 390 | else: 391 | if table_key in data["tables"]: 392 | raise TableAlreadyExistsError(f"Table {namespace}.{table_name} already exists") 393 | data["tables"][table_key] = { 394 | "namespace": namespace, 395 | "name": table_name, 396 | "previous_metadata_location": None 397 | } 398 | 399 | data["tables"][table_key]["metadata_location"] = updated_staged_table.metadata_location 400 | self._write_catalog_json(data, etag) 401 | 402 | except Exception as e: 403 | try: 404 | updated_staged_table.io.delete(updated_staged_table.metadata_location) 405 | except Exception: 406 | pass 407 | raise e 408 | 409 | return CommitTableResponse( 410 | metadata=updated_staged_table.metadata, 411 | metadata_location=updated_staged_table.metadata_location 412 | ) 413 | 414 | def register_table(self, identifier: Union[str, Identifier], metadata_location: str) -> Table: 415 | """Register a new table using existing metadata.""" 416 | namespace_tuple = Catalog.namespace_from(identifier) 417 | namespace = Catalog.namespace_to_string(namespace_tuple) 418 | table_name = Catalog.table_name_from(identifier) 419 | table_key = self._table_key(namespace, table_name) 420 | 421 | if not self._namespace_exists(namespace): 422 | raise NoSuchNamespaceError(f"Namespace does not exist: {namespace}") 423 | 424 | data, etag = self._read_catalog_json() 425 | if table_key in data.get("tables", {}): 426 | raise TableAlreadyExistsError(f"Table {namespace}.{table_name} already exists") 427 | 428 | data["tables"][table_key] = { 429 | "namespace": namespace, 430 | "name": table_name, 431 | "metadata_location": metadata_location, 432 | "previous_metadata_location": None 433 | } 434 | self._write_catalog_json(data, etag) 435 | 436 | return self.load_table(identifier) 437 | 438 | def update_namespace_properties( 439 | self, namespace: Union[str, Identifier], removals: Optional[Set[str]] = None, updates: Properties = EMPTY_DICT 440 | ) -> PropertiesUpdateSummary: 441 | """Remove provided property keys and update properties for a namespace in catalog.json.""" 442 | namespace_str = Catalog.namespace_to_string(namespace) 443 | data, etag = self._read_catalog_json() 444 | if namespace_str not in data["namespaces"]: 445 | raise NoSuchNamespaceError(f"Namespace {namespace_str} does not exist") 446 | current_properties = data["namespaces"][namespace_str].get("properties", {}) 447 | if removals: 448 | for key in removals: 449 | current_properties.pop(key, None) 450 | if updates: 451 | current_properties.update(updates) 452 | data["namespaces"][namespace_str]["properties"] = current_properties 453 | self._write_catalog_json(data, etag) 454 | # Return a dummy PropertiesUpdateSummary for now (implement as needed) 455 | return PropertiesUpdateSummary() 456 | -------------------------------------------------------------------------------- /src/boringcatalog/cli.py: -------------------------------------------------------------------------------- 1 | import os 2 | import json 3 | import click 4 | import duckdb 5 | import subprocess 6 | import tempfile 7 | import logging 8 | from string import Template 9 | from .catalog import BoringCatalog 10 | import pyarrow.parquet as pq 11 | import datetime 12 | # Configure logging to display in CLI 13 | logging.basicConfig( 14 | format='%(message)s', 15 | level=logging.INFO 16 | ) 17 | 18 | DEFAULT_NAMESPACE = "ice_default" 19 | DEFAULT_CATALOG_NAME = "boring" 20 | # Silence pyiceberg logs 21 | logging.getLogger('pyiceberg').setLevel(logging.WARNING) 22 | 23 | def ensure_ice_dir(): 24 | """Ensure .ice directory exists and return its path.""" 25 | ice_dir = os.path.abspath('.ice') 26 | os.makedirs(ice_dir, exist_ok=True) 27 | return ice_dir 28 | 29 | def load_index(): 30 | """Load configuration from .ice/index if it exists.""" 31 | index_path = os.path.join(ensure_ice_dir(), 'index') 32 | try: 33 | with open(index_path, 'r') as f: 34 | return json.load(f) 35 | except (FileNotFoundError, json.JSONDecodeError): 36 | return None 37 | 38 | def save_index(properties, catalog_uri=None, catalog_name=None): 39 | """Save configuration to .ice/index with separate catalog_uri, catalog_name, and properties sections.""" 40 | config = { 41 | "catalog_uri": catalog_uri, 42 | "catalog_name": catalog_name, 43 | "properties": properties 44 | } 45 | index_path = os.path.join(ensure_ice_dir(), 'index') 46 | with open(index_path, 'w') as f: 47 | json.dump(config, f, indent=2) 48 | 49 | def get_catalog(): 50 | """Get catalog instance from stored configuration.""" 51 | config = load_index() 52 | if not config: 53 | raise click.ClickException( 54 | "No catalog configuration found. Run 'ice init' first." 55 | ) 56 | 57 | properties = config.get("properties", {}) 58 | if config.get("catalog_uri"): 59 | properties["uri"] = config["catalog_uri"] 60 | # Use catalog_name from top-level field if present, else default 61 | catalog_name = config.get("catalog_name", DEFAULT_CATALOG_NAME) 62 | return BoringCatalog(catalog_name, **properties) 63 | 64 | def print_version(ctx, param, value): 65 | if not value or ctx.resilient_parsing: 66 | return 67 | click.echo('Boring Catalog version 0.2.0') 68 | ctx.exit() 69 | 70 | def get_sql_template(): 71 | """Read the SQL template file.""" 72 | template_path = os.path.join(os.path.dirname(__file__), 'duckdb_init.sql') 73 | with open(template_path, 'r') as f: 74 | return Template(f.read()) 75 | 76 | def print_table_log(catalog, table_identifier, label=None): 77 | """Print the log (snapshots) for a given table identifier.""" 78 | if not catalog._table_exists(table_identifier): 79 | return False 80 | if label: 81 | click.echo(label) 82 | table = catalog.load_table(table_identifier) 83 | snapshots = sorted(table.snapshots(), key=lambda x: x.timestamp_ms, reverse=True) 84 | if not snapshots: 85 | click.echo(f"No snapshots found for table {table_identifier}.") 86 | return False 87 | for snap in snapshots: 88 | click.echo(f"commit {snap.snapshot_id:<20}") 89 | ts = datetime.datetime.utcfromtimestamp(int(snap.timestamp_ms) / 1000).strftime('%Y-%m-%d %H:%M:%S UTC') 90 | click.echo(f" Table: {table_identifier:<25}") 91 | click.echo(f" Date: {ts:<25}") 92 | click.echo(f" Operation: {str(snap.summary.operation):<15}") 93 | click.echo(f" Summary:") 94 | summary = snap.summary.additional_properties 95 | max_key_len = max(len(str(k)) for k in summary.keys()) if summary else 0 96 | for k, v in summary.items(): 97 | click.echo(f" {k.ljust(max_key_len)} : {v}") 98 | click.echo(f" ") 99 | return True 100 | 101 | @click.group(invoke_without_command=True) 102 | @click.option('--version', is_flag=True, callback=print_version, expose_value=False, is_eager=True, help='Show version and exit') 103 | @click.pass_context 104 | def cli(ctx): 105 | """Boring Catalog CLI tool. 106 | 107 | Run 'ice COMMAND --help' for more information on a command. 108 | """ 109 | # Show help if no command is provided 110 | if ctx.invoked_subcommand is None: 111 | click.echo(ctx.get_help()) 112 | ctx.exit() 113 | 114 | @cli.command() 115 | @click.option('--catalog', help='Custom location for catalog.json (default: warehouse/catalog/catalog_.json)') 116 | @click.option('--property', '-p', multiple=True, help='Properties in the format key=value') 117 | @click.option('--catalog-name', default=DEFAULT_CATALOG_NAME, show_default=True, help='Name of the catalog (used in file naming and metadata)') 118 | def init(catalog, property, catalog_name): 119 | """Initialize a new Boring Catalog.""" 120 | 121 | try: 122 | properties = {} 123 | for prop in property: 124 | try: 125 | key, value = prop.split('=', 1) 126 | properties[key.strip()] = value.strip() 127 | except ValueError: 128 | raise click.ClickException(f"Invalid property format: {prop}. Use key=value format") 129 | 130 | if not catalog and not "warehouse" in properties: 131 | catalog = f"warehouse/catalog/catalog_{catalog_name}.json" 132 | properties["warehouse"] = "warehouse" 133 | 134 | elif not catalog and "warehouse" in properties: 135 | catalog = f"{properties['warehouse']}/catalog/catalog_{catalog_name}.json" 136 | 137 | # Do NOT save catalog_name in properties anymore 138 | save_index(properties, catalog, catalog_name) 139 | 140 | properties["uri"] = catalog 141 | catalog_instance = BoringCatalog(catalog_name, **properties) 142 | 143 | # Display information in specific order 144 | click.echo(f"Initialized Boring Catalog in {os.path.join('.ice', 'index')}") 145 | click.echo(f"Catalog location: {catalog}") 146 | if "warehouse" in properties: 147 | click.echo(f"Warehouse location: {properties['warehouse']}") 148 | click.echo(f"Catalog name: {catalog_name}") 149 | 150 | except Exception as e: 151 | click.echo(f"Error initializing catalog: {str(e)}", err=True) 152 | raise click.Abort() 153 | 154 | @cli.command(name='list-namespaces') 155 | @click.argument('parent', required=False) 156 | def list_namespaces(parent): 157 | """List all namespaces or child namespaces of PARENT.""" 158 | try: 159 | catalog = get_catalog() 160 | namespaces = catalog.list_namespaces(parent if parent else ()) 161 | 162 | if not namespaces: 163 | click.echo("No namespaces found.") 164 | return 165 | 166 | click.echo("Namespaces:") 167 | for ns in namespaces: 168 | click.echo(f" {'.'.join(ns)}") 169 | except Exception as e: 170 | click.echo(f"Error listing namespaces: {str(e)}", err=True) 171 | raise click.Abort() 172 | 173 | @cli.command(name='list-tables') 174 | @click.argument('namespace', required=False) 175 | def list_tables(namespace): 176 | """List all tables in the specified NAMESPACE, or all tables in all namespaces if not specified.""" 177 | try: 178 | catalog = get_catalog() 179 | 180 | if namespace: 181 | tables = catalog.list_tables(namespace) 182 | if not tables: 183 | click.echo(f"No tables found in namespace '{namespace}'.") 184 | return 185 | click.echo(f"Tables in namespace '{namespace}':") 186 | for table in tables: 187 | table_name = table[-1] 188 | click.echo(f" {table_name}") 189 | else: 190 | namespaces = catalog.list_namespaces() 191 | found_any = False 192 | for ns_tuple in namespaces: 193 | ns = ".".join(ns_tuple) 194 | tables = catalog.list_tables(ns) 195 | if tables: 196 | found_any = True 197 | click.echo(f"Tables in namespace '{ns}':") 198 | for table in tables: 199 | table_name = table[-1] 200 | click.echo(f" {table_name}") 201 | if not found_any: 202 | click.echo("No tables found in any namespace.") 203 | except Exception as e: 204 | click.echo(f"Error listing tables: {str(e)}", err=True) 205 | raise click.Abort() 206 | 207 | @cli.command(context_settings=dict( 208 | ignore_unknown_options=True, 209 | allow_extra_args=True, 210 | )) 211 | @click.option('--catalog-path', help='Optional path to a catalog.json') 212 | @click.argument('duckdb_args', nargs=-1) 213 | def duck(catalog_path=None, duckdb_args=()): 214 | """Open DuckDB CLI with catalog configuration. Optionally provide a path to a catalog.json. Extra arguments are passed to DuckDB CLI.""" 215 | try: 216 | if catalog_path: 217 | properties = {"uri": os.path.abspath(catalog_path)} 218 | catalog = BoringCatalog(DEFAULT_CATALOG_NAME, **properties) 219 | else: 220 | config = load_index() 221 | if not config: 222 | raise click.ClickException( 223 | "No catalog configuration found. Run 'ice init' first." 224 | ) 225 | catalog = get_catalog() 226 | 227 | if len(catalog.list_namespaces()) == 0: 228 | raise click.ClickException("No namespaces found in catalog. Run 'ice create-namespace' to create a namespace.") 229 | 230 | if len(catalog.catalog.get("tables", {}).keys()) == 0: 231 | raise click.ClickException("No tables found in catalog. Run 'ice commit' to create a table.") 232 | 233 | # Get SQL template and substitute variables 234 | template_str = get_sql_template().template 235 | # Add S3 configuration at the beginning of the script 236 | if "s3" in catalog.uri: 237 | s3_config = ( 238 | ".mode list\n" 239 | ".header off\n" 240 | "SELECT 'boring-catalog: Loading s3 secrets...' ;\n" 241 | ".mode line\n" 242 | "CREATE OR REPLACE SECRET secret (TYPE s3, PROVIDER credential_chain);\n" 243 | ) 244 | # Insert the S3 configuration right after the first comment line 245 | lines = template_str.split('\n') 246 | template_str = lines[0] + '\n' + s3_config + '\n'.join(lines[1:]) 247 | template = Template(template_str) 248 | 249 | sql = template.substitute(CATALOG_JSON=catalog.uri) 250 | 251 | # Write the SQL to a temporary file 252 | with tempfile.NamedTemporaryFile(mode='w', suffix='.sql', delete=False) as f: 253 | f.write(sql) 254 | 255 | # Start DuckDB CLI with the initialization script and extra args 256 | cmd = ['duckdb', '--init', f.name] + list(duckdb_args) 257 | 258 | subprocess.run(cmd) 259 | 260 | # Clean up 261 | os.unlink(f.name) 262 | 263 | except Exception as e: 264 | click.echo(f"Error starting DuckDB CLI: {str(e)}", err=True) 265 | raise click.Abort() 266 | 267 | @cli.command(name='create-namespace') 268 | @click.argument('namespace', required=True) 269 | @click.option('--property', '-p', multiple=True, help='Properties in the format key=value') 270 | def create_namespace(namespace, property): 271 | """Create a new namespace in the catalog. 272 | 273 | NAMESPACE is the name of the namespace to create (e.g. 'my_namespace' or 'parent.child') 274 | """ 275 | try: 276 | catalog = get_catalog() 277 | 278 | # Parse properties if provided 279 | properties = {} 280 | for prop in property: 281 | try: 282 | key, value = prop.split('=', 1) 283 | properties[key.strip()] = value.strip() 284 | except ValueError: 285 | raise click.ClickException(f"Invalid property format: {prop}. Use key=value format") 286 | 287 | # Create the namespace 288 | catalog.create_namespace(namespace, properties) 289 | click.echo(f"Created namespace: {namespace}") 290 | if properties: 291 | click.echo("Properties:") 292 | for key, value in properties.items(): 293 | click.echo(f" {key}: {value}") 294 | 295 | except Exception as e: 296 | click.echo(f"Error creating namespace: {str(e)}", err=True) 297 | raise click.Abort() 298 | 299 | # Add a utility function to resolve table identifier with namespace logic 300 | 301 | def resolve_table_identifier_with_namespace(catalog, table_identifier): 302 | """Resolve table identifier to include namespace, creating default if needed (like in commit).""" 303 | if len(table_identifier.split(".")) == 1: 304 | namespaces = catalog.list_namespaces() 305 | if len(namespaces) == 0: 306 | click.echo(f"No namespace found, creating and using default namespace: {DEFAULT_NAMESPACE}") 307 | namespace = DEFAULT_NAMESPACE 308 | catalog.create_namespace(namespace) 309 | elif len(namespaces) == 1: 310 | namespace = namespaces[0][0] 311 | else: 312 | raise click.ClickException("No namespace specified. Please specify a namespace for the table.") 313 | table_identifier = f"{namespace}.{table_identifier}" 314 | return table_identifier 315 | 316 | @cli.command(name='commit') 317 | @click.argument('table_identifier', required=True) 318 | @click.option('--source', required=True, help='Parquet file URI to commit as a new snapshot') 319 | @click.option('--mode', default='append', help='Mode to commit the file', type=click.Choice(['append', 'overwrite'])) 320 | def commit(table_identifier, source, mode): 321 | """Commit a new snapshot to a table from a Parquet file.""" 322 | try: 323 | catalog = get_catalog() 324 | table_identifier = resolve_table_identifier_with_namespace(catalog, table_identifier) 325 | df = pq.read_table(source) 326 | if not catalog.table_exists(table_identifier): 327 | click.echo(f"Table {table_identifier} does not exist in the catalog. Creating it now...") 328 | catalog.create_table(table_identifier, schema=df.schema) 329 | table = catalog.load_table(table_identifier) 330 | if mode == "append": 331 | table.append(df) 332 | elif mode == "overwrite": 333 | table.overwrite(df) 334 | else: 335 | raise click.ClickException(f"Invalid mode: {mode}. Use 'append' or 'overwrite'.") 336 | click.echo(f"Committed {source} to table {table_identifier}") 337 | except Exception as e: 338 | click.echo(f"Error committing file to table: {str(e)}", err=True) 339 | raise click.Abort() 340 | 341 | @cli.command(name='log') 342 | @click.argument('table_identifier', required=False) 343 | def log_snapshots(table_identifier): 344 | """Print all snapshot entries for a table or all tables in the current catalog or default namespace.""" 345 | try: 346 | catalog = get_catalog() 347 | if not table_identifier: 348 | # Default to the default namespace, create if needed 349 | namespaces = catalog.list_namespaces() 350 | if not any(ns[0] == DEFAULT_NAMESPACE for ns in namespaces): 351 | click.echo(f"No namespace found, creating and using default namespace: {DEFAULT_NAMESPACE}") 352 | catalog.create_namespace(DEFAULT_NAMESPACE) 353 | tables = catalog.list_tables(DEFAULT_NAMESPACE) 354 | if not tables: 355 | click.echo(f"No tables found in default namespace '{DEFAULT_NAMESPACE}'.") 356 | return 357 | found_any = False 358 | for table in tables: 359 | table_identifier_full = f"{DEFAULT_NAMESPACE}.{table[-1]}" 360 | if print_table_log(catalog, table_identifier_full, label=f"=== Log for table: {table_identifier_full} ==="): 361 | found_any = True 362 | if not found_any: 363 | click.echo("No snapshots found for any table in the default namespace.") 364 | return 365 | # If a table_identifier is provided, resolve it as in commit 366 | table_identifier = resolve_table_identifier_with_namespace(catalog, table_identifier) 367 | if not catalog._table_exists(table_identifier): 368 | raise click.ClickException(f"Table {table_identifier} does not exist in the catalog.") 369 | print_table_log(catalog, table_identifier) 370 | except Exception as e: 371 | click.echo(f"Error loading catalog or snapshots: {str(e)}", err=True) 372 | raise click.Abort() 373 | 374 | @cli.command(name='catalog') 375 | def print_catalog(): 376 | """Print the current catalog.json as JSON.""" 377 | try: 378 | catalog = get_catalog() 379 | catalog_json = catalog.catalog 380 | click.echo(json.dumps(catalog_json, indent=2)) 381 | except Exception as e: 382 | click.echo(f"Error printing catalog: {str(e)}", err=True) 383 | raise click.Abort() 384 | 385 | if __name__ == '__main__': 386 | cli() -------------------------------------------------------------------------------- /src/boringcatalog/duckdb_init.sql: -------------------------------------------------------------------------------- 1 | -- Install and load extensions 2 | SET VARIABLE catalog_json = '${CATALOG_JSON}'; 3 | SET VARIABLE tmp_file = '/tmp/iceberg_init.sql'; 4 | 5 | .mode list 6 | .header off 7 | SELECT 'boring-catalog: Loading extensions...'; 8 | INSTALL iceberg; 9 | LOAD iceberg; 10 | 11 | SELECT 'boring-catalog: Init schemas and tables...' ; 12 | -- Create schemas 13 | CREATE SCHEMA IF NOT EXISTS catalog; 14 | CREATE TABLE catalog.namespaces AS 15 | SELECT 16 | namespace, 17 | unnest(properties.properties) 18 | FROM ( 19 | UNPIVOT ( 20 | SELECT unnest(namespaces) 21 | FROM read_json(getvariable('catalog_json')) 22 | ) ON COLUMNS(*) INTO name namespace value properties 23 | ); 24 | 25 | CREATE OR REPLACE TABLE catalog.namespaces AS 26 | SELECT 27 | namespace, 28 | unnest(properties.properties) 29 | FROM ( 30 | UNPIVOT ( 31 | SELECT 32 | unnest(namespaces) 33 | FROM read_json(getvariable('catalog_json')) 34 | ) ON COLUMNS(*) INTO name namespace value properties 35 | ); 36 | CREATE OR REPLACE TABLE catalog.tables AS 37 | SELECT 38 | properties.namespace as namespace, 39 | table_name as table_name, 40 | unnest(properties) 41 | FROM ( 42 | UNPIVOT ( 43 | SELECT unnest(tables) 44 | FROM read_json(getvariable('catalog_json')) 45 | ) ON COLUMNS(*) INTO name table_name value properties 46 | ); 47 | 48 | 49 | .mode list 50 | .header off 51 | .once getvariable("tmp_file") 52 | select 'CREATE SCHEMA IF NOT EXISTS ' || i || ';' 53 | from (select namespace from catalog.namespaces) x(i); 54 | .read getvariable("tmp_file") 55 | 56 | 57 | .mode list 58 | .header off 59 | .once getvariable("tmp_file") 60 | select 'CREATE OR REPLACE VIEW ' || j || ' AS SELECT * FROM iceberg_scan(''' || k || ''');' 61 | from (select table_name, metadata_location from catalog.tables) x(j,k); 62 | .read getvariable("tmp_file") 63 | 64 | .mode list 65 | .header off 66 | .once getvariable("tmp_file") 67 | select 'CREATE TABLE ' || j || '_metadata AS SELECT * FROM iceberg_metadata(''' || k || ''');' 68 | from (select table_name, metadata_location from catalog.tables) x(j,k); 69 | .read getvariable("tmp_file") 70 | 71 | .mode list 72 | .header off 73 | .once getvariable("tmp_file") 74 | select 'CREATE TABLE ' || j || '_snapshots AS SELECT unnest(snapshots, recursive:=true) from read_json(''' || k || ''');' 75 | from (select table_name, metadata_location from catalog.tables) x(j,k); 76 | 77 | .read getvariable("tmp_file") 78 | 79 | SELECT '' ; 80 | SELECT 'Everything is ready! ' ; 81 | SELECT '' ; 82 | SELECT 'Here are some commands to help you get started:' ; 83 | SELECT ' > show; -- show all tables' ; 84 | SELECT ' > select * from catalog.namespaces; -- list namespaces' ; 85 | SELECT ' > select * from catalog.tables; -- list tables' ; 86 | SELECT ' > select * from .
; -- query iceberg table' ; 87 | 88 | SELECT '' ; 89 | 90 | .mode duckbox 91 | .prompt 'ice ➜ ' -------------------------------------------------------------------------------- /tests/test_catalog.py: -------------------------------------------------------------------------------- 1 | # (Move file from src/boringcatalog/test_catalog.py to tests/test_catalog.py) 2 | import os 3 | import subprocess 4 | import sys 5 | import pytest 6 | import pyarrow as pa 7 | import pyarrow.parquet as pq 8 | import pandas as pd 9 | import json 10 | from boringcatalog import BoringCatalog 11 | import shutil 12 | import logging 13 | 14 | @pytest.fixture(scope="function") 15 | def tmp_catalog_dir(tmp_path): 16 | return tmp_path 17 | 18 | @pytest.fixture(scope="function") 19 | def dummy_parquet(tmp_path): 20 | # Create a small dummy parquet file 21 | df = pd.DataFrame({"id": [1, 2], "value": ["a", "b"]}) 22 | table = pa.Table.from_pandas(df) 23 | parquet_path = tmp_path / "dummy.parquet" 24 | pq.write_table(table, parquet_path) 25 | return parquet_path 26 | 27 | def run_cli(args, cwd): 28 | cmd = [sys.executable, "-m", "boringcatalog.cli"] + args 29 | result = subprocess.run(cmd, cwd=cwd, capture_output=True, text=True) 30 | print("STDOUT:\n", result.stdout) 31 | print("STDERR:\n", result.stderr) 32 | return result 33 | 34 | @pytest.mark.parametrize("args,expected_index,do_workflow", [ 35 | # ice init (no args) 36 | ([], { 37 | "catalog_uri": "warehouse/catalog/catalog_boring.json", 38 | "properties": {"warehouse": "warehouse"} 39 | }, True), 40 | # ice init -p warehouse=warehouse3/ 41 | (["-p", "warehouse=warehouse3"], { 42 | "catalog_uri": "warehouse3/catalog/catalog_boring.json", 43 | "properties": {"warehouse": "warehouse3"} 44 | }, True), 45 | # ice init --catalog warehouse2/catalog_boring.json 46 | (["--catalog", "warehouse2/catalog_boring.json"], { 47 | "catalog_uri": "warehouse2/catalog_boring.json", 48 | "properties": {} 49 | }, False), 50 | # ice init --catalog tt/catalog.json -p warehouse=warehouse4 51 | (["--catalog", "tt/catalog.json", "-p", "warehouse=warehouse4"], { 52 | "catalog_uri": "tt/catalog.json", 53 | "properties": {"warehouse": "warehouse4"} 54 | }, False), 55 | # ice init -p warehouse=tttrr 56 | (["-p", "warehouse=tttrr"], { 57 | "catalog_uri": "tttrr/catalog/catalog_boring.json", 58 | "properties": {"warehouse": "tttrr"} 59 | }, False), 60 | ]) 61 | def test_ice_init_variants(tmp_path, args, expected_index, do_workflow, caplog): 62 | # Clean up .ice if it exists 63 | ice_dir = tmp_path / ".ice" 64 | if ice_dir.exists(): 65 | shutil.rmtree(ice_dir) 66 | # If warehouse is needed, create it 67 | warehouse = expected_index["properties"].get("warehouse") 68 | if warehouse: 69 | warehouse_dir = tmp_path / warehouse 70 | warehouse_dir.mkdir(parents=True, exist_ok=True) 71 | # Run CLI 72 | result = run_cli(["init"] + args, cwd=tmp_path) 73 | assert result.returncode == 0 74 | index_path = tmp_path / ".ice" / "index" 75 | assert index_path.exists(), f".ice/index not created for args {args}" 76 | # Check content 77 | with open(index_path) as f: 78 | index = json.load(f) 79 | assert index["catalog_uri"] == expected_index["catalog_uri"], f"catalog_uri mismatch for args {args}" 80 | # Only check properties equality if warehouse is specified in expected_index 81 | if expected_index["properties"]: 82 | assert index["properties"] == expected_index["properties"], f"properties mismatch for args {args}" 83 | # Check BoringCatalog usage 84 | os.chdir(tmp_path) 85 | caplog.set_level(logging.INFO) 86 | catalog = BoringCatalog() 87 | # If warehouse is not specified, it should default to the catalog folder 88 | if not expected_index["properties"].get("warehouse"): 89 | expected_warehouse = str(os.path.dirname(index["catalog_uri"])) 90 | assert catalog.properties["warehouse"] == expected_warehouse 91 | assert f"Using catalog folder to store iceberg data: {expected_warehouse}" in caplog.text 92 | namespaces = catalog.list_namespaces() 93 | assert isinstance(namespaces, list) 94 | # If do_workflow, run commit, log, catalog commands 95 | if do_workflow: 96 | # Create dummy parquet 97 | df = pd.DataFrame({"id": [1, 2], "value": ["a", "b"]}) 98 | table = pa.Table.from_pandas(df) 99 | parquet_path = tmp_path / "dummy.parquet" 100 | pq.write_table(table, parquet_path) 101 | # ice commit my_table --source dummy.parquet 102 | result = run_cli([ 103 | "commit", "my_table", "--source", str(parquet_path) 104 | ], cwd=tmp_path) 105 | assert result.returncode == 0 106 | assert "Committed" in result.stdout 107 | # ice log my_table 108 | result = run_cli(["log", "my_table"], cwd=tmp_path) 109 | assert result.returncode == 0 110 | assert "commit" in result.stdout 111 | # ice catalog 112 | result = run_cli(["catalog"], cwd=tmp_path) 113 | assert result.returncode == 0 114 | assert '"tables"' in result.stdout 115 | 116 | # 5. ice duck (just check it starts, don't wait for interactive) 117 | # We'll skip actually running duckdb interactively in CI 118 | # result = run_cli(["duck"], cwd=tmp_catalog_dir) 119 | # assert result.returncode == 0 120 | 121 | def test_custom_catalog_name(tmp_path): 122 | # Clean up .ice if it exists 123 | ice_dir = tmp_path / ".ice" 124 | if ice_dir.exists(): 125 | shutil.rmtree(ice_dir) 126 | warehouse_dir = tmp_path / "customwarehouse" 127 | warehouse_dir.mkdir(parents=True, exist_ok=True) 128 | custom_name = "mycat" 129 | # Run CLI with custom catalog name 130 | result = run_cli([ 131 | "init", "-p", "warehouse=customwarehouse", "--catalog-name", custom_name 132 | ], cwd=tmp_path) 133 | assert result.returncode == 0 134 | index_path = tmp_path / ".ice" / "index" 135 | assert index_path.exists(), ".ice/index not created for custom catalog name" 136 | with open(index_path) as f: 137 | index = json.load(f) 138 | assert index["catalog_uri"].endswith(f"catalog_{custom_name}.json") 139 | assert index["catalog_name"] == custom_name 140 | # Check BoringCatalog instance uses the custom name 141 | os.chdir(tmp_path) 142 | catalog = BoringCatalog() 143 | assert catalog.name == custom_name 144 | # Check catalog.json content 145 | with open(index["catalog_uri"]) as f: 146 | catalog_json = json.load(f) 147 | assert catalog_json["catalog_name"] == custom_name 148 | 149 | def test_custom_catalog_file_path(tmp_path): 150 | # Test initializing with a fully custom catalog file path 151 | ice_dir = tmp_path / ".ice" 152 | if ice_dir.exists(): 153 | shutil.rmtree(ice_dir) 154 | custom_catalog_path = tmp_path / "mydir" / "mycustom.json" 155 | custom_catalog_path.parent.mkdir(parents=True, exist_ok=True) 156 | result = run_cli([ 157 | "init", "--catalog", str(custom_catalog_path), "--catalog-name", "specialcat" 158 | ], cwd=tmp_path) 159 | assert result.returncode == 0 160 | index_path = tmp_path / ".ice" / "index" 161 | assert index_path.exists(), ".ice/index not created for custom catalog path" 162 | with open(index_path) as f: 163 | index = json.load(f) 164 | assert index["catalog_uri"] == str(custom_catalog_path) 165 | assert index["catalog_name"] == "specialcat" 166 | assert custom_catalog_path.exists(), "Custom catalog file was not created" 167 | # Check BoringCatalog loads from this path 168 | os.chdir(tmp_path) 169 | catalog = BoringCatalog() 170 | assert catalog.name == "specialcat" 171 | assert catalog.uri == str(custom_catalog_path) 172 | 173 | 174 | def test_reinit_overwrite_behavior(tmp_path): 175 | # Test running ice init twice in the same directory 176 | ice_dir = tmp_path / ".ice" 177 | if ice_dir.exists(): 178 | shutil.rmtree(ice_dir) 179 | warehouse_dir = tmp_path / "warehouse" 180 | warehouse_dir.mkdir(parents=True, exist_ok=True) 181 | # First init 182 | result1 = run_cli(["init", "-p", "warehouse=warehouse"], cwd=tmp_path) 183 | assert result1.returncode == 0 184 | index_path = tmp_path / ".ice" / "index" 185 | assert index_path.exists() 186 | with open(index_path) as f: 187 | index1 = json.load(f) 188 | # Second init (should overwrite or succeed) 189 | result2 = run_cli(["init", "-p", "warehouse=warehouse"], cwd=tmp_path) 190 | # Accept both overwrite and success (should not crash) 191 | assert result2.returncode == 0 192 | with open(index_path) as f: 193 | index2 = json.load(f) 194 | # The index file should still be valid and point to the same warehouse 195 | assert index2["properties"]["warehouse"] == "warehouse" 196 | 197 | 198 | def test_manual_index_loading(tmp_path): 199 | # Test loading a catalog from a manually created .ice/index file 200 | ice_dir = tmp_path / ".ice" 201 | ice_dir.mkdir(exist_ok=True) 202 | custom_catalog_path = tmp_path / "manualcat.json" 203 | # Write a minimal catalog file 204 | with open(custom_catalog_path, "w") as f: 205 | json.dump({"catalog_name": "manualcat", "namespaces": {}, "tables": {}}, f) 206 | # Write a custom .ice/index 207 | index = { 208 | "catalog_uri": str(custom_catalog_path), 209 | "catalog_name": "manualcat", 210 | "properties": {"warehouse": "manualwarehouse"} 211 | } 212 | with open(ice_dir / "index", "w") as f: 213 | json.dump(index, f) 214 | os.chdir(tmp_path) 215 | catalog = BoringCatalog() 216 | assert catalog.name == "manualcat" 217 | assert catalog.uri == str(custom_catalog_path) 218 | assert catalog.properties["warehouse"] == "manualwarehouse" 219 | # Now test missing catalog_name (should default to 'boring') 220 | index2 = { 221 | "catalog_uri": str(custom_catalog_path), 222 | "properties": {"warehouse": "manualwarehouse"} 223 | } 224 | with open(ice_dir / "index", "w") as f: 225 | json.dump(index2, f) 226 | catalog2 = BoringCatalog() 227 | assert catalog2.name == "boring" 228 | assert catalog2.uri == str(custom_catalog_path) 229 | assert catalog2.properties["warehouse"] == "manualwarehouse" 230 | --------------------------------------------------------------------------------