├── geo_transformer ├── __init__.py ├── tests │ ├── __init__.py │ ├── data │ │ └── test_points.txt.gz │ ├── conftest.py │ ├── test_cli.py │ ├── test_trie.py │ ├── test_io.py │ └── test_transformer.py ├── data │ └── test_points.txt.gz ├── models.py ├── app.py ├── trie.py ├── transformer.py └── io.py ├── pyproject.toml ├── requirements.in ├── .coveragerc ├── requirements-dev.in ├── .github └── workflows │ ├── black.yml │ └── test.yml ├── requirements.txt ├── Makefile ├── setup.py ├── LICENSE ├── requirements-dev.txt ├── .gitignore └── README.md /geo_transformer/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /geo_transformer/tests/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /pyproject.toml: -------------------------------------------------------------------------------- 1 | [tool.black] 2 | line-length = 150 -------------------------------------------------------------------------------- /requirements.in: -------------------------------------------------------------------------------- 1 | geolib==1.0.7 2 | typer==0.4.1 -------------------------------------------------------------------------------- /.coveragerc: -------------------------------------------------------------------------------- 1 | [run] 2 | omit = 3 | */tests/* 4 | **/__init__.py -------------------------------------------------------------------------------- /geo_transformer/data/test_points.txt.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Gaarv/stuart-data-challenge/main/geo_transformer/data/test_points.txt.gz -------------------------------------------------------------------------------- /geo_transformer/tests/data/test_points.txt.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Gaarv/stuart-data-challenge/main/geo_transformer/tests/data/test_points.txt.gz -------------------------------------------------------------------------------- /requirements-dev.in: -------------------------------------------------------------------------------- 1 | geolib==1.0.7 2 | typer==0.4.1 3 | 4 | pip-tools==6.6.0 5 | black 6 | pytest==7.1.2 7 | pytest-cov==3.0.0 8 | pytest-benchmark==3.4.1 9 | pyinstaller==5.1 -------------------------------------------------------------------------------- /.github/workflows/black.yml: -------------------------------------------------------------------------------- 1 | name: Lint 2 | 3 | on: [push, pull_request] 4 | 5 | jobs: 6 | lint: 7 | runs-on: ubuntu-latest 8 | steps: 9 | - uses: actions/checkout@v3 10 | - uses: psf/black@stable 11 | with: 12 | options: "--check" 13 | src: "./geo_transformer" -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | # 2 | # This file is autogenerated by pip-compile with python 3.9 3 | # To update, run: 4 | # 5 | # pip-compile requirements.in 6 | # 7 | click==8.1.3 8 | # via typer 9 | colorama==0.4.4 10 | # via click 11 | future==0.18.2 12 | # via geolib 13 | geolib==1.0.7 14 | # via -r requirements.in 15 | typer==0.4.1 16 | # via -r requirements.in 17 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | .PHONY: all 2 | 3 | install: 4 | pip install -r requirements.txt 5 | pip install -U . 6 | 7 | install-dev: 8 | pip install -r requirements-dev.txt 9 | pip install -U . 10 | 11 | test: 12 | python -m pytest --cov -v -s -k "not benchmark" 13 | 14 | benchmark: 15 | python -m pytest -v -s -k "benchmark" 16 | 17 | standalone: 18 | pyinstaller --onefile geo_transformer/app.py --name geo-transformer -------------------------------------------------------------------------------- /geo_transformer/models.py: -------------------------------------------------------------------------------- 1 | from typing import NamedTuple 2 | 3 | 4 | class Location(NamedTuple): 5 | """A location represented by latitude and longitude.""" 6 | 7 | lat: float 8 | lng: float 9 | 10 | 11 | class Geohash(NamedTuple): 12 | """A object containing original location, geohash encoding and a unique prefix identifier.""" 13 | 14 | location: Location 15 | geohash: str 16 | uniq: str 17 | -------------------------------------------------------------------------------- /.github/workflows/test.yml: -------------------------------------------------------------------------------- 1 | name: Test 2 | 3 | on: [push, pull_request] 4 | 5 | jobs: 6 | test: 7 | runs-on: ubuntu-latest 8 | steps: 9 | - name: Checkout 10 | uses: actions/checkout@v3 11 | 12 | - name: Set up Python 3.9 13 | uses: actions/setup-python@v3 14 | with: 15 | python-version: "3.9" 16 | 17 | - name: Install dependencies and package 18 | run: make install-dev 19 | 20 | - name: Run test suite 21 | run: make test 22 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup, find_packages 2 | 3 | setup( 4 | name="geo_transformer", 5 | version="1.0.0", 6 | description="Test challenge for Stuart", 7 | url="https://github.com/StuartHiring/python-test-sebastienhoarau", 8 | author="Sebastien Hoarau", 9 | author_email="sebastien.h.data.eng@gmail.com", 10 | classifiers=[ 11 | "License :: OSI Approved :: MIT License", 12 | "Programming Language :: Python :: 3.9", 13 | "Programming Language :: Python :: 3 :: Only", 14 | ], 15 | packages=find_packages(), 16 | python_requires=">=3.9, <3.10", 17 | package_data={ 18 | "geo_transformer": [ 19 | "data/test_points.txt.gz", 20 | ], 21 | }, 22 | ) 23 | -------------------------------------------------------------------------------- /geo_transformer/tests/conftest.py: -------------------------------------------------------------------------------- 1 | from pathlib import Path 2 | from typing import Optional 3 | 4 | import pytest 5 | from geo_transformer.io import extract_from_file, load_locations 6 | 7 | TEST_FILE = Path("geo_transformer/tests/data/test_points.txt.gz") 8 | 9 | 10 | EXPECTED_TEST_OUTPUT = """lat,lng,geohash,uniq 11 | 41.388828145321,2.1689976634898,sp3e3qe7mkcb,sp3e3 12 | 41.390743,2.138067,sp3e2wuys9dr,sp3e2wuy 13 | 41.390853,2.138177,sp3e2wuzpnhr,sp3e2wuz 14 | """ 15 | 16 | 17 | @pytest.fixture(scope="session") 18 | def extracted_file(): 19 | return extract_from_file(TEST_FILE) 20 | 21 | 22 | @pytest.fixture(scope="function") 23 | def data_test_locations(extracted_file: Optional[Path]): 24 | if extracted_file: 25 | data_points = load_locations(extracted_file) 26 | return data_points 27 | else: 28 | return [] 29 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2022 Sebastien Hoarau 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /geo_transformer/tests/test_cli.py: -------------------------------------------------------------------------------- 1 | import tempfile 2 | from pathlib import Path 3 | 4 | from geo_transformer.app import app 5 | from geo_transformer.tests.conftest import EXPECTED_TEST_OUTPUT, TEST_FILE 6 | from typer.testing import CliRunner 7 | 8 | runner = CliRunner() 9 | 10 | 11 | def test_app_stdout(): 12 | result = runner.invoke(app, [TEST_FILE.as_posix()]) 13 | assert result.exit_code == 0 14 | assert result.output == EXPECTED_TEST_OUTPUT 15 | 16 | 17 | def test_app_output_to_file(): 18 | with tempfile.NamedTemporaryFile(suffix=".csv") as output_file: 19 | result = runner.invoke(app, [TEST_FILE.as_posix(), "--output-file", output_file.name]) 20 | assert result.exit_code == 0 21 | assert Path(output_file.name).read_text() == EXPECTED_TEST_OUTPUT 22 | 23 | 24 | def test_app_non_existing_file(): 25 | test_file = "non_existing_file.csv" 26 | result = runner.invoke(app, [test_file]) 27 | assert result.exit_code == 1 28 | assert result.output == f"Input file {test_file} does not exist\n" 29 | 30 | 31 | def test_app_non_gz_file(): 32 | with tempfile.NamedTemporaryFile(suffix=".gz") as test_file: 33 | result = runner.invoke(app, [test_file.name]) 34 | assert result.exit_code == 1 35 | assert result.output == f"Error while extracting archive {test_file.name}: Input file {test_file.name} is not a valid gzip file\n" 36 | -------------------------------------------------------------------------------- /geo_transformer/tests/test_trie.py: -------------------------------------------------------------------------------- 1 | from geo_transformer.trie import Trie 2 | 3 | 4 | def test_insert_in_trie(): 5 | trie = Trie() 6 | words = ["ab", "bc", "ac"] 7 | for word in words: 8 | trie.insert(word) 9 | 10 | root_children = sorted(list(trie.root.children.keys())) 11 | a_children = sorted(list(trie.root.children["a"].children.keys())) 12 | b_children = sorted(list(trie.root.children["b"].children.keys())) 13 | 14 | assert root_children == ["a", "b"] 15 | assert a_children == ["b", "c"] 16 | assert b_children == ["c"] 17 | 18 | 19 | def test_search_in_trie_existing_path(): 20 | trie = Trie() 21 | words = ["whatever", "whenever", "wherever"] 22 | for word in words: 23 | trie.insert(word) 24 | path = trie.search("whenever") 25 | path = [node.char for node in path] 26 | assert path == ["w", "h", "e", "n", "e", "v", "e", "r"] 27 | 28 | 29 | def test_search_in_trie_partial_existing_path(): 30 | trie = Trie() 31 | words = ["whatever", "whenever", "wherever"] 32 | for word in words: 33 | trie.insert(word) 34 | path = trie.search("whocares") 35 | path = [node.char for node in path] 36 | assert path == ["w", "h"] 37 | 38 | 39 | def test_search_in_trie_empty_path(): 40 | trie = Trie() 41 | words = ["whatever", "whenever", "wherever"] 42 | for word in words: 43 | trie.insert(word) 44 | path = trie.search("null") 45 | path = [node.char for node in path] 46 | assert path == [] 47 | -------------------------------------------------------------------------------- /geo_transformer/tests/test_io.py: -------------------------------------------------------------------------------- 1 | import tempfile 2 | from pathlib import Path 3 | from typing import Generator 4 | 5 | from geo_transformer.io import extract_from_file, load_locations, write_to_file 6 | from geo_transformer.models import Location 7 | from geo_transformer.tests.conftest import TEST_FILE, EXPECTED_TEST_OUTPUT 8 | from geo_transformer import transformer 9 | 10 | 11 | def test_extract_from_file(): 12 | extracted_file = extract_from_file(TEST_FILE) 13 | assert extracted_file is not None 14 | assert extracted_file.exists() 15 | assert len(extracted_file.open("r").readlines()) == 4 16 | 17 | 18 | def test_extract_from_non_existing_file(): 19 | extracted_file = extract_from_file(Path("/path/to/nofile.gz")) 20 | assert extracted_file is None 21 | 22 | 23 | def test_extract_from_non_gzip_file(): 24 | with tempfile.NamedTemporaryFile(suffix=".gz") as test_file: 25 | extracted_file = extract_from_file(Path(test_file.name)) 26 | assert extracted_file is None 27 | 28 | 29 | def test_load_points(extracted_file): 30 | data_points = load_locations(extracted_file) 31 | assert len(list(data_points)) == 3 32 | 33 | 34 | def test_write_to_file(data_test_locations: Generator[Location, None, None]): 35 | with tempfile.NamedTemporaryFile(suffix=".csv") as output_file: 36 | locations = list(data_test_locations) # copy generator to list 37 | index = transformer.build(transformer.encode(location.lat, location.lng) for location in locations) 38 | geohashs = transformer.transform(locations, index) # type: ignore 39 | write_to_file(Path(output_file.name), geohashs) 40 | assert Path(output_file.name).read_text() == EXPECTED_TEST_OUTPUT 41 | -------------------------------------------------------------------------------- /requirements-dev.txt: -------------------------------------------------------------------------------- 1 | # 2 | # This file is autogenerated by pip-compile with python 3.9 3 | # To update, run: 4 | # 5 | # pip-compile requirements-dev.in 6 | # 7 | altgraph==0.17.2 8 | # via pyinstaller 9 | attrs==21.4.0 10 | # via pytest 11 | black==22.3.0 12 | # via -r requirements-dev.in 13 | click==8.1.3 14 | # via 15 | # black 16 | # pip-tools 17 | # typer 18 | coverage[toml]==6.3.2 19 | # via pytest-cov 20 | future==0.18.2 21 | # via geolib 22 | geolib==1.0.7 23 | # via -r requirements-dev.in 24 | iniconfig==1.1.1 25 | # via pytest 26 | mypy-extensions==0.4.3 27 | # via black 28 | packaging==21.3 29 | # via pytest 30 | pathspec==0.9.0 31 | # via black 32 | pep517==0.12.0 33 | # via pip-tools 34 | pip-tools==6.6.0 35 | # via -r requirements-dev.in 36 | platformdirs==2.5.2 37 | # via black 38 | pluggy==1.0.0 39 | # via pytest 40 | py==1.11.0 41 | # via pytest 42 | py-cpuinfo==8.0.0 43 | # via pytest-benchmark 44 | pyinstaller==5.1 45 | # via -r requirements-dev.in 46 | pyinstaller-hooks-contrib==2022.5 47 | # via pyinstaller 48 | pyparsing==3.0.8 49 | # via packaging 50 | pytest==7.1.2 51 | # via 52 | # -r requirements-dev.in 53 | # pytest-benchmark 54 | # pytest-cov 55 | pytest-benchmark==3.4.1 56 | # via -r requirements-dev.in 57 | pytest-cov==3.0.0 58 | # via -r requirements-dev.in 59 | tomli==2.0.1 60 | # via 61 | # black 62 | # coverage 63 | # pep517 64 | # pytest 65 | typer==0.4.1 66 | # via -r requirements-dev.in 67 | typing-extensions==4.2.0 68 | # via black 69 | wheel==0.37.1 70 | # via pip-tools 71 | 72 | # The following packages are considered to be unsafe in a requirements file: 73 | # pip 74 | # setuptools 75 | -------------------------------------------------------------------------------- /geo_transformer/app.py: -------------------------------------------------------------------------------- 1 | from functools import partial 2 | from pathlib import Path 3 | from typing import Optional 4 | 5 | import typer 6 | 7 | from geo_transformer import io, transformer 8 | 9 | app = typer.Typer() 10 | 11 | 12 | @app.command() 13 | def main( 14 | input_file: Path = typer.Argument(..., help="Path to input file, gzip compressed."), 15 | output_file: Optional[Path] = typer.Option(None, help="Path to output file. If not provided, output will be printed to stdout."), 16 | verbose: bool = typer.Option(False, help="Provide additional information while running the program."), 17 | ): 18 | if input_file.is_file(): 19 | extracted_file = io.extract_from_file(input_file) 20 | 21 | if extracted_file: 22 | stream_locations = partial(io.load_locations, extracted_file) # use partial as generator factory 23 | geohashed_locations = (transformer.encode(location.lat, location.lng) for location in stream_locations()) 24 | 25 | if verbose: 26 | typer.secho(f"Loading locations from {extracted_file.as_posix()}", fg=typer.colors.GREEN, err=True) 27 | 28 | index = transformer.build(geohashed_locations) 29 | geohashs = transformer.transform(stream_locations(), index) 30 | 31 | if not output_file: 32 | io.print_to_console(geohashs) 33 | else: 34 | if verbose: 35 | typer.secho(f"Saving geohashes to {output_file.as_posix()}", fg=typer.colors.GREEN, err=True) 36 | io.write_to_file(output_file, geohashs) 37 | else: 38 | raise typer.Exit(code=1) 39 | else: 40 | typer.secho(f"Input file {input_file} does not exist", fg=typer.colors.RED, err=True) 41 | raise typer.Exit(code=1) 42 | 43 | 44 | if __name__ == "__main__": 45 | app() 46 | -------------------------------------------------------------------------------- /geo_transformer/trie.py: -------------------------------------------------------------------------------- 1 | from typing import Dict, List 2 | 3 | 4 | class TrieNode: 5 | """ 6 | A node in a Trie. Contains a character and a dictionary of children nodes. 7 | """ 8 | 9 | def __init__(self, char: str): 10 | self.char = char 11 | self.children: Dict[str, TrieNode] = {} 12 | 13 | @property 14 | def is_leaf(self) -> bool: 15 | return any(self.children) 16 | 17 | 18 | class Trie: 19 | """ 20 | A Trie is a tree data structure that stores a set of strings. See https://en.wikipedia.org/wiki/Trie. 21 | """ 22 | 23 | def __init__(self): 24 | self.root = TrieNode("") 25 | 26 | def insert(self, s: str) -> None: 27 | """Insert a given string into the Trie. 28 | 29 | Args: 30 | s (str): the string to insert 31 | """ 32 | current_node = self.root 33 | for char in s: 34 | if char not in current_node.children: 35 | child = TrieNode(char) 36 | current_node.children[char] = child 37 | current_node = child 38 | else: 39 | current_node = current_node.children[char] 40 | 41 | def search(self, s: str) -> List[TrieNode]: 42 | """Search for a given string in the Trie. 43 | 44 | Args: 45 | s (str): string to search 46 | 47 | Returns: 48 | List[TrieNode]: list of TrieNodes corresponding to the path of the string in the Trie. 49 | If an exact match is not found, returns the partial path or an empty list if no match is found. 50 | """ 51 | path = [] 52 | node = self.root 53 | for char in s: 54 | node = node.children.get(char) 55 | if node: 56 | path.append(node) 57 | else: 58 | break 59 | return path 60 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | pip-wheel-metadata/ 24 | share/python-wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .nox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *.cover 50 | *.py,cover 51 | .hypothesis/ 52 | .pytest_cache/ 53 | .benchmarks 54 | 55 | # Translations 56 | *.mo 57 | *.pot 58 | 59 | # Django stuff: 60 | *.log 61 | local_settings.py 62 | db.sqlite3 63 | db.sqlite3-journal 64 | 65 | # Flask stuff: 66 | instance/ 67 | .webassets-cache 68 | 69 | # Scrapy stuff: 70 | .scrapy 71 | 72 | # Sphinx documentation 73 | docs/_build/ 74 | 75 | # PyBuilder 76 | target/ 77 | 78 | # Jupyter Notebook 79 | .ipynb_checkpoints 80 | 81 | # IPython 82 | profile_default/ 83 | ipython_config.py 84 | 85 | # pyenv 86 | .python-version 87 | 88 | # pipenv 89 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 90 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 91 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 92 | # install all needed dependencies. 93 | #Pipfile.lock 94 | 95 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 96 | __pypackages__/ 97 | 98 | # Celery stuff 99 | celerybeat-schedule 100 | celerybeat.pid 101 | 102 | # SageMath parsed files 103 | *.sage.py 104 | 105 | # Environments 106 | .env 107 | .venv 108 | env/ 109 | venv/ 110 | ENV/ 111 | env.bak/ 112 | venv.bak/ 113 | 114 | # Spyder project settings 115 | .spyderproject 116 | .spyproject 117 | 118 | # Rope project settings 119 | .ropeproject 120 | 121 | # mkdocs documentation 122 | /site 123 | 124 | # mypy 125 | .mypy_cache/ 126 | .dmypy.json 127 | dmypy.json 128 | 129 | # Pyre type checker 130 | .pyre/ 131 | -------------------------------------------------------------------------------- /geo_transformer/transformer.py: -------------------------------------------------------------------------------- 1 | from typing import Generator 2 | 3 | from geolib import geohash as geohashlib 4 | 5 | from geo_transformer.models import Geohash, Location 6 | from geo_transformer.trie import Trie 7 | 8 | 9 | def encode(latitude: float, longitude: float, precision: int = 12) -> str: 10 | """Encode latitude and longitude to geohash. 11 | 12 | Args: 13 | latitude (float): latitude of a given location 14 | longitude (float): longitude of a given location 15 | precision (int, optional): geohash prevision, from 1 (lowest) to 12 (highest), included. Defaults to 12. 16 | 17 | Returns: 18 | str: geohash as string 19 | """ 20 | geohash = geohashlib.encode(latitude, longitude, precision) 21 | return geohash 22 | 23 | 24 | def build(geohashs: Generator[str, None, None]) -> Trie: 25 | """Helper function to build a trie from a geohash generator. 26 | 27 | Args: 28 | geohashs (Generator[str, None, None]): list of geohashes to insert into the trie 29 | 30 | Returns: 31 | Trie: a trie with all geohashes inserted 32 | """ 33 | trie = Trie() 34 | for geohash in geohashs: 35 | trie.insert(geohash) 36 | return trie 37 | 38 | 39 | def query_unique_prefix(geohash: str, trie: Trie) -> str: 40 | """Query a trie to obtain a unique prefix for a given geohash. 41 | 42 | Args: 43 | geohash (str): geohash to query 44 | trie (Trie): Trie to query against 45 | 46 | Returns: 47 | str: unique prefix found in the trie 48 | """ 49 | unique_index = 0 50 | 51 | # reverse the path found 52 | reversed_path = trie.search(geohash)[::-1] 53 | for i in range(len(reversed_path)): 54 | unique_index = i 55 | # stop when the parent node of the current child has more than one child 56 | if len(reversed_path[i + 1].children) > 1: 57 | break 58 | 59 | # slice the path to obtain the unique prefix and reverse it back 60 | path = reversed_path[unique_index:][::-1] 61 | return "".join([node.char for node in path]) 62 | 63 | 64 | def transform(locations: Generator[Location, None, None], trie: Trie) -> Generator[Geohash, None, None]: 65 | """transform locations to geohashs. Adds geohash encoding and a unique prefix identifier to input locations. 66 | 67 | Args: 68 | locations (Generator[Location, None, None]): Locations to transform 69 | 70 | Yields: 71 | Generator[Geohash, None, None]: GeoHash objects 72 | """ 73 | for location in locations: 74 | geohash = encode(location.lat, location.lng) 75 | unique_prefix = query_unique_prefix(geohash, trie) 76 | yield Geohash(location=location, geohash=geohash, uniq=unique_prefix) 77 | -------------------------------------------------------------------------------- /geo_transformer/io.py: -------------------------------------------------------------------------------- 1 | import csv 2 | import gzip 3 | import tempfile 4 | from pathlib import Path 5 | from typing import Generator, Optional 6 | 7 | import typer 8 | 9 | from geo_transformer.models import Geohash, Location 10 | 11 | GZIP_MAGIC_BYTES = "1f8b" 12 | CSV_HEADER = ["lat", "lng", "geohash", "uniq"] 13 | 14 | 15 | def extract_from_file(archive: Path) -> Optional[Path]: 16 | """Extract gzip file containing raw data 17 | 18 | Args: 19 | archive (Path): path to the archive, ie. "data/input.gz" 20 | 21 | Returns: 22 | Path: extracted file absolute path 23 | """ 24 | extracted_file = None 25 | try: 26 | magic_bytes = archive.open("rb").read(2).hex() 27 | if magic_bytes != GZIP_MAGIC_BYTES: 28 | raise Exception(f"Input file {archive.as_posix()} is not a valid gzip file") 29 | dest = Path(tempfile.mkdtemp()).joinpath(archive.stem) 30 | content = gzip.decompress(archive.open("rb").read()) 31 | dest.write_bytes(content) 32 | extracted_file = dest 33 | except Exception as e: 34 | typer.secho(f"Error while extracting archive {archive.as_posix()}: {e}", fg=typer.colors.RED, err=True) 35 | return extracted_file 36 | 37 | 38 | def load_locations(csv_file: Path) -> Generator[Location, None, None]: 39 | """Load points from CSV file as Location objects 40 | 41 | Args: 42 | csv_file (Path): path to the CSV file, ie. "data/input.csv" 43 | 44 | Yields: 45 | Generator[Location, None, None]: Location objects 46 | """ 47 | try: 48 | with csv_file.open("r") as csvfile: 49 | csvreader = csv.reader(csvfile) 50 | next(csvreader) # skip header 51 | for row in csvreader: 52 | yield Location(lat=float(row[0]), lng=float(row[1])) 53 | except Exception as e: 54 | typer.secho(f"Error while loading points: {e}", fg=typer.colors.RED, err=True) 55 | yield from () 56 | 57 | 58 | def print_to_console(geohashs: Generator[Geohash, None, None]) -> None: 59 | """Print Geohash objects to console as CSV 60 | 61 | Args: 62 | locations (Generator[Geohash, None, None]): Geohash objects 63 | """ 64 | typer.echo(",".join(CSV_HEADER)) # print header 65 | for geohash in geohashs: 66 | typer.secho(f"{geohash.location.lat},{geohash.location.lng},{geohash.geohash},{geohash.uniq}") 67 | 68 | 69 | def write_to_file(output_file: Path, geohashs: Generator[Geohash, None, None]) -> None: 70 | """Write Geohash objects to CSV file 71 | 72 | Args: 73 | output_file (Path): path to the output file, ie. "data/output.csv" 74 | geohashs (Generator[Geohash, None, None]): Geohash objects 75 | """ 76 | with output_file.open("w") as csvfile: 77 | csvwriter = csv.writer(csvfile) 78 | csvwriter.writerow(CSV_HEADER) 79 | for geohash in geohashs: 80 | csvwriter.writerow([geohash.location.lat, geohash.location.lng, geohash.geohash, geohash.uniq]) 81 | -------------------------------------------------------------------------------- /geo_transformer/tests/test_transformer.py: -------------------------------------------------------------------------------- 1 | import random 2 | import string 3 | from typing import Generator 4 | 5 | import pytest 6 | from geo_transformer.models import Geohash, Location 7 | from geo_transformer.transformer import build, encode, query_unique_prefix, transform 8 | 9 | 10 | def test_geohash_instance(data_test_locations: Generator[Location, None, None]): 11 | for location in data_test_locations: 12 | assert isinstance(Geohash(location=location, geohash=encode(location.lat, location.lng), uniq=""), Geohash) 13 | 14 | 15 | def test_geohash_encode(data_test_locations: Generator[Location, None, None]): 16 | geohashs = [encode(location.lat, location.lng) for location in data_test_locations] 17 | assert geohashs == ["sp3e3qe7mkcb", "sp3e2wuys9dr", "sp3e2wuzpnhr"] 18 | 19 | 20 | def test_query_unique_prefix(data_test_locations: Generator[Location, None, None]): 21 | index = build(encode(location.lat, location.lng) for location in data_test_locations) 22 | prefix = query_unique_prefix("sp3e3qe7mkcb", index) 23 | assert prefix == "sp3e3" 24 | 25 | 26 | def test_transformer(data_test_locations: Generator[Location, None, None]): 27 | locations = list(data_test_locations) # copy generator to list 28 | index = build(encode(location.lat, location.lng) for location in locations) 29 | results = transform(locations, index) # type: ignore 30 | prefixes = [geohash.uniq for geohash in results] 31 | assert len(list(prefixes)) == 3 32 | assert prefixes == ["sp3e3", "sp3e2wuy", "sp3e2wuz"] 33 | 34 | 35 | @pytest.mark.benchmark(group="geohash") 36 | def test_geohash_encode_benchmark(data_test_locations: Generator[Location, None, None], benchmark): 37 | location = list(data_test_locations)[0] 38 | geohash = benchmark(encode, location.lat, location.lng) 39 | assert geohash == "sp3e3qe7mkcb" 40 | 41 | 42 | @pytest.mark.benchmark(group="trie-insert") 43 | def test_build_index_benchmark_1_000(benchmark): 44 | geohashs = generate_fake_random_geohash(1_000) 45 | benchmark(build, geohashs) 46 | 47 | 48 | @pytest.mark.benchmark(group="trie-insert") 49 | def test_build_index_benchmark_10_000(benchmark): 50 | geohashs = generate_fake_random_geohash(10_000) 51 | benchmark(build, geohashs) 52 | 53 | 54 | @pytest.mark.benchmark(group="trie-insert") 55 | def test_build_index_benchmark_100_000(benchmark): 56 | geohashs = generate_fake_random_geohash(100_000) 57 | benchmark(build, geohashs) 58 | 59 | 60 | @pytest.mark.benchmark(group="trie-insert") 61 | def test_build_index_benchmark_1_000_000(benchmark): 62 | geohashs = generate_fake_random_geohash(1_000_000) 63 | benchmark(build, geohashs) 64 | 65 | 66 | @pytest.mark.benchmark(group="trie-query") 67 | def test_query_unique_prefix_benchmark_small_trie(data_test_locations: Generator[Location, None, None], benchmark): 68 | index = build(encode(location.lat, location.lng) for location in data_test_locations) 69 | prefix = benchmark(query_unique_prefix, "sp3e3qe7mkcb", index) 70 | assert prefix == "sp3e3" 71 | 72 | 73 | @pytest.mark.benchmark(group="trie-query") 74 | def test_query_unique_prefix_benchmark_large_trie(benchmark): 75 | geohashs = generate_fake_random_geohash(1_000_000) 76 | index = build(geohashs) 77 | benchmark(query_unique_prefix, generate_fake_random_geohash(1), index) 78 | 79 | 80 | def generate_fake_random_geohash(size: int) -> Generator[str, None, None]: 81 | """Generate a fake (invalid) geohash-like of stringof with 12 random characters 82 | 83 | Args: 84 | size (int): number of fake geohashs to generate 85 | 86 | Yields: 87 | Generator[str, None, None]: fake geohashs strings 88 | """ 89 | for _ in range(size): 90 | yield "".join(random.choice(string.ascii_lowercase) for _ in range(12)) 91 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | [![Test](https://github.com/gaarv/stuart-data-challenge/actions/workflows/test.yml/badge.svg)](https://github.com/gaarv/stuart-data-challenge/actions/workflows/test.yml) 2 | 3 | :globe_with_meridians: Unique Geohash :globe_with_meridians: 4 | === 5 | 6 | # Requirements 7 | 8 | * make (if not present, `apt install make` on Linux Debian or `brew install make` on Mac) 9 | * the current sources (python-test-sebastienhoarau) 10 | 11 | A Makefile provides convenient shortcuts for most tasks. 12 | 13 | # Setup 14 | 15 | Create a dedicated Python virtual environment, ie. with [conda](https://docs.conda.io/en/latest/miniconda.html): 16 | 17 | conda create -n geo-transformer python=3.9 pip 18 | 19 | Activate the virtual environment: 20 | 21 | conda activate geo-transformer 22 | 23 | Install minimum requirements and package locally with: 24 | 25 | make install 26 | 27 | at the root of the project. 28 | 29 | # Usage 30 | 31 | Change directory to `geo_transformer` directory: 32 | 33 | cd geo_transformer 34 | 35 | then run: 36 | 37 | python app.py data/test_points.txt.gz 38 | 39 | to use the provided sample data and print output to the console. 40 | 41 | You can also use any file respecting the same schema as the one provided, compressed in gzip format for the `INPUT_FILE` argument. 42 | 43 |
44 | 45 | List all available commands with: 46 | 47 | python app.py --help 48 | 49 | # Project structure 50 | 51 | ``` 52 | python-test-sebastienhoarau 53 | ├── LICENSE 54 | ├── Makefile 55 | ├── README.md # this file 56 | ├── geo_transformer # main package 57 | │ ├── __init__.py 58 | │ ├── app.py # application entrypoint 59 | │ ├── data # sample application data 60 | │ ├── io.py # input/output functions 61 | │ ├── models.py # datas structures 62 | │ ├── tests # tests 63 | │ └── transformer.py # transformer functions (geohash encoding, unique prefix) 64 | ├── pyproject.toml 65 | ├── requirements-dev.in # dev requirements 66 | ├── requirements-dev.txt # compiled dev requirements with pip-compile 67 | ├── requirements.in # requirements 68 | ├── requirements.txt # compiled requirements with pip-compile 69 | └── setup.py 70 | ``` 71 | 72 | Python source code is formatted with [Black](https://github.com/psf/black). 73 | 74 | # Development environment 75 | 76 | Similar to [Setup](#setup), while at the root of the project, install development requirements and package locally with: 77 | 78 | make install-dev 79 | 80 | # Run tests 81 | 82 | Tests can be run with: 83 | 84 | make test 85 | 86 | The console prints tests outputs as well as code coverage. 87 | 88 | # Run benchmarks 89 | 90 | Benchmarks can be run with: 91 | 92 | make benchmark 93 | 94 | # Creating a standalone binary 95 | 96 | A standalone binary for your platform can be created with: 97 | 98 | make standalone 99 | 100 | And will be found as `geo-transformer` in newly created `dist` directory. The produced binary does not require any Python installation or dependencies and can be run just like in [Usage](#usage) with: 101 | 102 | ./geo-transformer 103 | 104 | A downloadable version (ELF / Linux only) is available in the `Releases` section. 105 | 106 | # Updating dependencies 107 | 108 | The requirements files `requirements.txt` and `requirements-dev.txt` are generated with `pip-compile` from [pip-tools](https://github.com/jazzband/pip-tools). 109 | 110 | * update `requirements.in` and `requirements-dev.in` as needed 111 | * run `pip-compile requirements.in` and `pip-compile requirements-dev.in` to update the compiled requirements. 112 | 113 | # Original Problem Statement 114 | 115 | Your task is to transform the set of longitude, latitude coordinates provided in the `test_points.txt.gz` file 116 | into corresponding [GeoHash](https://en.wikipedia.org/wiki/Geohash) codes. 117 | For each pair of coordinates only the shortest geohash prefix that uniquely identifies this point must be stored. 118 | For instance, this 3 points dataset will store these unique prefixes: 119 | 120 | |latitude | longitude | geohash | unique_prefix | 121 | |----------------|-----------------|--------------|---------------| 122 | |41.388828145321 | 2.1689976634898 | sp3e3qe7mkcb | sp3e3 | 123 | |41.390743 | 2.138067 | sp3e2wuys9dr | sp3e2wuy | 124 | |41.390853 | 2.138177 | sp3e2wuzpnhr | sp3e2wuz | 125 | 126 | The solution must be coded in `Python` and you can use any public domain libraries. 127 | It should work with any file respecting the same schema as the one provided. 128 | The executable must output the solution on `stdout` in [CSV format](https://tools.ietf.org/html/rfc4180) 129 | with 4 columns following the structure of the example, *ie*: 130 | 131 | ```csv 132 | lat,lng,geohash,uniq 133 | 41.388828145321,2.1689976634898,sp3e3qe7mkcb,sp3e3 134 | 41.390743,2.138067,sp3e2wuys9dr,sp3e2wuy 135 | 41.390853,2.138177,sp3e2wuzpnhr,sp3e2wuz 136 | ``` 137 | 138 | ## :nerd_face: We value in the solution 139 | 140 | - Good software design 141 | - Proper documentation 142 | - Compliance to Python standards and modern usages (*eg.*: [PEP8](https://www.python.org/dev/peps/pep-0008/)) 143 | - Proper use of data structures 144 | - Ergonomy of the command line interface 145 | - Setup/Launch instructions if required 146 | --------------------------------------------------------------------------------