├── geo_transformer
├── __init__.py
├── tests
│ ├── __init__.py
│ ├── data
│ │ └── test_points.txt.gz
│ ├── conftest.py
│ ├── test_cli.py
│ ├── test_trie.py
│ ├── test_io.py
│ └── test_transformer.py
├── data
│ └── test_points.txt.gz
├── models.py
├── app.py
├── trie.py
├── transformer.py
└── io.py
├── pyproject.toml
├── requirements.in
├── .coveragerc
├── requirements-dev.in
├── .github
└── workflows
│ ├── black.yml
│ └── test.yml
├── requirements.txt
├── Makefile
├── setup.py
├── LICENSE
├── requirements-dev.txt
├── .gitignore
└── README.md
/geo_transformer/__init__.py:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/geo_transformer/tests/__init__.py:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------
1 | [tool.black]
2 | line-length = 150
--------------------------------------------------------------------------------
/requirements.in:
--------------------------------------------------------------------------------
1 | geolib==1.0.7
2 | typer==0.4.1
--------------------------------------------------------------------------------
/.coveragerc:
--------------------------------------------------------------------------------
1 | [run]
2 | omit =
3 | */tests/*
4 | **/__init__.py
--------------------------------------------------------------------------------
/geo_transformer/data/test_points.txt.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Gaarv/stuart-data-challenge/main/geo_transformer/data/test_points.txt.gz
--------------------------------------------------------------------------------
/geo_transformer/tests/data/test_points.txt.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Gaarv/stuart-data-challenge/main/geo_transformer/tests/data/test_points.txt.gz
--------------------------------------------------------------------------------
/requirements-dev.in:
--------------------------------------------------------------------------------
1 | geolib==1.0.7
2 | typer==0.4.1
3 |
4 | pip-tools==6.6.0
5 | black
6 | pytest==7.1.2
7 | pytest-cov==3.0.0
8 | pytest-benchmark==3.4.1
9 | pyinstaller==5.1
--------------------------------------------------------------------------------
/.github/workflows/black.yml:
--------------------------------------------------------------------------------
1 | name: Lint
2 |
3 | on: [push, pull_request]
4 |
5 | jobs:
6 | lint:
7 | runs-on: ubuntu-latest
8 | steps:
9 | - uses: actions/checkout@v3
10 | - uses: psf/black@stable
11 | with:
12 | options: "--check"
13 | src: "./geo_transformer"
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | #
2 | # This file is autogenerated by pip-compile with python 3.9
3 | # To update, run:
4 | #
5 | # pip-compile requirements.in
6 | #
7 | click==8.1.3
8 | # via typer
9 | colorama==0.4.4
10 | # via click
11 | future==0.18.2
12 | # via geolib
13 | geolib==1.0.7
14 | # via -r requirements.in
15 | typer==0.4.1
16 | # via -r requirements.in
17 |
--------------------------------------------------------------------------------
/Makefile:
--------------------------------------------------------------------------------
1 | .PHONY: all
2 |
3 | install:
4 | pip install -r requirements.txt
5 | pip install -U .
6 |
7 | install-dev:
8 | pip install -r requirements-dev.txt
9 | pip install -U .
10 |
11 | test:
12 | python -m pytest --cov -v -s -k "not benchmark"
13 |
14 | benchmark:
15 | python -m pytest -v -s -k "benchmark"
16 |
17 | standalone:
18 | pyinstaller --onefile geo_transformer/app.py --name geo-transformer
--------------------------------------------------------------------------------
/geo_transformer/models.py:
--------------------------------------------------------------------------------
1 | from typing import NamedTuple
2 |
3 |
4 | class Location(NamedTuple):
5 | """A location represented by latitude and longitude."""
6 |
7 | lat: float
8 | lng: float
9 |
10 |
11 | class Geohash(NamedTuple):
12 | """A object containing original location, geohash encoding and a unique prefix identifier."""
13 |
14 | location: Location
15 | geohash: str
16 | uniq: str
17 |
--------------------------------------------------------------------------------
/.github/workflows/test.yml:
--------------------------------------------------------------------------------
1 | name: Test
2 |
3 | on: [push, pull_request]
4 |
5 | jobs:
6 | test:
7 | runs-on: ubuntu-latest
8 | steps:
9 | - name: Checkout
10 | uses: actions/checkout@v3
11 |
12 | - name: Set up Python 3.9
13 | uses: actions/setup-python@v3
14 | with:
15 | python-version: "3.9"
16 |
17 | - name: Install dependencies and package
18 | run: make install-dev
19 |
20 | - name: Run test suite
21 | run: make test
22 |
--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
1 | from setuptools import setup, find_packages
2 |
3 | setup(
4 | name="geo_transformer",
5 | version="1.0.0",
6 | description="Test challenge for Stuart",
7 | url="https://github.com/StuartHiring/python-test-sebastienhoarau",
8 | author="Sebastien Hoarau",
9 | author_email="sebastien.h.data.eng@gmail.com",
10 | classifiers=[
11 | "License :: OSI Approved :: MIT License",
12 | "Programming Language :: Python :: 3.9",
13 | "Programming Language :: Python :: 3 :: Only",
14 | ],
15 | packages=find_packages(),
16 | python_requires=">=3.9, <3.10",
17 | package_data={
18 | "geo_transformer": [
19 | "data/test_points.txt.gz",
20 | ],
21 | },
22 | )
23 |
--------------------------------------------------------------------------------
/geo_transformer/tests/conftest.py:
--------------------------------------------------------------------------------
1 | from pathlib import Path
2 | from typing import Optional
3 |
4 | import pytest
5 | from geo_transformer.io import extract_from_file, load_locations
6 |
7 | TEST_FILE = Path("geo_transformer/tests/data/test_points.txt.gz")
8 |
9 |
10 | EXPECTED_TEST_OUTPUT = """lat,lng,geohash,uniq
11 | 41.388828145321,2.1689976634898,sp3e3qe7mkcb,sp3e3
12 | 41.390743,2.138067,sp3e2wuys9dr,sp3e2wuy
13 | 41.390853,2.138177,sp3e2wuzpnhr,sp3e2wuz
14 | """
15 |
16 |
17 | @pytest.fixture(scope="session")
18 | def extracted_file():
19 | return extract_from_file(TEST_FILE)
20 |
21 |
22 | @pytest.fixture(scope="function")
23 | def data_test_locations(extracted_file: Optional[Path]):
24 | if extracted_file:
25 | data_points = load_locations(extracted_file)
26 | return data_points
27 | else:
28 | return []
29 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2022 Sebastien Hoarau
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/geo_transformer/tests/test_cli.py:
--------------------------------------------------------------------------------
1 | import tempfile
2 | from pathlib import Path
3 |
4 | from geo_transformer.app import app
5 | from geo_transformer.tests.conftest import EXPECTED_TEST_OUTPUT, TEST_FILE
6 | from typer.testing import CliRunner
7 |
8 | runner = CliRunner()
9 |
10 |
11 | def test_app_stdout():
12 | result = runner.invoke(app, [TEST_FILE.as_posix()])
13 | assert result.exit_code == 0
14 | assert result.output == EXPECTED_TEST_OUTPUT
15 |
16 |
17 | def test_app_output_to_file():
18 | with tempfile.NamedTemporaryFile(suffix=".csv") as output_file:
19 | result = runner.invoke(app, [TEST_FILE.as_posix(), "--output-file", output_file.name])
20 | assert result.exit_code == 0
21 | assert Path(output_file.name).read_text() == EXPECTED_TEST_OUTPUT
22 |
23 |
24 | def test_app_non_existing_file():
25 | test_file = "non_existing_file.csv"
26 | result = runner.invoke(app, [test_file])
27 | assert result.exit_code == 1
28 | assert result.output == f"Input file {test_file} does not exist\n"
29 |
30 |
31 | def test_app_non_gz_file():
32 | with tempfile.NamedTemporaryFile(suffix=".gz") as test_file:
33 | result = runner.invoke(app, [test_file.name])
34 | assert result.exit_code == 1
35 | assert result.output == f"Error while extracting archive {test_file.name}: Input file {test_file.name} is not a valid gzip file\n"
36 |
--------------------------------------------------------------------------------
/geo_transformer/tests/test_trie.py:
--------------------------------------------------------------------------------
1 | from geo_transformer.trie import Trie
2 |
3 |
4 | def test_insert_in_trie():
5 | trie = Trie()
6 | words = ["ab", "bc", "ac"]
7 | for word in words:
8 | trie.insert(word)
9 |
10 | root_children = sorted(list(trie.root.children.keys()))
11 | a_children = sorted(list(trie.root.children["a"].children.keys()))
12 | b_children = sorted(list(trie.root.children["b"].children.keys()))
13 |
14 | assert root_children == ["a", "b"]
15 | assert a_children == ["b", "c"]
16 | assert b_children == ["c"]
17 |
18 |
19 | def test_search_in_trie_existing_path():
20 | trie = Trie()
21 | words = ["whatever", "whenever", "wherever"]
22 | for word in words:
23 | trie.insert(word)
24 | path = trie.search("whenever")
25 | path = [node.char for node in path]
26 | assert path == ["w", "h", "e", "n", "e", "v", "e", "r"]
27 |
28 |
29 | def test_search_in_trie_partial_existing_path():
30 | trie = Trie()
31 | words = ["whatever", "whenever", "wherever"]
32 | for word in words:
33 | trie.insert(word)
34 | path = trie.search("whocares")
35 | path = [node.char for node in path]
36 | assert path == ["w", "h"]
37 |
38 |
39 | def test_search_in_trie_empty_path():
40 | trie = Trie()
41 | words = ["whatever", "whenever", "wherever"]
42 | for word in words:
43 | trie.insert(word)
44 | path = trie.search("null")
45 | path = [node.char for node in path]
46 | assert path == []
47 |
--------------------------------------------------------------------------------
/geo_transformer/tests/test_io.py:
--------------------------------------------------------------------------------
1 | import tempfile
2 | from pathlib import Path
3 | from typing import Generator
4 |
5 | from geo_transformer.io import extract_from_file, load_locations, write_to_file
6 | from geo_transformer.models import Location
7 | from geo_transformer.tests.conftest import TEST_FILE, EXPECTED_TEST_OUTPUT
8 | from geo_transformer import transformer
9 |
10 |
11 | def test_extract_from_file():
12 | extracted_file = extract_from_file(TEST_FILE)
13 | assert extracted_file is not None
14 | assert extracted_file.exists()
15 | assert len(extracted_file.open("r").readlines()) == 4
16 |
17 |
18 | def test_extract_from_non_existing_file():
19 | extracted_file = extract_from_file(Path("/path/to/nofile.gz"))
20 | assert extracted_file is None
21 |
22 |
23 | def test_extract_from_non_gzip_file():
24 | with tempfile.NamedTemporaryFile(suffix=".gz") as test_file:
25 | extracted_file = extract_from_file(Path(test_file.name))
26 | assert extracted_file is None
27 |
28 |
29 | def test_load_points(extracted_file):
30 | data_points = load_locations(extracted_file)
31 | assert len(list(data_points)) == 3
32 |
33 |
34 | def test_write_to_file(data_test_locations: Generator[Location, None, None]):
35 | with tempfile.NamedTemporaryFile(suffix=".csv") as output_file:
36 | locations = list(data_test_locations) # copy generator to list
37 | index = transformer.build(transformer.encode(location.lat, location.lng) for location in locations)
38 | geohashs = transformer.transform(locations, index) # type: ignore
39 | write_to_file(Path(output_file.name), geohashs)
40 | assert Path(output_file.name).read_text() == EXPECTED_TEST_OUTPUT
41 |
--------------------------------------------------------------------------------
/requirements-dev.txt:
--------------------------------------------------------------------------------
1 | #
2 | # This file is autogenerated by pip-compile with python 3.9
3 | # To update, run:
4 | #
5 | # pip-compile requirements-dev.in
6 | #
7 | altgraph==0.17.2
8 | # via pyinstaller
9 | attrs==21.4.0
10 | # via pytest
11 | black==22.3.0
12 | # via -r requirements-dev.in
13 | click==8.1.3
14 | # via
15 | # black
16 | # pip-tools
17 | # typer
18 | coverage[toml]==6.3.2
19 | # via pytest-cov
20 | future==0.18.2
21 | # via geolib
22 | geolib==1.0.7
23 | # via -r requirements-dev.in
24 | iniconfig==1.1.1
25 | # via pytest
26 | mypy-extensions==0.4.3
27 | # via black
28 | packaging==21.3
29 | # via pytest
30 | pathspec==0.9.0
31 | # via black
32 | pep517==0.12.0
33 | # via pip-tools
34 | pip-tools==6.6.0
35 | # via -r requirements-dev.in
36 | platformdirs==2.5.2
37 | # via black
38 | pluggy==1.0.0
39 | # via pytest
40 | py==1.11.0
41 | # via pytest
42 | py-cpuinfo==8.0.0
43 | # via pytest-benchmark
44 | pyinstaller==5.1
45 | # via -r requirements-dev.in
46 | pyinstaller-hooks-contrib==2022.5
47 | # via pyinstaller
48 | pyparsing==3.0.8
49 | # via packaging
50 | pytest==7.1.2
51 | # via
52 | # -r requirements-dev.in
53 | # pytest-benchmark
54 | # pytest-cov
55 | pytest-benchmark==3.4.1
56 | # via -r requirements-dev.in
57 | pytest-cov==3.0.0
58 | # via -r requirements-dev.in
59 | tomli==2.0.1
60 | # via
61 | # black
62 | # coverage
63 | # pep517
64 | # pytest
65 | typer==0.4.1
66 | # via -r requirements-dev.in
67 | typing-extensions==4.2.0
68 | # via black
69 | wheel==0.37.1
70 | # via pip-tools
71 |
72 | # The following packages are considered to be unsafe in a requirements file:
73 | # pip
74 | # setuptools
75 |
--------------------------------------------------------------------------------
/geo_transformer/app.py:
--------------------------------------------------------------------------------
1 | from functools import partial
2 | from pathlib import Path
3 | from typing import Optional
4 |
5 | import typer
6 |
7 | from geo_transformer import io, transformer
8 |
9 | app = typer.Typer()
10 |
11 |
12 | @app.command()
13 | def main(
14 | input_file: Path = typer.Argument(..., help="Path to input file, gzip compressed."),
15 | output_file: Optional[Path] = typer.Option(None, help="Path to output file. If not provided, output will be printed to stdout."),
16 | verbose: bool = typer.Option(False, help="Provide additional information while running the program."),
17 | ):
18 | if input_file.is_file():
19 | extracted_file = io.extract_from_file(input_file)
20 |
21 | if extracted_file:
22 | stream_locations = partial(io.load_locations, extracted_file) # use partial as generator factory
23 | geohashed_locations = (transformer.encode(location.lat, location.lng) for location in stream_locations())
24 |
25 | if verbose:
26 | typer.secho(f"Loading locations from {extracted_file.as_posix()}", fg=typer.colors.GREEN, err=True)
27 |
28 | index = transformer.build(geohashed_locations)
29 | geohashs = transformer.transform(stream_locations(), index)
30 |
31 | if not output_file:
32 | io.print_to_console(geohashs)
33 | else:
34 | if verbose:
35 | typer.secho(f"Saving geohashes to {output_file.as_posix()}", fg=typer.colors.GREEN, err=True)
36 | io.write_to_file(output_file, geohashs)
37 | else:
38 | raise typer.Exit(code=1)
39 | else:
40 | typer.secho(f"Input file {input_file} does not exist", fg=typer.colors.RED, err=True)
41 | raise typer.Exit(code=1)
42 |
43 |
44 | if __name__ == "__main__":
45 | app()
46 |
--------------------------------------------------------------------------------
/geo_transformer/trie.py:
--------------------------------------------------------------------------------
1 | from typing import Dict, List
2 |
3 |
4 | class TrieNode:
5 | """
6 | A node in a Trie. Contains a character and a dictionary of children nodes.
7 | """
8 |
9 | def __init__(self, char: str):
10 | self.char = char
11 | self.children: Dict[str, TrieNode] = {}
12 |
13 | @property
14 | def is_leaf(self) -> bool:
15 | return any(self.children)
16 |
17 |
18 | class Trie:
19 | """
20 | A Trie is a tree data structure that stores a set of strings. See https://en.wikipedia.org/wiki/Trie.
21 | """
22 |
23 | def __init__(self):
24 | self.root = TrieNode("")
25 |
26 | def insert(self, s: str) -> None:
27 | """Insert a given string into the Trie.
28 |
29 | Args:
30 | s (str): the string to insert
31 | """
32 | current_node = self.root
33 | for char in s:
34 | if char not in current_node.children:
35 | child = TrieNode(char)
36 | current_node.children[char] = child
37 | current_node = child
38 | else:
39 | current_node = current_node.children[char]
40 |
41 | def search(self, s: str) -> List[TrieNode]:
42 | """Search for a given string in the Trie.
43 |
44 | Args:
45 | s (str): string to search
46 |
47 | Returns:
48 | List[TrieNode]: list of TrieNodes corresponding to the path of the string in the Trie.
49 | If an exact match is not found, returns the partial path or an empty list if no match is found.
50 | """
51 | path = []
52 | node = self.root
53 | for char in s:
54 | node = node.children.get(char)
55 | if node:
56 | path.append(node)
57 | else:
58 | break
59 | return path
60 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | # Byte-compiled / optimized / DLL files
2 | __pycache__/
3 | *.py[cod]
4 | *$py.class
5 |
6 | # C extensions
7 | *.so
8 |
9 | # Distribution / packaging
10 | .Python
11 | build/
12 | develop-eggs/
13 | dist/
14 | downloads/
15 | eggs/
16 | .eggs/
17 | lib/
18 | lib64/
19 | parts/
20 | sdist/
21 | var/
22 | wheels/
23 | pip-wheel-metadata/
24 | share/python-wheels/
25 | *.egg-info/
26 | .installed.cfg
27 | *.egg
28 | MANIFEST
29 |
30 | # PyInstaller
31 | # Usually these files are written by a python script from a template
32 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
33 | *.manifest
34 | *.spec
35 |
36 | # Installer logs
37 | pip-log.txt
38 | pip-delete-this-directory.txt
39 |
40 | # Unit test / coverage reports
41 | htmlcov/
42 | .tox/
43 | .nox/
44 | .coverage
45 | .coverage.*
46 | .cache
47 | nosetests.xml
48 | coverage.xml
49 | *.cover
50 | *.py,cover
51 | .hypothesis/
52 | .pytest_cache/
53 | .benchmarks
54 |
55 | # Translations
56 | *.mo
57 | *.pot
58 |
59 | # Django stuff:
60 | *.log
61 | local_settings.py
62 | db.sqlite3
63 | db.sqlite3-journal
64 |
65 | # Flask stuff:
66 | instance/
67 | .webassets-cache
68 |
69 | # Scrapy stuff:
70 | .scrapy
71 |
72 | # Sphinx documentation
73 | docs/_build/
74 |
75 | # PyBuilder
76 | target/
77 |
78 | # Jupyter Notebook
79 | .ipynb_checkpoints
80 |
81 | # IPython
82 | profile_default/
83 | ipython_config.py
84 |
85 | # pyenv
86 | .python-version
87 |
88 | # pipenv
89 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
90 | # However, in case of collaboration, if having platform-specific dependencies or dependencies
91 | # having no cross-platform support, pipenv may install dependencies that don't work, or not
92 | # install all needed dependencies.
93 | #Pipfile.lock
94 |
95 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow
96 | __pypackages__/
97 |
98 | # Celery stuff
99 | celerybeat-schedule
100 | celerybeat.pid
101 |
102 | # SageMath parsed files
103 | *.sage.py
104 |
105 | # Environments
106 | .env
107 | .venv
108 | env/
109 | venv/
110 | ENV/
111 | env.bak/
112 | venv.bak/
113 |
114 | # Spyder project settings
115 | .spyderproject
116 | .spyproject
117 |
118 | # Rope project settings
119 | .ropeproject
120 |
121 | # mkdocs documentation
122 | /site
123 |
124 | # mypy
125 | .mypy_cache/
126 | .dmypy.json
127 | dmypy.json
128 |
129 | # Pyre type checker
130 | .pyre/
131 |
--------------------------------------------------------------------------------
/geo_transformer/transformer.py:
--------------------------------------------------------------------------------
1 | from typing import Generator
2 |
3 | from geolib import geohash as geohashlib
4 |
5 | from geo_transformer.models import Geohash, Location
6 | from geo_transformer.trie import Trie
7 |
8 |
9 | def encode(latitude: float, longitude: float, precision: int = 12) -> str:
10 | """Encode latitude and longitude to geohash.
11 |
12 | Args:
13 | latitude (float): latitude of a given location
14 | longitude (float): longitude of a given location
15 | precision (int, optional): geohash prevision, from 1 (lowest) to 12 (highest), included. Defaults to 12.
16 |
17 | Returns:
18 | str: geohash as string
19 | """
20 | geohash = geohashlib.encode(latitude, longitude, precision)
21 | return geohash
22 |
23 |
24 | def build(geohashs: Generator[str, None, None]) -> Trie:
25 | """Helper function to build a trie from a geohash generator.
26 |
27 | Args:
28 | geohashs (Generator[str, None, None]): list of geohashes to insert into the trie
29 |
30 | Returns:
31 | Trie: a trie with all geohashes inserted
32 | """
33 | trie = Trie()
34 | for geohash in geohashs:
35 | trie.insert(geohash)
36 | return trie
37 |
38 |
39 | def query_unique_prefix(geohash: str, trie: Trie) -> str:
40 | """Query a trie to obtain a unique prefix for a given geohash.
41 |
42 | Args:
43 | geohash (str): geohash to query
44 | trie (Trie): Trie to query against
45 |
46 | Returns:
47 | str: unique prefix found in the trie
48 | """
49 | unique_index = 0
50 |
51 | # reverse the path found
52 | reversed_path = trie.search(geohash)[::-1]
53 | for i in range(len(reversed_path)):
54 | unique_index = i
55 | # stop when the parent node of the current child has more than one child
56 | if len(reversed_path[i + 1].children) > 1:
57 | break
58 |
59 | # slice the path to obtain the unique prefix and reverse it back
60 | path = reversed_path[unique_index:][::-1]
61 | return "".join([node.char for node in path])
62 |
63 |
64 | def transform(locations: Generator[Location, None, None], trie: Trie) -> Generator[Geohash, None, None]:
65 | """transform locations to geohashs. Adds geohash encoding and a unique prefix identifier to input locations.
66 |
67 | Args:
68 | locations (Generator[Location, None, None]): Locations to transform
69 |
70 | Yields:
71 | Generator[Geohash, None, None]: GeoHash objects
72 | """
73 | for location in locations:
74 | geohash = encode(location.lat, location.lng)
75 | unique_prefix = query_unique_prefix(geohash, trie)
76 | yield Geohash(location=location, geohash=geohash, uniq=unique_prefix)
77 |
--------------------------------------------------------------------------------
/geo_transformer/io.py:
--------------------------------------------------------------------------------
1 | import csv
2 | import gzip
3 | import tempfile
4 | from pathlib import Path
5 | from typing import Generator, Optional
6 |
7 | import typer
8 |
9 | from geo_transformer.models import Geohash, Location
10 |
11 | GZIP_MAGIC_BYTES = "1f8b"
12 | CSV_HEADER = ["lat", "lng", "geohash", "uniq"]
13 |
14 |
15 | def extract_from_file(archive: Path) -> Optional[Path]:
16 | """Extract gzip file containing raw data
17 |
18 | Args:
19 | archive (Path): path to the archive, ie. "data/input.gz"
20 |
21 | Returns:
22 | Path: extracted file absolute path
23 | """
24 | extracted_file = None
25 | try:
26 | magic_bytes = archive.open("rb").read(2).hex()
27 | if magic_bytes != GZIP_MAGIC_BYTES:
28 | raise Exception(f"Input file {archive.as_posix()} is not a valid gzip file")
29 | dest = Path(tempfile.mkdtemp()).joinpath(archive.stem)
30 | content = gzip.decompress(archive.open("rb").read())
31 | dest.write_bytes(content)
32 | extracted_file = dest
33 | except Exception as e:
34 | typer.secho(f"Error while extracting archive {archive.as_posix()}: {e}", fg=typer.colors.RED, err=True)
35 | return extracted_file
36 |
37 |
38 | def load_locations(csv_file: Path) -> Generator[Location, None, None]:
39 | """Load points from CSV file as Location objects
40 |
41 | Args:
42 | csv_file (Path): path to the CSV file, ie. "data/input.csv"
43 |
44 | Yields:
45 | Generator[Location, None, None]: Location objects
46 | """
47 | try:
48 | with csv_file.open("r") as csvfile:
49 | csvreader = csv.reader(csvfile)
50 | next(csvreader) # skip header
51 | for row in csvreader:
52 | yield Location(lat=float(row[0]), lng=float(row[1]))
53 | except Exception as e:
54 | typer.secho(f"Error while loading points: {e}", fg=typer.colors.RED, err=True)
55 | yield from ()
56 |
57 |
58 | def print_to_console(geohashs: Generator[Geohash, None, None]) -> None:
59 | """Print Geohash objects to console as CSV
60 |
61 | Args:
62 | locations (Generator[Geohash, None, None]): Geohash objects
63 | """
64 | typer.echo(",".join(CSV_HEADER)) # print header
65 | for geohash in geohashs:
66 | typer.secho(f"{geohash.location.lat},{geohash.location.lng},{geohash.geohash},{geohash.uniq}")
67 |
68 |
69 | def write_to_file(output_file: Path, geohashs: Generator[Geohash, None, None]) -> None:
70 | """Write Geohash objects to CSV file
71 |
72 | Args:
73 | output_file (Path): path to the output file, ie. "data/output.csv"
74 | geohashs (Generator[Geohash, None, None]): Geohash objects
75 | """
76 | with output_file.open("w") as csvfile:
77 | csvwriter = csv.writer(csvfile)
78 | csvwriter.writerow(CSV_HEADER)
79 | for geohash in geohashs:
80 | csvwriter.writerow([geohash.location.lat, geohash.location.lng, geohash.geohash, geohash.uniq])
81 |
--------------------------------------------------------------------------------
/geo_transformer/tests/test_transformer.py:
--------------------------------------------------------------------------------
1 | import random
2 | import string
3 | from typing import Generator
4 |
5 | import pytest
6 | from geo_transformer.models import Geohash, Location
7 | from geo_transformer.transformer import build, encode, query_unique_prefix, transform
8 |
9 |
10 | def test_geohash_instance(data_test_locations: Generator[Location, None, None]):
11 | for location in data_test_locations:
12 | assert isinstance(Geohash(location=location, geohash=encode(location.lat, location.lng), uniq=""), Geohash)
13 |
14 |
15 | def test_geohash_encode(data_test_locations: Generator[Location, None, None]):
16 | geohashs = [encode(location.lat, location.lng) for location in data_test_locations]
17 | assert geohashs == ["sp3e3qe7mkcb", "sp3e2wuys9dr", "sp3e2wuzpnhr"]
18 |
19 |
20 | def test_query_unique_prefix(data_test_locations: Generator[Location, None, None]):
21 | index = build(encode(location.lat, location.lng) for location in data_test_locations)
22 | prefix = query_unique_prefix("sp3e3qe7mkcb", index)
23 | assert prefix == "sp3e3"
24 |
25 |
26 | def test_transformer(data_test_locations: Generator[Location, None, None]):
27 | locations = list(data_test_locations) # copy generator to list
28 | index = build(encode(location.lat, location.lng) for location in locations)
29 | results = transform(locations, index) # type: ignore
30 | prefixes = [geohash.uniq for geohash in results]
31 | assert len(list(prefixes)) == 3
32 | assert prefixes == ["sp3e3", "sp3e2wuy", "sp3e2wuz"]
33 |
34 |
35 | @pytest.mark.benchmark(group="geohash")
36 | def test_geohash_encode_benchmark(data_test_locations: Generator[Location, None, None], benchmark):
37 | location = list(data_test_locations)[0]
38 | geohash = benchmark(encode, location.lat, location.lng)
39 | assert geohash == "sp3e3qe7mkcb"
40 |
41 |
42 | @pytest.mark.benchmark(group="trie-insert")
43 | def test_build_index_benchmark_1_000(benchmark):
44 | geohashs = generate_fake_random_geohash(1_000)
45 | benchmark(build, geohashs)
46 |
47 |
48 | @pytest.mark.benchmark(group="trie-insert")
49 | def test_build_index_benchmark_10_000(benchmark):
50 | geohashs = generate_fake_random_geohash(10_000)
51 | benchmark(build, geohashs)
52 |
53 |
54 | @pytest.mark.benchmark(group="trie-insert")
55 | def test_build_index_benchmark_100_000(benchmark):
56 | geohashs = generate_fake_random_geohash(100_000)
57 | benchmark(build, geohashs)
58 |
59 |
60 | @pytest.mark.benchmark(group="trie-insert")
61 | def test_build_index_benchmark_1_000_000(benchmark):
62 | geohashs = generate_fake_random_geohash(1_000_000)
63 | benchmark(build, geohashs)
64 |
65 |
66 | @pytest.mark.benchmark(group="trie-query")
67 | def test_query_unique_prefix_benchmark_small_trie(data_test_locations: Generator[Location, None, None], benchmark):
68 | index = build(encode(location.lat, location.lng) for location in data_test_locations)
69 | prefix = benchmark(query_unique_prefix, "sp3e3qe7mkcb", index)
70 | assert prefix == "sp3e3"
71 |
72 |
73 | @pytest.mark.benchmark(group="trie-query")
74 | def test_query_unique_prefix_benchmark_large_trie(benchmark):
75 | geohashs = generate_fake_random_geohash(1_000_000)
76 | index = build(geohashs)
77 | benchmark(query_unique_prefix, generate_fake_random_geohash(1), index)
78 |
79 |
80 | def generate_fake_random_geohash(size: int) -> Generator[str, None, None]:
81 | """Generate a fake (invalid) geohash-like of stringof with 12 random characters
82 |
83 | Args:
84 | size (int): number of fake geohashs to generate
85 |
86 | Yields:
87 | Generator[str, None, None]: fake geohashs strings
88 | """
89 | for _ in range(size):
90 | yield "".join(random.choice(string.ascii_lowercase) for _ in range(12))
91 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | [](https://github.com/gaarv/stuart-data-challenge/actions/workflows/test.yml)
2 |
3 | :globe_with_meridians: Unique Geohash :globe_with_meridians:
4 | ===
5 |
6 | # Requirements
7 |
8 | * make (if not present, `apt install make` on Linux Debian or `brew install make` on Mac)
9 | * the current sources (python-test-sebastienhoarau)
10 |
11 | A Makefile provides convenient shortcuts for most tasks.
12 |
13 | # Setup
14 |
15 | Create a dedicated Python virtual environment, ie. with [conda](https://docs.conda.io/en/latest/miniconda.html):
16 |
17 | conda create -n geo-transformer python=3.9 pip
18 |
19 | Activate the virtual environment:
20 |
21 | conda activate geo-transformer
22 |
23 | Install minimum requirements and package locally with:
24 |
25 | make install
26 |
27 | at the root of the project.
28 |
29 | # Usage
30 |
31 | Change directory to `geo_transformer` directory:
32 |
33 | cd geo_transformer
34 |
35 | then run:
36 |
37 | python app.py data/test_points.txt.gz
38 |
39 | to use the provided sample data and print output to the console.
40 |
41 | You can also use any file respecting the same schema as the one provided, compressed in gzip format for the `INPUT_FILE` argument.
42 |
43 |
44 |
45 | List all available commands with:
46 |
47 | python app.py --help
48 |
49 | # Project structure
50 |
51 | ```
52 | python-test-sebastienhoarau
53 | ├── LICENSE
54 | ├── Makefile
55 | ├── README.md # this file
56 | ├── geo_transformer # main package
57 | │ ├── __init__.py
58 | │ ├── app.py # application entrypoint
59 | │ ├── data # sample application data
60 | │ ├── io.py # input/output functions
61 | │ ├── models.py # datas structures
62 | │ ├── tests # tests
63 | │ └── transformer.py # transformer functions (geohash encoding, unique prefix)
64 | ├── pyproject.toml
65 | ├── requirements-dev.in # dev requirements
66 | ├── requirements-dev.txt # compiled dev requirements with pip-compile
67 | ├── requirements.in # requirements
68 | ├── requirements.txt # compiled requirements with pip-compile
69 | └── setup.py
70 | ```
71 |
72 | Python source code is formatted with [Black](https://github.com/psf/black).
73 |
74 | # Development environment
75 |
76 | Similar to [Setup](#setup), while at the root of the project, install development requirements and package locally with:
77 |
78 | make install-dev
79 |
80 | # Run tests
81 |
82 | Tests can be run with:
83 |
84 | make test
85 |
86 | The console prints tests outputs as well as code coverage.
87 |
88 | # Run benchmarks
89 |
90 | Benchmarks can be run with:
91 |
92 | make benchmark
93 |
94 | # Creating a standalone binary
95 |
96 | A standalone binary for your platform can be created with:
97 |
98 | make standalone
99 |
100 | And will be found as `geo-transformer` in newly created `dist` directory. The produced binary does not require any Python installation or dependencies and can be run just like in [Usage](#usage) with:
101 |
102 | ./geo-transformer
103 |
104 | A downloadable version (ELF / Linux only) is available in the `Releases` section.
105 |
106 | # Updating dependencies
107 |
108 | The requirements files `requirements.txt` and `requirements-dev.txt` are generated with `pip-compile` from [pip-tools](https://github.com/jazzband/pip-tools).
109 |
110 | * update `requirements.in` and `requirements-dev.in` as needed
111 | * run `pip-compile requirements.in` and `pip-compile requirements-dev.in` to update the compiled requirements.
112 |
113 | # Original Problem Statement
114 |
115 | Your task is to transform the set of longitude, latitude coordinates provided in the `test_points.txt.gz` file
116 | into corresponding [GeoHash](https://en.wikipedia.org/wiki/Geohash) codes.
117 | For each pair of coordinates only the shortest geohash prefix that uniquely identifies this point must be stored.
118 | For instance, this 3 points dataset will store these unique prefixes:
119 |
120 | |latitude | longitude | geohash | unique_prefix |
121 | |----------------|-----------------|--------------|---------------|
122 | |41.388828145321 | 2.1689976634898 | sp3e3qe7mkcb | sp3e3 |
123 | |41.390743 | 2.138067 | sp3e2wuys9dr | sp3e2wuy |
124 | |41.390853 | 2.138177 | sp3e2wuzpnhr | sp3e2wuz |
125 |
126 | The solution must be coded in `Python` and you can use any public domain libraries.
127 | It should work with any file respecting the same schema as the one provided.
128 | The executable must output the solution on `stdout` in [CSV format](https://tools.ietf.org/html/rfc4180)
129 | with 4 columns following the structure of the example, *ie*:
130 |
131 | ```csv
132 | lat,lng,geohash,uniq
133 | 41.388828145321,2.1689976634898,sp3e3qe7mkcb,sp3e3
134 | 41.390743,2.138067,sp3e2wuys9dr,sp3e2wuy
135 | 41.390853,2.138177,sp3e2wuzpnhr,sp3e2wuz
136 | ```
137 |
138 | ## :nerd_face: We value in the solution
139 |
140 | - Good software design
141 | - Proper documentation
142 | - Compliance to Python standards and modern usages (*eg.*: [PEP8](https://www.python.org/dev/peps/pep-0008/))
143 | - Proper use of data structures
144 | - Ergonomy of the command line interface
145 | - Setup/Launch instructions if required
146 |
--------------------------------------------------------------------------------