├── .python-version ├── .vscode └── settings.json ├── README.md ├── bin ├── run-main.sh └── run-test.sh ├── data └── .gitignore ├── poetry.lock ├── poetry.toml ├── pyproject.toml ├── pytest.ini └── src ├── .gitignore ├── cinema ├── core │ ├── domain │ │ ├── movie_type.py │ │ └── services │ │ │ └── movie_domain_service.py │ ├── ports │ │ ├── primary │ │ │ ├── average_movie_budget_command.py │ │ │ ├── count_movie_command.py │ │ │ ├── most_expensive_movie_command.py │ │ │ └── use_cases.py │ │ └── secondary │ │ │ └── movie_repository.py │ └── use_cases │ │ ├── average_budget_use_case.py │ │ ├── count_movie_use_case.py │ │ └── most_expensive_movies.py ├── primary_adapters │ ├── cli │ │ └── main.py │ └── tests │ │ ├── average_movie_budget_use_case_test.py │ │ ├── most_expensive_movies_use_case_test.py │ │ └── movie_count_use_case_test.py └── secondary_adapters │ └── repositories │ └── movie │ ├── in_memory_movie_repository.py │ └── kaggle_movie_repository.py └── main.py /.python-version: -------------------------------------------------------------------------------- 1 | 3.9.13 2 | -------------------------------------------------------------------------------- /.vscode/settings.json: -------------------------------------------------------------------------------- 1 | { 2 | "[python]": { 3 | "editor.defaultFormatter": "ms-python.black-formatter", 4 | "editor.formatOnSave": true 5 | }, 6 | "python.formatting.provider": "none", 7 | "editor.formatOnSave": true, 8 | "python.testing.pytestArgs": ["src"], 9 | "python.testing.unittestEnabled": false, 10 | "python.testing.pytestEnabled": true, 11 | "python.analysis.typeCheckingMode": "strict", 12 | "files.exclude": { 13 | "**/__pycache__": true 14 | } 15 | } 16 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # TCM Labs' Python/Spark data engineering white project 2 | 3 | This is a Python/Spark white project in which we highlight how to craft a highly maintainable data project which can run both locally, on any developer machine, and remotely on any Spark cluster (including Databricks clusters). This also works with Pandas. 4 | 5 | Especially, if you're using Notebooks in production and are unhappy about the quality of the output data, you'll find valuable insights here. 6 | 7 | ## What problems does this white project intend to solve? 8 | 9 | This project is here to help you if you struggle with some of these problems: 10 | 11 | - poor data quality 12 | - lack of testing 13 | - costly maintainance 14 | - slow iteration speed 15 | - difficult collaboration 16 | - deployment to production fear 17 | - etc. 18 | 19 | This project shows how battle-tested software engineering architecture known to be highly effective in backend or frontend environments can also be applied to data projects, using Spark and/or Pandas. 20 | 21 | ## Architecture, tests, feedback loop, code quality and environments 22 | 23 | Let's zoom into 6 very specific problems that your data engineering project may have: 24 | 25 | ### Problem #1: "We don't know where and what the business rules are" 26 | 27 | This is a **software architecture** problem. 28 | 29 | - Quickly find such business rules and modify them 30 | - Allow business/non-software engineer people to read and possible contribute to the project in very isolated places 31 | 32 | ### Problem #2: "We don't know if the code/business rules behave as expected" 33 | 34 | This is a **testing** problem. 35 | 36 | How do we solve this? 37 | 38 | This repository shows how to write thorough Spark unit tests: 39 | 40 | - Ensure business use cases behaves as expected: given a known input, we should determnistically return the same output 41 | - Prevent any regression, so that all previously working feature don't break when new features or bug fixes are merged 42 | 43 | ### Problem #3: "We're slow to iterate, developing a new feature or fixing a bug takes age" 44 | 45 | This is a **feedback loop** problem. 46 | 47 | How do we solve this? 48 | 49 | - Allow developers get a very fast local feedback loop by allowing them to run tests locally in less than a minute 50 | - Allow continuous integration (CI), by preventing non-functioning code to be merged into `main` branch 51 | 52 | ### Problem #4: "We always spend a lot of time understanding what the code does, and people complain it's cryptic and hard to decypher" 53 | 54 | This is a **code quality** problem. 55 | 56 | How do we solve this? We automate the boring yet very important stuffs with tools such as: 57 | 58 | - `black` handles code formatting 59 | - `flake8` handles linting 60 | - `Python Type Hints` handles static type checking 61 | 62 | ### Problem #5: "We had a bug in production because the installed library version didn't match what we're using in development" 63 | 64 | This is a **server/machine provisioning** problem. 65 | 66 | How do we solve this? 67 | 68 | - We make it next to impossible to deploy an application in production without the appropriate dependencies and expected exact versions 69 | - We also allow all developers to work with the same dependency trees, on their machine. 70 | 71 | How? We use the following tools: 72 | 73 | - `poetry` for managing Python dependencies 74 | - `pyenv` for managing Python binaries 75 | 76 | That could work with `pip` + `virtualenv` or `conda` 77 | 78 | ### Problem #6: "We have duplicated code everywhere, and updating all of this takes age" 79 | 80 | This is a **software engineering** problem. 81 | 82 | How do we solve this? 83 | 84 | - We make domain code explicit in domain services 85 | - We make orchestration logic, ie. what do to sequentially or in parallel, explicit in application services 86 | 87 | Code is data, and duplicated data tend to go out of sync. Duplicated code will lead to inconsistent code which will lead to inconsistent results. 88 | 89 | ## Highlighted software engineering concepts 90 | 91 | This repository uses the following **software engineering** concepts: 92 | 93 | - [Hexagonal architecture (= ports and adapter)](https://alistair.cockburn.us/hexagonal-architecture); you may be familiar with [Clean architecture](https://blog.cleancoder.com/uncle-bob/2012/08/13/the-clean-architecture.html) 94 | - Inversion of Control (IoC) and Dependency Injection mecanisms (DI) 95 | - Dependency Inversion Principle (DIP) 96 | 97 | We also borrow useful concepts from **Domain-Driven Design (DDD)**, especially: 98 | 99 | - Repository pattern 100 | - Application services 101 | - Domain services 102 | 103 | Also, we use Domain-Specific Languages (DSLs) ideas to hide the Spark/Pandas implementations details, and focus on the _what_ rather than the _how_. This part is not mandatory. 104 | 105 | ## Maintainers 106 | 107 | Provided to you by [TCM Labs](https://www.tcmlabs.fr/), an expert IT Consulting firm based in Paris, France. 108 | 109 | We believe that there's absolutely no difference between data, backend and frontend engineering. 110 | 111 | Lead maintainer: 112 | 113 | - Jean-Baptiste Musso // jeanbaptiste (at) tcmlabs.fr 114 | 115 | ## Contributing 116 | 117 | Feel free to open issues and pull requests. Contributions are welcomed. 118 | -------------------------------------------------------------------------------- /bin/run-main.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | 3 | poetry run python src/main.py -------------------------------------------------------------------------------- /bin/run-test.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | 3 | python -m pytest --verbose --capture=no "$@" 4 | -------------------------------------------------------------------------------- /data/.gitignore: -------------------------------------------------------------------------------- 1 | * 2 | !.gitignore -------------------------------------------------------------------------------- /poetry.lock: -------------------------------------------------------------------------------- 1 | [[package]] 2 | name = "atomicwrites" 3 | version = "1.4.1" 4 | description = "Atomic file writes." 5 | category = "dev" 6 | optional = false 7 | python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*" 8 | 9 | [[package]] 10 | name = "attrs" 11 | version = "21.4.0" 12 | description = "Classes Without Boilerplate" 13 | category = "dev" 14 | optional = false 15 | python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*" 16 | 17 | [package.extras] 18 | dev = ["coverage[toml] (>=5.0.2)", "hypothesis", "pympler", "pytest (>=4.3.0)", "six", "mypy", "pytest-mypy-plugins", "zope.interface", "furo", "sphinx", "sphinx-notfound-page", "pre-commit", "cloudpickle"] 19 | docs = ["furo", "sphinx", "zope.interface", "sphinx-notfound-page"] 20 | tests = ["coverage[toml] (>=5.0.2)", "hypothesis", "pympler", "pytest (>=4.3.0)", "six", "mypy", "pytest-mypy-plugins", "zope.interface", "cloudpickle"] 21 | tests_no_zope = ["coverage[toml] (>=5.0.2)", "hypothesis", "pympler", "pytest (>=4.3.0)", "six", "mypy", "pytest-mypy-plugins", "cloudpickle"] 22 | 23 | [[package]] 24 | name = "black" 25 | version = "22.6.0" 26 | description = "The uncompromising code formatter." 27 | category = "dev" 28 | optional = false 29 | python-versions = ">=3.6.2" 30 | 31 | [package.dependencies] 32 | click = ">=8.0.0" 33 | mypy-extensions = ">=0.4.3" 34 | pathspec = ">=0.9.0" 35 | platformdirs = ">=2" 36 | tomli = {version = ">=1.1.0", markers = "python_full_version < \"3.11.0a7\""} 37 | typing-extensions = {version = ">=3.10.0.0", markers = "python_version < \"3.10\""} 38 | 39 | [package.extras] 40 | colorama = ["colorama (>=0.4.3)"] 41 | d = ["aiohttp (>=3.7.4)"] 42 | jupyter = ["ipython (>=7.8.0)", "tokenize-rt (>=3.2.0)"] 43 | uvloop = ["uvloop (>=0.15.2)"] 44 | 45 | [[package]] 46 | name = "click" 47 | version = "8.1.3" 48 | description = "Composable command line interface toolkit" 49 | category = "dev" 50 | optional = false 51 | python-versions = ">=3.7" 52 | 53 | [package.dependencies] 54 | colorama = {version = "*", markers = "platform_system == \"Windows\""} 55 | 56 | [[package]] 57 | name = "colorama" 58 | version = "0.4.5" 59 | description = "Cross-platform colored terminal text." 60 | category = "dev" 61 | optional = false 62 | python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*" 63 | 64 | [[package]] 65 | name = "docopt" 66 | version = "0.6.2" 67 | description = "Pythonic argument parser, that will make you smile" 68 | category = "dev" 69 | optional = false 70 | python-versions = "*" 71 | 72 | [[package]] 73 | name = "findspark" 74 | version = "2.0.1" 75 | description = "Find pyspark to make it importable." 76 | category = "dev" 77 | optional = false 78 | python-versions = "*" 79 | 80 | [[package]] 81 | name = "iniconfig" 82 | version = "1.1.1" 83 | description = "iniconfig: brain-dead simple config-ini parsing" 84 | category = "dev" 85 | optional = false 86 | python-versions = "*" 87 | 88 | [[package]] 89 | name = "mypy-extensions" 90 | version = "0.4.3" 91 | description = "Experimental type system extensions for programs checked with the mypy typechecker." 92 | category = "dev" 93 | optional = false 94 | python-versions = "*" 95 | 96 | [[package]] 97 | name = "packaging" 98 | version = "21.3" 99 | description = "Core utilities for Python packages" 100 | category = "dev" 101 | optional = false 102 | python-versions = ">=3.6" 103 | 104 | [package.dependencies] 105 | pyparsing = ">=2.0.2,<3.0.5 || >3.0.5" 106 | 107 | [[package]] 108 | name = "pathspec" 109 | version = "0.9.0" 110 | description = "Utility library for gitignore style pattern matching of file paths." 111 | category = "dev" 112 | optional = false 113 | python-versions = "!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,!=3.4.*,>=2.7" 114 | 115 | [[package]] 116 | name = "platformdirs" 117 | version = "2.5.2" 118 | description = "A small Python module for determining appropriate platform-specific dirs, e.g. a \"user data dir\"." 119 | category = "dev" 120 | optional = false 121 | python-versions = ">=3.7" 122 | 123 | [package.extras] 124 | docs = ["furo (>=2021.7.5b38)", "proselint (>=0.10.2)", "sphinx-autodoc-typehints (>=1.12)", "sphinx (>=4)"] 125 | test = ["appdirs (==1.4.4)", "pytest-cov (>=2.7)", "pytest-mock (>=3.6)", "pytest (>=6)"] 126 | 127 | [[package]] 128 | name = "pluggy" 129 | version = "1.0.0" 130 | description = "plugin and hook calling mechanisms for python" 131 | category = "dev" 132 | optional = false 133 | python-versions = ">=3.6" 134 | 135 | [package.extras] 136 | dev = ["pre-commit", "tox"] 137 | testing = ["pytest", "pytest-benchmark"] 138 | 139 | [[package]] 140 | name = "py" 141 | version = "1.11.0" 142 | description = "library with cross-python path, ini-parsing, io, code, log facilities" 143 | category = "dev" 144 | optional = false 145 | python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*" 146 | 147 | [[package]] 148 | name = "py4j" 149 | version = "0.10.9.5" 150 | description = "Enables Python programs to dynamically access arbitrary Java objects" 151 | category = "main" 152 | optional = false 153 | python-versions = "*" 154 | 155 | [[package]] 156 | name = "pyparsing" 157 | version = "3.0.9" 158 | description = "pyparsing module - Classes and methods to define and execute parsing grammars" 159 | category = "dev" 160 | optional = false 161 | python-versions = ">=3.6.8" 162 | 163 | [package.extras] 164 | diagrams = ["railroad-diagrams", "jinja2"] 165 | 166 | [[package]] 167 | name = "pyspark" 168 | version = "3.3.0" 169 | description = "Apache Spark Python API" 170 | category = "main" 171 | optional = false 172 | python-versions = ">=3.7" 173 | 174 | [package.dependencies] 175 | py4j = "0.10.9.5" 176 | 177 | [package.extras] 178 | ml = ["numpy (>=1.15)"] 179 | mllib = ["numpy (>=1.15)"] 180 | pandas_on_spark = ["numpy (>=1.15)", "pandas (>=1.0.5)", "pyarrow (>=1.0.0)"] 181 | sql = ["pandas (>=1.0.5)", "pyarrow (>=1.0.0)"] 182 | 183 | [[package]] 184 | name = "pytest" 185 | version = "7.1.2" 186 | description = "pytest: simple powerful testing with Python" 187 | category = "dev" 188 | optional = false 189 | python-versions = ">=3.7" 190 | 191 | [package.dependencies] 192 | atomicwrites = {version = ">=1.0", markers = "sys_platform == \"win32\""} 193 | attrs = ">=19.2.0" 194 | colorama = {version = "*", markers = "sys_platform == \"win32\""} 195 | iniconfig = "*" 196 | packaging = "*" 197 | pluggy = ">=0.12,<2.0" 198 | py = ">=1.8.2" 199 | tomli = ">=1.0.0" 200 | 201 | [package.extras] 202 | testing = ["argcomplete", "hypothesis (>=3.56)", "mock", "nose", "pygments (>=2.7.2)", "requests", "xmlschema"] 203 | 204 | [[package]] 205 | name = "pytest-spark" 206 | version = "0.6.0" 207 | description = "pytest plugin to run the tests with support of pyspark." 208 | category = "dev" 209 | optional = false 210 | python-versions = "*" 211 | 212 | [package.dependencies] 213 | findspark = "*" 214 | pytest = "*" 215 | 216 | [[package]] 217 | name = "pytest-watch" 218 | version = "4.2.0" 219 | description = "Local continuous test runner with pytest and watchdog." 220 | category = "dev" 221 | optional = false 222 | python-versions = "*" 223 | 224 | [package.dependencies] 225 | colorama = ">=0.3.3" 226 | docopt = ">=0.4.0" 227 | pytest = ">=2.6.4" 228 | watchdog = ">=0.6.0" 229 | 230 | [[package]] 231 | name = "tomli" 232 | version = "2.0.1" 233 | description = "A lil' TOML parser" 234 | category = "dev" 235 | optional = false 236 | python-versions = ">=3.7" 237 | 238 | [[package]] 239 | name = "typing-extensions" 240 | version = "4.3.0" 241 | description = "Backported and Experimental Type Hints for Python 3.7+" 242 | category = "dev" 243 | optional = false 244 | python-versions = ">=3.7" 245 | 246 | [[package]] 247 | name = "watchdog" 248 | version = "2.1.9" 249 | description = "Filesystem events monitoring" 250 | category = "dev" 251 | optional = false 252 | python-versions = ">=3.6" 253 | 254 | [package.extras] 255 | watchmedo = ["PyYAML (>=3.10)"] 256 | 257 | [metadata] 258 | lock-version = "1.1" 259 | python-versions = "^3.9" 260 | content-hash = "8b87ebea17cf625e72e280a4f219dfc7a5722fe420b9beeb77e0460c83df34f0" 261 | 262 | [metadata.files] 263 | atomicwrites = [ 264 | {file = "atomicwrites-1.4.1.tar.gz", hash = "sha256:81b2c9071a49367a7f770170e5eec8cb66567cfbbc8c73d20ce5ca4a8d71cf11"}, 265 | ] 266 | attrs = [ 267 | {file = "attrs-21.4.0-py2.py3-none-any.whl", hash = "sha256:2d27e3784d7a565d36ab851fe94887c5eccd6a463168875832a1be79c82828b4"}, 268 | {file = "attrs-21.4.0.tar.gz", hash = "sha256:626ba8234211db98e869df76230a137c4c40a12d72445c45d5f5b716f076e2fd"}, 269 | ] 270 | black = [ 271 | {file = "black-22.6.0-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:f586c26118bc6e714ec58c09df0157fe2d9ee195c764f630eb0d8e7ccce72e69"}, 272 | {file = "black-22.6.0-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:b270a168d69edb8b7ed32c193ef10fd27844e5c60852039599f9184460ce0807"}, 273 | {file = "black-22.6.0-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:6797f58943fceb1c461fb572edbe828d811e719c24e03375fd25170ada53825e"}, 274 | {file = "black-22.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:c85928b9d5f83b23cee7d0efcb310172412fbf7cb9d9ce963bd67fd141781def"}, 275 | {file = "black-22.6.0-cp310-cp310-win_amd64.whl", hash = "sha256:f6fe02afde060bbeef044af7996f335fbe90b039ccf3f5eb8f16df8b20f77666"}, 276 | {file = "black-22.6.0-cp36-cp36m-macosx_10_9_x86_64.whl", hash = "sha256:cfaf3895a9634e882bf9d2363fed5af8888802d670f58b279b0bece00e9a872d"}, 277 | {file = "black-22.6.0-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:94783f636bca89f11eb5d50437e8e17fbc6a929a628d82304c80fa9cd945f256"}, 278 | {file = "black-22.6.0-cp36-cp36m-win_amd64.whl", hash = "sha256:2ea29072e954a4d55a2ff58971b83365eba5d3d357352a07a7a4df0d95f51c78"}, 279 | {file = "black-22.6.0-cp37-cp37m-macosx_10_9_x86_64.whl", hash = "sha256:e439798f819d49ba1c0bd9664427a05aab79bfba777a6db94fd4e56fae0cb849"}, 280 | {file = "black-22.6.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:187d96c5e713f441a5829e77120c269b6514418f4513a390b0499b0987f2ff1c"}, 281 | {file = "black-22.6.0-cp37-cp37m-win_amd64.whl", hash = "sha256:074458dc2f6e0d3dab7928d4417bb6957bb834434516f21514138437accdbe90"}, 282 | {file = "black-22.6.0-cp38-cp38-macosx_10_9_universal2.whl", hash = "sha256:a218d7e5856f91d20f04e931b6f16d15356db1c846ee55f01bac297a705ca24f"}, 283 | {file = "black-22.6.0-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:568ac3c465b1c8b34b61cd7a4e349e93f91abf0f9371eda1cf87194663ab684e"}, 284 | {file = "black-22.6.0-cp38-cp38-macosx_11_0_arm64.whl", hash = "sha256:6c1734ab264b8f7929cef8ae5f900b85d579e6cbfde09d7387da8f04771b51c6"}, 285 | {file = "black-22.6.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:c9a3ac16efe9ec7d7381ddebcc022119794872abce99475345c5a61aa18c45ad"}, 286 | {file = "black-22.6.0-cp38-cp38-win_amd64.whl", hash = "sha256:b9fd45787ba8aa3f5e0a0a98920c1012c884622c6c920dbe98dbd05bc7c70fbf"}, 287 | {file = "black-22.6.0-cp39-cp39-macosx_10_9_universal2.whl", hash = "sha256:7ba9be198ecca5031cd78745780d65a3f75a34b2ff9be5837045dce55db83d1c"}, 288 | {file = "black-22.6.0-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:a3db5b6409b96d9bd543323b23ef32a1a2b06416d525d27e0f67e74f1446c8f2"}, 289 | {file = "black-22.6.0-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:560558527e52ce8afba936fcce93a7411ab40c7d5fe8c2463e279e843c0328ee"}, 290 | {file = "black-22.6.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:b154e6bbde1e79ea3260c4b40c0b7b3109ffcdf7bc4ebf8859169a6af72cd70b"}, 291 | {file = "black-22.6.0-cp39-cp39-win_amd64.whl", hash = "sha256:4af5bc0e1f96be5ae9bd7aaec219c901a94d6caa2484c21983d043371c733fc4"}, 292 | {file = "black-22.6.0-py3-none-any.whl", hash = "sha256:ac609cf8ef5e7115ddd07d85d988d074ed00e10fbc3445aee393e70164a2219c"}, 293 | {file = "black-22.6.0.tar.gz", hash = "sha256:6c6d39e28aed379aec40da1c65434c77d75e65bb59a1e1c283de545fb4e7c6c9"}, 294 | ] 295 | click = [ 296 | {file = "click-8.1.3-py3-none-any.whl", hash = "sha256:bb4d8133cb15a609f44e8213d9b391b0809795062913b383c62be0ee95b1db48"}, 297 | {file = "click-8.1.3.tar.gz", hash = "sha256:7682dc8afb30297001674575ea00d1814d808d6a36af415a82bd481d37ba7b8e"}, 298 | ] 299 | colorama = [ 300 | {file = "colorama-0.4.5-py2.py3-none-any.whl", hash = "sha256:854bf444933e37f5824ae7bfc1e98d5bce2ebe4160d46b5edf346a89358e99da"}, 301 | {file = "colorama-0.4.5.tar.gz", hash = "sha256:e6c6b4334fc50988a639d9b98aa429a0b57da6e17b9a44f0451f930b6967b7a4"}, 302 | ] 303 | docopt = [ 304 | {file = "docopt-0.6.2.tar.gz", hash = "sha256:49b3a825280bd66b3aa83585ef59c4a8c82f2c8a522dbe754a8bc8d08c85c491"}, 305 | ] 306 | findspark = [ 307 | {file = "findspark-2.0.1-py2.py3-none-any.whl", hash = "sha256:e5d5415ff8ced6b173b801e12fc90c1eefca1fb6bf9c19c4fc1f235d4222e753"}, 308 | {file = "findspark-2.0.1.tar.gz", hash = "sha256:aa10a96cb616cab329181d72e8ef13d2dc453b4babd02b5482471a0882c1195e"}, 309 | ] 310 | iniconfig = [ 311 | {file = "iniconfig-1.1.1-py2.py3-none-any.whl", hash = "sha256:011e24c64b7f47f6ebd835bb12a743f2fbe9a26d4cecaa7f53bc4f35ee9da8b3"}, 312 | {file = "iniconfig-1.1.1.tar.gz", hash = "sha256:bc3af051d7d14b2ee5ef9969666def0cd1a000e121eaea580d4a313df4b37f32"}, 313 | ] 314 | mypy-extensions = [ 315 | {file = "mypy_extensions-0.4.3-py2.py3-none-any.whl", hash = "sha256:090fedd75945a69ae91ce1303b5824f428daf5a028d2f6ab8a299250a846f15d"}, 316 | {file = "mypy_extensions-0.4.3.tar.gz", hash = "sha256:2d82818f5bb3e369420cb3c4060a7970edba416647068eb4c5343488a6c604a8"}, 317 | ] 318 | packaging = [ 319 | {file = "packaging-21.3-py3-none-any.whl", hash = "sha256:ef103e05f519cdc783ae24ea4e2e0f508a9c99b2d4969652eed6a2e1ea5bd522"}, 320 | {file = "packaging-21.3.tar.gz", hash = "sha256:dd47c42927d89ab911e606518907cc2d3a1f38bbd026385970643f9c5b8ecfeb"}, 321 | ] 322 | pathspec = [ 323 | {file = "pathspec-0.9.0-py2.py3-none-any.whl", hash = "sha256:7d15c4ddb0b5c802d161efc417ec1a2558ea2653c2e8ad9c19098201dc1c993a"}, 324 | {file = "pathspec-0.9.0.tar.gz", hash = "sha256:e564499435a2673d586f6b2130bb5b95f04a3ba06f81b8f895b651a3c76aabb1"}, 325 | ] 326 | platformdirs = [ 327 | {file = "platformdirs-2.5.2-py3-none-any.whl", hash = "sha256:027d8e83a2d7de06bbac4e5ef7e023c02b863d7ea5d079477e722bb41ab25788"}, 328 | {file = "platformdirs-2.5.2.tar.gz", hash = "sha256:58c8abb07dcb441e6ee4b11d8df0ac856038f944ab98b7be6b27b2a3c7feef19"}, 329 | ] 330 | pluggy = [ 331 | {file = "pluggy-1.0.0-py2.py3-none-any.whl", hash = "sha256:74134bbf457f031a36d68416e1509f34bd5ccc019f0bcc952c7b909d06b37bd3"}, 332 | {file = "pluggy-1.0.0.tar.gz", hash = "sha256:4224373bacce55f955a878bf9cfa763c1e360858e330072059e10bad68531159"}, 333 | ] 334 | py = [ 335 | {file = "py-1.11.0-py2.py3-none-any.whl", hash = "sha256:607c53218732647dff4acdfcd50cb62615cedf612e72d1724fb1a0cc6405b378"}, 336 | {file = "py-1.11.0.tar.gz", hash = "sha256:51c75c4126074b472f746a24399ad32f6053d1b34b68d2fa41e558e6f4a98719"}, 337 | ] 338 | py4j = [ 339 | {file = "py4j-0.10.9.5-py2.py3-none-any.whl", hash = "sha256:52d171a6a2b031d8a5d1de6efe451cf4f5baff1a2819aabc3741c8406539ba04"}, 340 | {file = "py4j-0.10.9.5.tar.gz", hash = "sha256:276a4a3c5a2154df1860ef3303a927460e02e97b047dc0a47c1c3fb8cce34db6"}, 341 | ] 342 | pyparsing = [ 343 | {file = "pyparsing-3.0.9-py3-none-any.whl", hash = "sha256:5026bae9a10eeaefb61dab2f09052b9f4307d44aee4eda64b309723d8d206bbc"}, 344 | {file = "pyparsing-3.0.9.tar.gz", hash = "sha256:2b020ecf7d21b687f219b71ecad3631f644a47f01403fa1d1036b0c6416d70fb"}, 345 | ] 346 | pyspark = [ 347 | {file = "pyspark-3.3.0.tar.gz", hash = "sha256:7ebe8e9505647b4d124d5a82fca60dfd3891021cf8ad6c5ec88777eeece92cf7"}, 348 | ] 349 | pytest = [ 350 | {file = "pytest-7.1.2-py3-none-any.whl", hash = "sha256:13d0e3ccfc2b6e26be000cb6568c832ba67ba32e719443bfe725814d3c42433c"}, 351 | {file = "pytest-7.1.2.tar.gz", hash = "sha256:a06a0425453864a270bc45e71f783330a7428defb4230fb5e6a731fde06ecd45"}, 352 | ] 353 | pytest-spark = [ 354 | {file = "pytest-spark-0.6.0.tar.gz", hash = "sha256:06e3fbfa2e7fa69d2976c10037c9ee3549c80580228bde5b9aa602f44b711f17"}, 355 | {file = "pytest_spark-0.6.0-py3-none-any.whl", hash = "sha256:cabfbcfca6a4876c5e03b151ba9217f3888fe5142154c1e885dd7902afa85a89"}, 356 | ] 357 | pytest-watch = [ 358 | {file = "pytest-watch-4.2.0.tar.gz", hash = "sha256:06136f03d5b361718b8d0d234042f7b2f203910d8568f63df2f866b547b3d4b9"}, 359 | ] 360 | tomli = [ 361 | {file = "tomli-2.0.1-py3-none-any.whl", hash = "sha256:939de3e7a6161af0c887ef91b7d41a53e7c5a1ca976325f429cb46ea9bc30ecc"}, 362 | {file = "tomli-2.0.1.tar.gz", hash = "sha256:de526c12914f0c550d15924c62d72abc48d6fe7364aa87328337a31007fe8a4f"}, 363 | ] 364 | typing-extensions = [ 365 | {file = "typing_extensions-4.3.0-py3-none-any.whl", hash = "sha256:25642c956049920a5aa49edcdd6ab1e06d7e5d467fc00e0506c44ac86fbfca02"}, 366 | {file = "typing_extensions-4.3.0.tar.gz", hash = "sha256:e6d2677a32f47fc7eb2795db1dd15c1f34eff616bcaf2cfb5e997f854fa1c4a6"}, 367 | ] 368 | watchdog = [ 369 | {file = "watchdog-2.1.9-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:a735a990a1095f75ca4f36ea2ef2752c99e6ee997c46b0de507ba40a09bf7330"}, 370 | {file = "watchdog-2.1.9-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:6b17d302850c8d412784d9246cfe8d7e3af6bcd45f958abb2d08a6f8bedf695d"}, 371 | {file = "watchdog-2.1.9-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:ee3e38a6cc050a8830089f79cbec8a3878ec2fe5160cdb2dc8ccb6def8552658"}, 372 | {file = "watchdog-2.1.9-cp36-cp36m-macosx_10_9_x86_64.whl", hash = "sha256:64a27aed691408a6abd83394b38503e8176f69031ca25d64131d8d640a307591"}, 373 | {file = "watchdog-2.1.9-cp37-cp37m-macosx_10_9_x86_64.whl", hash = "sha256:195fc70c6e41237362ba720e9aaf394f8178bfc7fa68207f112d108edef1af33"}, 374 | {file = "watchdog-2.1.9-cp38-cp38-macosx_10_9_universal2.whl", hash = "sha256:bfc4d351e6348d6ec51df007432e6fe80adb53fd41183716017026af03427846"}, 375 | {file = "watchdog-2.1.9-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:8250546a98388cbc00c3ee3cc5cf96799b5a595270dfcfa855491a64b86ef8c3"}, 376 | {file = "watchdog-2.1.9-cp38-cp38-macosx_11_0_arm64.whl", hash = "sha256:117ffc6ec261639a0209a3252546b12800670d4bf5f84fbd355957a0595fe654"}, 377 | {file = "watchdog-2.1.9-cp39-cp39-macosx_10_9_universal2.whl", hash = "sha256:97f9752208f5154e9e7b76acc8c4f5a58801b338de2af14e7e181ee3b28a5d39"}, 378 | {file = "watchdog-2.1.9-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:247dcf1df956daa24828bfea5a138d0e7a7c98b1a47cf1fa5b0c3c16241fcbb7"}, 379 | {file = "watchdog-2.1.9-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:226b3c6c468ce72051a4c15a4cc2ef317c32590d82ba0b330403cafd98a62cfd"}, 380 | {file = "watchdog-2.1.9-pp37-pypy37_pp73-macosx_10_9_x86_64.whl", hash = "sha256:d9820fe47c20c13e3c9dd544d3706a2a26c02b2b43c993b62fcd8011bcc0adb3"}, 381 | {file = "watchdog-2.1.9-pp38-pypy38_pp73-macosx_10_9_x86_64.whl", hash = "sha256:70af927aa1613ded6a68089a9262a009fbdf819f46d09c1a908d4b36e1ba2b2d"}, 382 | {file = "watchdog-2.1.9-pp39-pypy39_pp73-macosx_10_9_x86_64.whl", hash = "sha256:ed80a1628cee19f5cfc6bb74e173f1b4189eb532e705e2a13e3250312a62e0c9"}, 383 | {file = "watchdog-2.1.9-py3-none-manylinux2014_aarch64.whl", hash = "sha256:9f05a5f7c12452f6a27203f76779ae3f46fa30f1dd833037ea8cbc2887c60213"}, 384 | {file = "watchdog-2.1.9-py3-none-manylinux2014_armv7l.whl", hash = "sha256:255bb5758f7e89b1a13c05a5bceccec2219f8995a3a4c4d6968fe1de6a3b2892"}, 385 | {file = "watchdog-2.1.9-py3-none-manylinux2014_i686.whl", hash = "sha256:d3dda00aca282b26194bdd0adec21e4c21e916956d972369359ba63ade616153"}, 386 | {file = "watchdog-2.1.9-py3-none-manylinux2014_ppc64.whl", hash = "sha256:186f6c55abc5e03872ae14c2f294a153ec7292f807af99f57611acc8caa75306"}, 387 | {file = "watchdog-2.1.9-py3-none-manylinux2014_ppc64le.whl", hash = "sha256:083171652584e1b8829581f965b9b7723ca5f9a2cd7e20271edf264cfd7c1412"}, 388 | {file = "watchdog-2.1.9-py3-none-manylinux2014_s390x.whl", hash = "sha256:b530ae007a5f5d50b7fbba96634c7ee21abec70dc3e7f0233339c81943848dc1"}, 389 | {file = "watchdog-2.1.9-py3-none-manylinux2014_x86_64.whl", hash = "sha256:4f4e1c4aa54fb86316a62a87b3378c025e228178d55481d30d857c6c438897d6"}, 390 | {file = "watchdog-2.1.9-py3-none-win32.whl", hash = "sha256:5952135968519e2447a01875a6f5fc8c03190b24d14ee52b0f4b1682259520b1"}, 391 | {file = "watchdog-2.1.9-py3-none-win_amd64.whl", hash = "sha256:7a833211f49143c3d336729b0020ffd1274078e94b0ae42e22f596999f50279c"}, 392 | {file = "watchdog-2.1.9-py3-none-win_ia64.whl", hash = "sha256:ad576a565260d8f99d97f2e64b0f97a48228317095908568a9d5c786c829d428"}, 393 | {file = "watchdog-2.1.9.tar.gz", hash = "sha256:43ce20ebb36a51f21fa376f76d1d4692452b2527ccd601950d69ed36b9e21609"}, 394 | ] 395 | -------------------------------------------------------------------------------- /poetry.toml: -------------------------------------------------------------------------------- 1 | [virtualenvs] 2 | in-project = true 3 | -------------------------------------------------------------------------------- /pyproject.toml: -------------------------------------------------------------------------------- 1 | [tool.poetry] 2 | name = "data-engineering" 3 | version = "0.1.0" 4 | description = "" 5 | authors = ["TCM Labs"] 6 | license = "MIT" 7 | 8 | [tool.poetry.dependencies] 9 | python = "^3.9" 10 | pyspark = "^3.3.0" 11 | 12 | [tool.poetry.dev-dependencies] 13 | black = "^22.6.0" 14 | pytest = "^7.1.2" 15 | pytest-watch = "^4.2.0" 16 | pytest-spark = "^0.6.0" 17 | 18 | [build-system] 19 | requires = ["poetry-core>=1.0.0"] 20 | build-backend = "poetry.core.masonry.api" 21 | -------------------------------------------------------------------------------- /pytest.ini: -------------------------------------------------------------------------------- 1 | [pytest] 2 | ; see: https://docs.pytest.org/en/latest/explanation/goodpractices.html 3 | pythonpath = "src" 4 | -------------------------------------------------------------------------------- /src/.gitignore: -------------------------------------------------------------------------------- 1 | __pycache__ -------------------------------------------------------------------------------- /src/cinema/core/domain/movie_type.py: -------------------------------------------------------------------------------- 1 | from dataclasses import dataclass 2 | from typing import Protocol 3 | 4 | 5 | class WithName(Protocol): 6 | name: str 7 | 8 | 9 | class WithBudget(Protocol): 10 | budget: int 11 | 12 | 13 | @dataclass 14 | class Movie(WithName, WithBudget): 15 | name: str 16 | budget: int 17 | -------------------------------------------------------------------------------- /src/cinema/core/domain/services/movie_domain_service.py: -------------------------------------------------------------------------------- 1 | from __future__ import annotations 2 | 3 | from typing import Generic, List, TypeVar 4 | 5 | from pyspark.sql.dataframe import DataFrame 6 | from pyspark.sql.functions import col, avg 7 | 8 | from cinema.core.domain.movie_type import Movie, WithBudget, WithName 9 | 10 | T = TypeVar("T", covariant=True) 11 | 12 | 13 | class MovieDomainService(Generic[T]): 14 | df: DataFrame 15 | 16 | def __init__(self, df: DataFrame): 17 | self.df = df 18 | 19 | # Where steps 20 | def most_expensive( 21 | self: MovieDomainService[WithBudget], count: int 22 | ) -> MovieDomainService[T]: 23 | return MovieDomainService(self.df.orderBy(col("budget").desc()).limit(count)) 24 | 25 | # Select 26 | def names(self: MovieDomainService[WithName]) -> MovieDomainService[WithName]: 27 | return MovieDomainService(self.df.select("name")) 28 | 29 | def budgets(self: MovieDomainService[WithBudget]) -> MovieDomainService[WithBudget]: 30 | return MovieDomainService(self.df.select("budget")) 31 | 32 | # Aggregations 33 | def average_budget(self: MovieDomainService[WithBudget]) -> float: 34 | return self.df.agg(avg(col("budget"))).collect()[0]["avg(budget)"] 35 | 36 | def count(self: MovieDomainService[T]) -> int: 37 | return self.df.count() 38 | 39 | # Terminal steps 40 | def dangerously_convert_to_list(self: MovieDomainService[Movie]) -> List[Movie]: 41 | # Note: Spark being a big data tool, we're not supposed to convert 42 | # a DataFrame to a list. 43 | # Such conversion is most likely an anti-pattern unless you're 44 | # absolutely sure that the Spark queries return a very small number of 45 | # elements 46 | movies = [Movie(**row.asDict()) for row in self.df.collect()] 47 | 48 | return movies 49 | -------------------------------------------------------------------------------- /src/cinema/core/ports/primary/average_movie_budget_command.py: -------------------------------------------------------------------------------- 1 | from dataclasses import dataclass 2 | 3 | 4 | @dataclass 5 | class AverageMovieBudgetCommand: # TODO: make this generic 6 | pass 7 | -------------------------------------------------------------------------------- /src/cinema/core/ports/primary/count_movie_command.py: -------------------------------------------------------------------------------- 1 | from dataclasses import dataclass 2 | 3 | 4 | @dataclass 5 | class CountMovieCommand: # TODO: make this generic 6 | pass 7 | -------------------------------------------------------------------------------- /src/cinema/core/ports/primary/most_expensive_movie_command.py: -------------------------------------------------------------------------------- 1 | from dataclasses import dataclass 2 | 3 | 4 | @dataclass 5 | class MostExpensiveMoviesCommand: # TODO: make this generic 6 | count: int 7 | -------------------------------------------------------------------------------- /src/cinema/core/ports/primary/use_cases.py: -------------------------------------------------------------------------------- 1 | from abc import abstractmethod 2 | from typing import Generic, Protocol, TypeVar 3 | 4 | 5 | C = TypeVar("C", contravariant=True) # Command 6 | R = TypeVar("R", covariant=True) # Response 7 | 8 | 9 | class UseCase(Generic[C, R], Protocol): 10 | @abstractmethod 11 | def run(self, command: C) -> R: 12 | pass 13 | -------------------------------------------------------------------------------- /src/cinema/core/ports/secondary/movie_repository.py: -------------------------------------------------------------------------------- 1 | from abc import abstractmethod 2 | from typing import Protocol 3 | 4 | from pyspark.sql.dataframe import DataFrame 5 | 6 | 7 | class MovieRepository(Protocol): 8 | @abstractmethod 9 | def find_all_movies(self) -> DataFrame: 10 | raise NotImplementedError() 11 | -------------------------------------------------------------------------------- /src/cinema/core/use_cases/average_budget_use_case.py: -------------------------------------------------------------------------------- 1 | from cinema.core.domain.movie_type import Movie 2 | from cinema.core.domain.services.movie_domain_service import MovieDomainService 3 | from cinema.core.ports.primary.average_movie_budget_command import ( 4 | AverageMovieBudgetCommand, 5 | ) 6 | from cinema.core.ports.primary.use_cases import UseCase 7 | from cinema.core.ports.secondary.movie_repository import MovieRepository 8 | 9 | 10 | class AverageMovieBudgetUseCase(UseCase[AverageMovieBudgetCommand, float]): 11 | # https://stackoverflow.com/questions/72141966/infer-type-from-subclass-method-return-type 12 | 13 | def __init__(self, movie_repository: MovieRepository) -> None: 14 | self._movie_repository = movie_repository 15 | 16 | def run(self, command: AverageMovieBudgetCommand): 17 | # TODO: find out why we need to type 'command' argument again 18 | movies = MovieDomainService[Movie](self._movie_repository.find_all_movies()) 19 | 20 | average_movie_budget = movies.average_budget() 21 | 22 | return average_movie_budget 23 | -------------------------------------------------------------------------------- /src/cinema/core/use_cases/count_movie_use_case.py: -------------------------------------------------------------------------------- 1 | from cinema.core.domain.movie_type import Movie 2 | from cinema.core.domain.services.movie_domain_service import MovieDomainService 3 | from cinema.core.ports.primary.count_movie_command import CountMovieCommand 4 | from cinema.core.ports.primary.use_cases import UseCase 5 | from cinema.core.ports.secondary.movie_repository import MovieRepository 6 | 7 | 8 | class CountMovieUseCase(UseCase[CountMovieCommand, int]): 9 | # https://stackoverflow.com/questions/72141966/infer-type-from-subclass-method-return-type 10 | 11 | def __init__(self, movie_repository: MovieRepository) -> None: 12 | self._movie_repository = movie_repository 13 | 14 | def run(self, command: CountMovieCommand): 15 | # TODO: find out why we need to type 'command' argument again 16 | movies = MovieDomainService[Movie](self._movie_repository.find_all_movies()) 17 | 18 | average_movie_budget = movies.count() 19 | 20 | return average_movie_budget 21 | -------------------------------------------------------------------------------- /src/cinema/core/use_cases/most_expensive_movies.py: -------------------------------------------------------------------------------- 1 | from typing import List 2 | 3 | from cinema.core.domain.movie_type import Movie 4 | from cinema.core.domain.services.movie_domain_service import MovieDomainService 5 | from cinema.core.ports.primary.most_expensive_movie_command import ( 6 | MostExpensiveMoviesCommand, 7 | ) 8 | from cinema.core.ports.primary.use_cases import UseCase 9 | from cinema.core.ports.secondary.movie_repository import MovieRepository 10 | 11 | 12 | class MostExpensiveMoviesUseCase(UseCase[MostExpensiveMoviesCommand, List[Movie]]): 13 | # https://stackoverflow.com/questions/72141966/infer-type-from-subclass-method-return-type 14 | 15 | def __init__(self, movie_repository: MovieRepository) -> None: 16 | self._movie_repository = movie_repository 17 | 18 | def run(self, command: MostExpensiveMoviesCommand): 19 | # TODO: find out why we need to type 'command' argument again 20 | movies = MovieDomainService[Movie](self._movie_repository.find_all_movies()) 21 | 22 | most_expensive_movies = movies.most_expensive( 23 | count=command.count 24 | ).dangerously_convert_to_list() 25 | 26 | return most_expensive_movies 27 | -------------------------------------------------------------------------------- /src/cinema/primary_adapters/cli/main.py: -------------------------------------------------------------------------------- 1 | from pyspark.sql import SparkSession 2 | from cinema.core.ports.primary.most_expensive_movie_command import ( 3 | MostExpensiveMoviesCommand, 4 | ) 5 | 6 | 7 | from cinema.secondary_adapters.repositories.movie.in_memory_movie_repository import ( 8 | InMemoryMovieRepository, 9 | ) 10 | 11 | from cinema.core.use_cases.most_expensive_movies import MostExpensiveMoviesUseCase 12 | 13 | 14 | def run_movie_application(): 15 | # Wire-up dependencies 16 | spark = SparkSession.builder.appName("Python Spark SQL basic example").getOrCreate() 17 | 18 | movie_repository = InMemoryMovieRepository(spark) 19 | use_case = MostExpensiveMoviesUseCase(movie_repository) 20 | 21 | # Run use case 22 | top_two_most_expensive_movies = use_case.run(MostExpensiveMoviesCommand(count=2)) 23 | 24 | return top_two_most_expensive_movies 25 | -------------------------------------------------------------------------------- /src/cinema/primary_adapters/tests/average_movie_budget_use_case_test.py: -------------------------------------------------------------------------------- 1 | from pyspark.sql import SparkSession 2 | from cinema.core.ports.primary.average_movie_budget_command import ( 3 | AverageMovieBudgetCommand, 4 | ) 5 | from cinema.core.use_cases.average_budget_use_case import AverageMovieBudgetUseCase 6 | from cinema.secondary_adapters.repositories.movie.in_memory_movie_repository import ( 7 | InMemoryMovieRepository, 8 | ) 9 | 10 | 11 | class TestAverageMovieBudgetUseCase: 12 | def test_average_movie_budget(self, spark_session: SparkSession): 13 | movie_repository = InMemoryMovieRepository(spark_session) 14 | average_movie_budget_use_case = AverageMovieBudgetUseCase(movie_repository) 15 | 16 | average_movie_budget_command = AverageMovieBudgetCommand() 17 | 18 | average_budget = average_movie_budget_use_case.run(average_movie_budget_command) 19 | 20 | assert average_budget == 135_000_000 21 | -------------------------------------------------------------------------------- /src/cinema/primary_adapters/tests/most_expensive_movies_use_case_test.py: -------------------------------------------------------------------------------- 1 | from cinema.core.domain.movie_type import Movie 2 | from cinema.core.ports.primary.most_expensive_movie_command import ( 3 | MostExpensiveMoviesCommand, 4 | ) 5 | from cinema.core.use_cases.most_expensive_movies import MostExpensiveMoviesUseCase 6 | from cinema.secondary_adapters.repositories.movie.kaggle_movie_repository import ( 7 | KaggleFileSystemMovieRepository, 8 | ) 9 | from cinema.secondary_adapters.repositories.movie.in_memory_movie_repository import ( 10 | InMemoryMovieRepository, 11 | ) 12 | from pyspark.sql import SparkSession 13 | 14 | 15 | class TestMostExpensiveMoviesUseCase: 16 | def test_most_expensive_movies(self, spark_session: SparkSession): 17 | movie_repository = InMemoryMovieRepository(spark_session) 18 | most_expensive_movies_use_case = MostExpensiveMoviesUseCase(movie_repository) 19 | 20 | most_expensive_movie_ever_command = MostExpensiveMoviesCommand(count=1) 21 | 22 | movie = most_expensive_movies_use_case.run(most_expensive_movie_ever_command) 23 | 24 | assert movie == [ 25 | Movie( 26 | name="Strange Universe", 27 | budget=205_000_000, 28 | ) 29 | ] 30 | 31 | def test_most_expensive_movies_kaggle_dataset(self, spark_session: SparkSession): 32 | movie_repository = KaggleFileSystemMovieRepository(spark_session) 33 | most_expensive_movies_use_case = MostExpensiveMoviesUseCase(movie_repository) 34 | 35 | most_expensive_movie_ever_command = MostExpensiveMoviesCommand(count=1) 36 | 37 | [movie] = most_expensive_movies_use_case.run(most_expensive_movie_ever_command) 38 | 39 | assert movie.name == "Pirates of the Caribbean: On Stranger Tides" 40 | assert movie.budget == 380_000_000 41 | -------------------------------------------------------------------------------- /src/cinema/primary_adapters/tests/movie_count_use_case_test.py: -------------------------------------------------------------------------------- 1 | from pyspark.sql import SparkSession 2 | from cinema.core.ports.primary.count_movie_command import CountMovieCommand 3 | 4 | from cinema.core.use_cases.count_movie_use_case import CountMovieUseCase 5 | from cinema.secondary_adapters.repositories.movie.in_memory_movie_repository import ( 6 | InMemoryMovieRepository, 7 | ) 8 | 9 | 10 | class TestMovieCountUseCase: 11 | def test_total_number_of_movies(self, spark_session: SparkSession): 12 | movie_repository = InMemoryMovieRepository(spark_session) 13 | count_movie_use_case = CountMovieUseCase(movie_repository) 14 | 15 | average_movie_budget_command = CountMovieCommand() 16 | 17 | total_number_of_movies = count_movie_use_case.run(average_movie_budget_command) 18 | 19 | assert total_number_of_movies == 3 20 | -------------------------------------------------------------------------------- /src/cinema/secondary_adapters/repositories/movie/in_memory_movie_repository.py: -------------------------------------------------------------------------------- 1 | from cinema.core.ports.secondary.movie_repository import MovieRepository 2 | from pyspark.sql import Row, SparkSession 3 | 4 | 5 | class InMemoryMovieRepository(MovieRepository): 6 | spark: SparkSession 7 | 8 | def __init__(self, spark: SparkSession) -> None: 9 | super().__init__() 10 | self._spark = spark 11 | 12 | def find_all_movies(self): 13 | movies = self._spark.createDataFrame( 14 | [ 15 | Row( 16 | name="Greatest Heroes I", 17 | budget=120_000_000, 18 | ), 19 | Row( 20 | name="From the future, to the past", 21 | budget=80_000_000, 22 | ), 23 | Row( 24 | name="Strange Universe", 25 | budget=205_000_000, 26 | ), 27 | ] 28 | ) 29 | 30 | return movies 31 | -------------------------------------------------------------------------------- /src/cinema/secondary_adapters/repositories/movie/kaggle_movie_repository.py: -------------------------------------------------------------------------------- 1 | from typing import final 2 | 3 | from cinema.core.ports.secondary.movie_repository import MovieRepository 4 | from pyspark.sql import SparkSession 5 | from pyspark.sql.types import DateType, FloatType, IntegerType, StringType, StructType 6 | 7 | movies_metadata_raw_csv_schema = ( 8 | StructType() 9 | .add("adult", StringType(), nullable=False) 10 | .add("belongs_to_collection", StringType(), nullable=False) 11 | .add("budget", IntegerType(), nullable=False) 12 | .add("genres", StringType(), nullable=False) 13 | .add("homepage", StringType(), nullable=False) 14 | .add("id", StringType(), nullable=False) 15 | .add("imdb_id", StringType(), nullable=False) 16 | .add("original_language", StringType(), nullable=False) 17 | .add("original_title", StringType(), nullable=False) 18 | .add("overview", StringType(), nullable=False) 19 | .add("popularity", FloatType(), nullable=False) 20 | .add("poster_path", StringType(), nullable=False) 21 | .add("production_companies", StringType(), nullable=False) 22 | .add("production_countries", StringType(), nullable=False) 23 | .add("release_date", DateType(), nullable=False) 24 | .add("revenue", IntegerType(), nullable=False) 25 | .add("runtime", FloatType(), nullable=False) 26 | .add("spoken_languages", StringType(), nullable=False) 27 | .add("status", StringType(), nullable=False) 28 | .add("tagline", StringType(), nullable=False) 29 | .add("title", StringType(), nullable=False) 30 | .add("video", StringType(), nullable=False) 31 | .add("vote_average", FloatType(), nullable=False) 32 | .add("vote_count", IntegerType(), nullable=False) 33 | ) 34 | 35 | 36 | @final 37 | class KaggleFileSystemMovieRepository(MovieRepository): 38 | spark: SparkSession 39 | 40 | def __init__(self, spark: SparkSession) -> None: 41 | self._spark = spark 42 | 43 | def find_all_movies(self): 44 | movies = ( 45 | self._spark.read.csv( 46 | "data/kaggle_movies_dataset/movies_metadata.csv", 47 | # See: https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option 48 | schema=movies_metadata_raw_csv_schema, 49 | header=True, 50 | inferSchema=False, 51 | mode="PERMISSIVE", # stricter: FAILFAST 52 | sep=",", 53 | ) 54 | .withColumnRenamed("original_title", "name") 55 | .select("name", "budget") 56 | ) 57 | 58 | return movies 59 | -------------------------------------------------------------------------------- /src/main.py: -------------------------------------------------------------------------------- 1 | from cinema.primary_adapters.cli.main import run_movie_application 2 | 3 | if __name__ == "__main__": 4 | result = run_movie_application() 5 | print(result) 6 | --------------------------------------------------------------------------------