├── .python-version
├── .vscode
    └── settings.json
├── README.md
├── bin
    ├── run-main.sh
    └── run-test.sh
├── data
    └── .gitignore
├── poetry.lock
├── poetry.toml
├── pyproject.toml
├── pytest.ini
└── src
    ├── .gitignore
    ├── cinema
        ├── core
        │   ├── domain
        │   │   ├── movie_type.py
        │   │   └── services
        │   │   │   └── movie_domain_service.py
        │   ├── ports
        │   │   ├── primary
        │   │   │   ├── average_movie_budget_command.py
        │   │   │   ├── count_movie_command.py
        │   │   │   ├── most_expensive_movie_command.py
        │   │   │   └── use_cases.py
        │   │   └── secondary
        │   │   │   └── movie_repository.py
        │   └── use_cases
        │   │   ├── average_budget_use_case.py
        │   │   ├── count_movie_use_case.py
        │   │   └── most_expensive_movies.py
        ├── primary_adapters
        │   ├── cli
        │   │   └── main.py
        │   └── tests
        │   │   ├── average_movie_budget_use_case_test.py
        │   │   ├── most_expensive_movies_use_case_test.py
        │   │   └── movie_count_use_case_test.py
        └── secondary_adapters
        │   └── repositories
        │       └── movie
        │           ├── in_memory_movie_repository.py
        │           └── kaggle_movie_repository.py
    └── main.py


/.python-version:
--------------------------------------------------------------------------------
1 | 3.9.13
2 | 


--------------------------------------------------------------------------------
/.vscode/settings.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "[python]": {
 3 |     "editor.defaultFormatter": "ms-python.black-formatter",
 4 |     "editor.formatOnSave": true
 5 |   },
 6 |   "python.formatting.provider": "none",
 7 |   "editor.formatOnSave": true,
 8 |   "python.testing.pytestArgs": ["src"],
 9 |   "python.testing.unittestEnabled": false,
10 |   "python.testing.pytestEnabled": true,
11 |   "python.analysis.typeCheckingMode": "strict",
12 |   "files.exclude": {
13 |     "**/__pycache__": true
14 |   }
15 | }
16 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # TCM Labs' Python/Spark data engineering white project
  2 | 
  3 | This is a Python/Spark white project in which we highlight how to craft a highly maintainable data project which can run both locally, on any developer machine, and remotely on any Spark cluster (including Databricks clusters). This also works with Pandas.
  4 | 
  5 | Especially, if you're using Notebooks in production and are unhappy about the quality of the output data, you'll find valuable insights here.
  6 | 
  7 | ## What problems does this white project intend to solve?
  8 | 
  9 | This project is here to help you if you struggle with some of these problems:
 10 | 
 11 | - poor data quality
 12 | - lack of testing
 13 | - costly maintainance
 14 | - slow iteration speed
 15 | - difficult collaboration
 16 | - deployment to production fear
 17 | - etc.
 18 | 
 19 | This project shows how battle-tested software engineering architecture known to be highly effective in backend or frontend environments can also be applied to data projects, using Spark and/or Pandas.
 20 | 
 21 | ## Architecture, tests, feedback loop, code quality and environments
 22 | 
 23 | Let's zoom into 6 very specific problems that your data engineering project may have:
 24 | 
 25 | ### Problem #1: "We don't know where and what the business rules are"
 26 | 
 27 | This is a **software architecture** problem.
 28 | 
 29 | - Quickly find such business rules and modify them
 30 | - Allow business/non-software engineer people to read and possible contribute to the project in very isolated places
 31 | 
 32 | ### Problem #2: "We don't know if the code/business rules behave as expected"
 33 | 
 34 | This is a **testing** problem.
 35 | 
 36 | How do we solve this?
 37 | 
 38 | This repository shows how to write thorough Spark unit tests:
 39 | 
 40 | - Ensure business use cases behaves as expected: given a known input, we should determnistically return the same output
 41 | - Prevent any regression, so that all previously working feature don't break when new features or bug fixes are merged
 42 | 
 43 | ### Problem #3: "We're slow to iterate, developing a new feature or fixing a bug takes age"
 44 | 
 45 | This is a **feedback loop** problem.
 46 | 
 47 | How do we solve this?
 48 | 
 49 | - Allow developers get a very fast local feedback loop by allowing them to run tests locally in less than a minute
 50 | - Allow continuous integration (CI), by preventing non-functioning code to be merged into `main` branch
 51 | 
 52 | ### Problem #4: "We always spend a lot of time understanding what the code does, and people complain it's cryptic and hard to decypher"
 53 | 
 54 | This is a **code quality** problem.
 55 | 
 56 | How do we solve this? We automate the boring yet very important stuffs with tools such as:
 57 | 
 58 | - `black` handles code formatting
 59 | - `flake8` handles linting
 60 | - `Python Type Hints` handles static type checking
 61 | 
 62 | ### Problem #5: "We had a bug in production because the installed library version didn't match what we're using in development"
 63 | 
 64 | This is a **server/machine provisioning** problem.
 65 | 
 66 | How do we solve this?
 67 | 
 68 | - We make it next to impossible to deploy an application in production without the appropriate dependencies and expected exact versions
 69 | - We also allow all developers to work with the same dependency trees, on their machine.
 70 | 
 71 | How? We use the following tools:
 72 | 
 73 | - `poetry` for managing Python dependencies
 74 | - `pyenv` for managing Python binaries
 75 | 
 76 | That could work with `pip` + `virtualenv` or `conda`
 77 | 
 78 | ### Problem #6: "We have duplicated code everywhere, and updating all of this takes age"
 79 | 
 80 | This is a **software engineering** problem.
 81 | 
 82 | How do we solve this?
 83 | 
 84 | - We make domain code explicit in domain services
 85 | - We make orchestration logic, ie. what do to sequentially or in parallel, explicit in application services
 86 | 
 87 | Code is data, and duplicated data tend to go out of sync. Duplicated code will lead to inconsistent code which will lead to inconsistent results.
 88 | 
 89 | ## Highlighted software engineering concepts
 90 | 
 91 | This repository uses the following **software engineering** concepts:
 92 | 
 93 | - [Hexagonal architecture (= ports and adapter)](https://alistair.cockburn.us/hexagonal-architecture); you may be familiar with [Clean architecture](https://blog.cleancoder.com/uncle-bob/2012/08/13/the-clean-architecture.html)
 94 | - Inversion of Control (IoC) and Dependency Injection mecanisms (DI)
 95 | - Dependency Inversion Principle (DIP)
 96 | 
 97 | We also borrow useful concepts from **Domain-Driven Design (DDD)**, especially:
 98 | 
 99 | - Repository pattern
100 | - Application services
101 | - Domain services
102 | 
103 | Also, we use Domain-Specific Languages (DSLs) ideas to hide the Spark/Pandas implementations details, and focus on the _what_ rather than the _how_. This part is not mandatory.
104 | 
105 | ## Maintainers
106 | 
107 | Provided to you by [TCM Labs](https://www.tcmlabs.fr/), an expert IT Consulting firm based in Paris, France.
108 | 
109 | We believe that there's absolutely no difference between data, backend and frontend engineering.
110 | 
111 | Lead maintainer:
112 | 
113 | - Jean-Baptiste Musso // jeanbaptiste (at) tcmlabs.fr
114 | 
115 | ## Contributing
116 | 
117 | Feel free to open issues and pull requests. Contributions are welcomed.
118 | 


--------------------------------------------------------------------------------
/bin/run-main.sh:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env bash
2 | 
3 | poetry run python src/main.py


--------------------------------------------------------------------------------
/bin/run-test.sh:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env bash
2 | 
3 | python -m pytest --verbose --capture=no "$@"
4 | 


--------------------------------------------------------------------------------
/data/.gitignore:
--------------------------------------------------------------------------------
1 | *
2 | !.gitignore


--------------------------------------------------------------------------------
/poetry.lock:
--------------------------------------------------------------------------------
  1 | [[package]]
  2 | name = "atomicwrites"
  3 | version = "1.4.1"
  4 | description = "Atomic file writes."
  5 | category = "dev"
  6 | optional = false
  7 | python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*"
  8 | 
  9 | [[package]]
 10 | name = "attrs"
 11 | version = "21.4.0"
 12 | description = "Classes Without Boilerplate"
 13 | category = "dev"
 14 | optional = false
 15 | python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*"
 16 | 
 17 | [package.extras]
 18 | dev = ["coverage[toml] (>=5.0.2)", "hypothesis", "pympler", "pytest (>=4.3.0)", "six", "mypy", "pytest-mypy-plugins", "zope.interface", "furo", "sphinx", "sphinx-notfound-page", "pre-commit", "cloudpickle"]
 19 | docs = ["furo", "sphinx", "zope.interface", "sphinx-notfound-page"]
 20 | tests = ["coverage[toml] (>=5.0.2)", "hypothesis", "pympler", "pytest (>=4.3.0)", "six", "mypy", "pytest-mypy-plugins", "zope.interface", "cloudpickle"]
 21 | tests_no_zope = ["coverage[toml] (>=5.0.2)", "hypothesis", "pympler", "pytest (>=4.3.0)", "six", "mypy", "pytest-mypy-plugins", "cloudpickle"]
 22 | 
 23 | [[package]]
 24 | name = "black"
 25 | version = "22.6.0"
 26 | description = "The uncompromising code formatter."
 27 | category = "dev"
 28 | optional = false
 29 | python-versions = ">=3.6.2"
 30 | 
 31 | [package.dependencies]
 32 | click = ">=8.0.0"
 33 | mypy-extensions = ">=0.4.3"
 34 | pathspec = ">=0.9.0"
 35 | platformdirs = ">=2"
 36 | tomli = {version = ">=1.1.0", markers = "python_full_version < \"3.11.0a7\""}
 37 | typing-extensions = {version = ">=3.10.0.0", markers = "python_version < \"3.10\""}
 38 | 
 39 | [package.extras]
 40 | colorama = ["colorama (>=0.4.3)"]
 41 | d = ["aiohttp (>=3.7.4)"]
 42 | jupyter = ["ipython (>=7.8.0)", "tokenize-rt (>=3.2.0)"]
 43 | uvloop = ["uvloop (>=0.15.2)"]
 44 | 
 45 | [[package]]
 46 | name = "click"
 47 | version = "8.1.3"
 48 | description = "Composable command line interface toolkit"
 49 | category = "dev"
 50 | optional = false
 51 | python-versions = ">=3.7"
 52 | 
 53 | [package.dependencies]
 54 | colorama = {version = "*", markers = "platform_system == \"Windows\""}
 55 | 
 56 | [[package]]
 57 | name = "colorama"
 58 | version = "0.4.5"
 59 | description = "Cross-platform colored terminal text."
 60 | category = "dev"
 61 | optional = false
 62 | python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*"
 63 | 
 64 | [[package]]
 65 | name = "docopt"
 66 | version = "0.6.2"
 67 | description = "Pythonic argument parser, that will make you smile"
 68 | category = "dev"
 69 | optional = false
 70 | python-versions = "*"
 71 | 
 72 | [[package]]
 73 | name = "findspark"
 74 | version = "2.0.1"
 75 | description = "Find pyspark to make it importable."
 76 | category = "dev"
 77 | optional = false
 78 | python-versions = "*"
 79 | 
 80 | [[package]]
 81 | name = "iniconfig"
 82 | version = "1.1.1"
 83 | description = "iniconfig: brain-dead simple config-ini parsing"
 84 | category = "dev"
 85 | optional = false
 86 | python-versions = "*"
 87 | 
 88 | [[package]]
 89 | name = "mypy-extensions"
 90 | version = "0.4.3"
 91 | description = "Experimental type system extensions for programs checked with the mypy typechecker."
 92 | category = "dev"
 93 | optional = false
 94 | python-versions = "*"
 95 | 
 96 | [[package]]
 97 | name = "packaging"
 98 | version = "21.3"
 99 | description = "Core utilities for Python packages"
100 | category = "dev"
101 | optional = false
102 | python-versions = ">=3.6"
103 | 
104 | [package.dependencies]
105 | pyparsing = ">=2.0.2,<3.0.5 || >3.0.5"
106 | 
107 | [[package]]
108 | name = "pathspec"
109 | version = "0.9.0"
110 | description = "Utility library for gitignore style pattern matching of file paths."
111 | category = "dev"
112 | optional = false
113 | python-versions = "!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,!=3.4.*,>=2.7"
114 | 
115 | [[package]]
116 | name = "platformdirs"
117 | version = "2.5.2"
118 | description = "A small Python module for determining appropriate platform-specific dirs, e.g. a \"user data dir\"."
119 | category = "dev"
120 | optional = false
121 | python-versions = ">=3.7"
122 | 
123 | [package.extras]
124 | docs = ["furo (>=2021.7.5b38)", "proselint (>=0.10.2)", "sphinx-autodoc-typehints (>=1.12)", "sphinx (>=4)"]
125 | test = ["appdirs (==1.4.4)", "pytest-cov (>=2.7)", "pytest-mock (>=3.6)", "pytest (>=6)"]
126 | 
127 | [[package]]
128 | name = "pluggy"
129 | version = "1.0.0"
130 | description = "plugin and hook calling mechanisms for python"
131 | category = "dev"
132 | optional = false
133 | python-versions = ">=3.6"
134 | 
135 | [package.extras]
136 | dev = ["pre-commit", "tox"]
137 | testing = ["pytest", "pytest-benchmark"]
138 | 
139 | [[package]]
140 | name = "py"
141 | version = "1.11.0"
142 | description = "library with cross-python path, ini-parsing, io, code, log facilities"
143 | category = "dev"
144 | optional = false
145 | python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*"
146 | 
147 | [[package]]
148 | name = "py4j"
149 | version = "0.10.9.5"
150 | description = "Enables Python programs to dynamically access arbitrary Java objects"
151 | category = "main"
152 | optional = false
153 | python-versions = "*"
154 | 
155 | [[package]]
156 | name = "pyparsing"
157 | version = "3.0.9"
158 | description = "pyparsing module - Classes and methods to define and execute parsing grammars"
159 | category = "dev"
160 | optional = false
161 | python-versions = ">=3.6.8"
162 | 
163 | [package.extras]
164 | diagrams = ["railroad-diagrams", "jinja2"]
165 | 
166 | [[package]]
167 | name = "pyspark"
168 | version = "3.3.0"
169 | description = "Apache Spark Python API"
170 | category = "main"
171 | optional = false
172 | python-versions = ">=3.7"
173 | 
174 | [package.dependencies]
175 | py4j = "0.10.9.5"
176 | 
177 | [package.extras]
178 | ml = ["numpy (>=1.15)"]
179 | mllib = ["numpy (>=1.15)"]
180 | pandas_on_spark = ["numpy (>=1.15)", "pandas (>=1.0.5)", "pyarrow (>=1.0.0)"]
181 | sql = ["pandas (>=1.0.5)", "pyarrow (>=1.0.0)"]
182 | 
183 | [[package]]
184 | name = "pytest"
185 | version = "7.1.2"
186 | description = "pytest: simple powerful testing with Python"
187 | category = "dev"
188 | optional = false
189 | python-versions = ">=3.7"
190 | 
191 | [package.dependencies]
192 | atomicwrites = {version = ">=1.0", markers = "sys_platform == \"win32\""}
193 | attrs = ">=19.2.0"
194 | colorama = {version = "*", markers = "sys_platform == \"win32\""}
195 | iniconfig = "*"
196 | packaging = "*"
197 | pluggy = ">=0.12,<2.0"
198 | py = ">=1.8.2"
199 | tomli = ">=1.0.0"
200 | 
201 | [package.extras]
202 | testing = ["argcomplete", "hypothesis (>=3.56)", "mock", "nose", "pygments (>=2.7.2)", "requests", "xmlschema"]
203 | 
204 | [[package]]
205 | name = "pytest-spark"
206 | version = "0.6.0"
207 | description = "pytest plugin to run the tests with support of pyspark."
208 | category = "dev"
209 | optional = false
210 | python-versions = "*"
211 | 
212 | [package.dependencies]
213 | findspark = "*"
214 | pytest = "*"
215 | 
216 | [[package]]
217 | name = "pytest-watch"
218 | version = "4.2.0"
219 | description = "Local continuous test runner with pytest and watchdog."
220 | category = "dev"
221 | optional = false
222 | python-versions = "*"
223 | 
224 | [package.dependencies]
225 | colorama = ">=0.3.3"
226 | docopt = ">=0.4.0"
227 | pytest = ">=2.6.4"
228 | watchdog = ">=0.6.0"
229 | 
230 | [[package]]
231 | name = "tomli"
232 | version = "2.0.1"
233 | description = "A lil' TOML parser"
234 | category = "dev"
235 | optional = false
236 | python-versions = ">=3.7"
237 | 
238 | [[package]]
239 | name = "typing-extensions"
240 | version = "4.3.0"
241 | description = "Backported and Experimental Type Hints for Python 3.7+"
242 | category = "dev"
243 | optional = false
244 | python-versions = ">=3.7"
245 | 
246 | [[package]]
247 | name = "watchdog"
248 | version = "2.1.9"
249 | description = "Filesystem events monitoring"
250 | category = "dev"
251 | optional = false
252 | python-versions = ">=3.6"
253 | 
254 | [package.extras]
255 | watchmedo = ["PyYAML (>=3.10)"]
256 | 
257 | [metadata]
258 | lock-version = "1.1"
259 | python-versions = "^3.9"
260 | content-hash = "8b87ebea17cf625e72e280a4f219dfc7a5722fe420b9beeb77e0460c83df34f0"
261 | 
262 | [metadata.files]
263 | atomicwrites = [
264 |     {file = "atomicwrites-1.4.1.tar.gz", hash = "sha256:81b2c9071a49367a7f770170e5eec8cb66567cfbbc8c73d20ce5ca4a8d71cf11"},
265 | ]
266 | attrs = [
267 |     {file = "attrs-21.4.0-py2.py3-none-any.whl", hash = "sha256:2d27e3784d7a565d36ab851fe94887c5eccd6a463168875832a1be79c82828b4"},
268 |     {file = "attrs-21.4.0.tar.gz", hash = "sha256:626ba8234211db98e869df76230a137c4c40a12d72445c45d5f5b716f076e2fd"},
269 | ]
270 | black = [
271 |     {file = "black-22.6.0-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:f586c26118bc6e714ec58c09df0157fe2d9ee195c764f630eb0d8e7ccce72e69"},
272 |     {file = "black-22.6.0-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:b270a168d69edb8b7ed32c193ef10fd27844e5c60852039599f9184460ce0807"},
273 |     {file = "black-22.6.0-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:6797f58943fceb1c461fb572edbe828d811e719c24e03375fd25170ada53825e"},
274 |     {file = "black-22.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:c85928b9d5f83b23cee7d0efcb310172412fbf7cb9d9ce963bd67fd141781def"},
275 |     {file = "black-22.6.0-cp310-cp310-win_amd64.whl", hash = "sha256:f6fe02afde060bbeef044af7996f335fbe90b039ccf3f5eb8f16df8b20f77666"},
276 |     {file = "black-22.6.0-cp36-cp36m-macosx_10_9_x86_64.whl", hash = "sha256:cfaf3895a9634e882bf9d2363fed5af8888802d670f58b279b0bece00e9a872d"},
277 |     {file = "black-22.6.0-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:94783f636bca89f11eb5d50437e8e17fbc6a929a628d82304c80fa9cd945f256"},
278 |     {file = "black-22.6.0-cp36-cp36m-win_amd64.whl", hash = "sha256:2ea29072e954a4d55a2ff58971b83365eba5d3d357352a07a7a4df0d95f51c78"},
279 |     {file = "black-22.6.0-cp37-cp37m-macosx_10_9_x86_64.whl", hash = "sha256:e439798f819d49ba1c0bd9664427a05aab79bfba777a6db94fd4e56fae0cb849"},
280 |     {file = "black-22.6.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:187d96c5e713f441a5829e77120c269b6514418f4513a390b0499b0987f2ff1c"},
281 |     {file = "black-22.6.0-cp37-cp37m-win_amd64.whl", hash = "sha256:074458dc2f6e0d3dab7928d4417bb6957bb834434516f21514138437accdbe90"},
282 |     {file = "black-22.6.0-cp38-cp38-macosx_10_9_universal2.whl", hash = "sha256:a218d7e5856f91d20f04e931b6f16d15356db1c846ee55f01bac297a705ca24f"},
283 |     {file = "black-22.6.0-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:568ac3c465b1c8b34b61cd7a4e349e93f91abf0f9371eda1cf87194663ab684e"},
284 |     {file = "black-22.6.0-cp38-cp38-macosx_11_0_arm64.whl", hash = "sha256:6c1734ab264b8f7929cef8ae5f900b85d579e6cbfde09d7387da8f04771b51c6"},
285 |     {file = "black-22.6.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:c9a3ac16efe9ec7d7381ddebcc022119794872abce99475345c5a61aa18c45ad"},
286 |     {file = "black-22.6.0-cp38-cp38-win_amd64.whl", hash = "sha256:b9fd45787ba8aa3f5e0a0a98920c1012c884622c6c920dbe98dbd05bc7c70fbf"},
287 |     {file = "black-22.6.0-cp39-cp39-macosx_10_9_universal2.whl", hash = "sha256:7ba9be198ecca5031cd78745780d65a3f75a34b2ff9be5837045dce55db83d1c"},
288 |     {file = "black-22.6.0-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:a3db5b6409b96d9bd543323b23ef32a1a2b06416d525d27e0f67e74f1446c8f2"},
289 |     {file = "black-22.6.0-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:560558527e52ce8afba936fcce93a7411ab40c7d5fe8c2463e279e843c0328ee"},
290 |     {file = "black-22.6.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:b154e6bbde1e79ea3260c4b40c0b7b3109ffcdf7bc4ebf8859169a6af72cd70b"},
291 |     {file = "black-22.6.0-cp39-cp39-win_amd64.whl", hash = "sha256:4af5bc0e1f96be5ae9bd7aaec219c901a94d6caa2484c21983d043371c733fc4"},
292 |     {file = "black-22.6.0-py3-none-any.whl", hash = "sha256:ac609cf8ef5e7115ddd07d85d988d074ed00e10fbc3445aee393e70164a2219c"},
293 |     {file = "black-22.6.0.tar.gz", hash = "sha256:6c6d39e28aed379aec40da1c65434c77d75e65bb59a1e1c283de545fb4e7c6c9"},
294 | ]
295 | click = [
296 |     {file = "click-8.1.3-py3-none-any.whl", hash = "sha256:bb4d8133cb15a609f44e8213d9b391b0809795062913b383c62be0ee95b1db48"},
297 |     {file = "click-8.1.3.tar.gz", hash = "sha256:7682dc8afb30297001674575ea00d1814d808d6a36af415a82bd481d37ba7b8e"},
298 | ]
299 | colorama = [
300 |     {file = "colorama-0.4.5-py2.py3-none-any.whl", hash = "sha256:854bf444933e37f5824ae7bfc1e98d5bce2ebe4160d46b5edf346a89358e99da"},
301 |     {file = "colorama-0.4.5.tar.gz", hash = "sha256:e6c6b4334fc50988a639d9b98aa429a0b57da6e17b9a44f0451f930b6967b7a4"},
302 | ]
303 | docopt = [
304 |     {file = "docopt-0.6.2.tar.gz", hash = "sha256:49b3a825280bd66b3aa83585ef59c4a8c82f2c8a522dbe754a8bc8d08c85c491"},
305 | ]
306 | findspark = [
307 |     {file = "findspark-2.0.1-py2.py3-none-any.whl", hash = "sha256:e5d5415ff8ced6b173b801e12fc90c1eefca1fb6bf9c19c4fc1f235d4222e753"},
308 |     {file = "findspark-2.0.1.tar.gz", hash = "sha256:aa10a96cb616cab329181d72e8ef13d2dc453b4babd02b5482471a0882c1195e"},
309 | ]
310 | iniconfig = [
311 |     {file = "iniconfig-1.1.1-py2.py3-none-any.whl", hash = "sha256:011e24c64b7f47f6ebd835bb12a743f2fbe9a26d4cecaa7f53bc4f35ee9da8b3"},
312 |     {file = "iniconfig-1.1.1.tar.gz", hash = "sha256:bc3af051d7d14b2ee5ef9969666def0cd1a000e121eaea580d4a313df4b37f32"},
313 | ]
314 | mypy-extensions = [
315 |     {file = "mypy_extensions-0.4.3-py2.py3-none-any.whl", hash = "sha256:090fedd75945a69ae91ce1303b5824f428daf5a028d2f6ab8a299250a846f15d"},
316 |     {file = "mypy_extensions-0.4.3.tar.gz", hash = "sha256:2d82818f5bb3e369420cb3c4060a7970edba416647068eb4c5343488a6c604a8"},
317 | ]
318 | packaging = [
319 |     {file = "packaging-21.3-py3-none-any.whl", hash = "sha256:ef103e05f519cdc783ae24ea4e2e0f508a9c99b2d4969652eed6a2e1ea5bd522"},
320 |     {file = "packaging-21.3.tar.gz", hash = "sha256:dd47c42927d89ab911e606518907cc2d3a1f38bbd026385970643f9c5b8ecfeb"},
321 | ]
322 | pathspec = [
323 |     {file = "pathspec-0.9.0-py2.py3-none-any.whl", hash = "sha256:7d15c4ddb0b5c802d161efc417ec1a2558ea2653c2e8ad9c19098201dc1c993a"},
324 |     {file = "pathspec-0.9.0.tar.gz", hash = "sha256:e564499435a2673d586f6b2130bb5b95f04a3ba06f81b8f895b651a3c76aabb1"},
325 | ]
326 | platformdirs = [
327 |     {file = "platformdirs-2.5.2-py3-none-any.whl", hash = "sha256:027d8e83a2d7de06bbac4e5ef7e023c02b863d7ea5d079477e722bb41ab25788"},
328 |     {file = "platformdirs-2.5.2.tar.gz", hash = "sha256:58c8abb07dcb441e6ee4b11d8df0ac856038f944ab98b7be6b27b2a3c7feef19"},
329 | ]
330 | pluggy = [
331 |     {file = "pluggy-1.0.0-py2.py3-none-any.whl", hash = "sha256:74134bbf457f031a36d68416e1509f34bd5ccc019f0bcc952c7b909d06b37bd3"},
332 |     {file = "pluggy-1.0.0.tar.gz", hash = "sha256:4224373bacce55f955a878bf9cfa763c1e360858e330072059e10bad68531159"},
333 | ]
334 | py = [
335 |     {file = "py-1.11.0-py2.py3-none-any.whl", hash = "sha256:607c53218732647dff4acdfcd50cb62615cedf612e72d1724fb1a0cc6405b378"},
336 |     {file = "py-1.11.0.tar.gz", hash = "sha256:51c75c4126074b472f746a24399ad32f6053d1b34b68d2fa41e558e6f4a98719"},
337 | ]
338 | py4j = [
339 |     {file = "py4j-0.10.9.5-py2.py3-none-any.whl", hash = "sha256:52d171a6a2b031d8a5d1de6efe451cf4f5baff1a2819aabc3741c8406539ba04"},
340 |     {file = "py4j-0.10.9.5.tar.gz", hash = "sha256:276a4a3c5a2154df1860ef3303a927460e02e97b047dc0a47c1c3fb8cce34db6"},
341 | ]
342 | pyparsing = [
343 |     {file = "pyparsing-3.0.9-py3-none-any.whl", hash = "sha256:5026bae9a10eeaefb61dab2f09052b9f4307d44aee4eda64b309723d8d206bbc"},
344 |     {file = "pyparsing-3.0.9.tar.gz", hash = "sha256:2b020ecf7d21b687f219b71ecad3631f644a47f01403fa1d1036b0c6416d70fb"},
345 | ]
346 | pyspark = [
347 |     {file = "pyspark-3.3.0.tar.gz", hash = "sha256:7ebe8e9505647b4d124d5a82fca60dfd3891021cf8ad6c5ec88777eeece92cf7"},
348 | ]
349 | pytest = [
350 |     {file = "pytest-7.1.2-py3-none-any.whl", hash = "sha256:13d0e3ccfc2b6e26be000cb6568c832ba67ba32e719443bfe725814d3c42433c"},
351 |     {file = "pytest-7.1.2.tar.gz", hash = "sha256:a06a0425453864a270bc45e71f783330a7428defb4230fb5e6a731fde06ecd45"},
352 | ]
353 | pytest-spark = [
354 |     {file = "pytest-spark-0.6.0.tar.gz", hash = "sha256:06e3fbfa2e7fa69d2976c10037c9ee3549c80580228bde5b9aa602f44b711f17"},
355 |     {file = "pytest_spark-0.6.0-py3-none-any.whl", hash = "sha256:cabfbcfca6a4876c5e03b151ba9217f3888fe5142154c1e885dd7902afa85a89"},
356 | ]
357 | pytest-watch = [
358 |     {file = "pytest-watch-4.2.0.tar.gz", hash = "sha256:06136f03d5b361718b8d0d234042f7b2f203910d8568f63df2f866b547b3d4b9"},
359 | ]
360 | tomli = [
361 |     {file = "tomli-2.0.1-py3-none-any.whl", hash = "sha256:939de3e7a6161af0c887ef91b7d41a53e7c5a1ca976325f429cb46ea9bc30ecc"},
362 |     {file = "tomli-2.0.1.tar.gz", hash = "sha256:de526c12914f0c550d15924c62d72abc48d6fe7364aa87328337a31007fe8a4f"},
363 | ]
364 | typing-extensions = [
365 |     {file = "typing_extensions-4.3.0-py3-none-any.whl", hash = "sha256:25642c956049920a5aa49edcdd6ab1e06d7e5d467fc00e0506c44ac86fbfca02"},
366 |     {file = "typing_extensions-4.3.0.tar.gz", hash = "sha256:e6d2677a32f47fc7eb2795db1dd15c1f34eff616bcaf2cfb5e997f854fa1c4a6"},
367 | ]
368 | watchdog = [
369 |     {file = "watchdog-2.1.9-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:a735a990a1095f75ca4f36ea2ef2752c99e6ee997c46b0de507ba40a09bf7330"},
370 |     {file = "watchdog-2.1.9-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:6b17d302850c8d412784d9246cfe8d7e3af6bcd45f958abb2d08a6f8bedf695d"},
371 |     {file = "watchdog-2.1.9-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:ee3e38a6cc050a8830089f79cbec8a3878ec2fe5160cdb2dc8ccb6def8552658"},
372 |     {file = "watchdog-2.1.9-cp36-cp36m-macosx_10_9_x86_64.whl", hash = "sha256:64a27aed691408a6abd83394b38503e8176f69031ca25d64131d8d640a307591"},
373 |     {file = "watchdog-2.1.9-cp37-cp37m-macosx_10_9_x86_64.whl", hash = "sha256:195fc70c6e41237362ba720e9aaf394f8178bfc7fa68207f112d108edef1af33"},
374 |     {file = "watchdog-2.1.9-cp38-cp38-macosx_10_9_universal2.whl", hash = "sha256:bfc4d351e6348d6ec51df007432e6fe80adb53fd41183716017026af03427846"},
375 |     {file = "watchdog-2.1.9-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:8250546a98388cbc00c3ee3cc5cf96799b5a595270dfcfa855491a64b86ef8c3"},
376 |     {file = "watchdog-2.1.9-cp38-cp38-macosx_11_0_arm64.whl", hash = "sha256:117ffc6ec261639a0209a3252546b12800670d4bf5f84fbd355957a0595fe654"},
377 |     {file = "watchdog-2.1.9-cp39-cp39-macosx_10_9_universal2.whl", hash = "sha256:97f9752208f5154e9e7b76acc8c4f5a58801b338de2af14e7e181ee3b28a5d39"},
378 |     {file = "watchdog-2.1.9-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:247dcf1df956daa24828bfea5a138d0e7a7c98b1a47cf1fa5b0c3c16241fcbb7"},
379 |     {file = "watchdog-2.1.9-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:226b3c6c468ce72051a4c15a4cc2ef317c32590d82ba0b330403cafd98a62cfd"},
380 |     {file = "watchdog-2.1.9-pp37-pypy37_pp73-macosx_10_9_x86_64.whl", hash = "sha256:d9820fe47c20c13e3c9dd544d3706a2a26c02b2b43c993b62fcd8011bcc0adb3"},
381 |     {file = "watchdog-2.1.9-pp38-pypy38_pp73-macosx_10_9_x86_64.whl", hash = "sha256:70af927aa1613ded6a68089a9262a009fbdf819f46d09c1a908d4b36e1ba2b2d"},
382 |     {file = "watchdog-2.1.9-pp39-pypy39_pp73-macosx_10_9_x86_64.whl", hash = "sha256:ed80a1628cee19f5cfc6bb74e173f1b4189eb532e705e2a13e3250312a62e0c9"},
383 |     {file = "watchdog-2.1.9-py3-none-manylinux2014_aarch64.whl", hash = "sha256:9f05a5f7c12452f6a27203f76779ae3f46fa30f1dd833037ea8cbc2887c60213"},
384 |     {file = "watchdog-2.1.9-py3-none-manylinux2014_armv7l.whl", hash = "sha256:255bb5758f7e89b1a13c05a5bceccec2219f8995a3a4c4d6968fe1de6a3b2892"},
385 |     {file = "watchdog-2.1.9-py3-none-manylinux2014_i686.whl", hash = "sha256:d3dda00aca282b26194bdd0adec21e4c21e916956d972369359ba63ade616153"},
386 |     {file = "watchdog-2.1.9-py3-none-manylinux2014_ppc64.whl", hash = "sha256:186f6c55abc5e03872ae14c2f294a153ec7292f807af99f57611acc8caa75306"},
387 |     {file = "watchdog-2.1.9-py3-none-manylinux2014_ppc64le.whl", hash = "sha256:083171652584e1b8829581f965b9b7723ca5f9a2cd7e20271edf264cfd7c1412"},
388 |     {file = "watchdog-2.1.9-py3-none-manylinux2014_s390x.whl", hash = "sha256:b530ae007a5f5d50b7fbba96634c7ee21abec70dc3e7f0233339c81943848dc1"},
389 |     {file = "watchdog-2.1.9-py3-none-manylinux2014_x86_64.whl", hash = "sha256:4f4e1c4aa54fb86316a62a87b3378c025e228178d55481d30d857c6c438897d6"},
390 |     {file = "watchdog-2.1.9-py3-none-win32.whl", hash = "sha256:5952135968519e2447a01875a6f5fc8c03190b24d14ee52b0f4b1682259520b1"},
391 |     {file = "watchdog-2.1.9-py3-none-win_amd64.whl", hash = "sha256:7a833211f49143c3d336729b0020ffd1274078e94b0ae42e22f596999f50279c"},
392 |     {file = "watchdog-2.1.9-py3-none-win_ia64.whl", hash = "sha256:ad576a565260d8f99d97f2e64b0f97a48228317095908568a9d5c786c829d428"},
393 |     {file = "watchdog-2.1.9.tar.gz", hash = "sha256:43ce20ebb36a51f21fa376f76d1d4692452b2527ccd601950d69ed36b9e21609"},
394 | ]
395 | 


--------------------------------------------------------------------------------
/poetry.toml:
--------------------------------------------------------------------------------
1 | [virtualenvs]
2 | in-project = true
3 | 


--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------
 1 | [tool.poetry]
 2 | name = "data-engineering"
 3 | version = "0.1.0"
 4 | description = ""
 5 | authors = ["TCM Labs"]
 6 | license = "MIT"
 7 | 
 8 | [tool.poetry.dependencies]
 9 | python = "^3.9"
10 | pyspark = "^3.3.0"
11 | 
12 | [tool.poetry.dev-dependencies]
13 | black = "^22.6.0"
14 | pytest = "^7.1.2"
15 | pytest-watch = "^4.2.0"
16 | pytest-spark = "^0.6.0"
17 | 
18 | [build-system]
19 | requires = ["poetry-core>=1.0.0"]
20 | build-backend = "poetry.core.masonry.api"
21 | 


--------------------------------------------------------------------------------
/pytest.ini:
--------------------------------------------------------------------------------
1 | [pytest]
2 | ; see: https://docs.pytest.org/en/latest/explanation/goodpractices.html
3 | pythonpath = "src"
4 | 


--------------------------------------------------------------------------------
/src/.gitignore:
--------------------------------------------------------------------------------
1 | __pycache__


--------------------------------------------------------------------------------
/src/cinema/core/domain/movie_type.py:
--------------------------------------------------------------------------------
 1 | from dataclasses import dataclass
 2 | from typing import Protocol
 3 | 
 4 | 
 5 | class WithName(Protocol):
 6 |     name: str
 7 | 
 8 | 
 9 | class WithBudget(Protocol):
10 |     budget: int
11 | 
12 | 
13 | @dataclass
14 | class Movie(WithName, WithBudget):
15 |     name: str
16 |     budget: int
17 | 


--------------------------------------------------------------------------------
/src/cinema/core/domain/services/movie_domain_service.py:
--------------------------------------------------------------------------------
 1 | from __future__ import annotations
 2 | 
 3 | from typing import Generic, List, TypeVar
 4 | 
 5 | from pyspark.sql.dataframe import DataFrame
 6 | from pyspark.sql.functions import col, avg
 7 | 
 8 | from cinema.core.domain.movie_type import Movie, WithBudget, WithName
 9 | 
10 | T = TypeVar("T", covariant=True)
11 | 
12 | 
13 | class MovieDomainService(Generic[T]):
14 |     df: DataFrame
15 | 
16 |     def __init__(self, df: DataFrame):
17 |         self.df = df
18 | 
19 |     # Where steps
20 |     def most_expensive(
21 |         self: MovieDomainService[WithBudget], count: int
22 |     ) -> MovieDomainService[T]:
23 |         return MovieDomainService(self.df.orderBy(col("budget").desc()).limit(count))
24 | 
25 |     # Select
26 |     def names(self: MovieDomainService[WithName]) -> MovieDomainService[WithName]:
27 |         return MovieDomainService(self.df.select("name"))
28 | 
29 |     def budgets(self: MovieDomainService[WithBudget]) -> MovieDomainService[WithBudget]:
30 |         return MovieDomainService(self.df.select("budget"))
31 | 
32 |     # Aggregations
33 |     def average_budget(self: MovieDomainService[WithBudget]) -> float:
34 |         return self.df.agg(avg(col("budget"))).collect()[0]["avg(budget)"]
35 | 
36 |     def count(self: MovieDomainService[T]) -> int:
37 |         return self.df.count()
38 | 
39 |     # Terminal steps
40 |     def dangerously_convert_to_list(self: MovieDomainService[Movie]) -> List[Movie]:
41 |         # Note: Spark being a big data tool, we're not supposed to convert
42 |         # a DataFrame to a list.
43 |         # Such conversion is most likely an anti-pattern unless you're
44 |         # absolutely sure that the Spark queries return a very small number of
45 |         # elements
46 |         movies = [Movie(**row.asDict()) for row in self.df.collect()]
47 | 
48 |         return movies
49 | 


--------------------------------------------------------------------------------
/src/cinema/core/ports/primary/average_movie_budget_command.py:
--------------------------------------------------------------------------------
1 | from dataclasses import dataclass
2 | 
3 | 
4 | @dataclass
5 | class AverageMovieBudgetCommand:  # TODO: make this generic
6 |     pass
7 | 


--------------------------------------------------------------------------------
/src/cinema/core/ports/primary/count_movie_command.py:
--------------------------------------------------------------------------------
1 | from dataclasses import dataclass
2 | 
3 | 
4 | @dataclass
5 | class CountMovieCommand:  # TODO: make this generic
6 |     pass
7 | 


--------------------------------------------------------------------------------
/src/cinema/core/ports/primary/most_expensive_movie_command.py:
--------------------------------------------------------------------------------
1 | from dataclasses import dataclass
2 | 
3 | 
4 | @dataclass
5 | class MostExpensiveMoviesCommand:  # TODO: make this generic
6 |     count: int
7 | 


--------------------------------------------------------------------------------
/src/cinema/core/ports/primary/use_cases.py:
--------------------------------------------------------------------------------
 1 | from abc import abstractmethod
 2 | from typing import Generic, Protocol, TypeVar
 3 | 
 4 | 
 5 | C = TypeVar("C", contravariant=True)  # Command
 6 | R = TypeVar("R", covariant=True)  # Response
 7 | 
 8 | 
 9 | class UseCase(Generic[C, R], Protocol):
10 |     @abstractmethod
11 |     def run(self, command: C) -> R:
12 |         pass
13 | 


--------------------------------------------------------------------------------
/src/cinema/core/ports/secondary/movie_repository.py:
--------------------------------------------------------------------------------
 1 | from abc import abstractmethod
 2 | from typing import Protocol
 3 | 
 4 | from pyspark.sql.dataframe import DataFrame
 5 | 
 6 | 
 7 | class MovieRepository(Protocol):
 8 |     @abstractmethod
 9 |     def find_all_movies(self) -> DataFrame:
10 |         raise NotImplementedError()
11 | 


--------------------------------------------------------------------------------
/src/cinema/core/use_cases/average_budget_use_case.py:
--------------------------------------------------------------------------------
 1 | from cinema.core.domain.movie_type import Movie
 2 | from cinema.core.domain.services.movie_domain_service import MovieDomainService
 3 | from cinema.core.ports.primary.average_movie_budget_command import (
 4 |     AverageMovieBudgetCommand,
 5 | )
 6 | from cinema.core.ports.primary.use_cases import UseCase
 7 | from cinema.core.ports.secondary.movie_repository import MovieRepository
 8 | 
 9 | 
10 | class AverageMovieBudgetUseCase(UseCase[AverageMovieBudgetCommand, float]):
11 |     # https://stackoverflow.com/questions/72141966/infer-type-from-subclass-method-return-type
12 | 
13 |     def __init__(self, movie_repository: MovieRepository) -> None:
14 |         self._movie_repository = movie_repository
15 | 
16 |     def run(self, command: AverageMovieBudgetCommand):
17 |         # TODO: find out why we need to type 'command' argument again
18 |         movies = MovieDomainService[Movie](self._movie_repository.find_all_movies())
19 | 
20 |         average_movie_budget = movies.average_budget()
21 | 
22 |         return average_movie_budget
23 | 


--------------------------------------------------------------------------------
/src/cinema/core/use_cases/count_movie_use_case.py:
--------------------------------------------------------------------------------
 1 | from cinema.core.domain.movie_type import Movie
 2 | from cinema.core.domain.services.movie_domain_service import MovieDomainService
 3 | from cinema.core.ports.primary.count_movie_command import CountMovieCommand
 4 | from cinema.core.ports.primary.use_cases import UseCase
 5 | from cinema.core.ports.secondary.movie_repository import MovieRepository
 6 | 
 7 | 
 8 | class CountMovieUseCase(UseCase[CountMovieCommand, int]):
 9 |     # https://stackoverflow.com/questions/72141966/infer-type-from-subclass-method-return-type
10 | 
11 |     def __init__(self, movie_repository: MovieRepository) -> None:
12 |         self._movie_repository = movie_repository
13 | 
14 |     def run(self, command: CountMovieCommand):
15 |         # TODO: find out why we need to type 'command' argument again
16 |         movies = MovieDomainService[Movie](self._movie_repository.find_all_movies())
17 | 
18 |         average_movie_budget = movies.count()
19 | 
20 |         return average_movie_budget
21 | 


--------------------------------------------------------------------------------
/src/cinema/core/use_cases/most_expensive_movies.py:
--------------------------------------------------------------------------------
 1 | from typing import List
 2 | 
 3 | from cinema.core.domain.movie_type import Movie
 4 | from cinema.core.domain.services.movie_domain_service import MovieDomainService
 5 | from cinema.core.ports.primary.most_expensive_movie_command import (
 6 |     MostExpensiveMoviesCommand,
 7 | )
 8 | from cinema.core.ports.primary.use_cases import UseCase
 9 | from cinema.core.ports.secondary.movie_repository import MovieRepository
10 | 
11 | 
12 | class MostExpensiveMoviesUseCase(UseCase[MostExpensiveMoviesCommand, List[Movie]]):
13 |     # https://stackoverflow.com/questions/72141966/infer-type-from-subclass-method-return-type
14 | 
15 |     def __init__(self, movie_repository: MovieRepository) -> None:
16 |         self._movie_repository = movie_repository
17 | 
18 |     def run(self, command: MostExpensiveMoviesCommand):
19 |         # TODO: find out why we need to type 'command' argument again
20 |         movies = MovieDomainService[Movie](self._movie_repository.find_all_movies())
21 | 
22 |         most_expensive_movies = movies.most_expensive(
23 |             count=command.count
24 |         ).dangerously_convert_to_list()
25 | 
26 |         return most_expensive_movies
27 | 


--------------------------------------------------------------------------------
/src/cinema/primary_adapters/cli/main.py:
--------------------------------------------------------------------------------
 1 | from pyspark.sql import SparkSession
 2 | from cinema.core.ports.primary.most_expensive_movie_command import (
 3 |     MostExpensiveMoviesCommand,
 4 | )
 5 | 
 6 | 
 7 | from cinema.secondary_adapters.repositories.movie.in_memory_movie_repository import (
 8 |     InMemoryMovieRepository,
 9 | )
10 | 
11 | from cinema.core.use_cases.most_expensive_movies import MostExpensiveMoviesUseCase
12 | 
13 | 
14 | def run_movie_application():
15 |     # Wire-up dependencies
16 |     spark = SparkSession.builder.appName("Python Spark SQL basic example").getOrCreate()
17 | 
18 |     movie_repository = InMemoryMovieRepository(spark)
19 |     use_case = MostExpensiveMoviesUseCase(movie_repository)
20 | 
21 |     # Run use case
22 |     top_two_most_expensive_movies = use_case.run(MostExpensiveMoviesCommand(count=2))
23 | 
24 |     return top_two_most_expensive_movies
25 | 


--------------------------------------------------------------------------------
/src/cinema/primary_adapters/tests/average_movie_budget_use_case_test.py:
--------------------------------------------------------------------------------
 1 | from pyspark.sql import SparkSession
 2 | from cinema.core.ports.primary.average_movie_budget_command import (
 3 |     AverageMovieBudgetCommand,
 4 | )
 5 | from cinema.core.use_cases.average_budget_use_case import AverageMovieBudgetUseCase
 6 | from cinema.secondary_adapters.repositories.movie.in_memory_movie_repository import (
 7 |     InMemoryMovieRepository,
 8 | )
 9 | 
10 | 
11 | class TestAverageMovieBudgetUseCase:
12 |     def test_average_movie_budget(self, spark_session: SparkSession):
13 |         movie_repository = InMemoryMovieRepository(spark_session)
14 |         average_movie_budget_use_case = AverageMovieBudgetUseCase(movie_repository)
15 | 
16 |         average_movie_budget_command = AverageMovieBudgetCommand()
17 | 
18 |         average_budget = average_movie_budget_use_case.run(average_movie_budget_command)
19 | 
20 |         assert average_budget == 135_000_000
21 | 


--------------------------------------------------------------------------------
/src/cinema/primary_adapters/tests/most_expensive_movies_use_case_test.py:
--------------------------------------------------------------------------------
 1 | from cinema.core.domain.movie_type import Movie
 2 | from cinema.core.ports.primary.most_expensive_movie_command import (
 3 |     MostExpensiveMoviesCommand,
 4 | )
 5 | from cinema.core.use_cases.most_expensive_movies import MostExpensiveMoviesUseCase
 6 | from cinema.secondary_adapters.repositories.movie.kaggle_movie_repository import (
 7 |     KaggleFileSystemMovieRepository,
 8 | )
 9 | from cinema.secondary_adapters.repositories.movie.in_memory_movie_repository import (
10 |     InMemoryMovieRepository,
11 | )
12 | from pyspark.sql import SparkSession
13 | 
14 | 
15 | class TestMostExpensiveMoviesUseCase:
16 |     def test_most_expensive_movies(self, spark_session: SparkSession):
17 |         movie_repository = InMemoryMovieRepository(spark_session)
18 |         most_expensive_movies_use_case = MostExpensiveMoviesUseCase(movie_repository)
19 | 
20 |         most_expensive_movie_ever_command = MostExpensiveMoviesCommand(count=1)
21 | 
22 |         movie = most_expensive_movies_use_case.run(most_expensive_movie_ever_command)
23 | 
24 |         assert movie == [
25 |             Movie(
26 |                 name="Strange Universe",
27 |                 budget=205_000_000,
28 |             )
29 |         ]
30 | 
31 |     def test_most_expensive_movies_kaggle_dataset(self, spark_session: SparkSession):
32 |         movie_repository = KaggleFileSystemMovieRepository(spark_session)
33 |         most_expensive_movies_use_case = MostExpensiveMoviesUseCase(movie_repository)
34 | 
35 |         most_expensive_movie_ever_command = MostExpensiveMoviesCommand(count=1)
36 | 
37 |         [movie] = most_expensive_movies_use_case.run(most_expensive_movie_ever_command)
38 | 
39 |         assert movie.name == "Pirates of the Caribbean: On Stranger Tides"
40 |         assert movie.budget == 380_000_000
41 | 


--------------------------------------------------------------------------------
/src/cinema/primary_adapters/tests/movie_count_use_case_test.py:
--------------------------------------------------------------------------------
 1 | from pyspark.sql import SparkSession
 2 | from cinema.core.ports.primary.count_movie_command import CountMovieCommand
 3 | 
 4 | from cinema.core.use_cases.count_movie_use_case import CountMovieUseCase
 5 | from cinema.secondary_adapters.repositories.movie.in_memory_movie_repository import (
 6 |     InMemoryMovieRepository,
 7 | )
 8 | 
 9 | 
10 | class TestMovieCountUseCase:
11 |     def test_total_number_of_movies(self, spark_session: SparkSession):
12 |         movie_repository = InMemoryMovieRepository(spark_session)
13 |         count_movie_use_case = CountMovieUseCase(movie_repository)
14 | 
15 |         average_movie_budget_command = CountMovieCommand()
16 | 
17 |         total_number_of_movies = count_movie_use_case.run(average_movie_budget_command)
18 | 
19 |         assert total_number_of_movies == 3
20 | 


--------------------------------------------------------------------------------
/src/cinema/secondary_adapters/repositories/movie/in_memory_movie_repository.py:
--------------------------------------------------------------------------------
 1 | from cinema.core.ports.secondary.movie_repository import MovieRepository
 2 | from pyspark.sql import Row, SparkSession
 3 | 
 4 | 
 5 | class InMemoryMovieRepository(MovieRepository):
 6 |     spark: SparkSession
 7 | 
 8 |     def __init__(self, spark: SparkSession) -> None:
 9 |         super().__init__()
10 |         self._spark = spark
11 | 
12 |     def find_all_movies(self):
13 |         movies = self._spark.createDataFrame(
14 |             [
15 |                 Row(
16 |                     name="Greatest Heroes I",
17 |                     budget=120_000_000,
18 |                 ),
19 |                 Row(
20 |                     name="From the future, to the past",
21 |                     budget=80_000_000,
22 |                 ),
23 |                 Row(
24 |                     name="Strange Universe",
25 |                     budget=205_000_000,
26 |                 ),
27 |             ]
28 |         )
29 | 
30 |         return movies
31 | 


--------------------------------------------------------------------------------
/src/cinema/secondary_adapters/repositories/movie/kaggle_movie_repository.py:
--------------------------------------------------------------------------------
 1 | from typing import final
 2 | 
 3 | from cinema.core.ports.secondary.movie_repository import MovieRepository
 4 | from pyspark.sql import SparkSession
 5 | from pyspark.sql.types import DateType, FloatType, IntegerType, StringType, StructType
 6 | 
 7 | movies_metadata_raw_csv_schema = (
 8 |     StructType()
 9 |     .add("adult", StringType(), nullable=False)
10 |     .add("belongs_to_collection", StringType(), nullable=False)
11 |     .add("budget", IntegerType(), nullable=False)
12 |     .add("genres", StringType(), nullable=False)
13 |     .add("homepage", StringType(), nullable=False)
14 |     .add("id", StringType(), nullable=False)
15 |     .add("imdb_id", StringType(), nullable=False)
16 |     .add("original_language", StringType(), nullable=False)
17 |     .add("original_title", StringType(), nullable=False)
18 |     .add("overview", StringType(), nullable=False)
19 |     .add("popularity", FloatType(), nullable=False)
20 |     .add("poster_path", StringType(), nullable=False)
21 |     .add("production_companies", StringType(), nullable=False)
22 |     .add("production_countries", StringType(), nullable=False)
23 |     .add("release_date", DateType(), nullable=False)
24 |     .add("revenue", IntegerType(), nullable=False)
25 |     .add("runtime", FloatType(), nullable=False)
26 |     .add("spoken_languages", StringType(), nullable=False)
27 |     .add("status", StringType(), nullable=False)
28 |     .add("tagline", StringType(), nullable=False)
29 |     .add("title", StringType(), nullable=False)
30 |     .add("video", StringType(), nullable=False)
31 |     .add("vote_average", FloatType(), nullable=False)
32 |     .add("vote_count", IntegerType(), nullable=False)
33 | )
34 | 
35 | 
36 | @final
37 | class KaggleFileSystemMovieRepository(MovieRepository):
38 |     spark: SparkSession
39 | 
40 |     def __init__(self, spark: SparkSession) -> None:
41 |         self._spark = spark
42 | 
43 |     def find_all_movies(self):
44 |         movies = (
45 |             self._spark.read.csv(
46 |                 "data/kaggle_movies_dataset/movies_metadata.csv",
47 |                 # See: https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option
48 |                 schema=movies_metadata_raw_csv_schema,
49 |                 header=True,
50 |                 inferSchema=False,
51 |                 mode="PERMISSIVE",  # stricter: FAILFAST
52 |                 sep=",",
53 |             )
54 |             .withColumnRenamed("original_title", "name")
55 |             .select("name", "budget")
56 |         )
57 | 
58 |         return movies
59 | 


--------------------------------------------------------------------------------
/src/main.py:
--------------------------------------------------------------------------------
1 | from cinema.primary_adapters.cli.main import run_movie_application
2 | 
3 | if __name__ == "__main__":
4 |     result = run_movie_application()
5 |     print(result)
6 | 


--------------------------------------------------------------------------------