├── .github └── ISSUE_TEMPLATE │ ├── exercise.md │ └── updates-post-testing.md ├── .gitignore ├── CODE_OF_CONDUCT.md ├── LICENSE ├── README.md ├── assets └── 42ai_logo.png ├── module00.pdf ├── module00 ├── assets │ ├── client_server.png │ └── tables.png ├── ex00 │ └── ex00.md ├── ex01 │ └── ex01.md ├── ex02 │ └── ex02.md ├── ex03 │ ├── ex03.md │ └── psycopg2_basics.md ├── ex04 │ └── ex04.md ├── ex05 │ └── ex05.md ├── ex06 │ └── ex06.md ├── ex07 │ └── ex07.md ├── ex08 │ └── ex08.md ├── ex09 │ └── ex09.md ├── ex10 │ └── ex10.md ├── ex11 │ └── ex11.md ├── ex12 │ └── ex12.md ├── ex13 │ └── ex13.md ├── ex14 │ └── ex14.md ├── module00.md └── resources │ ├── Pipfile │ ├── db │ ├── Dockerfile │ ├── init.sql │ └── pg_hba.conf │ ├── docker-compose.yml │ ├── docker_install.sh │ └── psycopg2_documentation.pdf ├── module01.pdf ├── module01 ├── assets │ └── dashboard.png ├── ex00 │ └── ex00.md ├── ex01 │ └── ex01.md ├── ex02 │ └── ex02.md ├── ex03 │ └── ex03.md ├── ex04 │ └── ex04.md ├── ex05 │ └── ex05.md ├── ex06 │ └── ex06.md ├── ex07 │ └── ex07.md ├── ex07bis │ └── ex07bis.md ├── ex08 │ └── ex08.md ├── ex09 │ └── ex09.md ├── ex10 │ └── ex10.md ├── module01.md └── resources │ └── ingest-pipeline.conf ├── module02.pdf ├── module02 ├── assets │ ├── access_key.png │ ├── aws_regions.png │ ├── terraform_1.png │ ├── terraform_2.png │ ├── terraform_3.png │ ├── terraform_4.png │ ├── terraform_5.png │ └── terraform_6.png ├── ex00 │ └── ex00.md ├── ex01 │ └── ex01.md ├── ex02 │ └── ex02.md ├── ex03 │ └── ex03.md ├── ex04 │ └── ex04.md ├── ex05 │ └── ex05.md ├── ex06 │ └── ex06.md ├── ex07 │ └── ex07.md ├── ex08 │ └── ex08.md ├── ex09 │ └── ex09.md ├── ex10 │ └── ex10.md ├── ex11 │ └── ex11.md ├── ex12 │ └── ex12.md └── module02.md └── resources └── appstore_games.csv.zip /.github/ISSUE_TEMPLATE/exercise.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Exercise 3 | about: Create a report to help us improve 4 | title: '' 5 | labels: fixme 6 | assignees: '' 7 | 8 | --- 9 | 10 | * Day: xx 11 | * Exercise: xx 12 | 13 | A clear and concise description of what the problem/misunderstanding is. 14 | 15 | **Examples** 16 | If applicable, add examples to help explain your problem. 17 | 18 | ```python 19 | print("Code example") 20 | ``` 21 | 22 | **Screenshots** 23 | If applicable, add screenshots to help explain your problem. 24 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/updates-post-testing.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Updates post-testing 3 | about: Updates post-testing for a whole day 4 | title: '' 5 | labels: fixme 6 | assignees: '' 7 | 8 | --- 9 | 10 | ## Global notes: 11 | 12 | - [ ] 1. 13 | - [ ] 2. 14 | 15 | ### ex00: 16 | - [ ] 1. 17 | - [ ] 2. 18 | 19 | ### ex01: 20 | - [ ] 1. 21 | - [ ] 2. 22 | 23 | ### ex02: 24 | - [ ] 1. 25 | - [ ] 2. 26 | 27 | ### ex03: 28 | - [ ] 1. 29 | - [ ] 2. 30 | 31 | ### ex04: 32 | - [ ] 1. 33 | - [ ] 2. 34 | 35 | ### ex05: 36 | - [ ] 1. 37 | - [ ] 2. 38 | 39 | ### ex06: 40 | - [ ] 1. 41 | - [ ] 2. 42 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | *.egg-info/ 24 | .installed.cfg 25 | *.egg 26 | MANIFEST 27 | 28 | # PyInstaller 29 | # Usually these files are written by a python script from a template 30 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 31 | *.manifest 32 | *.spec 33 | 34 | # Installer logs 35 | pip-log.txt 36 | pip-delete-this-directory.txt 37 | 38 | # Unit test / coverage reports 39 | htmlcov/ 40 | .tox/ 41 | .coverage 42 | .coverage.* 43 | .cache 44 | nosetests.xml 45 | coverage.xml 46 | *.cover 47 | .hypothesis/ 48 | .pytest_cache/ 49 | 50 | # Translations 51 | *.mo 52 | *.pot 53 | 54 | # Django stuff: 55 | *.log 56 | local_settings.py 57 | db.sqlite3 58 | 59 | # Flask stuff: 60 | instance/ 61 | .webassets-cache 62 | 63 | # Scrapy stuff: 64 | .scrapy 65 | 66 | # Sphinx documentation 67 | docs/_build/ 68 | 69 | # PyBuilder 70 | target/ 71 | 72 | # Jupyter Notebook 73 | .ipynb_checkpoints 74 | 75 | # pyenv 76 | .python-version 77 | 78 | # celery beat schedule file 79 | celerybeat-schedule 80 | 81 | # SageMath parsed files 82 | *.sage.py 83 | 84 | # Environments 85 | .env 86 | .venv 87 | env/ 88 | venv/ 89 | ENV/ 90 | env.bak/ 91 | venv.bak/ 92 | 93 | # Spyder project settings 94 | .spyderproject 95 | .spyproject 96 | 97 | # Rope project settings 98 | .ropeproject 99 | 100 | # mkdocs documentation 101 | /site 102 | 103 | # mypy 104 | .mypy_cache/ 105 | 106 | # MACOS stuff 107 | *.DS_STORE 108 | 109 | # TMP files 110 | *.swp -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Contributor Covenant Code of Conduct 2 | 3 | ## Our Pledge 4 | 5 | In the interest of fostering an open and welcoming environment, we as 6 | contributors and maintainers pledge to making participation in our project and 7 | our community a harassment-free experience for everyone, regardless of age, body 8 | size, disability, ethnicity, sex characteristics, gender identity and expression, 9 | level of experience, education, socio-economic status, nationality, personal 10 | appearance, race, religion, or sexual identity and orientation. 11 | 12 | ## Our Standards 13 | 14 | Examples of behavior that contributes to creating a positive environment 15 | include: 16 | 17 | * Using welcoming and inclusive language 18 | * Being respectful of differing viewpoints and experiences 19 | * Gracefully accepting constructive criticism 20 | * Focusing on what is best for the community 21 | * Showing empathy towards other community members 22 | 23 | Examples of unacceptable behavior by participants include: 24 | 25 | * The use of sexualized language or imagery and unwelcome sexual attention or 26 | advances 27 | * Trolling, insulting/derogatory comments, and personal or political attacks 28 | * Public or private harassment 29 | * Publishing others' private information, such as a physical or electronic 30 | address, without explicit permission 31 | * Other conduct which could reasonably be considered inappropriate in a 32 | professional setting 33 | 34 | ## Our Responsibilities 35 | 36 | Project maintainers are responsible for clarifying the standards of acceptable 37 | behavior and are expected to take appropriate and fair corrective action in 38 | response to any instances of unacceptable behavior. 39 | 40 | Project maintainers have the right and responsibility to remove, edit, or 41 | reject comments, commits, code, wiki edits, issues, and other contributions 42 | that are not aligned to this Code of Conduct, or to ban temporarily or 43 | permanently any contributor for other behaviors that they deem inappropriate, 44 | threatening, offensive, or harmful. 45 | 46 | ## Scope 47 | 48 | This Code of Conduct applies both within project spaces and in public spaces 49 | when an individual is representing the project or its community. Examples of 50 | representing a project or community include using an official project e-mail 51 | address, posting via an official social media account, or acting as an appointed 52 | representative at an online or offline event. Representation of a project may be 53 | further defined and clarified by project maintainers. 54 | 55 | ## Enforcement 56 | 57 | Instances of abusive, harassing, or otherwise unacceptable behavior may be 58 | reported by contacting the project team at contact@42ai.fr. All 59 | complaints will be reviewed and investigated and will result in a response that 60 | is deemed necessary and appropriate to the circumstances. The project team is 61 | obligated to maintain confidentiality with regard to the reporter of an incident. 62 | Further details of specific enforcement policies may be posted separately. 63 | 64 | Project maintainers who do not follow or enforce the Code of Conduct in good 65 | faith may face temporary or permanent repercussions as determined by other 66 | members of the project's leadership. 67 | 68 | ## Attribution 69 | 70 | This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4, 71 | available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html 72 | 73 | [homepage]: https://www.contributor-covenant.org 74 | 75 | For answers to common questions about this code of conduct, see 76 | https://www.contributor-covenant.org/faq 77 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Attribution-NonCommercial-ShareAlike 4.0 International 2 | 3 | ======================================================================= 4 | 5 | Creative Commons Corporation ("Creative Commons") is not a law firm and 6 | does not provide legal services or legal advice. Distribution of 7 | Creative Commons public licenses does not create a lawyer-client or 8 | other relationship. Creative Commons makes its licenses and related 9 | information available on an "as-is" basis. Creative Commons gives no 10 | warranties regarding its licenses, any material licensed under their 11 | terms and conditions, or any related information. Creative Commons 12 | disclaims all liability for damages resulting from their use to the 13 | fullest extent possible. 14 | 15 | Using Creative Commons Public Licenses 16 | 17 | Creative Commons public licenses provide a standard set of terms and 18 | conditions that creators and other rights holders may use to share 19 | original works of authorship and other material subject to copyright 20 | and certain other rights specified in the public license below. The 21 | following considerations are for informational purposes only, are not 22 | exhaustive, and do not form part of our licenses. 23 | 24 | Considerations for licensors: Our public licenses are 25 | intended for use by those authorized to give the public 26 | permission to use material in ways otherwise restricted by 27 | copyright and certain other rights. Our licenses are 28 | irrevocable. Licensors should read and understand the terms 29 | and conditions of the license they choose before applying it. 30 | Licensors should also secure all rights necessary before 31 | applying our licenses so that the public can reuse the 32 | material as expected. Licensors should clearly mark any 33 | material not subject to the license. This includes other CC- 34 | licensed material, or material used under an exception or 35 | limitation to copyright. More considerations for licensors: 36 | wiki.creativecommons.org/Considerations_for_licensors 37 | 38 | Considerations for the public: By using one of our public 39 | licenses, a licensor grants the public permission to use the 40 | licensed material under specified terms and conditions. If 41 | the licensor's permission is not necessary for any reason--for 42 | example, because of any applicable exception or limitation to 43 | copyright--then that use is not regulated by the license. Our 44 | licenses grant only permissions under copyright and certain 45 | other rights that a licensor has authority to grant. Use of 46 | the licensed material may still be restricted for other 47 | reasons, including because others have copyright or other 48 | rights in the material. A licensor may make special requests, 49 | such as asking that all changes be marked or described. 50 | Although not required by our licenses, you are encouraged to 51 | respect those requests where reasonable. More_considerations 52 | for the public: 53 | wiki.creativecommons.org/Considerations_for_licensees 54 | 55 | ======================================================================= 56 | 57 | Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International 58 | Public License 59 | 60 | By exercising the Licensed Rights (defined below), You accept and agree 61 | to be bound by the terms and conditions of this Creative Commons 62 | Attribution-NonCommercial-ShareAlike 4.0 International Public License 63 | ("Public License"). To the extent this Public License may be 64 | interpreted as a contract, You are granted the Licensed Rights in 65 | consideration of Your acceptance of these terms and conditions, and the 66 | Licensor grants You such rights in consideration of benefits the 67 | Licensor receives from making the Licensed Material available under 68 | these terms and conditions. 69 | 70 | 71 | Section 1 -- Definitions. 72 | 73 | a. Adapted Material means material subject to Copyright and Similar 74 | Rights that is derived from or based upon the Licensed Material 75 | and in which the Licensed Material is translated, altered, 76 | arranged, transformed, or otherwise modified in a manner requiring 77 | permission under the Copyright and Similar Rights held by the 78 | Licensor. For purposes of this Public License, where the Licensed 79 | Material is a musical work, performance, or sound recording, 80 | Adapted Material is always produced where the Licensed Material is 81 | synched in timed relation with a moving image. 82 | 83 | b. Adapter's License means the license You apply to Your Copyright 84 | and Similar Rights in Your contributions to Adapted Material in 85 | accordance with the terms and conditions of this Public License. 86 | 87 | c. BY-NC-SA Compatible License means a license listed at 88 | creativecommons.org/compatiblelicenses, approved by Creative 89 | Commons as essentially the equivalent of this Public License. 90 | 91 | d. Copyright and Similar Rights means copyright and/or similar rights 92 | closely related to copyright including, without limitation, 93 | performance, broadcast, sound recording, and Sui Generis Database 94 | Rights, without regard to how the rights are labeled or 95 | categorized. For purposes of this Public License, the rights 96 | specified in Section 2(b)(1)-(2) are not Copyright and Similar 97 | Rights. 98 | 99 | e. Effective Technological Measures means those measures that, in the 100 | absence of proper authority, may not be circumvented under laws 101 | fulfilling obligations under Article 11 of the WIPO Copyright 102 | Treaty adopted on December 20, 1996, and/or similar international 103 | agreements. 104 | 105 | f. Exceptions and Limitations means fair use, fair dealing, and/or 106 | any other exception or limitation to Copyright and Similar Rights 107 | that applies to Your use of the Licensed Material. 108 | 109 | g. License Elements means the license attributes listed in the name 110 | of a Creative Commons Public License. The License Elements of this 111 | Public License are Attribution, NonCommercial, and ShareAlike. 112 | 113 | h. Licensed Material means the artistic or literary work, database, 114 | or other material to which the Licensor applied this Public 115 | License. 116 | 117 | i. Licensed Rights means the rights granted to You subject to the 118 | terms and conditions of this Public License, which are limited to 119 | all Copyright and Similar Rights that apply to Your use of the 120 | Licensed Material and that the Licensor has authority to license. 121 | 122 | j. Licensor means the individual(s) or entity(ies) granting rights 123 | under this Public License. 124 | 125 | k. NonCommercial means not primarily intended for or directed towards 126 | commercial advantage or monetary compensation. For purposes of 127 | this Public License, the exchange of the Licensed Material for 128 | other material subject to Copyright and Similar Rights by digital 129 | file-sharing or similar means is NonCommercial provided there is 130 | no payment of monetary compensation in connection with the 131 | exchange. 132 | 133 | l. Share means to provide material to the public by any means or 134 | process that requires permission under the Licensed Rights, such 135 | as reproduction, public display, public performance, distribution, 136 | dissemination, communication, or importation, and to make material 137 | available to the public including in ways that members of the 138 | public may access the material from a place and at a time 139 | individually chosen by them. 140 | 141 | m. Sui Generis Database Rights means rights other than copyright 142 | resulting from Directive 96/9/EC of the European Parliament and of 143 | the Council of 11 March 1996 on the legal protection of databases, 144 | as amended and/or succeeded, as well as other essentially 145 | equivalent rights anywhere in the world. 146 | 147 | n. You means the individual or entity exercising the Licensed Rights 148 | under this Public License. Your has a corresponding meaning. 149 | 150 | 151 | Section 2 -- Scope. 152 | 153 | a. License grant. 154 | 155 | 1. Subject to the terms and conditions of this Public License, 156 | the Licensor hereby grants You a worldwide, royalty-free, 157 | non-sublicensable, non-exclusive, irrevocable license to 158 | exercise the Licensed Rights in the Licensed Material to: 159 | 160 | a. reproduce and Share the Licensed Material, in whole or 161 | in part, for NonCommercial purposes only; and 162 | 163 | b. produce, reproduce, and Share Adapted Material for 164 | NonCommercial purposes only. 165 | 166 | 2. Exceptions and Limitations. For the avoidance of doubt, where 167 | Exceptions and Limitations apply to Your use, this Public 168 | License does not apply, and You do not need to comply with 169 | its terms and conditions. 170 | 171 | 3. Term. The term of this Public License is specified in Section 172 | 6(a). 173 | 174 | 4. Media and formats; technical modifications allowed. The 175 | Licensor authorizes You to exercise the Licensed Rights in 176 | all media and formats whether now known or hereafter created, 177 | and to make technical modifications necessary to do so. The 178 | Licensor waives and/or agrees not to assert any right or 179 | authority to forbid You from making technical modifications 180 | necessary to exercise the Licensed Rights, including 181 | technical modifications necessary to circumvent Effective 182 | Technological Measures. For purposes of this Public License, 183 | simply making modifications authorized by this Section 2(a) 184 | (4) never produces Adapted Material. 185 | 186 | 5. Downstream recipients. 187 | 188 | a. Offer from the Licensor -- Licensed Material. Every 189 | recipient of the Licensed Material automatically 190 | receives an offer from the Licensor to exercise the 191 | Licensed Rights under the terms and conditions of this 192 | Public License. 193 | 194 | b. Additional offer from the Licensor -- Adapted Material. 195 | Every recipient of Adapted Material from You 196 | automatically receives an offer from the Licensor to 197 | exercise the Licensed Rights in the Adapted Material 198 | under the conditions of the Adapter's License You apply. 199 | 200 | c. No downstream restrictions. You may not offer or impose 201 | any additional or different terms or conditions on, or 202 | apply any Effective Technological Measures to, the 203 | Licensed Material if doing so restricts exercise of the 204 | Licensed Rights by any recipient of the Licensed 205 | Material. 206 | 207 | 6. No endorsement. Nothing in this Public License constitutes or 208 | may be construed as permission to assert or imply that You 209 | are, or that Your use of the Licensed Material is, connected 210 | with, or sponsored, endorsed, or granted official status by, 211 | the Licensor or others designated to receive attribution as 212 | provided in Section 3(a)(1)(A)(i). 213 | 214 | b. Other rights. 215 | 216 | 1. Moral rights, such as the right of integrity, are not 217 | licensed under this Public License, nor are publicity, 218 | privacy, and/or other similar personality rights; however, to 219 | the extent possible, the Licensor waives and/or agrees not to 220 | assert any such rights held by the Licensor to the limited 221 | extent necessary to allow You to exercise the Licensed 222 | Rights, but not otherwise. 223 | 224 | 2. Patent and trademark rights are not licensed under this 225 | Public License. 226 | 227 | 3. To the extent possible, the Licensor waives any right to 228 | collect royalties from You for the exercise of the Licensed 229 | Rights, whether directly or through a collecting society 230 | under any voluntary or waivable statutory or compulsory 231 | licensing scheme. In all other cases the Licensor expressly 232 | reserves any right to collect such royalties, including when 233 | the Licensed Material is used other than for NonCommercial 234 | purposes. 235 | 236 | 237 | Section 3 -- License Conditions. 238 | 239 | Your exercise of the Licensed Rights is expressly made subject to the 240 | following conditions. 241 | 242 | a. Attribution. 243 | 244 | 1. If You Share the Licensed Material (including in modified 245 | form), You must: 246 | 247 | a. retain the following if it is supplied by the Licensor 248 | with the Licensed Material: 249 | 250 | i. identification of the creator(s) of the Licensed 251 | Material and any others designated to receive 252 | attribution, in any reasonable manner requested by 253 | the Licensor (including by pseudonym if 254 | designated); 255 | 256 | ii. a copyright notice; 257 | 258 | iii. a notice that refers to this Public License; 259 | 260 | iv. a notice that refers to the disclaimer of 261 | warranties; 262 | 263 | v. a URI or hyperlink to the Licensed Material to the 264 | extent reasonably practicable; 265 | 266 | b. indicate if You modified the Licensed Material and 267 | retain an indication of any previous modifications; and 268 | 269 | c. indicate the Licensed Material is licensed under this 270 | Public License, and include the text of, or the URI or 271 | hyperlink to, this Public License. 272 | 273 | 2. You may satisfy the conditions in Section 3(a)(1) in any 274 | reasonable manner based on the medium, means, and context in 275 | which You Share the Licensed Material. For example, it may be 276 | reasonable to satisfy the conditions by providing a URI or 277 | hyperlink to a resource that includes the required 278 | information. 279 | 3. If requested by the Licensor, You must remove any of the 280 | information required by Section 3(a)(1)(A) to the extent 281 | reasonably practicable. 282 | 283 | b. ShareAlike. 284 | 285 | In addition to the conditions in Section 3(a), if You Share 286 | Adapted Material You produce, the following conditions also apply. 287 | 288 | 1. The Adapter's License You apply must be a Creative Commons 289 | license with the same License Elements, this version or 290 | later, or a BY-NC-SA Compatible License. 291 | 292 | 2. You must include the text of, or the URI or hyperlink to, the 293 | Adapter's License You apply. You may satisfy this condition 294 | in any reasonable manner based on the medium, means, and 295 | context in which You Share Adapted Material. 296 | 297 | 3. You may not offer or impose any additional or different terms 298 | or conditions on, or apply any Effective Technological 299 | Measures to, Adapted Material that restrict exercise of the 300 | rights granted under the Adapter's License You apply. 301 | 302 | 303 | Section 4 -- Sui Generis Database Rights. 304 | 305 | Where the Licensed Rights include Sui Generis Database Rights that 306 | apply to Your use of the Licensed Material: 307 | 308 | a. for the avoidance of doubt, Section 2(a)(1) grants You the right 309 | to extract, reuse, reproduce, and Share all or a substantial 310 | portion of the contents of the database for NonCommercial purposes 311 | only; 312 | 313 | b. if You include all or a substantial portion of the database 314 | contents in a database in which You have Sui Generis Database 315 | Rights, then the database in which You have Sui Generis Database 316 | Rights (but not its individual contents) is Adapted Material, 317 | including for purposes of Section 3(b); and 318 | 319 | c. You must comply with the conditions in Section 3(a) if You Share 320 | all or a substantial portion of the contents of the database. 321 | 322 | For the avoidance of doubt, this Section 4 supplements and does not 323 | replace Your obligations under this Public License where the Licensed 324 | Rights include other Copyright and Similar Rights. 325 | 326 | 327 | Section 5 -- Disclaimer of Warranties and Limitation of Liability. 328 | 329 | a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE 330 | EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS 331 | AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF 332 | ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS, 333 | IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION, 334 | WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR 335 | PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS, 336 | ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT 337 | KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT 338 | ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU. 339 | 340 | b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE 341 | TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION, 342 | NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT, 343 | INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES, 344 | COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR 345 | USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN 346 | ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR 347 | DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR 348 | IN PART, THIS LIMITATION MAY NOT APPLY TO YOU. 349 | 350 | c. The disclaimer of warranties and limitation of liability provided 351 | above shall be interpreted in a manner that, to the extent 352 | possible, most closely approximates an absolute disclaimer and 353 | waiver of all liability. 354 | 355 | 356 | Section 6 -- Term and Termination. 357 | 358 | a. This Public License applies for the term of the Copyright and 359 | Similar Rights licensed here. However, if You fail to comply with 360 | this Public License, then Your rights under this Public License 361 | terminate automatically. 362 | 363 | b. Where Your right to use the Licensed Material has terminated under 364 | Section 6(a), it reinstates: 365 | 366 | 1. automatically as of the date the violation is cured, provided 367 | it is cured within 30 days of Your discovery of the 368 | violation; or 369 | 370 | 2. upon express reinstatement by the Licensor. 371 | 372 | For the avoidance of doubt, this Section 6(b) does not affect any 373 | right the Licensor may have to seek remedies for Your violations 374 | of this Public License. 375 | 376 | c. For the avoidance of doubt, the Licensor may also offer the 377 | Licensed Material under separate terms or conditions or stop 378 | distributing the Licensed Material at any time; however, doing so 379 | will not terminate this Public License. 380 | 381 | d. Sections 1, 5, 6, 7, and 8 survive termination of this Public 382 | License. 383 | 384 | 385 | Section 7 -- Other Terms and Conditions. 386 | 387 | a. The Licensor shall not be bound by any additional or different 388 | terms or conditions communicated by You unless expressly agreed. 389 | 390 | b. Any arrangements, understandings, or agreements regarding the 391 | Licensed Material not stated herein are separate from and 392 | independent of the terms and conditions of this Public License. 393 | 394 | 395 | Section 8 -- Interpretation. 396 | 397 | a. For the avoidance of doubt, this Public License does not, and 398 | shall not be interpreted to, reduce, limit, restrict, or impose 399 | conditions on any use of the Licensed Material that could lawfully 400 | be made without permission under this Public License. 401 | 402 | b. To the extent possible, if any provision of this Public License is 403 | deemed unenforceable, it shall be automatically reformed to the 404 | minimum extent necessary to make it enforceable. If the provision 405 | cannot be reformed, it shall be severed from this Public License 406 | without affecting the enforceability of the remaining terms and 407 | conditions. 408 | 409 | c. No term or condition of this Public License will be waived and no 410 | failure to comply consented to unless expressly agreed to by the 411 | Licensor. 412 | 413 | d. Nothing in this Public License constitutes or may be interpreted 414 | as a limitation upon, or waiver of, any privileges and immunities 415 | that apply to the Licensor or You, including from the legal 416 | processes of any jurisdiction or authority. 417 | 418 | ======================================================================= 419 | 420 | Creative Commons is not a party to its public 421 | licenses. Notwithstanding, Creative Commons may elect to apply one of 422 | its public licenses to material it publishes and in those instances 423 | will be considered the “Licensor.” The text of the Creative Commons 424 | public licenses is dedicated to the public domain under the CC0 Public 425 | Domain Dedication. Except for the limited purpose of indicating that 426 | material is shared under a Creative Commons public license or as 427 | otherwise permitted by the Creative Commons policies published at 428 | creativecommons.org/policies, Creative Commons does not authorize the 429 | use of the trademark "Creative Commons" or any other trademark or logo 430 | of Creative Commons without its prior written consent including, 431 | without limitation, in connection with any unauthorized modifications 432 | to any of its public licenses or any other arrangements, 433 | understandings, or agreements concerning use of licensed material. For 434 | the avoidance of doubt, this paragraph does not form part of the 435 | public licenses. 436 | 437 | Creative Commons may be contacted at creativecommons.org. 438 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |

2 | 42 AI Logo 3 |

4 | 5 |

6 | Bootcamp Data Engineering 7 |

8 |

9 | One week to learn Data Engineering :rocket: 10 |

11 |
12 | 13 | 14 | ### Table of Contents 15 | 16 | - [Curriculum](#curriculum) 17 | - [Module00 - PostgreSQL](#Module00---postgresql) 18 | - [Module01 - Elasticsearch](#Module01---elasticsearch) 19 | - [Module02 - AWS](#Module02---aws) 20 | - [Module03 - Hadoop](#Module03---spark) 21 | - [Module04 - Spark](#Module04---airflow) 22 | - [Acknowledgements](#acknowledgements) 23 | - [Contributors](#contributors) 24 | 25 | This project is a Data Engineering bootcamp created by [42 AI](http://www.42ai.fr). 26 | 27 | A prior Python programming experience is required (the Python bootcamp)! Your mission, should you choose to accept it, is to come and learn some of the essential knowledge for Data Engineering, in a single week. You will start with SQL and NoSQL languages and then get acquainted with some useful tools/frameworks for Data Engineering like Airflow, AWS and Spark. 28 | 29 | 42 Artificial Intelligence is a student organization of the Paris campus of the school 42. Our purpose is to foster discussion, learning, and interest in the field of artificial intelligence, by organizing various activities such as lectures and workshops. 30 | 31 | 32 | ## Curriculum 33 | 34 | ### Module00 - PostgreSQL 35 | **Let's get started with PostgreSQL!** :link: 36 | > Filter Data, Normalize Data, Populate tables, Data Analysis ... 37 | 38 | ### Module01 - Elasticsearch 39 | **Get acquainted with Elasticsearch** :mag_right: 40 | > Elasticsearch setup, Data Analysis, Aggregations, Kibana and Monitoring ... 41 | 42 | ### Module02 - AWS 43 | **Start exploring the cloud on AWS!** :cloud: 44 | > Discover AWS, Flask APIs and infrastructure provisioning with Terraform ... 45 | 46 | ### Module03 - Hadoop 47 | **Work in progress** 48 | 49 | ### Module04 - Spark 50 | **Work in progress** 51 | 52 | ## Acknowledgements 53 | 54 | ### Contributors 55 | 56 | * Francois-Xavier Babin (fbabin@student.42.fr) 57 | * Jeremy Jauzion (jjauzion@student.42.fr) 58 | * Myriam Benzarti (mybenzar@student.42.fr) 59 | * Mehdi Aissa Belaloui (mbelalou@student.42.fr 60 | * Eren Ozdek (eozdek@student.42.fr) 61 | -------------------------------------------------------------------------------- /assets/42ai_logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/42-AI/bootcamp_data-engineering/0b745562307d5fa6b395ba9b745c0df6e76ed17d/assets/42ai_logo.png -------------------------------------------------------------------------------- /module00.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/42-AI/bootcamp_data-engineering/0b745562307d5fa6b395ba9b745c0df6e76ed17d/module00.pdf -------------------------------------------------------------------------------- /module00/assets/client_server.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/42-AI/bootcamp_data-engineering/0b745562307d5fa6b395ba9b745c0df6e76ed17d/module00/assets/client_server.png -------------------------------------------------------------------------------- /module00/assets/tables.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/42-AI/bootcamp_data-engineering/0b745562307d5fa6b395ba9b745c0df6e76ed17d/module00/assets/tables.png -------------------------------------------------------------------------------- /module00/ex00/ex00.md: -------------------------------------------------------------------------------- 1 | # Exercise 00 - Setup 2 | 3 | | | | 4 | | --------------------: | ---- | 5 | | Turn-in directory : | ex00 | 6 | | Files to turn in : | None | 7 | | Forbidden functions : | None | 8 | | Remarks : | n/a | 9 | 10 | ## The client-server architecture 11 | 12 | PostgreSQL is an open-source database which follows a client-server architecture. It is divided in three different components : 13 | - a **client**, a program on the user's machine which communicates the user's query to the server and receives the server's answers. 14 | - a **server**, a program running in the background that manages access to a specific resource, service or network. The server will understand the client's query and apply it to the database. Then it will send an answer to the client. 15 | - a **database system**, where the data is stored. 16 | 17 | ![client-server architecture](../assets/client_server.png){width=600px} 18 | 19 | ps: client and server can be located on the same machine 20 | 21 | In the case of PostgreSQL, we are going to use `psql` as a client and `pg_ctl` for the server. 22 | 23 | ## PostgreSQL install 24 | 25 | The first thing we need to do is install PostgreSQL. 26 | 27 | ```bash 28 | brew install postgresql 29 | ``` 30 | 31 | nb: if you notice any problem with brew, you can reinstall it with the following command. 32 | 33 | ```bash 34 | rm -rf $HOME/.brew && git clone --depth=1 https://github.com/Homebrew/brew $HOME/.brew && echo 'export PATH=$HOME/.brew/bin:$PATH' >> $HOME/.zshrc && source $HOME/.zshrc && brew update 35 | ``` 36 | 37 | The next thing we need to do is export a variable `PGDATA`. We can add the following line to our `.zshrc` file. 38 | 39 | ```bash 40 | export PGDATA=$HOME/.brew/var/postgres 41 | ``` 42 | 43 | and source the .zshrc. 44 | 45 | ```bash 46 | source ~/.zshrc 47 | ``` 48 | 49 | Now, we can start the postgresql server. A server is a program running in the background that manages access to a specific resource, service or network. As you guessed, the postgresql allows us to access a database here. 50 | 51 | We can start the server. 52 | 53 | ```bash 54 | $> pg_ctl start 55 | waiting for server to start....2019-12-08 15:58:21.171 CET [84406] LOG: starting PostgreSQL 12.1 on x86_64-apple-darwin18.6.0, compiled by Apple LLVM version 10.0.1 (clang-1001.0.46.4), 64-bit 56 | 2019-12-08 15:58:21.173 CET [84406] LOG: listening on IPv6 address "::1", port 5432 57 | 2019-12-08 15:58:21.173 CET [84406] LOG: listening on IPv4 address "127.0.0.1", port 5432 58 | 2019-12-08 15:58:21.174 CET [84406] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432" 59 | 2019-12-08 15:58:21.192 CET [84407] LOG: database system was shut down at 2019-12-08 15:49:49 CET 60 | 2019-12-08 15:58:21.201 CET [84406] LOG: database system is ready to accept connections 61 | done 62 | server started 63 | ``` 64 | 65 | We notice the postgreSQL is associated with the port `5432`. 66 | 67 | `pg_ctl stop` can stop the server. 68 | 69 | A server program is often associated with a client. Our client here is called `psql`. In the beginning, only one database exists, `postgres`. We must use that database first to access the postgresql console. 70 | 71 | ```bash 72 | $> psql -d postgres 73 | psql (12.1) 74 | Type "help" for help. 75 | 76 | postgres=# 77 | ``` 78 | 79 | `\?` allows you to see all the possible commands in the PostgreSQL console. 80 | The first thing we can do is list the databases with `\l`. 81 | 82 | ```txt 83 | postgres=# \l 84 | List of databases 85 | Name | Owner | Encoding | Collate | Ctype | Access privileges 86 | -----------+--------+----------+---------+-------+------------------- 87 | postgres | fbabin | UTF8 | C | C | 88 | template0 | fbabin | UTF8 | C | C | =c/fbabin + 89 | | | | | | fbabin=CTc/fbabin 90 | template1 | fbabin | UTF8 | C | C | =c/fbabin + 91 | | | | | | fbabin=CTc/fbabin 92 | (3 rows) 93 | ``` 94 | 95 | We are going to create a database for the day. 96 | ```bash 97 | postgres=# CREATE DATABASE appstore_games; 98 | ``` 99 | Add a user with a very strong password! 100 | ```bash 101 | postgres=# CREATE USER postgres_user WITH PASSWORD '12345'; 102 | ``` 103 | We must alter the database (changes the attributes of a database) to allow access only for us. 104 | ```bash 105 | postgres=# ALTER DATABASE appstore_games OWNER TO postgres_user; 106 | ``` 107 | The last thing we need to do is edit the `~/.brew/var/postgres/pg_hba.conf` file to modify the following line. 108 | ``` 109 | host all all 127.0.0.1/32 trusted 110 | ``` 111 | to 112 | ``` 113 | host all all 127.0.0.1/32 md5 114 | ``` 115 | This modification will force the use of the password to connect to the database. 116 | 117 | We are ready to use Postgres! 118 | 119 | ## Pyenv install 120 | 121 | Dealing with Python is often hell when it comes to python versions and libraries version. This problem is often encountered few people are working on the same server with different library needs. 122 | Furthermore you don't want to mess with the system python. That's why virtual environments and separated python are a preferred solution. 123 | 124 | You can install pyenv with brew using the following command. 125 | 126 | ```txt 127 | brew install pyenv 128 | ``` 129 | 130 | All the python candidates can then be listed. 131 | 132 | ```txt 133 | pyenv install --list | grep " 3\.[678]" 134 | ``` 135 | ... and installed. For the day we are going to choose version `3.8.0`. 136 | 137 | ```txt 138 | pyenv install -v 3.8.0 139 | ``` 140 | 141 | Finally the installed version can be activated through this command. 142 | 143 | ```txt 144 | pyenv global 3.8.0 145 | ``` 146 | 147 | Don’t forget to add those lines to your .zshrc file in order to activate your python environment each time you open a terminal. 148 | 149 | ```txt 150 | export PATH="/home/misteir/.pyenv/bin:$PATH" 151 | eval "$(pyenv init -)" 152 | eval "$(pyenv virtualenv-init -)" 153 | 154 | pyenv global 3.8.0 #activate the python 3.8.0 as default python 155 | ``` 156 | 157 | ## Pipenv install 158 | 159 | Pipenv is a tool to handle packages versions of an environment. This tool is very similar to the `requirements.txt` file with some extra metadata. 160 | 161 | Pipenv can be installed with this simple command. 162 | 163 | ```txt 164 | pip install pipenv 165 | ``` 166 | 167 | You can find a toml file for the day named `Pipfile`. 168 | 169 | ```txt 170 | [[source]] 171 | url = "https://pypi.python.org/simple" 172 | verify_ssl = true 173 | name = "pypi" 174 | 175 | [packages] 176 | jupyter = "*" 177 | numpy = "*" 178 | pandas = "*" 179 | psycopg2 = "*" 180 | 181 | [requires] 182 | python_version = "3.8.0" 183 | ``` 184 | 185 | To setup your environment just follow these two steps. 186 | 187 | ```txt 188 | pipenv install 189 | pipenv shell 190 | ``` 191 | 192 | You have now PostgreSQL, virtual python and requirements installed and ready for the day! 193 | -------------------------------------------------------------------------------- /module00/ex01/ex01.md: -------------------------------------------------------------------------------- 1 | # Exercise 01 - Clean 2 | 3 | | | | 4 | | -----------------------:| ------------------ | 5 | | Turn-in directory : | ex01 | 6 | | Files to turn in : | clean.py | 7 | | Forbidden function : | None | 8 | | Remarks : | n/a | 9 | 10 | 11 | ## Objective 12 | 13 | You must clean the given CSV dataset to insert it into a PostgreSQL table. 14 | 15 | ## Instructions 16 | 17 | The `appstore_games.csv.zip` file is available in the resources, you can unzip it to use it. 18 | 19 | We are going to keep the following columns: `ID`, `Name`, `Average User Rating`, `User Rating Count`, `Price`, `Description`, `Developer`, `Age Rating`, `Languages`, `Size`, `Primary Genre`, `Genres`, `Original Release Date`, `Current Version Release Date`. 20 | 21 | 1) You need to implement the function `df_nan_filter`. It takes a pandas dataframe as input and applies the following replacement for NaN values : 22 | * remove the row if `Size` if NaN. 23 | * set `Languages` as "EN" if NaN. 24 | * set `Price` as 0.0 if NaN. 25 | * set `Average User Rating` as the median of the column if NaN. 26 | * set `User Rating Count` as 1 if NaN. 27 | 28 | ```python 29 | def df_nan_filter(df): 30 | """Apply filters on NaN values 31 | Args: 32 | df: pandas dataframe. 33 | Returns: 34 | Filtered Dataframe. 35 | Raises: 36 | This function shouldn't raise any Exception. 37 | """ 38 | ``` 39 | 40 | 2) Create the function `change_date_format` that will change the date format from `dd/mm/yyyy` to `yyyy-mm-dd`. 41 | 42 | ```python 43 | def change_date_format(date: str): 44 | """Change date format from dd/mm/yyyy to yyyy-mm-dd 45 | Args: 46 | date: a string representing the date. 47 | Returns: 48 | The date in the format yyyy-mm-dd. 49 | Raises: 50 | This function shouldn't raise any Exception. 51 | """ 52 | ``` 53 | 54 | Your function must work with the following commands. 55 | 56 | ```python 57 | df["Original Release Date"] = df["Original Release Date"].apply(lambda x: change_date_format(x)) 58 | df["Current Version Release Date"] = df["Current Version Release Date"].apply(lambda x: change_date_format(x)) 59 | ``` 60 | 61 | 3) You need to apply the following function to the `Description` column. 62 | 63 | ```python 64 | import re 65 | 66 | def string_filter(s: str): 67 | """Apply filters in order to clean the string. 68 | Args: 69 | s: string. 70 | Returns: 71 | Filtered String. 72 | Raises: 73 | This function shouldn't raise any Exception. 74 | """ 75 | # filter : \\t, \\n, \\U1a1b2c3d4, \\u1a2b, \\x1a 76 | # turn \' into ' 77 | # replace remaining \\ with \ 78 | # turn multiple spaces into one space 79 | s = re.sub(r'''\\+(t|n|U[a-z0-9]{8}|u[a-z0-9]{4}|x[a-z0-9]{2}|[\.]{2})''', ' ', s) 80 | s = s.replace('\\\'', '\'').replace('\\\\', '\\') 81 | s = re.sub(r' +', ' ', s) 82 | return (s) 83 | ``` 84 | 85 | 4) Remove the `ID` duplicates. 86 | 87 | 5) Convert the data type of the columns `Age Rating`, `User Rating Count` and `Size` to int. 88 | 89 | 6) Remove the rows whose `Name` length is lower than 4 characters. 90 | 91 | You must apply these steps to create a script producing the file `appstore_games.cleaned.csv`. 92 | 93 | ## Examples 94 | 95 | The following example does not show the true dataset and values obtained after the filters. 96 | 97 | ```txt 98 | >>> df = pd.read_csv("appstore_games.csv") 99 | >>> df.head(1) 100 | Average User Rating User Rating Count Price Languages 101 | 1 NaN NaN NaN NaN 102 | >>> df = nan_filter(df) 103 | >>> df.head(1) 104 | Age User Rating User Rating Count Price Languages 105 | 4 1 15 EN 106 | ``` 107 | 108 | ```python 109 | for e in df: 110 | print("'{}' :: {}".format(e, df.loc[0, e])) 111 | ``` 112 | 113 | With the above code, you should obtain something similar to this output for the values of the first row. The output shape is (16809, 14). 114 | 115 | \clearpage 116 | 117 | ```txt 118 | 'ID' :: 284921427 119 | 'Name' :: Sudoku 120 | 'Average User Rating' :: 4.0 121 | 'User Rating Count' :: 3553 122 | 'Price' :: 2.99 123 | 'Description' :: Join over 21,000,000 of our fans and download one of our Sudoku games today! Makers of the Best Sudoku Game of 2008, Sudoku (Free), we offer you the best selling Sudoku game for iPhone with great features and 1000 unique puzzles! Sudoku will give you many hours of fun and puzzle solving. Enjoy the challenge of solving Sudoku puzzles whenever or wherever you are using your iPhone or iPod Touch. OPTIONS All options are on by default, but you can turn them off in the Options menu Show Incorrect :: Shows incorrect answers in red. Smart Buttons :: Disables the number button when that number is completed on the game board. Smart Notes :: Removes the number from the notes in the box, column, and row that contains the cell with your correct answer. FEATURES 1000 unique handcrafted puzzles ALL puzzles solvable WITHOUT guessing Four different skill levels Challenge a friend Multiple color schemes ALL notes: tap the All notes button on to show all the possible answers for each square. Tap the All notes button off to remove the notes. Hints: shows the answer for the selected square or a random square when one is not selected Pause the game at any time and resume where you left off Best times, progress statistics, and much more Do you want more? Try one of our other versions of sudoku which have all the same great features! * Try Color Sudoku for a fun twist to solving sudoku puzzles. * For advanced puzzle solving, try Expert Sudoku to challenge your sudoku solving skills. 124 | 'Developer' :: Mighty Mighty Good Games 125 | 'Age Rating' :: 4 126 | 'Languages' :: DA, NL, EN, FI, FR, DE, IT, JA, KO, NB, PL, PT, RU, ZH, ES, SV, ZH 127 | 'Size' :: 15853568 128 | 'Primary Genre' :: Games 129 | 'Genres' :: Games, Strategy, Puzzle 130 | 'Original Release Date' :: 2008-07-11 131 | 'Current Version Release Date' :: 2017-05-30 132 | ``` 133 | -------------------------------------------------------------------------------- /module00/ex02/ex02.md: -------------------------------------------------------------------------------- 1 | # Exercise 02 - Normalize 2 | 3 | | | | 4 | | -----------------------:| ------------------ | 5 | | Turn-in directory : | ex02 | 6 | | Files to turn in : | normalize.py | 7 | | Forbidden function : | None | 8 | | Remarks : | n/a | 9 | 10 | 11 | ## Objective 12 | 13 | You must normalize the given CSV dataset to insert it into a PostgreSQL table. 14 | 15 | ## Instructions 16 | 17 | We are going to use the previously cleaned dataset and apply the `1NF normalization` rule to it. 18 | 19 | ### 1NF normalization 20 | * Each column should contain atomic values (list entries like `x, y` violate this rule). 21 | * Each column should contain values of the same type. 22 | * Each column should have unique names. 23 | * Order in which data is saved does not matter. 24 | 25 | This rule is normally applied to a database but we are going to use those data as database tables in the next exercises. 26 | 27 | The only rule that we are not following concerns the list of values in columns. Not respecting this rule will complicate queries a lot (querying on a list is not convenient). 28 | 29 | 30 | The two columns that don't respect this rule are `Languages` and `Genres`. In order to respect the 1NF rule you have to create 3 dataframes (that are going to be postgresql tables) : 31 | 32 | * **df** : `ID`, `Name`, `Average User Rating`,`User Rating Count`, `Price`, `Description`, `Developer`, `Age Rating`, `Size`, `Original Release Date`, `Current Version Release Date` 33 | * **df_genres** : `ID`, `Primary Genre`, `Genre` 34 | * **df_languages** : `ID`, `Language` 35 | 36 | We want to go from this form ... 37 | 38 | ```txt 39 | +----------+-----------+ 40 | |ID |Language | 41 | +----------+-----------+ 42 | |284921427 |DA, NL, EN | 43 | +----------+-----------+ 44 | ``` 45 | 46 | ... to this one. 47 | 48 | ```txt 49 | +----------+---------+ 50 | |ID |Language | 51 | +----------+---------+ 52 | |284921427 |DA | 53 | |284921427 |NL | 54 | |284921427 |EN | 55 | +----------+---------+ 56 | ``` 57 | 58 | To do that we can use the `explode` function of pandas. This function only works with lists so we have to convert the string `DA, NL, EN` to a list format like `[DA, NL, EN]`. 59 | 60 | 1) Create the 3 dataframes (with the corresponding columns) 61 | 62 | 2) Convert multiple words genres to a single word format (ex: `Arcade & Aventure` to `Arcade_&_Aventure`) 63 | 64 | 3) Convert strings to list format (for columns with list) and remove the 'Games' genre from each list (it is irrelevant information as it is in each list) 65 | 66 | 4) Use the `explode` function of pandas (index of dataframes will be broken) 67 | 5) reset the index of the dataframes (`reset_index` function) 68 | 69 | 6) Save the dataframes into the files : 70 | * `appstore_games.normalized.csv` (shape : (16809, 11)) 71 | * `appstore_games_genres.normalized.csv` (shape : (44252, 3)) 72 | * `appstore_games_languages.normalized.csv` (shape : (54695, 2)) 73 | 74 | ## Examples 75 | 76 | ```txt 77 | +----------+---------+ 78 | |ID |Language | 79 | +----------+---------+ 80 | |284921427 |DA | 81 | |284921427 |NL | 82 | |284921427 |EN | 83 | |284921427 |FI | 84 | |284921427 |FR | 85 | |... |... | 86 | +----------+---------+ 87 | Only showing 5 lines ! 88 | ``` 89 | 90 | ```txt 91 | +----------+--------------+--------------+ 92 | |ID |Primary Genre |Genre | 93 | +----------+--------------+--------------+ 94 | |284921427 |Games |Strategy | 95 | |284921427 |Games |Puzzle | 96 | |284926400 |Games |Strategy | 97 | |284926400 |Games |Board | 98 | |284946595 |Games |Board | 99 | |... |... |... | 100 | +----------+--------------+--------------+ 101 | ``` -------------------------------------------------------------------------------- /module00/ex03/ex03.md: -------------------------------------------------------------------------------- 1 | # Exercise 03 - Populate 2 | | | | 3 | | -----------------------:| ------------------ | 4 | | Turn-in directory : | ex03 | 5 | | Files to turn in : | populate.py | 6 | | Forbidden function : | None | 7 | | Remarks : | n/a | 8 | 9 | ## Objective 10 | 11 | You must insert : 12 | * `appstore_games.normalized.csv` 13 | * `appstore_games_genres.normalized.csv` 14 | * `appstore_games_languages.normalized.csv` 15 | 16 | data into a PostgreSQL table. 17 | 18 | ## Instructions 19 | 20 | You can read the psycopg2_basics documentation (some included functions will help you with this exercise). 21 | 22 | 1) You first need to create 3 functions. 23 | - `create_appstore_games` 24 | - `create_appstore_games_genres` 25 | - `create_appstore_games_languages` 26 | 27 | ... to create the following tables : 28 | 29 | ![tables](../assets/tables.png){width=450px} 30 | 31 | nb: Foreign keys are a reference to an existing column in another table. 32 | 33 | 2) You will have to create the 3 populate functions 34 | 35 | * `populate_appstore_games` 36 | * `populate_appstore_games_genres` 37 | * `populate_appstore_games_languages` 38 | 39 | ... to insert data into the different tables. 40 | 41 | Before you do anything you must ensure postgresql is running. 42 | 43 | # Examples 44 | 45 | At the end your display table should show the following output for the table : 46 | 47 | * `appstore_games_genres` 48 | 49 | ```txt 50 | +---+----------+--------------+---------+ 51 | |id |game_id |primary_genre |genre | 52 | +---+----------+--------------+---------+ 53 | |0 |284921427 |Games |Strategy | 54 | |1 |284921427 |Games |Puzzle | 55 | |2 |284926400 |Games |Strategy | 56 | |3 |284926400 |Games |Board | 57 | |4 |284946595 |Games |Board | 58 | |5 |284946595 |Games |Strategy | 59 | |6 |285755462 |Games |Strategy | 60 | |7 |285755462 |Games |Puzzle | 61 | |8 |285831220 |Games |Strategy | 62 | |9 |285831220 |Games |Board | 63 | |.. |... |... |... | 64 | +---+----------+--------------+---------+ 65 | ``` 66 | 67 | * `appstore_games_languages` 68 | 69 | ```txt 70 | +---+----------+---------+ 71 | |id |game_id |language | 72 | +---+----------+---------+ 73 | |0 |284921427 |DA | 74 | |1 |284921427 |NL | 75 | |2 |284921427 |EN | 76 | |3 |284921427 |FI | 77 | |4 |284921427 |FR | 78 | |5 |284921427 |DE | 79 | |6 |284921427 |IT | 80 | |7 |284921427 |JA | 81 | |8 |284921427 |KO | 82 | |9 |284921427 |NB | 83 | |.. |... |... | 84 | +---+----------+---------+ 85 | ``` -------------------------------------------------------------------------------- /module00/ex03/psycopg2_basics.md: -------------------------------------------------------------------------------- 1 | # Psycopg2 basics 2 | 3 | Psycopg is a very popular PostgreSQL database adapter for the Python programming language. Its full documentation can be seen **[here](https://pypi.org/project/psycopg2/)**. 4 | 5 | The function `connect()` creates a new database session and returns a new connection instance. 6 | 7 | ```python 8 | import psycopg2 9 | 10 | def get_connection(): 11 | conn = psycopg2.connect( 12 | database="appstore_games", 13 | host="localhost", 14 | user="postgres_user", 15 | password="12345" 16 | ) 17 | return (conn) 18 | ``` 19 | 20 | Cursors allows Python code to execute PostgreSQL command in a database session. 21 | 22 | ```python 23 | curr = conn.cursor() 24 | ``` 25 | 26 | Tables can be created with the cursor. 27 | 28 | ```python 29 | curr.execute("""CREATE TABLE members ( 30 | id serial PRIMARY KEY, 31 | firstname varchar(32), 32 | lastname varchar(32), 33 | birthdate date 34 | ) 35 | """) 36 | ``` 37 | 38 | It's also possible to remove a table. 39 | 40 | ```python 41 | curr.execute("DROP TABLE members") 42 | ``` 43 | 44 | To make changes persistent in the database, we need to commit (queries are called transactions). Finally, we can close the connection. 45 | 46 | ```python 47 | conn.commit() 48 | conn.close() 49 | ``` 50 | 51 | This gives the following full code. 52 | 53 | ```python 54 | import psycopg2 55 | 56 | def get_connection(): 57 | conn = psycopg2.connect( 58 | database="appstore_games", 59 | host="localhost", 60 | user="postgres_user", 61 | password="12345" 62 | ) 63 | return (conn) 64 | 65 | if __name__ == "__main__": 66 | conn = get_connection() 67 | curr = conn.cursor() 68 | curr.execute("""CREATE TABLE members ( 69 | id serial PRIMARY KEY, 70 | firstname varchar(32), 71 | lastname varchar(32), 72 | birthdate date 73 | ) 74 | """) 75 | conn.commit() 76 | conn.close() 77 | ``` 78 | 79 | ## Inserting data 80 | 81 | Data can be inserted into a table with the following syntax. 82 | 83 | ```python 84 | curr.execute(""" 85 | INSERT INTO members(firstname, lastname, birthdate) VALUES 86 | ('Eric', 'Clapton', '1945-13-30'), 87 | ('Joe', 'Bonamassa', '1977-05-08') 88 | """) 89 | ``` 90 | 91 | ## Delete data 92 | 93 | Data can also be deleted. 94 | 95 | ```python 96 | curr.execute("""DELETE FROM members 97 | WHERE lastname LIKE 'Clapton' 98 | """) 99 | ``` 100 | 101 | # Useful functions 102 | 103 | ## get connections 104 | 105 | ```python 106 | def get_connection(): 107 | conn = psycopg2.connect( 108 | database="appstore_games", 109 | host="localhost", 110 | user="postgres_user", 111 | password="12345" 112 | ) 113 | return (conn) 114 | ``` 115 | 116 | ## Showing table content 117 | 118 | We must use the `fetchall` function to gather all the result in a list of tuples. 119 | 120 | ```python 121 | def display_table(table: str): 122 | conn = set_connection() 123 | curr = conn.cursor() 124 | curr.execute("""SELECT * FROM %(table)s 125 | LIMIT 10""", {"table": AsIs(table)}) 126 | response = curr.fetchall() 127 | for row in response: 128 | print(row) 129 | conn.close() 130 | ``` 131 | 132 | ## Create a table 133 | 134 | ```python 135 | def create_table(): 136 | conn = get_connection() 137 | curr = conn.cursor() 138 | curr.execute("""CREATE TABLE test( 139 | FirstName varchar PRIMARY KEY, 140 | LastName varchar, 141 | Age int 142 | );""") 143 | conn.commit() 144 | conn.close() 145 | ``` 146 | 147 | ## Drop table 148 | 149 | ```python 150 | def delete_table(table: str): 151 | conn = set_connection() 152 | curr = conn.cursor() 153 | curr.execute("DROP TABLE IF EXISTS %(table)s;", {"table": AsIs(table)}) 154 | conn.commit() 155 | conn.close() 156 | ``` 157 | 158 | ## Inserting data into a table 159 | 160 | ```python 161 | def populate_table(): 162 | conn = get_connection() 163 | curr = conn.cursor() 164 | curr.execute("""INSERT INTO test 165 | (FirstName, 166 | LastName, 167 | Age) VALUES 168 | (%s, %s, %s)""", 169 | ('Michelle', 170 | 'Dupont', 171 | '33')) 172 | conn.commit() 173 | conn.close() 174 | ``` -------------------------------------------------------------------------------- /module00/ex04/ex04.md: -------------------------------------------------------------------------------- 1 | # Exercise 04 - Top100 2 | 3 | | | | 4 | | -----------------------:| ------------------ | 5 | | Turn-in directory : | ex04 | 6 | | Files to turn in : | top100.py | 7 | | Forbidden functions : | None | 8 | | Remarks : | n/a | 9 | 10 | ## Objective 11 | 12 | You must show the top 100 games Name with the best user rating. 13 | 14 | ## Instructions 15 | 16 | You must create a program using the function `get_top_100`. 17 | 18 | This function must show the top 100 games Name ordered by `Avg_user_rating` first then by `Name`. 19 | 20 | The names of games not starting with a letter must be ignored. Then, you must show the first 100 games starting with letters. 21 | 22 | **You must only use PostgreSQL for your queries !** 23 | 24 | ## Example 25 | 26 | ```txt 27 | >> get_top_100() 28 | AFK Arena 29 | APORIA 30 | AbsoluteShell 31 | Action Craft Mini Blockheads Match 3 Skins Survival Game 32 | Adrift by Tack 33 | Agadmator Chess Clock 34 | Age Of Magic 35 | Age of Giants: Tribal Warlords 36 | Age of War Empires: Order Rise 37 | Alicia Quatermain 2 (Platinum) 38 | ... 39 | ``` 40 | 41 | As you guessed, you should have 100 hits. -------------------------------------------------------------------------------- /module00/ex05/ex05.md: -------------------------------------------------------------------------------- 1 | # Exercise 05 - Name_lang 2 | 3 | | | | 4 | | -----------------------:| ------------------ | 5 | | Turn-in directory : | ex05 | 6 | | Files to turn in : | name_lang.py | 7 | | Forbidden functions : | None | 8 | | Remarks : | n/a | 9 | 10 | ## Objective 11 | 12 | You must show Name and Language of games strictly between 5 and 10 euros both excluded. 13 | 14 | ## Instructions 15 | 16 | You must create a program using the function `get_name_lang` that will show the Name and Language of games strictly between 5 and 10 euros. 17 | 18 | **You must only use PostgreSQL for your queries !** 19 | 20 | 21 | ## Example 22 | 23 | ```txt 24 | >> get_name_lang() 25 | Chess Genius, EN 26 | Chess Genius, FR 27 | Chess Genius, DE 28 | Chess Genius, IT 29 | Chess Genius, ES 30 | Chess - tChess Pro, EN 31 | Chess - tChess Pro, FR 32 | Chess - tChess Pro, DE 33 | Chess - tChess Pro, JA 34 | Chess - tChess Pro, KO 35 | ... 36 | ``` 37 | 38 | You should have 634 hits. -------------------------------------------------------------------------------- /module00/ex06/ex06.md: -------------------------------------------------------------------------------- 1 | # Exercise 06 - K-first 2 | 3 | | | | 4 | | -----------------------:| ------------------ | 5 | | Turn-in directory : | ex06 | 6 | | Files to turn in : | k_first.py | 7 | | Forbidden functions : | None | 8 | | Remarks : | n/a | 9 | 10 | ## Objective 11 | 12 | You must show the name of developers starting with 'K' and involved in casual games. 13 | 14 | ## Instructions 15 | 16 | You must create a program using the function `get_k_first` that shows the name of developers starting with 'K' (case sensitive) and involved in casual games. 17 | 18 | **You must only use PostgreSQL for your queries !** 19 | 20 | 21 | ## Example 22 | 23 | ```txt 24 | >> get_k_first() 25 | Koh Jing Yu 26 | Kyle Decot 27 | Kashif Tasneem 28 | Kristin Nutting 29 | Kok Leong Tan 30 | Key Player Publishing Limited 31 | KillerBytes 32 | KillerBytes 33 | Khoa Tran 34 | Kwai Ying Cindy Cheung 35 | KG2 Entertainment LLC 36 | Keehan Roberts 37 | ... 38 | ``` 39 | 40 | You should have 40 hits. -------------------------------------------------------------------------------- /module00/ex07/ex07.md: -------------------------------------------------------------------------------- 1 | # Exercise 07 - Seniors 2 | 3 | | | | 4 | | -----------------------:| ------------------ | 5 | | Turn-in directory : | ex07 | 6 | | Files to turn in : | seniors.py | 7 | | Forbidden functions : | None | 8 | | Remarks : | n/a | 9 | 10 | ## Objective 11 | 12 | You must show the Name of developers involved in games released before 01/08/2008 included and updated after 01/01/2018 included. 13 | 14 | ## Instructions 15 | 16 | You must create a program using a function `get_seniors` that shows the Name of developers involved in games released before 01/08/2008 included and updated after 01/01/2018 included. 17 | 18 | **You must only use PostgreSQL for your queries !** 19 | 20 | 21 | ## Example 22 | 23 | ```txt 24 | >> get_seniors() 25 | Kiss The Machine 26 | ... 27 | ``` 28 | 29 | You should have 3 hits. -------------------------------------------------------------------------------- /module00/ex08/ex08.md: -------------------------------------------------------------------------------- 1 | # Exercise 08 - Battle_royale 2 | 3 | | | | 4 | | -----------------------:| ------------------ | 5 | | Turn-in directory : | ex08 | 6 | | Files to turn in : | battle_royale.py | 7 | | Forbidden functions : | None | 8 | | Remarks : | n/a | 9 | 10 | ## Objective 11 | 12 | You must show the name of the games with "battle royale" in their description and with a URL that will redirect to `facebook.com`. 13 | 14 | ## Instructions 15 | 16 | You must create a program using a function `get_battle_royale` that shows the name of the games with "battle royale" (case insensitive) in their description and with a URL that will redirect to `facebook.com`. 17 | 18 | **You must only use PostgreSQL for your queries !** 19 | 20 | 21 | ## Example 22 | 23 | ```txt 24 | >> get_battle_royale() 25 | Lords Mobile: War Kingdom 26 | Crusaders of Light 27 | Blob io - Throw & split cells 28 | ... 29 | ``` 30 | 31 | You should have 5 hits. -------------------------------------------------------------------------------- /module00/ex09/ex09.md: -------------------------------------------------------------------------------- 1 | # Exercise 09 - Benefits 2 | 3 | | | | 4 | | -----------------------:| ------------------ | 5 | | Turn-in directory : | ex09 | 6 | | Files to turn in : | benefits.py | 7 | | Forbidden function : | None | 8 | | Remarks : | n/a | 9 | 10 | ## Objective 11 | 12 | Show the first 10 games that generated the most benefits. 13 | 14 | ## Instructions 15 | 16 | You must create a program using the function `get_benefits` that will show the first 10 genres that generated the most "benefits". 17 | 18 | Benefits are calculated with the number of users who voted times the price of the game. 19 | 20 | **You must only use PostgreSQL for your queries !** 21 | 22 | 23 | ## Example 24 | 25 | ```txt 26 | >> get_benefits() 27 | Strategy 28 | Entertainment 29 | ... 30 | ``` 31 | 32 | You should have 48 hits. -------------------------------------------------------------------------------- /module00/ex10/ex10.md: -------------------------------------------------------------------------------- 1 | # Exercise 10 - Sweet spot 2 | 3 | | | | 4 | | -----------------------:| ------------------ | 5 | | Turn-in directory : | ex10 | 6 | | Files to turn in : | sweet_spot.py | 7 | | Forbidden function : | None | 8 | | Remarks : | n/a | 9 | 10 | ## Objective 11 | 12 | Find the month where the most important number of games are released. 13 | 14 | ## Instructions 15 | 16 | Find the month where the most important number of games are released. 17 | 18 | **You must only use PostgreSQL for your queries !** 19 | 20 | 21 | ## Example 22 | 23 | This answer may not be the right one. 24 | 25 | ```txt 26 | january 27 | ``` 28 | 29 | You should have 1 hit. -------------------------------------------------------------------------------- /module00/ex11/ex11.md: -------------------------------------------------------------------------------- 1 | # Exercise 11 - Price Analysis 2 | 3 | | | | 4 | | -----------------------:| ------------------ | 5 | | Turn-in directory : | ex11 | 6 | | Files to turn in : | price.py, price.png | 7 | | Forbidden function : | None | 8 | | Remarks : | n/a | 9 | | Allowed python libraries : | matplotlib, numpy | 10 | 11 | ## Objective 12 | 13 | Analyze the price distribution of games by plotting a histogram of the price distribution. 14 | 15 | ## Instructions 16 | 17 | First, you need to write the right query to output a table where you have the distribution of price, i.e. the number of games for each price. 18 | 19 | Then, you can use matplotlib to create a histogram. Your histogram will have to : 20 | - not show games with a price below 1.0 21 | - have a bar plot with 3 euros interval 22 | - have the xlabel `Price` 23 | - have the ylabel `Frequency` 24 | - have the title `Appstore games price` 25 | 26 | You will have to save your histogram in a file named `price.png` 27 | 28 | Finally, you have to use numpy to find the mean and the standard deviation of your data set. 29 | 30 | nb: you do not need to worry about the number of decimals printed 31 | 32 | **You can use PostgreSQL and Python (for numpy, matplotlib, bins creation ...)** 33 | 34 | 35 | ## Example 36 | 37 | This answer may not be the right one. 38 | 39 | ```txt 40 | $> python price.py 41 | mean price : 15.04098 42 | std price : 6.03456 43 | ``` -------------------------------------------------------------------------------- /module00/ex12/ex12.md: -------------------------------------------------------------------------------- 1 | # Exercise 12 - Worldwide 2 | 3 | | | | 4 | | -----------------------:| ------------------ | 5 | | Turn-in directory : | ex12 | 6 | | Files to turn in : | worldwide.py | 7 | | Forbidden function : | None | 8 | | Remarks : | n/a | 9 | 10 | ## Objective 11 | 12 | Give the top 5 most played genres among games that have several distinct languages greater or equal to 3. 13 | 14 | ## Instructions 15 | 16 | You must write a query that filters games according to the number of languages they have, and then filter out the ones that have strictly less than 3 languages. Then you need to select the top 5 genres where those games appear. 17 | 18 | **You must only use PostgreSQL for your queries !** 19 | 20 | 21 | ## Example 22 | 23 | ```txt 24 | $> python worldwide.py 25 | Strategy 26 | ... 27 | ``` 28 | 29 | As you guessed, you should have 5 hits. -------------------------------------------------------------------------------- /module00/ex13/ex13.md: -------------------------------------------------------------------------------- 1 | # Exercise 13 - Italian_market 2 | 3 | | | | 4 | | -----------------------:| ------------------ | 5 | | Turn-in directory: | ex13 | 6 | | Files to turn in: | italian_market.py | 7 | | Forbidden functions: | None | 8 | | Remarks: | n/a | 9 | 10 | ## Objective 11 | 12 | Create a script which list the games supporting the Italian language first and Spanish otherwise. 13 | 14 | ## Instructions 15 | 16 | You must write a script which list the games supporting the Italian language first and Spanish otherwise. 17 | 18 | Hint : You should have a look at window functions. 19 | 20 | **You must only use PostgreSQL for your queries !** 21 | 22 | ## Example 23 | 24 | ```txt 25 | $> python italian_market.py 26 | 100 Balls plus 20 27 | 1010 Block King Puzzle 28 | 1010 Fit for Blocks bricks 29 | 1024 - 2048 - 4096 - 8192 30 | ... 31 | ``` 32 | 33 | You should have 2471 hits. 34 | -------------------------------------------------------------------------------- /module00/ex14/ex14.md: -------------------------------------------------------------------------------- 1 | # Exercise 14 - Sample 2 | 3 | | | | 4 | | -----------------------:| ------------------ | 5 | | Turn-in directory: | ex14 | 6 | | Files to turn in: | sample.py | 7 | | Forbidden functions: | None | 8 | | Remarks: | n/a | 9 | 10 | ## Objective 11 | 12 | Create a statistically representative sample of your dataset. 13 | 14 | ## Instructions 15 | 16 | 1) We need to find a good sample size for our dataset. You must find out how representative sample size calculation works. 17 | 18 | Find a sample size calculator online and compute the sample size using the given parameters: 19 | - The margin of error of 5% 20 | - Confidence Level of 95% 21 | - population size (size of appstore_games table) 22 | 23 | Then put the sample size in a variable. 24 | 25 | 2) Write a PostgreSQL `sample` function that will randomly select a given number of rows (sample_size parameter) 26 | 27 | 3) Use your `sample` function to randomly select a sample and save the result into a CSV file named `appstore_games.sample.csv` 28 | 29 | Hint : you can use `pd.read_sql_query` and `df.to_csv` ! 30 | 31 | **You must only use PostgreSQL for your queries!** 32 | 33 | 34 | ## Bonus 35 | 36 | Write a Python function `sample_size` with the following parameters: 37 | - `population_size` 38 | - `confidence_level` : default value `0.95` 39 | - `margin_error` : default value `0.05` 40 | - `standard_deviation` : default value `0.5` 41 | 42 | This function will compute the sample size needed for the given parameter following the given formula: 43 | 44 | $$ 45 | sample\_size = \frac{\frac{zscore^2 \times std(1 - std)}{margin\_error^2}}{1 + \frac{zscore^2 \times std(1 - std)}{margin\_error^2 \times Population\_size}} 46 | $$ 47 | 48 | The z_score depends on the confidence level following this table: 49 | 50 | \clearpage 51 | 52 | |Confidence_level|Z_score| 53 | |---|---| 54 | |0.80|1.28| 55 | |0.85|1.44| 56 | |0.90|1.65| 57 | |0.95|1.96| 58 | |0.99|2.58| 59 | -------------------------------------------------------------------------------- /module00/module00.md: -------------------------------------------------------------------------------- 1 | # Module00 - SQL with PostgreSQL 2 | 3 | In this module, you will learn how to use a SQL database: PostgreSQL. 4 | 5 | ## Notions of the module 6 | 7 | The purpose of the module is at first to create, administrate and normalize a PostgreSQL Database. Then, we are going to analyse the data and visualize the content of the database. Finally, we will see advanced notions like caching, replication and backups. 8 | 9 | ## General rules 10 | 11 | * The version of Python to use is 3.7, you can check the version of Python with the following command: `python -V`. 12 | * you will follow the **[Pep8 standard](https://www.python.org/dev/peps/pep-0008/)**. 13 | * The exercises are ordered from the easiest to the hardest. 14 | * Your exercises are going to be evaluated by someone else so make sure that variables and functions names are appropriated. 15 | * Your man is the internet. 16 | * You can also ask any question in the dedicated channel in Slack: **[42ai slack](https://42-ai.slack.com)**. 17 | * If you find any issue or mistakes in the subject please create an issue on our dedicated repository on **[Github issues](https://github.com/42-AI/bootcamp_data-engineering/issues")**. 18 | 19 | ## Foreword 20 | 21 | Data Engineering implies many tasks from organizing the data to putting data systems to productions. Data organization is often a mess in companies and our job is to provide a common, well-organized data source. Historically, the organization of the data is used to analyze the business and determine future business decisions. Those data organizations are called [Data warehouses](https://www.tutorialspoint.com/dwh/index.htm) and are used by business intelligence teams (teams in charge of analyzing the business). This organization of the data follows a [star scheme](https://www.tutorialspoint.com/dwh/dwh_schemas.htm) allowing fast analysis. 22 | 23 | Nowadays, we want to meet other cases' needs such as providing data to data science teams or other projects. To do so, we want to deliver a common data organization which won't be project-specific but which will be used by anyone willing to (business intelligence, data scientists, ...). This 24 | new data organization is called a [Data Lake](https://medium.com/rock-your-data/getting-started-with-data-lake-4bb13643f9). It contains all the company data. The job of data engineering consists of organizing the data : 25 | - ingestion 26 | - storage 27 | - catalog and search engine associated 28 | 29 | To do that SQL is often used to filter, join, select the data. During the module, you will discover an open-source SQL language, PostgreSQL. 30 | 31 | ### Exercise 00 - Setup 32 | ### Exercise 01 - Clean 33 | ### Exercise 02 - Normalize 34 | ### Exercise 03 - Populate 35 | ### Exercise 04 - Top_100 36 | ### Exercise 05 - Name_lang 37 | ### Exercise 06 - K-first 38 | ### Exercise 07 - Seniors 39 | ### Exercise 08 - Battle_royale 40 | ### Exercise 09 - Benefits 41 | ### Exercise 10 - Sweet_spot 42 | ### Exercise 11 - Price_analysis 43 | ### Exercise 12 - Worldwide 44 | ### Exercise 13 - Italian_market 45 | ### Exercise 14 - Sample 46 | -------------------------------------------------------------------------------- /module00/resources/Pipfile: -------------------------------------------------------------------------------- 1 | [[source]] 2 | url = "https://pypi.python.org/simple" 3 | verify_ssl = true 4 | name = "pypi" 5 | 6 | [packages] 7 | jupyter = "*" 8 | numpy = "*" 9 | pandas = "*" 10 | psycopg2 = "*" 11 | 12 | [requires] 13 | python_version = ">=3.7" 14 | 15 | -------------------------------------------------------------------------------- /module00/resources/db/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM postgres:12.2-alpine 2 | 3 | # run init.sql 4 | ADD init.sql /docker-entrypoint-initdb.d 5 | ADD pg_hba.conf /var/lib/postgresql/data -------------------------------------------------------------------------------- /module00/resources/db/init.sql: -------------------------------------------------------------------------------- 1 | CREATE DATABASE appstore_games; 2 | CREATE USER postgres_user WITH PASSWORD '12345'; 3 | ALTER DATABASE appstore_games OWNER TO postgres_user; -------------------------------------------------------------------------------- /module00/resources/db/pg_hba.conf: -------------------------------------------------------------------------------- 1 | # TYPE DATABASE USER ADDRESS METHOD 2 | 3 | # "local" is for Unix domain socket connections only 4 | local all all trust 5 | # IPv4 local connections: 6 | host all all 127.0.0.1/32 md5 7 | # IPv6 local connections: 8 | host all all ::1/128 trust 9 | # Allow replication connections from localhost, by a user with the 10 | # replication privilege. 11 | local replication all trust 12 | host replication all 127.0.0.1/32 trust 13 | host replication all ::1/128 trust 14 | 15 | host all all all md5 -------------------------------------------------------------------------------- /module00/resources/docker-compose.yml: -------------------------------------------------------------------------------- 1 | version: '3.4' 2 | 3 | services: 4 | db: 5 | build: 6 | context: ./db 7 | dockerfile: Dockerfile 8 | ports: 9 | - 54320:5432 10 | environment: 11 | - POSTGRES_USER=postgres 12 | - POSTGRES_PASSWORD=postgres 13 | - PGDATA=/var/lib/postgresql/data/pgdata 14 | volumes: 15 | - type: volume 16 | source: db-data 17 | target: /var/lib/postgresql/data 18 | volume: 19 | nocopy: true 20 | 21 | volumes: 22 | db-data: -------------------------------------------------------------------------------- /module00/resources/docker_install.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | # **************************************************************************** # 3 | # # 4 | # ::: :::::::: # 5 | # init_docker.sh :+: :+: :+: # 6 | # +:+ +:+ +:+ # 7 | # By: aguiot-- +#+ +:+ +#+ # 8 | # +#+#+#+#+#+ +#+ # 9 | # Created: 2019/11/18 08:17:08 by aguiot-- #+# #+# # 10 | # Updated: 2020/02/20 14:00:32 by aguiot-- ### ########.fr # 11 | # Updated: 2020/02/20 14:34:42 by aguiot-- ### ########.fr # 12 | # # 13 | # **************************************************************************** # 14 | 15 | # https://github.com/alexandregv/42toolbox 16 | 17 | # Ensure USER variabe is set 18 | [ -z "${USER}" ] && export USER=$(whoami) 19 | 20 | ################################################################################ 21 | 22 | # Config 23 | docker_destination="/goinfre/$USER/docker" #=> Select docker destination (goinfre is a good choice) 24 | 25 | ################################################################################ 26 | 27 | # Colors 28 | blue=$'\033[0;34m' 29 | cyan=$'\033[1;96m' 30 | reset=$'\033[0;39m' 31 | 32 | # Uninstall docker, docker-compose and docker-machine if they are installed with brew 33 | brew uninstall -f docker docker-compose docker-machine &>/dev/null ;: 34 | 35 | # Check if Docker is installed with MSC and open MSC if not 36 | if [ ! -d "/Applications/Docker.app" ] && [ ! -d "~/Applications/Docker.app" ]; then 37 | echo "${blue}Please install ${cyan}Docker for Mac ${blue}from the MSC (Managed Software Center)${reset}" 38 | open -a "Managed Software Center" 39 | read -n1 -p "${blue}Press RETURN when you have successfully installed ${cyan}Docker for Mac${blue}...${reset}" 40 | echo "" 41 | fi 42 | 43 | # Kill Docker if started, so it doesn't create files during the process 44 | pkill Docker 45 | 46 | # Ask to reset destination if it already exists 47 | if [ -d "$docker_destination" ]; then 48 | read -n1 -p "${blue}Folder ${cyan}$docker_destination${blue} already exists, do you want to reset it? [y/${cyan}N${blue}]${reset} " input 49 | echo "" 50 | if [ -n "$input" ] && [ "$input" = "y" ]; then 51 | rm -rf "$docker_destination"/{com.docker.{docker,helper},.docker} &>/dev/null ;: 52 | fi 53 | fi 54 | 55 | # Unlinks all symlinks, if they are 56 | unlink ~/Library/Containers/com.docker.docker &>/dev/null ;: 57 | unlink ~/Library/Containers/com.docker.helper &>/dev/null ;: 58 | unlink ~/.docker &>/dev/null ;: 59 | 60 | # Delete directories if they were not symlinks 61 | rm -rf ~/Library/Containers/com.docker.{docker,helper} ~/.docker &>/dev/null ;: 62 | 63 | # Create destination directories in case they don't already exist 64 | mkdir -p "$docker_destination"/{com.docker.{docker,helper},.docker} 65 | 66 | # Make symlinks 67 | ln -sf "$docker_destination"/com.docker.docker ~/Library/Containers/com.docker.docker 68 | ln -sf "$docker_destination"/com.docker.helper ~/Library/Containers/com.docker.helper 69 | ln -sf "$docker_destination"/.docker ~/.docker 70 | 71 | # Start Docker for Mac 72 | open -g -a Docker 73 | 74 | echo "${cyan}Docker${blue} is now starting! Please report any bug to: ${cyan}aguiot--${reset}" 75 | -------------------------------------------------------------------------------- /module00/resources/psycopg2_documentation.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/42-AI/bootcamp_data-engineering/0b745562307d5fa6b395ba9b745c0df6e76ed17d/module00/resources/psycopg2_documentation.pdf -------------------------------------------------------------------------------- /module01.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/42-AI/bootcamp_data-engineering/0b745562307d5fa6b395ba9b745c0df6e76ed17d/module01.pdf -------------------------------------------------------------------------------- /module01/assets/dashboard.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/42-AI/bootcamp_data-engineering/0b745562307d5fa6b395ba9b745c0df6e76ed17d/module01/assets/dashboard.png -------------------------------------------------------------------------------- /module01/ex00/ex00.md: -------------------------------------------------------------------------------- 1 | # Exercise 00 - The setup. 2 | 3 | | | | 4 | | -----------------------:| ------------------ | 5 | | Turn-in directory : | ex00 | 6 | | Files to turn in : | | 7 | | Forbidden function : | None | 8 | | Remarks : | n/a | 9 | 10 | Let's start simple: 11 | 12 | * Download and install Elasticsearch. 13 | - Go to [Elasticsearch download](https://www.elastic.co/downloads/past-releases). 14 | - In the product filter select `Elasticsearch`. 15 | - Choose the version 7.5.2 and download the tar.gz file. 16 | * Unzip the file 17 | * You should have several directories: 18 | 19 | | Directory | Description | 20 | | --------:| ------------------------------------------------------------------------------------------------- | 21 | | `/bin` | Binary scripts including elasticsearch to start a node and elasticsearch-plugin to install plugins | 22 | | `/config` | Configuration files including elasticsearch.yml | 23 | | `/data` | The location of the data files of each index and shard allocated on the node | 24 | | `/jdk` | The bundled version of OpenJDK from the JDK maintainers (GPLv2+CE) | 25 | | `/lib` | The Java JAR files of Elasticsearch | 26 | | `/logs` | Elasticsearch log files location | 27 | | `/modules` | Contains various Elasticsearch modules | 28 | | `/plugins` | Plugin files location. Each plugin will be contained in a subdirectory | 29 | 30 | * Start your cluster by running the `./elasticsearch` in the `/bin` folder and wait a few seconds for the node to start. 31 | 32 | Ok so now your cluster should be running and listening on `http://localhost:9200`. 33 | Elasticsearch works with a REST API, which means that to query your cluster you just have to send an HTTP request to the good endpoints (we will come to that). 34 | 35 | Check you can access the cluster: 36 | 37 | ``` 38 | curl http://localhost:9200 39 | ``` 40 | You can do the same in a web browser. 41 | 42 | You should see something like this: 43 | 44 | ``` 45 | { 46 | "name" : "e3r4p23.42.fr", 47 | "cluster_name" : "elasticsearch", 48 | "cluster_uuid" : "SZdgmzxFSnW2IMVxvVj-9w", 49 | "version" : { 50 | "number" : "7.5.2", 51 | "build_flavor" : "default", 52 | "build_type" : "tar", 53 | "build_hash" : "e9ccaed468e2fac2275a3761849cbee64b39519f", 54 | "build_date" : "2019-11-26T01:06:52.518245Z", 55 | "build_snapshot" : false, 56 | "lucene_version" : "8.3.0", 57 | "minimum_wire_compatibility_version" : "6.8.0", 58 | "minimum_index_compatibility_version" : "6.0.0-beta1" 59 | }, 60 | "tagline" : "You Know, for Search" 61 | } 62 | ``` 63 | 64 | If not, feel free to look at the doc :) (or ask your neighbors, or google...) [Elasticsearch setup](https://www.elastic.co/guide/en/elasticsearch/reference/current/setup.html). 65 | 66 | Now stop the cluster (ctrl-c). Change the configuration so that your cluster name is `"my-cluster"` and the node name is `"node1"`. 67 | 68 | Restart your cluster and check the new names with 69 | 70 | ``` 71 | curl http://localhost:9200 72 | ``` 73 | -------------------------------------------------------------------------------- /module01/ex01/ex01.md: -------------------------------------------------------------------------------- 1 | # Exercise 01 - The CRUDité. 2 | 3 | | | | 4 | | -----------------------:| ------------------ | 5 | | Turn-in directory : | ex01 | 6 | | Files to turn in : | create-doc.sh; ex01-queries.txt | 7 | | Forbidden function : | None | 8 | | Remarks : | n/a | 9 | 10 | Now we are going to see how to perform basic CRUD operation on Elasticsearch. 11 | 12 | ## Create 13 | 14 | I'm gonna make it easy for you: Here is a curl request that creates a document with id=1 into an index named "twitter" and containing 3 fields: 15 | 16 | ``` 17 | curl -X PUT "http://localhost:9200/twitter/_doc/1?pretty" -H 'Content-Type: application/json' -d' 18 | { 19 | "user" : "popol", 20 | "post_date" : "02 12 2015", 21 | "message" : "trying out elasticsearch" 22 | } 23 | ' 24 | ``` 25 | 26 | So, what do we have here? 27 | 28 | HTTP PUT method (remember, Elasticsearch use a REST API) followed by: 29 | 30 | `ip_of_the_cluster:9200/index_name/_doc/id_of_the_document`, then a header specifying the content-type as a json, and finally the json. 31 | Every document in Elasticsearch is a json, every request to Elasticsearch is sent as a json within an HTTP request. 32 | 33 | Try it out, you should get an answer from the server confirming the creation of the document. 34 | 35 | Let's see another way to create a document: Modify the above request to create a document in the twitter index but this time without specifying the id of the document. The document shall have the following field: 36 | 37 | ``` 38 | { 39 | "user" : "popol", 40 | "post_date" : "20 01 2019", 41 | "message" : "still trying out Elasticsearch" 42 | } 43 | ``` 44 | 45 | **Hint**: try POST instead of PUT 46 | 47 | Run the following command and check you have two hits. 48 | 49 | ``` 50 | curl -XGET "http://localhost:9200/twitter/_search"\?pretty 51 | ``` 52 | 53 | Look at the `_id` of each document and try to understand why those value. 54 | 55 | Save your curl request to a file named `create-doc.sh`. The file shall be executable for the correction and it shall create the two documents above. 56 | 57 | Ok nice, you have just created your two first documents and your first index !! 58 | However, using curl is not very convenient right... Wouldn't it be awesome to have a nice dev tool to write out those requests... Kibana !! 59 | Kibana is the visualization tool of the Elastic Stack. What's the Elastic Stack? -> [ELK stack](https://www.elastic.co/what-is/elk-stack). 60 | 61 | ### Kibana Install 62 | 63 | Let's install Kibana! 64 | 65 | As you did for Elasticsearch, on the same link 66 | - download Kibana v7.5.2 67 | - Unzip the file with tar and run it. 68 | - Wait until Kibana is started. 69 | 70 | You should see something like : 71 | `[16:09:00.957] [info][server][Kibana][http] http server running at http://localhost:5601` 72 | 73 | - Open your browser and go to `http://localhost:5601` 74 | - Click on the dev tool icon on the navigation pane (3rd before last) 75 | Here you can write your query to the cluster in a much nicer environment than curl. You should have a pre-made match_all query. Run it, in the result among other stuff, you should see the documents you have created. 76 | 77 | Try to create the following two documents in Kibana, still in the twitter index: 78 | 79 | ``` 80 | { 81 | "user" : "mimich", 82 | "post_date" : "31 12 2015", 83 | "message" : "Trying out Kibana" 84 | } 85 | ``` 86 | 87 | and: 88 | 89 | ``` 90 | { 91 | "user" : "jean mimich", 92 | "post_date" : "01 01 2016", 93 | "message" : "Trying something else" 94 | } 95 | ``` 96 | 97 | Got it? Great! From now on, all queries shall be done in Kibana. Save every query you run in Kibana in the ex01-queries.txt file. You will be evaluated on this file. 98 | 99 | ## Read 100 | 101 | Now that we got the (very) basis of how to query Elasticsearch, I'm gonna let you search the answer on our own. 102 | 103 | - Write a search query that returns all the documents contained in the 'twitter' index. 104 | You should get 4 hits 105 | - Write a search query that returns all the tweets from 'popol'. 106 | You should get 2 hits 107 | - Write a search query that returns all the tweets containing 'elasticsearch' in their message. 108 | You should get 2 hits 109 | - A little more complicated: write a search query that returns all the tweets from 'mimich' (and only this user!). 110 | 111 | You should get 1 hit. 112 | 113 | Save all the queries in `ex01-queries.txt`. 114 | 115 | ### Hints 116 | 117 | - look for the keyword field ;) 118 | - [strings are dead long live strings](https://www.elastic.co/fr/blog/strings-are-dead-long-live-strings) 119 | 120 | For help, please refer to the doc (or to your neighbors, or google) [query dsl](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html) 121 | 122 | ## Update 123 | 124 | Update the document with id 1 as follow and change the value of the field "message" from "trying out elasticsearch" to "updating the document". 125 | If you did this correctly when you update the document you should see "\_version": 2 in the answer from the cluster 126 | 127 | Save the query in `ex01-queries.txt`. 128 | 129 | ## Delete 130 | 131 | - Run the following command: 132 | 133 | ``` 134 | POST _bulk 135 | {"index": {"_index": "test_delete", "_id":1}} 136 | {"name": "clark kent", "aka": "superman"} 137 | {"index": {"_index": "test_delete"}} 138 | {"name": "louis XV", "aka": "le bien aimé"} 139 | ``` 140 | 141 | It is a bulk indexing algorithm, it allows to index several documents in only one request. 142 | - Delete the document with id 1 of the test_delete index 143 | - Delete the whole test_delete index 144 | 145 | Save all the queries in `ex01-queries.txt`. 146 | -------------------------------------------------------------------------------- /module01/ex02/ex02.md: -------------------------------------------------------------------------------- 1 | # Exercise 02 - Your first Index Your first Mapping. 2 | 3 | | | | 4 | | -----------------------:| ------------------ | 5 | | Turn-in directory: | ex02 | 6 | | Files to turn in: | ex02-queries.txt | 7 | | Forbidden functions: | None | 8 | | Remarks: | you have to put all the request you run in Kibana in ex02-queries.txt | 9 | 10 | 11 | At this point, you should have 4 documents in your twitter index from the previous exercise. You are now going to learn about the mapping. 12 | 13 | Using NoSQL doesn't mean you should not structure your data. If you want to optimize your cluster you must define the mapping of your index. We will see why. You may have noticed that in the previous exercise, every time you created a document, Elasticsearch automatically created the index for you. Well, it also generates a default mapping for this index. 14 | 15 | However, the default mapping is not ideal... 16 | 17 | - We would like to retrieve all the tweets posted in 2016 and beyond. Try the following search query: 18 | 19 | ``` 20 | GET twitter/_search 21 | { 22 | "query": { 23 | "range": { 24 | "post_date": { 25 | "gte": "01 01 2016" 26 | } 27 | } 28 | } 29 | } 30 | ``` 31 | 32 | Do you have good results? No... there is a mapping issue. 33 | 34 | - Your objective now is to create a new index called 'twitter_better_mapping' that contains the same 4 documents as the 'twitter' index but with a mapping that comply with those four requirements: 35 | 36 | The following query should only return the tweet posted in 2016 and beyond (2 hits): 37 | 38 | ``` 39 | GET twitter_better_mapping/_search 40 | { 41 | "query": { 42 | "range": { 43 | "post_date": { 44 | "gte": "01 01 2016" 45 | } 46 | } 47 | } 48 | } 49 | ``` 50 | 51 | The following query should return only 1 hit. 52 | 53 | ``` 54 | GET twitter_better_mapping/_search 55 | { 56 | "query": { 57 | "match": { 58 | "user": "mimich" 59 | } 60 | } 61 | } 62 | ``` 63 | 64 | The mapping must be strict (if you try to index a document with a field not defined in the mapping, you get an error). 65 | 66 | The size of the twitter_better_mapping index should be less than 5 kb (with four documents). What was the size of the original index? 67 | 68 | ### Hints 69 | 70 | - You can't modify the mapping of an existing index, so you have to define the mapping when you create the index, before indexing any document in the index. 71 | - The easiest way to write a mapping is to start from the default mapping Elasticsearch creates. Index a document sample into a temporary index, retrieve the default mapping of this index and copy and modify it to create a new index. Here you already have the twitter index with a default mapping. Write a request to get this mapping and start from here. 72 | - You will notice that by default ES creates two fields for every string field: my-field as "text" and my-field.keyword as "keyword" type. The "text" type takes computing power at indexing and costs storage space. The "keyword" type is light but does not offer all the search power of the "text" type. Some fields might need both, some might need just one... optimize your index! 73 | - Once you have created the new index with a better mapping, you can index the documents manually as you did in the previous exercise or you can use the reindex API (see Elastic Doc). 74 | -------------------------------------------------------------------------------- /module01/ex03/ex03.md: -------------------------------------------------------------------------------- 1 | # Exercise 03 - Text Analyzer. 2 | 3 | | | | 4 | | -----------------------:| ------------------ | 5 | | Turn-in directory: | ex03 | 6 | | Files to turn in: | ex03-queries.txt | 7 | | Forbidden functions: | None | 8 | | Remarks: | | 9 | 10 | 11 | So by now you already know that mapping a field as a keyword or as a text makes a big difference. This is because Elasticsearch analyses all the text fields at ingestion so the text is easier to search. 12 | 13 | - Let's see an example. Ingest the two following documents in an index named school. 14 | 15 | ``` 16 | POST school/_doc 17 | { 18 | "school": "42", 19 | "text" : "42 is a school where you write a lot of programs" 20 | } 21 | 22 | POST school/_doc 23 | { 24 | "school": "ICART", 25 | "text" : "The school of art management and art market management" 26 | } 27 | ``` 28 | 29 | We created an index that contains random schools. Let's look for programming schools in it. 30 | 31 | 32 | - Try this request. 33 | 34 | ``` 35 | GET school/_search 36 | { 37 | "query": 38 | { 39 | "match": { 40 | "text": "programming" 41 | } 42 | } 43 | } 44 | ``` 45 | 46 | No results... and yet, you have probably noticed that there is a document talking about a famous programming school. 47 | It's a shame that we can't get it when we execute our request using the keyword `programming`. 48 | 49 | 50 | - Your mission is to rectify this! Modify the school index mapping to create a shool_bis index that returns the good result to the following query: 51 | 52 | ``` 53 | GET school_bis/_search 54 | { 55 | "query": 56 | { 57 | "match": { 58 | "text": "programming" 59 | } 60 | } 61 | } 62 | ``` 63 | 64 | ### Hints 65 | 66 | - Look for the text analyzer section in the documentation. 67 | - There is a key notion to understand Elasticsearch: the **inverted index**. Take the time to understand how the analyzer creates **token** and how this works with the inverted index. 68 | -------------------------------------------------------------------------------- /module01/ex04/ex04.md: -------------------------------------------------------------------------------- 1 | # Exercise 04 - Ingest dataset 2 | 3 | | | | 4 | | -----------------------:| ------------------ | 5 | | Turn-in directory: | ex04 | 6 | | Files to turn in: | | 7 | | Forbidden functions: | None | 8 | | Remarks: | n/a | 9 | 10 | Now that you know the basics of how Elasticsearch works, you are ready to work with a real dataset !! 11 | And to make this fun you gonna use the same dataset as for the SQL module so you can understand the differences between SQL and NoSQL. 12 | 13 | There are many ways you can ingest data into Elasticsearch. In the previous exercise, you've seen how to create a document manually. 14 | You could do this for every line of the CSV, with a python script for instance that parses the CSV and create a document for each line. There is an Elasticsearch client API for many languages that helps to connect to the cluster (to avoid writing HTTP requests in python): [Elasticsearch client](https://www.elastic.co/guide/en/elasticsearch/client/index.html). 15 | 16 | But there is an easier way: Logstash. Logstash is the ETL (Extract Transform Load) tool of the Elasticsearch stack. We don't want you to spend to much time learning how to use Logstash so we will guide you step by step: 17 | 18 | - Download [logstash](https://www.elastic.co/downloads/logstash) 19 | 20 | - Un-tar the file (still in your `/goinfre`). 21 | 22 | - Move the 'ingest-pipeline.conf' to the `config/` in the logstash directory (unzip appstore_games.csv.zip if you have not already). 23 | 24 | This file describes all the operations that logstash shall do to ingest the data. Let's take a look at the file: 25 | 26 | The file is split into three parts : 27 | - `input`: definition of the inputs. 28 | - `filter`: operation to perform on the inputs. 29 | - `output`: definition of the outputs. 30 | 31 | ``` 32 | input { 33 | file { 34 | path => "/absolute/path/to/appstore_games.csv" 35 | start_position => "beginning" 36 | sincedb_path => "sincedb_file.txt" 37 | } 38 | } 39 | ``` 40 | 41 | - `file`: our input will be a file, could be something else (stdin, data stream, ...). 42 | - `path`: location of the input file. 43 | - `start_position`: where to start reading the file. 44 | - `sincedb_path`: logstash stores its position in the input file, so if new lines are added, only new lines will be processed (ie, if you want to re-run the ingest, delete the sincedb_file). 45 | 46 | ``` 47 | filter { 48 | csv { 49 | separator => "," 50 | columns => ["URL","ID","Name","Subtitle","Icon URL","Average User Rating","User Rating Count","Price","In-app Purchases","Description","Developer","Age Rating","Languages","Size","Primary Genre","Genres","Original Release Date","Current Version Release Date"] 51 | remove_field => ["message", "host", "path", "@timestamp"] 52 | skip_header => true 53 | } 54 | mutate { 55 | gsub => [ "Description", "\\n", " 56 | "] 57 | gsub => [ "Description", "\\t", " "] 58 | gsub => [ "Description", "\\u2022", "•"] 59 | gsub => [ "Description", "\\u2013", "–"] 60 | split => { "Genres" => "," } 61 | split => { "Languages" => "," } 62 | } 63 | } 64 | ``` 65 | 66 | - `csv`: we use the csv plugin to parse the file. 67 | - `separator`: split each line on the comma. 68 | - `column`: name of the columns (will create one field in the index mapping per column) 69 | - `remove_field`: here we remove 4 fields, those 4 fields are added by logstash to the raw data but we don't need them. 70 | - `skip_header`: skip the first line 71 | - `mutate`: When logstash parse the field it escape any '\' it found. This changes a '\\n', '\\t', '\\u2022' and '\\u2013' into a '\\\\n', '\\\\t', '\\\\u2022', '\\\\u2013' respectively, which is not what we want. The mutate plugin is used here to fix this. 72 | - `gsub`: substitute '\\n' by a new line and the '\\\\u20xx' by its unicode character. 73 | - `split`: split the "Genres" and "Languages" field on the "," instead of a single string like "FR, EN, KR" we will have ["EN", "FR", "KR] 74 | 75 | ``` 76 | output { 77 | elasticsearch { 78 | hosts => "http://localhost:9200" 79 | index => "appstore_games_tmp" 80 | } 81 | stdout { 82 | codec => "dots" 83 | } 84 | } 85 | ``` 86 | - `elasticsearch`: we want to output to an Elasticsearch cluster. 87 | - `hosts`: Ip of the cluster. 88 | - `index`: Name of the index where to put the data (index will be created if not existing, otherwise data are added to the cluster). 89 | - `stdout`: we also want an output on stdout to follow the ingestion process. 90 | - `codec => "dots"`: print one dot '.' for every document ingested. 91 | 92 | So, all we do here is create one document for each line of the csv. Then, for each line, split on comma and put the value in a field of the document with the name defined in 'columns'. Exactly what you would have done with Python but in much less line of code. 93 | 94 | Now, let's run Logstash: 95 | - To run logstash you will need to install a JDK or JRE. On a 42 computer, you can do this from the MSC. 96 | - Edit the ingest-pipeline.conf with the path to the appstore_games.csv 97 | - `./bin/logstash -f config/ingest-pipeline.conf` 98 | 99 | You should have 17007 documents in your index. 100 | -------------------------------------------------------------------------------- /module01/ex05/ex05.md: -------------------------------------------------------------------------------- 1 | # Exercise 05 - Search - Developers 2 | 3 | | | | 4 | | -----------------------:| ------------------ | 5 | | Turn-in directory: | ex05 | 6 | | Files to turn in: | ex05-queries.txt | 7 | | Forbidden functions: | None | 8 | | Remarks : | | 9 | 10 | Let's start with some queries you already did for the SQL module. 11 | 12 | We are looking for developers involved in games released before 01/08/2008 included and updated after 01/01/2018 included. 13 | 14 | - Write a query that returns the games matching this criterion. 15 | 16 | Your query shall also filter the "_source" to only return the following fields: `Developer`, `Original Release Date`, `Current Version Release Date`. 17 | 18 | You should get 3 hits. 19 | 20 | 21 | ### Hints 22 | 23 | - You might need to adjust the mapping of your index 24 | - Create a new index and use the reindex API to change the mapping rather than using logstash to re-ingest the CSV 25 | - The "bool" query will be useful ;) 26 | -------------------------------------------------------------------------------- /module01/ex06/ex06.md: -------------------------------------------------------------------------------- 1 | # Exercise 06 - Search - Name_Lang 2 | 3 | | | | 4 | | -----------------------:| ------------------ | 5 | | Turn-in directory : | ex06 | 6 | | Files to turn in : | ex06-queries.txt | 7 | | Forbidden function : | None | 8 | | Remarks: | | 9 | 10 | 11 | We are looking for the Name and Language of games between 5 and 10 euros. 12 | 13 | - Write a query that returns the games matching this criterion. 14 | 15 | Your query shall filter the "_source_" to return the following fields only: "Name", "Languages", "Price". 16 | 17 | You should get 192 hits. 18 | 19 | ### Hints 20 | 21 | - You might need to adjust the mapping of your index 22 | - Create a new index and use the reindex API to change the mapping rather than using logstash to re-ingest the CSV. 23 | -------------------------------------------------------------------------------- /module01/ex07/ex07.md: -------------------------------------------------------------------------------- 1 | # Exercise 07 - Search - Game 2 | 3 | | | | 4 | | -----------------------:| ------------------ | 5 | | Turn-in directory: | ex07 | 6 | | Files to turn in: | ex07-queries.txt | 7 | | Forbidden functions: | None | 8 | | Remarks : | | 9 | 10 | Elasticsearch was initially designed for full-text search, so let's try it. 11 | 12 | - I'm looking for a game (and no other genre will be accepted). I'm a big fan of starcraft and I like real-time strategy. Can you write a query to find me a game? 13 | 14 | It's a good time to look at how Elasticsearch scores the documents so you can tune your query to increase the result relevance. 15 | 16 | There isn't one good answer to this exercise, many answers are possible. Just make sure the top hits you got are relevant to what I'm searching for! 17 | -------------------------------------------------------------------------------- /module01/ex07bis/ex07bis.md: -------------------------------------------------------------------------------- 1 | # Exercise 07-bis - Search - Vibrant World 2 | 3 | | | | 4 | | -----------------------:| ------------------ | 5 | | Turn-in directory : | ex07bis | 6 | | Files to turn in : | ex07bis-queries.txt | 7 | | Forbidden function : | None | 8 | | Remarks : | | 9 | 10 | A little more of full-text search: 11 | 12 | - Two games speak of a Vibrant World and have a link to facebook.com (or fb.com) in their description. Find them! 13 | 14 | ### Hint 15 | 16 | The distance between the word "vibrant" and "world" longer than a few words... 17 | 18 | ## BONUS 19 | 20 | There are three games, special Kudos if you find the 3rd one! 21 | -------------------------------------------------------------------------------- /module01/ex08/ex08.md: -------------------------------------------------------------------------------- 1 | # Exercise 08 - Aggregation 2 | 3 | | | | 4 | | -----------------------:| ------------------ | 5 | | Turn-in directory : | ex08 | 6 | | Files to turn in : | ex08-queries.txt | 7 | | Forbidden function : | None | 8 | | Remarks : | | 9 | 10 | Let's do some aggregation! 11 | 12 | - Write a query that returns the top 10 developers in terms of games produced. 13 | 14 | - Set the size to 0 so the query returns only the aggregation, not the hits. -------------------------------------------------------------------------------- /module01/ex09/ex09.md: -------------------------------------------------------------------------------- 1 | # Exercise 09 - Aggregation in Aggregation 2 | 3 | | | | 4 | | -----------------------:| ------------------ | 5 | | Turn-in directory : | ex09 | 6 | | Files to turn in : | ex09-queries.txt | 7 | | Forbidden function : | None | 8 | | Remarks : | | 9 | 10 | We would like to know what is the most represented game "Genre" (top 10) and for each of those "genre" the repartition of the "Average User Rating" with a bucket interval of one (ie: for each Genre the number of game with an Avg User Rating of 1, of 2, 3, 4, and 5). 11 | 12 | - This must be done in a single query. -------------------------------------------------------------------------------- /module01/ex10/ex10.md: -------------------------------------------------------------------------------- 1 | # Exercise 10 - Kibana 2 | 3 | | | | 4 | | -----------------------:| ------------------ | 5 | | Turn-in directory: | ex10 | 6 | | Files to turn in: | ex10-queries.txt | 7 | | Forbidden functions: | None | 8 | | Remarks: | | 9 | 10 | 11 | Time to explore Kibana a little bit more. 12 | 13 | 14 | - Your goal is to create a Dashboard with the following visualizations: 15 | - A plot showing the number of games released (Y axis) over time (X axis) 16 | - A histogram that counts the number of games released each year, and for each year the count of the "average user rating" by an interval of 1 17 | - A Pie Chart showing the repartition of Genres 18 | - A cloud of words showing the top developers 19 | 20 | - Once your dashboard has been created, explore the possibilities of Kibana (click on the top developer in the cloud of words for instance). 21 | 22 | 23 | Your Dashboard should look like something like this: 24 | 25 | 26 | ![Dashboard](../assets/dashboard.png){width=550px} 27 | 28 | ### Hints 29 | 30 | - You need to create an index pattern first (Go in the management menu, then index pattern). 31 | - Create each visualization in the visualization tab and then create the dashboard in the dashboard tab. 32 | -------------------------------------------------------------------------------- /module01/module01.md: -------------------------------------------------------------------------------- 1 | # Module01 - Elasticsearch, Logstash, Kibana 2 | 3 | In the module, you will learn how to use a NoSQL database: Elasticsearch. 4 | Wait... Elasticsearch is a database? Well, not exactly it is more than that. It is defined as a search and analytics engine. But let's keep it simple for now, consider it as a database, we will see the rest later. 5 | 6 | ## Notions of the module 7 | 8 | Create an Elasticsearch cluster, create index and mappings, ingest document, search & aggregate, create visuals with Kibana. 9 | 10 | In the first part of this module (ex00 to ex03) you will learn the basics of Elasticsearch. Then you will apply this to a real dataset. 11 | 12 | ## General rules 13 | 14 | * The exercises are ordered from the easiest to the hardest. 15 | * Your exercises are going to be evaluated by someone else, so make sure that your variable names and function names are appropriate and civil. 16 | * Your manual is the internet. 17 | * You can also ask any question in the dedicated channel in Slack: **[42ai slack](https://42-ai.slack.com)**. 18 | * If you find any issue or mistakes in the subject please create an issue on our dedicated repository on Github: **[Github issues](https://github.com/42-AI/bootcamp_data-engineering/issues)**. 19 | 20 | ## Foreword 21 | 22 | Did you know that Elasticsearch helps you find your soulmate? 23 | 24 | With more than 26 million swipes daily, Tinder connects more people one-on-one and in real-time than any other mobile app in the world. Behind the minimalist UI and elegant "swipe right, swipe left" that Tinder pioneered are tremendous challenges of data science, machine learning, and global scalability. 25 | 26 | Hear how Tinder relies on the Elastic Stack to analyze, visualize, and predict not only a) which people a user will swipe right on, or b) which people will swipe right on that user, but c) when there's a mutual swipe match. Tinder's VP of Engineering will describe how the service is growing into a global platform for social discovery in many facets of life. 27 | 28 | If you wanna know how the magic works: **[elastic tinder](https://www.elastic.co/elasticon/conf/2017/sf/tinder-using-the-elastic-stack-to-make-connections-around-the-world)**. 29 | 30 | ## Helper 31 | 32 | * Your best friend for the module: **[Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html)**. 33 | * We recommend using the `/goinfre` directory for this module as you will need ~3Go for the Elasticsearch cluster. But you are free to do as you wish. Note that using `/sgoinfre` won't work. 34 | * Keep in mind that the `/goinfre` is a local & temporary directory. So if you change computer you will lose your work, if you log out you might lose your work. 35 | 36 | ### Exercise 00 - The setup. 37 | ### Exercise 01 - CRUDité. 38 | ### Exercise 02 - Your first Index Your first Mapping. 39 | ### Exercise 03 - Text Analyzer 40 | ### Exercise 04 - Eat them all! 41 | ### Exercise 05 - Search - Developers 42 | ### Exercise 06 - Search - Name_lang 43 | ### Exercise 07 - Search - Game 44 | ### Exercise 07bis - Search - Vibrant World 45 | ### Exercise 08 - Aggregation 46 | ### Exercise 09 - Aggregation in Aggregation 47 | ### Exercise 10 - Kibana & Monitoring 48 | -------------------------------------------------------------------------------- /module01/resources/ingest-pipeline.conf: -------------------------------------------------------------------------------- 1 | input 2 | { 3 | file 4 | { 5 | path => "/path/to/the/appstore_games.csv" 6 | start_position => "beginning" 7 | sincedb_path => "my-sincedb" 8 | } 9 | } 10 | 11 | filter 12 | { 13 | csv 14 | { 15 | separator => "," 16 | columns => ["URL","ID","Name","Subtitle","Icon URL","Average User Rating","User Rating Count","Price","In-app Purchases","Description","Developer","Age Rating","Languages","Size","Primary Genre","Genres","Original Release Date","Current Version Release Date"] 17 | remove_field => ["message", "host", "path", "@timestamp"] 18 | skip_header => true 19 | } 20 | 21 | mutate 22 | { 23 | gsub => [ "Description", "\\n", " 24 | "] 25 | gsub => [ "Description", "\\u2022", "•"] 26 | gsub => [ "Description", "\\u2013", "–"] 27 | gsub => [ "Description", "\\t", " "] 28 | split => { "Genres" => "," } 29 | split => { "Languages" => "," } 30 | } 31 | } 32 | 33 | 34 | output 35 | { 36 | elasticsearch 37 | { 38 | hosts => "http://localhost:9200" 39 | index => "appstore_games_tmp" 40 | } 41 | 42 | stdout 43 | { 44 | codec => "dots" 45 | } 46 | } 47 | -------------------------------------------------------------------------------- /module02.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/42-AI/bootcamp_data-engineering/0b745562307d5fa6b395ba9b745c0df6e76ed17d/module02.pdf -------------------------------------------------------------------------------- /module02/assets/access_key.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/42-AI/bootcamp_data-engineering/0b745562307d5fa6b395ba9b745c0df6e76ed17d/module02/assets/access_key.png -------------------------------------------------------------------------------- /module02/assets/aws_regions.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/42-AI/bootcamp_data-engineering/0b745562307d5fa6b395ba9b745c0df6e76ed17d/module02/assets/aws_regions.png -------------------------------------------------------------------------------- /module02/assets/terraform_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/42-AI/bootcamp_data-engineering/0b745562307d5fa6b395ba9b745c0df6e76ed17d/module02/assets/terraform_1.png -------------------------------------------------------------------------------- /module02/assets/terraform_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/42-AI/bootcamp_data-engineering/0b745562307d5fa6b395ba9b745c0df6e76ed17d/module02/assets/terraform_2.png -------------------------------------------------------------------------------- /module02/assets/terraform_3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/42-AI/bootcamp_data-engineering/0b745562307d5fa6b395ba9b745c0df6e76ed17d/module02/assets/terraform_3.png -------------------------------------------------------------------------------- /module02/assets/terraform_4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/42-AI/bootcamp_data-engineering/0b745562307d5fa6b395ba9b745c0df6e76ed17d/module02/assets/terraform_4.png -------------------------------------------------------------------------------- /module02/assets/terraform_5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/42-AI/bootcamp_data-engineering/0b745562307d5fa6b395ba9b745c0df6e76ed17d/module02/assets/terraform_5.png -------------------------------------------------------------------------------- /module02/assets/terraform_6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/42-AI/bootcamp_data-engineering/0b745562307d5fa6b395ba9b745c0df6e76ed17d/module02/assets/terraform_6.png -------------------------------------------------------------------------------- /module02/ex00/ex00.md: -------------------------------------------------------------------------------- 1 | # Exercise 00 - Setup 2 | 3 | | | | 4 | | -----------------------:| ------------------ | 5 | | Turn-in directory: | ex00 | 6 | | Files to turn in: | | 7 | | Forbidden function: | None | 8 | | Remarks: | n/a | 9 | 10 | In this exercise we are going to setup our account to start working with a cloud provider. 11 | 12 | Don't worry! Even if you enter your card number, this module should not cost you anything. First, indeed, AWS as a free tier usage (check if it the case also for your cloud provider) that allows you to use a small amount of AWS resources for free. This will be sufficient enough for what you are going to do today. By the end of the day, you will have to entirely destroy your infrastructure (don't keep things running) !!! 13 | 14 | ## Exercise 15 | 16 | - Create an account on your cloud provider (all the exercise were made using AWS but you can choose another cloud provider). 17 | - Set up a billing alarm linked to your email that will alert you if the cost of your infrastructure exceeds 1\$. 18 | - Create a new administrator user separated from your root account (you will need to use this user for all the exercises). Save the credentials linked to the administrator user into a file called `credentials.csv`. 19 | 20 | All the mechanisms we are creating now will ensure your access is secured and will allow you to quickly be alerted if you forgot to destroy your infrastructure. -------------------------------------------------------------------------------- /module02/ex01/ex01.md: -------------------------------------------------------------------------------- 1 | # Exercise 01 - Storage 2 | 3 | | | | 4 | | -----------------------:| ------------------ | 5 | | Turn-in directory: | ex01 | 6 | | Files to turn in: | presigned_url.sh | 7 | | Forbidden function: | None | 8 | | Remarks: | n/a | 9 | 10 | 11 | ## AWS CLI 12 | 13 | We are going to use the AWS command-line interface. The first thing we need to do is install it. 14 | 15 | You should be able to run `aws --version` now. 16 | 17 | We can setup our AWS account for the CLI with the command `aws configure`. You will need to enter: 18 | 19 | - access key : in your `credentials.csv` file 20 | - secret access key : in your `credentials.csv` file 21 | - region : `eu-west-1` (Ireland) 22 | - default output format : `None` 23 | 24 | The AWS CLI is now ready! 25 | 26 | ## S3 bucket creation 27 | 28 | Amazon S3 provides developers and IT teams with secure, durable, and highly-scalable cloud storage. Amazon S3 is easy-to-use object storage with a simple web service interface that you can use to store and retrieve any amount of data from anywhere on the web. 29 | 30 | A bucket is a container (web folder) for objects (files) stored in Amazon S3. Every Amazon S3 object is contained in a bucket. Buckets form the top-level namespace for Amazon S3, and bucket names are global. This means that your bucket names must be unique globally (across all AWS accounts). The reason for that is when we create a bucket, it is going to have a web address (ex : `https://s3-eu-west-1.amazonaws.com/example`). 31 | 32 | Even though the namespace for Amazon S3 buckets is global, each Amazon S3 bucket is created in a specific region that you choose. This lets you control where your data is stored. 33 | 34 | With your free usage you can store up to 5 Gb of data! 35 | 36 | ## Exercise 37 | 38 | In this exercise, you will learn to create an S3 bucket and use aws-cli. 39 | 40 | - Connect to the console of your administrator user 41 | - Create an S3 bucket starting with the prefix `module02-` and finished with whatever numbers you want. 42 | - Using aws-cli, copy `appstore_games.csv` file to the bucket. You can check the file was correctly copied using the AWS console. 43 | - Using aws-cli, create a presigned URL allowing you to download the file. Your presigned url must have an expiring time of 10 minutes. Your AWS CLI command must be stored in the `presigned_url.sh` script. 44 | -------------------------------------------------------------------------------- /module02/ex02/ex02.md: -------------------------------------------------------------------------------- 1 | # Exercise 02 - Compute 2 | 3 | | | | 4 | | -----------------------:| ------------------ | 5 | | Turn-in directory: | ex02 | 6 | | Files to turn in: | os_name.txt | 7 | | Forbidden function: | None | 8 | | Remarks: | n/a | 9 | 10 | Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable compute capacity in the cloud. Amazon EC2 reduces the time required to obtain and boot new server instances to minutes, allowing us to quickly scale capacity (up or down) depending on our needs. 11 | 12 | Amazon EC2 allows you to acquire compute through the launching of virtual servers called instances. When you launch an instance, you can make use of the compute as you wish, just as you would with an on-premises server (local servers). Because you are paying for the computing power of the instance, you are charged per hour while the instance is running. When you stop the instance, you are no longer charged. 13 | 14 | Two concepts are key to launching instances on AWS: 15 | - **instance type** : the amount of virtual hardware dedicated to the instance. 16 | - **AMI (Amazon Machine Image)** : the software loaded on the instance (Linux, MacOS, Debian, ...). 17 | 18 | The instance type defines the virtual hardware supporting an Amazon EC2 instance. There are dozens of instance types available, varying in the following dimensions: 19 | 20 | - Virtual CPUs (vCPUs) 21 | - Memory 22 | - Storage (size and type) 23 | - Network performance 24 | 25 | Instance types are grouped into families based on the ratio of these values to each other. Today we are going to use t2.micro instances (they are included in the free usage)! 26 | 27 | One of the impressive features of EC2 is autoscaling. If you have a website, with 100 users you can have your website running on a little instance. If the next day, you have 10000 users then your server can scale up by recruiting new ec2 instances to handle this new load! 28 | 29 | ## Exercise 30 | 31 | In this exercise, you will learn how to create and connect to an ec2-instance. If you are on another cloud provider aim for linux based instances with a very little size (if it can be free it is better). 32 | 33 | Follow these steps for the exercise: 34 | - launch an ec2 instance with the AMI : `Amazon Linux 2 AMI`. 35 | - choose `t2.micro` as instance type. 36 | - create a key pair. 37 | - connect in ssh to your instance using your key pair. 38 | - get and save the os name of your instance in the `os_name.txt` file. 39 | - terminate your instance. 40 | 41 | Within minutes we have created a server and we can work on it! -------------------------------------------------------------------------------- /module02/ex03/ex03.md: -------------------------------------------------------------------------------- 1 | # Exercise 03 - Flask API - List & Delete 2 | 3 | | | | 4 | | -----------------------:| ------------------ | 5 | | Turn-in directory: | ex03 | 6 | | Files to turn in: | app.py, \*.py | 7 | | Forbidden function: | None | 8 | | Remarks: | n/a | 9 | 10 | Before getting into AWS infrastructure, we are going to discover how to interact with AWS resources using a Python SDK (Software Development Kit) called boto3. We are going to work with a python micro framework called Flask to create an API (a programmatic interface) to interact with your s3 bucket. For now, the API will be built locally to ease the development. 11 | 12 | nb: for a simplification of the following exercises we are going to use Flask directly like a development environment. If we wanted a more production-ready application we would add a webserver like Nginx/Apache linked with Gunicorn/Wsgi. 13 | 14 | ## Exercise 15 | 16 | Create a Flask application `app.py` with three routes: 17 | 18 | - **`/`** 19 | - **status** : `200` 20 | - **message** : `Successfully connected to module02 upload/download API` 21 | - **`/list_files`** : 22 | - **status** : `200` 23 | - **message** : `Successfully listed files on s3 bucket ''` 24 | - **content** : list of files within the s3 bucket 25 | - **`/delete/`** : 26 | - **status** : `200` 27 | - **message** : `Successfully deleted file '' on s3 bucket ''` 28 | 29 | The content you return with your Flask API must be json formatted. You should use boto3 to interact with the s3 bucket you previously created (`module02-...`). -------------------------------------------------------------------------------- /module02/ex04/ex04.md: -------------------------------------------------------------------------------- 1 | # Exercise 04 - Flask API - Download & Upload 2 | 3 | | | | 4 | | -----------------------:| ------------------ | 5 | | Turn-in directory: | ex04 | 6 | | Files to turn in: | app.py, \*.py | 7 | | Forbidden function: | None | 8 | | Remarks: | n/a | 9 | 10 | 11 | We will continue to work on our Flask API to add new functionalities. This time we will work around file download and upload. In order to upload and download files we are going to use something we already used, presigned urls! 12 | 13 | ## Exercise 14 | 15 | Create a Flask application `app.py` with two more routes: 16 | 17 | - **`/download/`** : 18 | - **status** : `200` 19 | - **message** : `Successfully downloaded file '' on s3 bucket ''` 20 | - **content** : presigned url to download file 21 | - **`/upload/`** : 22 | - **status** : `200` 23 | - **message** : `Successfully uploaded file '' on s3 bucket ''` 24 | - **content** : presigned url to upload file 25 | 26 | The content you return with your Flask API has to be json formatted. You should use boto3 to interact with the s3 bucket you previously created (`module02-...`). -------------------------------------------------------------------------------- /module02/ex05/ex05.md: -------------------------------------------------------------------------------- 1 | # Exercise 05 - Flask API - Download 2 | 3 | | | | 4 | | -----------------------:| ------------------ | 5 | | Turn-in directory: | ex05 | 6 | | Files to turn in: | client.py app.py, \*.py | 7 | | Forbidden function: | None | 8 | | Remarks: | n/a | 9 | 10 | 11 | Our API is finished but it is not quite convenient to request! To ease the use of this API, we are going to create a client that will allow us to interact more easily with the API. 12 | 13 | ## Exercise 14 | 15 | Create a client `client.py` that will call the API you are creating and show results in a more human readable way. The client will have two parameters: 16 | 17 | - **`--ip`**: IP address of the API (the default IP must be defined as `0.0.0.0`) 18 | - **`--filename`**: file name to delete, download or upload. 19 | 20 | ... and the following options: 21 | 22 | - **`ping`**: call the route `/` of the API and print the message. 23 | - **`list`**: call the route `/list_files` of the API and show the files on the bucket. 24 | - **`delete`**: call the route `/delete/` of the API and delete a file on the bucket. 25 | - **`download`**: call the route `/download/` of the API and download a file from the bucket. 26 | - **`upload`**: call the route `/delete/` of the API and upload a file on the bucket. -------------------------------------------------------------------------------- /module02/ex06/ex06.md: -------------------------------------------------------------------------------- 1 | # Exercise 06 - IAM role 2 | 3 | | | | 4 | | -----------------------:| ------------------ | 5 | | Turn-in directory: | ex06 | 6 | | Files to turn in: | 00_variables.tf, 01_networking.tf, 07_iam.tf, 10_terraform.auto.tfvars | 7 | | Forbidden function: | None | 8 | | Remarks: | n/a | 9 | 10 | Terraform is a tool to deploy infrastructure as code. It can be used for multiple cloud providers (AWS, Azure, GCP, ...). We are going to use it to deploy our new API! 11 | 12 | As you already know, we are using our AWS free tier. However, if you let your server run for weeks you will have to pay. We want to avoid this possibility. That's why we are going to use a tool to automatically deploy and destroy our infrastructure, Terraform. 13 | 14 | All potentially critical data **MUST NOT** be deployed using infrastructure as code like terraform. If they are, they may be destroyed accidentally and you never want that to happen! 15 | 16 | ## Terraform install 17 | 18 | First, download the terraform software for macOS. 19 | 20 | ``` 21 | brew install terraform 22 | ``` 23 | 24 | You can now run the `terraform --version`. Terraform is ready! 25 | 26 | Terraform is composed of three kinds of files: 27 | - `.tfvars` : terraform variables. 28 | - `.tf` : terraform infrastructure description. 29 | - `.tfstate` : describe all the parameters of the stack you applied (is updated after an apply) 30 | 31 | You can run `terraform destroy` to delete your stack. 32 | 33 | No further talking, let's deep dive into Terraform! 34 | 35 | **For all the following exercises**, all the resources that can be tagged must use the `project_name` variable with the following tags structure: 36 | - `Name`: `-` 37 | - `Project_name`: `` 38 | 39 | Variables must be specified in variable files! 40 | 41 | ## Exercise 42 | 43 | For this first exercise, you will have to use the default VPC (Virtual Private Cloud). A VPC emulates a network within AWS infrastructure. This default VPC ease the use of AWS services like EC2 (you do not need to know anything in network setup). You will have to work in the Ireland region (this region can be changed depending on the cloud provider and your location). 44 | 45 | The main objective is to create an IAM role for an EC2 instance allowing it to use all actions on s3 buckets (list, copy, ...). In order to create a role in terraform you will have to create: 46 | - a role called `module02_s3FullAccessRole` 47 | - a profile called `module02_s3FullAccessProfile` 48 | - a policy called `module02_s3FullAccessPolicy` 49 | 50 | To test your role you can create an EC2 instance and link your newly created role to it, if the AWS cli works then the exercise is done. You must be able to destroy your stack entirely. -------------------------------------------------------------------------------- /module02/ex07/ex07.md: -------------------------------------------------------------------------------- 1 | # Exercise 07 - Security groups 2 | 3 | | | | 4 | | -----------------------:| ------------------ | 5 | | Turn-in directory: | ex07 | 6 | | Files to turn in: | 08_security_groups.tf, \*.tf, \*.auto.tfvars | 7 | | Forbidden function: | None | 8 | | Remarks: | n/a | 9 | 10 | As you already noticed, Flask uses the port 5000. In EC2 all the incoming and outgoing traffic is blocked by default (for security reasons). If we want to interact with our API we will have allow the traffic. In AWS, we can define traffic rules using security groups. The security group will then be associated with an EC2 instance. 11 | 12 | ## Exercise 13 | 14 | Create a security that will allow: 15 | - `ssh` incoming traffic (we will use it in the next exercise) 16 | - `tcp` incoming traffic on port `5000` (to interact with our Flask API) 17 | - outgoing traffic to the whole internet 18 | 19 | To test the security group you can associate it to a newly created EC2 instance (you will need to use an existing key pair). -------------------------------------------------------------------------------- /module02/ex08/ex08.md: -------------------------------------------------------------------------------- 1 | # Exercise 08 - Cloud API 2 | 3 | | | | 4 | | -----------------------:| ------------------ | 5 | | Turn-in directory: | ex08 | 6 | | Files to turn in: | 02_ec2.tf, \*.tf, \*.auto.tfvars | 7 | | Forbidden function: | None | 8 | | Remarks: | n/a | 9 | 10 | As you may have noticed, building a whole infrastructure needs a lot of steps (which includes). To ease the deployment, it was split into two parts: an intermediate deployment and the final implementation. The first part consists in deploying one EC2 instance with your working API. Thus we will ensure everything we did on terraform is working well. I have a good news, by the end of this exercise you will have done the first intermediate solution! 11 | 12 | ## Exercise 13 | 14 | To finalize our intermediate infrastructure we are going to add two components. 15 | 16 | First, you will need a key-pair file we will call `module02.pem` and provisioned through Terraform. The key pair must use the RSA algorithm and have the appropriate permissions to be functional. 17 | 18 | You must provision an EC2 resource which will use: 19 | - the default vpc 20 | - the role you created 21 | - the security group you created 22 | - the key pair you just created 23 | - a public ip address 24 | - an instance type `t2.micro` with a Linux AMI 25 | 26 | You must create an output that will show the public ip of the instance you created. 27 | 28 | At this point you should be able to ssh into the EC2 and use aws cli on s3 buckets. However, our API is still not working! 29 | 30 | First, upload the files of your API onto your s3 bucket (you don't need to upload the client). Those files should never be deleted to provision your API. This solution is not the cleanest solution but it will be sufficient for the purpose of this module. 31 | 32 | Create a bootstrap script that will: 33 | - install the necessary libraries 34 | - download the files of the API from the s3 bucket 35 | - start the API in background 36 | 37 | The exercise will be considered valid only if the API is working after a `terraform apply`. You should be able to use your client on the output ip. -------------------------------------------------------------------------------- /module02/ex09/ex09.md: -------------------------------------------------------------------------------- 1 | # Exercise 09 - Network 2 | 3 | | | | 4 | | -----------------------:| ------------------ | 5 | | Turn-in directory: | ex09 | 6 | | Files to turn in: | 00_variables.tf, 01_networking.tf, 10_terraform.auto.tfvars | 7 | | Forbidden function: | None | 8 | | Remarks: | n/a | 9 | 10 | AWS and I lied to you! You thought deploying a server was that simple? A huge part of the required stack for the deployment is hidden! This hidden layer uses a wizard configuration (default configuration suitable for most users). The default configuration includes: 11 | - network (VPC, subnets, CIDR blocks) 12 | - network components (routing table, Internet gateway, NAT gateway) 13 | - security (NACLS, security groups) 14 | 15 | ## Exercise 16 | 17 | For this new implementation we are going to recode more parts of our architecture like the network. We are going to use our own VPC to not rely on the default one. 18 | 19 | Create a VPC using terraform. You have to respect the following constraints: 20 | 21 | - your vpc is deployed in Ireland (specified as the variable `region`). 22 | - your vpc uses a `10.0.0.0/16` CIDR block. 23 | - your vpc must enable DNS hostname (this will be useful for the next exercises) 24 | 25 | On your AWS console, you can go in the VPC section to check if your VPC was correctly created. 26 | 27 | Within our newly created VPC we want to divide the network's IPs into subnets. This can be useful for many different purposes and helps isolate groups of hosts together and deal with them easily. In AWS, subnets are often associated with different availability zones which guarantees the high availability in case an AWS data center is destroyed. 28 | 29 | ![Flask API AWS infrastructure](../assets/terraform_2.png){width=300px} 30 | 31 | Within our previously created VPC, add 2 subnets with the following characteristics: 32 | - their depends on the creation of the VPC (it has to be specified on terraform) 33 | - your subnets will use `10.0.1.0/24` and `10.0.2.0/24` CIDR blocks. 34 | - your subnets will use `eu-west-1a` and `eu-west-1b` availability zones. 35 | - they must map public ip on launch. 36 | -------------------------------------------------------------------------------- /module02/ex10/ex10.md: -------------------------------------------------------------------------------- 1 | # Exercise 10 - IGW - Route table 2 | 3 | | | | 4 | | -----------------------:| ------------------ | 5 | | Turn-in directory: | ex10 | 6 | | Files to turn in: | 00_variables.tf, 01_networking.tf, 10_terraform.auto.tfvars | 7 | | Forbidden function: | None | 8 | | Remarks: | n/a | 9 | 10 | Let's continue our infrastructure! We created a Network with VPC and divided our network into subnets into two different availability zones. However, our network is still not accessible from the internet (or your IP). First, we need to create an internet gateway and link it with our VPC. This first step will allow us to interact with the internet. 11 | 12 | The subnets we created will be used to host our EC2 instances but our subnets are now disconnected from the internet and other IPs within the VPC. To fix this problem, we will create a route table (it acts like a combination of a switch (when you interact with IPs inside your VPC) and a router (when you want to interact with external IPs)). 13 | 14 | ![Flask API AWS infrastructure](../assets/terraform_3.png){width=300px} 15 | 16 | ## Exercise 17 | 18 | Create an Internet gateway (IGW) which depends on the VPC you created. Your IGW will need the tags: 19 | - `project_name` with the value `module02` 20 | - `Name` with the value `module02-igw` 21 | 22 | Create a route table that depends on the VPC and the IGW. Your route table will have to implement a route linking the `0.0.0.0/0` CIDR Block with the IGW. The `0.0.0.0/0` is really important in the IP search process, it means if you cannot find the IP you are looking for within the VPC then search it on other networks (through the IGW). Your route table will need the following tags: 23 | - `project_name` with the value `module02` 24 | - `Name` with the value `module02-rt` 25 | 26 | You thought you were finished ? We now need to associate our subnets to the route table ! Create route table associations for both of your subnets. They will depend on the route table and the concerned subnet. -------------------------------------------------------------------------------- /module02/ex11/ex11.md: -------------------------------------------------------------------------------- 1 | # Exercise 11 - Autoscaling 2 | 3 | | | | 4 | | -----------------------:| ------------------ | 5 | | Turn-in directory: | ex11 | 6 | | Files to turn in: | 02_asg.tf, \*.tf, \*.tfvars | 7 | | Forbidden function: | None | 8 | | Remarks: | n/a | 9 | 10 | 11 | Any Cloud provider is based on a pay as you go system. This system allows us not to pay depending on the number of users we have. If we have 10 users our t2.micro EC2 may be sufficient for our Flask application but if tomorrow 1000000 users want to try our super API we have to find a way to scale our infrastructure! 12 | 13 | This can be done trough autoscaling groups. The more traffic we have the more EC2 instances will spawn to handle the growing traffic. Of course those new instances will be terminated if the number of users goes down. 14 | 15 | ![Flask API AWS infrastructure](../assets/terraform_5.png){width=300px} 16 | 17 | ## Exercise 18 | 19 | Transform your `02_ec2.t` terraform file into `02_asg.tf`. You will have to transform your code into an autoscaling group. 20 | 21 | You need to implement a launch configuration with your EC2 parameters and add a create before destroy lifecycle. 22 | 23 | Create an autoscaling group with : 24 | - dependency on the launch configuration 25 | - a link to the subnets of our vpc 26 | - the launch configuration you previously created 27 | - a minimal and maximal size of your autoscaling group will be 2 (this will allow us to always keep 2 instances up even if one terminates) 28 | - a tag with: 29 | - `Autoscaling Flask` for a key 30 | - `flask-asg` for the value 31 | - the propagate at launch option 32 | 33 | You should see 2 EC2 instances created within your AWS console. 34 | -------------------------------------------------------------------------------- /module02/ex12/ex12.md: -------------------------------------------------------------------------------- 1 | # Exercise 12 - Load balancer 2 | 3 | | | | 4 | | -----------------------:| ------------------ | 5 | | Turn-in directory: | ex12 | 6 | | Files to turn in: | 03_elb.tf, \*.tf, \*.tfvars | 7 | | Forbidden function: | None | 8 | | Remarks: | n/a | 9 | 10 | 11 | Let's finish our infrastructure! With our autoscaling group, we now have 2 instances but we still need to go to our AWS console to search the IP of each EC2 instance which is not convenient! 12 | 13 | A solution is to create a load balancer. A load balancer as its name indicates will balance the traffic between EC2 instances (of our autoscaling group here). 14 | 15 | ![Flask API AWS infrastructure](../assets/terraform_6.png){width=300px} 16 | 17 | ## Exercise 18 | 19 | Create a security group for your load balancer. It must: 20 | - depend on the vpc you created 21 | - allow traffic from port 5000 22 | 23 | Create a load balancer with: 24 | - a health check on port 5000 every 30 sec and a healthy threshold at 2 25 | - a listener on the port 5000 26 | - the cross-zone load balancing option 27 | 28 | In your autoscaling group add your load balancer and a health check type of type `ELB`. 29 | 30 | Create a terraform output that will display the DNS name of your load balancer (this output will replace the output ip of the EC2 we had). 31 | 32 | You should be able to use the DNS name of your load balancer to call the API now (yes this should work with the ip option you created without any other modifications)! 33 | 34 | After the `terraform apply` finished you probably will have to wait 30 sec - 1 minute before the API is working. 35 | 36 | **Do not forget to `terraform destroy` at the end of the module!** -------------------------------------------------------------------------------- /module02/module02.md: -------------------------------------------------------------------------------- 1 | # Module02 - Cloud Storage API 2 | 3 | In the module, you will learn how to use a Cloud Provider. For all the exercises, I took Amazon Web Services (AWS) as an example but **you are totally free to use the cloud provider you want which is compatible with Terraform** (we advise you to use AWS if you don't have one). AWS has become the most popular cloud service provider in the world followed by Google Cloud Platform and Microsoft Azure. 4 | 5 | Amazon Web Services started in 2005 and it now delivers nearly 2000 services. Due to the large number of services and the maturity of AWS, it is a better option to start learning cloud computing. 6 | 7 | If you never heard about the Cloud before, do not worry! You will learn step by step what the Cloud is and how to use it. 8 | 9 | ## Notions of the module 10 | 11 | The module will be divided into two parts. In the first one, you will learn to use a tool called Terraform which will allow you to deploy/destruct cloud infrastructures. In the second part of the module, you will learn to use a software development kit (SDK) which will allow you to use Python in order to interact with your cloud. 12 | 13 | ## General rules 14 | 15 | * The exercises are ordered from the easiest to the hardest. 16 | * Your exercises are going to be evaluated by someone else, so make sure that your variable names and function names are appropriate and civil. 17 | * Your manual is the internet. 18 | * You can also ask any question in the dedicated channel in Slack: **[42ai slack](https://42-ai.slack.com)**. 19 | * If you find any issue or mistakes in the subject please create an issue on our dedicated repository on Github: **[Github issues](https://github.com/42-AI/bootcamp_data-engineering/issues)**. 20 | 21 | ## Foreword 22 | 23 | Cloud computing is the on-demand delivery of IT resources and applications via the Internet with pay-as-you-go pricing. In fact, a cloud server is located in a data center that could be anywhere in the world. 24 | 25 | Whether you run applications that share photos to millions of mobile users or deliver services that support the critical operations of your business, the cloud provides rapid access to flexible and low-cost IT resources. With cloud computing, you don’t need to make large up-front investments in hardware and spend a lot of time managing that hardware. Instead, you can provision exactly the right type and size of computing resources you need to power your newest bright idea or operate your IT department. With cloud computing, you can access as many resources as you need, almost instantly, and only pay for what you use. 26 | 27 | In its simplest form, cloud computing provides an easy way to access servers, storage, databases, and a broad set of application services over the Internet. Cloud computing providers such as AWS own and maintain the network-connected hardware required for these application services, while you provision and use what you need for your workloads. 28 | 29 | As seen previously, Cloud computing provides some real benefits : 30 | 31 | - **Variable expense**: You don't need to invest in huge data centers you may not use at full capacity. You pay for how much you consume! 32 | - **Available in minutes**: New IT resources can be accessed within minutes. 33 | - **Economies of scale**: A large number of users enables Cloud providers to achieve higher economies of scale translating at lower prices. 34 | - **Global in minutes**: Cloud architectures can be deployed really easily all around the world. 35 | 36 | Deployments using the cloud can be `all-in-cloud-based` (the entire infrastructure is in the cloud) or `hybrid` (using on-premise and cloud). 37 | 38 | ## AWS global infrastructure 39 | 40 | Amazon Web Services (AWS) is a cloud service provider, also known as infrastructure-as-a-service (`IaaS`). AWS is the clear market leader in this domain and offers much more services compared to its competitors. 41 | 42 | AWS has some interesting properties such as: 43 | 44 | - **High availability** : Any file can be accessed from anywhere 45 | - **Fault tolerance**: In case an AWS server fails, you can still retrieve the files (the fault tolerance is due to redundancy). 46 | - **Scalability**: Possibility to add more servers when needed. 47 | - **Elasticity**: Possibility to grow or shrink infrastructure. 48 | 49 | AWS provides a highly available technology infrastructure platform with multiple locations worldwide. These locations are composed of `regions` and `availability zones`. 50 | 51 | Each region represents a unique geographic area. Each region contains multiple, isolated locations known as availability zones. An availability zone is a physical data center geographically separated from other availability zones (redundant power, networking, and connectivity). 52 | 53 | You can achieve high availability by deploying your application across multiple availability zones. 54 | 55 | ![AWS regions](../assets/aws_regions.png){width=400px} 56 | 57 | The `edge locations` you see on the picture are endpoints for AWS which are used for caching content (performance optimization mechanism in which data is delivered from the closest servers for optimal application performance). Typically consists of CloudFront (Amazon's content delivery network (CDN)). 58 | 59 | ## Helper 60 | 61 | * Your best friends for the module: **[AWS documentation](https://docs.aws.amazon.com/index.html)** and **[Terraform documentation](https://www.terraform.io/docs/index.html)**. 62 | 63 | ### Exercise 00 - Setup 64 | ### Exercise 01 - Storage 65 | ### Exercise 02 - Compute 66 | ### Exercise 03 - Flask API - List & Delete 67 | ### Exercise 04 - Flask API - Download & Upload 68 | ### Exercise 05 - Client 69 | ### Exercise 06 - IAM role 70 | ### Exercise 07 - Security group 71 | ### Exercise 08 - Cloud API 72 | ### Exercise 09 - Network 73 | ### Exercise 10 - IGW - Route table 74 | ### Exercise 12 - Autoscaling 75 | ### Exercise 13 - Load balancer 76 | -------------------------------------------------------------------------------- /resources/appstore_games.csv.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/42-AI/bootcamp_data-engineering/0b745562307d5fa6b395ba9b745c0df6e76ed17d/resources/appstore_games.csv.zip --------------------------------------------------------------------------------