├── .github
└── ISSUE_TEMPLATE
│ ├── exercise.md
│ └── updates-post-testing.md
├── .gitignore
├── CODE_OF_CONDUCT.md
├── LICENSE
├── README.md
├── assets
└── 42ai_logo.png
├── module00.pdf
├── module00
├── assets
│ ├── client_server.png
│ └── tables.png
├── ex00
│ └── ex00.md
├── ex01
│ └── ex01.md
├── ex02
│ └── ex02.md
├── ex03
│ ├── ex03.md
│ └── psycopg2_basics.md
├── ex04
│ └── ex04.md
├── ex05
│ └── ex05.md
├── ex06
│ └── ex06.md
├── ex07
│ └── ex07.md
├── ex08
│ └── ex08.md
├── ex09
│ └── ex09.md
├── ex10
│ └── ex10.md
├── ex11
│ └── ex11.md
├── ex12
│ └── ex12.md
├── ex13
│ └── ex13.md
├── ex14
│ └── ex14.md
├── module00.md
└── resources
│ ├── Pipfile
│ ├── db
│ ├── Dockerfile
│ ├── init.sql
│ └── pg_hba.conf
│ ├── docker-compose.yml
│ ├── docker_install.sh
│ └── psycopg2_documentation.pdf
├── module01.pdf
├── module01
├── assets
│ └── dashboard.png
├── ex00
│ └── ex00.md
├── ex01
│ └── ex01.md
├── ex02
│ └── ex02.md
├── ex03
│ └── ex03.md
├── ex04
│ └── ex04.md
├── ex05
│ └── ex05.md
├── ex06
│ └── ex06.md
├── ex07
│ └── ex07.md
├── ex07bis
│ └── ex07bis.md
├── ex08
│ └── ex08.md
├── ex09
│ └── ex09.md
├── ex10
│ └── ex10.md
├── module01.md
└── resources
│ └── ingest-pipeline.conf
├── module02.pdf
├── module02
├── assets
│ ├── access_key.png
│ ├── aws_regions.png
│ ├── terraform_1.png
│ ├── terraform_2.png
│ ├── terraform_3.png
│ ├── terraform_4.png
│ ├── terraform_5.png
│ └── terraform_6.png
├── ex00
│ └── ex00.md
├── ex01
│ └── ex01.md
├── ex02
│ └── ex02.md
├── ex03
│ └── ex03.md
├── ex04
│ └── ex04.md
├── ex05
│ └── ex05.md
├── ex06
│ └── ex06.md
├── ex07
│ └── ex07.md
├── ex08
│ └── ex08.md
├── ex09
│ └── ex09.md
├── ex10
│ └── ex10.md
├── ex11
│ └── ex11.md
├── ex12
│ └── ex12.md
└── module02.md
└── resources
└── appstore_games.csv.zip
/.github/ISSUE_TEMPLATE/exercise.md:
--------------------------------------------------------------------------------
1 | ---
2 | name: Exercise
3 | about: Create a report to help us improve
4 | title: ''
5 | labels: fixme
6 | assignees: ''
7 |
8 | ---
9 |
10 | * Day: xx
11 | * Exercise: xx
12 |
13 | A clear and concise description of what the problem/misunderstanding is.
14 |
15 | **Examples**
16 | If applicable, add examples to help explain your problem.
17 |
18 | ```python
19 | print("Code example")
20 | ```
21 |
22 | **Screenshots**
23 | If applicable, add screenshots to help explain your problem.
24 |
--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/updates-post-testing.md:
--------------------------------------------------------------------------------
1 | ---
2 | name: Updates post-testing
3 | about: Updates post-testing for a whole day
4 | title: ''
5 | labels: fixme
6 | assignees: ''
7 |
8 | ---
9 |
10 | ## Global notes:
11 |
12 | - [ ] 1.
13 | - [ ] 2.
14 |
15 | ### ex00:
16 | - [ ] 1.
17 | - [ ] 2.
18 |
19 | ### ex01:
20 | - [ ] 1.
21 | - [ ] 2.
22 |
23 | ### ex02:
24 | - [ ] 1.
25 | - [ ] 2.
26 |
27 | ### ex03:
28 | - [ ] 1.
29 | - [ ] 2.
30 |
31 | ### ex04:
32 | - [ ] 1.
33 | - [ ] 2.
34 |
35 | ### ex05:
36 | - [ ] 1.
37 | - [ ] 2.
38 |
39 | ### ex06:
40 | - [ ] 1.
41 | - [ ] 2.
42 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | # Byte-compiled / optimized / DLL files
2 | __pycache__/
3 | *.py[cod]
4 | *$py.class
5 |
6 | # C extensions
7 | *.so
8 |
9 | # Distribution / packaging
10 | .Python
11 | build/
12 | develop-eggs/
13 | dist/
14 | downloads/
15 | eggs/
16 | .eggs/
17 | lib/
18 | lib64/
19 | parts/
20 | sdist/
21 | var/
22 | wheels/
23 | *.egg-info/
24 | .installed.cfg
25 | *.egg
26 | MANIFEST
27 |
28 | # PyInstaller
29 | # Usually these files are written by a python script from a template
30 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
31 | *.manifest
32 | *.spec
33 |
34 | # Installer logs
35 | pip-log.txt
36 | pip-delete-this-directory.txt
37 |
38 | # Unit test / coverage reports
39 | htmlcov/
40 | .tox/
41 | .coverage
42 | .coverage.*
43 | .cache
44 | nosetests.xml
45 | coverage.xml
46 | *.cover
47 | .hypothesis/
48 | .pytest_cache/
49 |
50 | # Translations
51 | *.mo
52 | *.pot
53 |
54 | # Django stuff:
55 | *.log
56 | local_settings.py
57 | db.sqlite3
58 |
59 | # Flask stuff:
60 | instance/
61 | .webassets-cache
62 |
63 | # Scrapy stuff:
64 | .scrapy
65 |
66 | # Sphinx documentation
67 | docs/_build/
68 |
69 | # PyBuilder
70 | target/
71 |
72 | # Jupyter Notebook
73 | .ipynb_checkpoints
74 |
75 | # pyenv
76 | .python-version
77 |
78 | # celery beat schedule file
79 | celerybeat-schedule
80 |
81 | # SageMath parsed files
82 | *.sage.py
83 |
84 | # Environments
85 | .env
86 | .venv
87 | env/
88 | venv/
89 | ENV/
90 | env.bak/
91 | venv.bak/
92 |
93 | # Spyder project settings
94 | .spyderproject
95 | .spyproject
96 |
97 | # Rope project settings
98 | .ropeproject
99 |
100 | # mkdocs documentation
101 | /site
102 |
103 | # mypy
104 | .mypy_cache/
105 |
106 | # MACOS stuff
107 | *.DS_STORE
108 |
109 | # TMP files
110 | *.swp
--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | # Contributor Covenant Code of Conduct
2 |
3 | ## Our Pledge
4 |
5 | In the interest of fostering an open and welcoming environment, we as
6 | contributors and maintainers pledge to making participation in our project and
7 | our community a harassment-free experience for everyone, regardless of age, body
8 | size, disability, ethnicity, sex characteristics, gender identity and expression,
9 | level of experience, education, socio-economic status, nationality, personal
10 | appearance, race, religion, or sexual identity and orientation.
11 |
12 | ## Our Standards
13 |
14 | Examples of behavior that contributes to creating a positive environment
15 | include:
16 |
17 | * Using welcoming and inclusive language
18 | * Being respectful of differing viewpoints and experiences
19 | * Gracefully accepting constructive criticism
20 | * Focusing on what is best for the community
21 | * Showing empathy towards other community members
22 |
23 | Examples of unacceptable behavior by participants include:
24 |
25 | * The use of sexualized language or imagery and unwelcome sexual attention or
26 | advances
27 | * Trolling, insulting/derogatory comments, and personal or political attacks
28 | * Public or private harassment
29 | * Publishing others' private information, such as a physical or electronic
30 | address, without explicit permission
31 | * Other conduct which could reasonably be considered inappropriate in a
32 | professional setting
33 |
34 | ## Our Responsibilities
35 |
36 | Project maintainers are responsible for clarifying the standards of acceptable
37 | behavior and are expected to take appropriate and fair corrective action in
38 | response to any instances of unacceptable behavior.
39 |
40 | Project maintainers have the right and responsibility to remove, edit, or
41 | reject comments, commits, code, wiki edits, issues, and other contributions
42 | that are not aligned to this Code of Conduct, or to ban temporarily or
43 | permanently any contributor for other behaviors that they deem inappropriate,
44 | threatening, offensive, or harmful.
45 |
46 | ## Scope
47 |
48 | This Code of Conduct applies both within project spaces and in public spaces
49 | when an individual is representing the project or its community. Examples of
50 | representing a project or community include using an official project e-mail
51 | address, posting via an official social media account, or acting as an appointed
52 | representative at an online or offline event. Representation of a project may be
53 | further defined and clarified by project maintainers.
54 |
55 | ## Enforcement
56 |
57 | Instances of abusive, harassing, or otherwise unacceptable behavior may be
58 | reported by contacting the project team at contact@42ai.fr. All
59 | complaints will be reviewed and investigated and will result in a response that
60 | is deemed necessary and appropriate to the circumstances. The project team is
61 | obligated to maintain confidentiality with regard to the reporter of an incident.
62 | Further details of specific enforcement policies may be posted separately.
63 |
64 | Project maintainers who do not follow or enforce the Code of Conduct in good
65 | faith may face temporary or permanent repercussions as determined by other
66 | members of the project's leadership.
67 |
68 | ## Attribution
69 |
70 | This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
71 | available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
72 |
73 | [homepage]: https://www.contributor-covenant.org
74 |
75 | For answers to common questions about this code of conduct, see
76 | https://www.contributor-covenant.org/faq
77 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Attribution-NonCommercial-ShareAlike 4.0 International
2 |
3 | =======================================================================
4 |
5 | Creative Commons Corporation ("Creative Commons") is not a law firm and
6 | does not provide legal services or legal advice. Distribution of
7 | Creative Commons public licenses does not create a lawyer-client or
8 | other relationship. Creative Commons makes its licenses and related
9 | information available on an "as-is" basis. Creative Commons gives no
10 | warranties regarding its licenses, any material licensed under their
11 | terms and conditions, or any related information. Creative Commons
12 | disclaims all liability for damages resulting from their use to the
13 | fullest extent possible.
14 |
15 | Using Creative Commons Public Licenses
16 |
17 | Creative Commons public licenses provide a standard set of terms and
18 | conditions that creators and other rights holders may use to share
19 | original works of authorship and other material subject to copyright
20 | and certain other rights specified in the public license below. The
21 | following considerations are for informational purposes only, are not
22 | exhaustive, and do not form part of our licenses.
23 |
24 | Considerations for licensors: Our public licenses are
25 | intended for use by those authorized to give the public
26 | permission to use material in ways otherwise restricted by
27 | copyright and certain other rights. Our licenses are
28 | irrevocable. Licensors should read and understand the terms
29 | and conditions of the license they choose before applying it.
30 | Licensors should also secure all rights necessary before
31 | applying our licenses so that the public can reuse the
32 | material as expected. Licensors should clearly mark any
33 | material not subject to the license. This includes other CC-
34 | licensed material, or material used under an exception or
35 | limitation to copyright. More considerations for licensors:
36 | wiki.creativecommons.org/Considerations_for_licensors
37 |
38 | Considerations for the public: By using one of our public
39 | licenses, a licensor grants the public permission to use the
40 | licensed material under specified terms and conditions. If
41 | the licensor's permission is not necessary for any reason--for
42 | example, because of any applicable exception or limitation to
43 | copyright--then that use is not regulated by the license. Our
44 | licenses grant only permissions under copyright and certain
45 | other rights that a licensor has authority to grant. Use of
46 | the licensed material may still be restricted for other
47 | reasons, including because others have copyright or other
48 | rights in the material. A licensor may make special requests,
49 | such as asking that all changes be marked or described.
50 | Although not required by our licenses, you are encouraged to
51 | respect those requests where reasonable. More_considerations
52 | for the public:
53 | wiki.creativecommons.org/Considerations_for_licensees
54 |
55 | =======================================================================
56 |
57 | Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International
58 | Public License
59 |
60 | By exercising the Licensed Rights (defined below), You accept and agree
61 | to be bound by the terms and conditions of this Creative Commons
62 | Attribution-NonCommercial-ShareAlike 4.0 International Public License
63 | ("Public License"). To the extent this Public License may be
64 | interpreted as a contract, You are granted the Licensed Rights in
65 | consideration of Your acceptance of these terms and conditions, and the
66 | Licensor grants You such rights in consideration of benefits the
67 | Licensor receives from making the Licensed Material available under
68 | these terms and conditions.
69 |
70 |
71 | Section 1 -- Definitions.
72 |
73 | a. Adapted Material means material subject to Copyright and Similar
74 | Rights that is derived from or based upon the Licensed Material
75 | and in which the Licensed Material is translated, altered,
76 | arranged, transformed, or otherwise modified in a manner requiring
77 | permission under the Copyright and Similar Rights held by the
78 | Licensor. For purposes of this Public License, where the Licensed
79 | Material is a musical work, performance, or sound recording,
80 | Adapted Material is always produced where the Licensed Material is
81 | synched in timed relation with a moving image.
82 |
83 | b. Adapter's License means the license You apply to Your Copyright
84 | and Similar Rights in Your contributions to Adapted Material in
85 | accordance with the terms and conditions of this Public License.
86 |
87 | c. BY-NC-SA Compatible License means a license listed at
88 | creativecommons.org/compatiblelicenses, approved by Creative
89 | Commons as essentially the equivalent of this Public License.
90 |
91 | d. Copyright and Similar Rights means copyright and/or similar rights
92 | closely related to copyright including, without limitation,
93 | performance, broadcast, sound recording, and Sui Generis Database
94 | Rights, without regard to how the rights are labeled or
95 | categorized. For purposes of this Public License, the rights
96 | specified in Section 2(b)(1)-(2) are not Copyright and Similar
97 | Rights.
98 |
99 | e. Effective Technological Measures means those measures that, in the
100 | absence of proper authority, may not be circumvented under laws
101 | fulfilling obligations under Article 11 of the WIPO Copyright
102 | Treaty adopted on December 20, 1996, and/or similar international
103 | agreements.
104 |
105 | f. Exceptions and Limitations means fair use, fair dealing, and/or
106 | any other exception or limitation to Copyright and Similar Rights
107 | that applies to Your use of the Licensed Material.
108 |
109 | g. License Elements means the license attributes listed in the name
110 | of a Creative Commons Public License. The License Elements of this
111 | Public License are Attribution, NonCommercial, and ShareAlike.
112 |
113 | h. Licensed Material means the artistic or literary work, database,
114 | or other material to which the Licensor applied this Public
115 | License.
116 |
117 | i. Licensed Rights means the rights granted to You subject to the
118 | terms and conditions of this Public License, which are limited to
119 | all Copyright and Similar Rights that apply to Your use of the
120 | Licensed Material and that the Licensor has authority to license.
121 |
122 | j. Licensor means the individual(s) or entity(ies) granting rights
123 | under this Public License.
124 |
125 | k. NonCommercial means not primarily intended for or directed towards
126 | commercial advantage or monetary compensation. For purposes of
127 | this Public License, the exchange of the Licensed Material for
128 | other material subject to Copyright and Similar Rights by digital
129 | file-sharing or similar means is NonCommercial provided there is
130 | no payment of monetary compensation in connection with the
131 | exchange.
132 |
133 | l. Share means to provide material to the public by any means or
134 | process that requires permission under the Licensed Rights, such
135 | as reproduction, public display, public performance, distribution,
136 | dissemination, communication, or importation, and to make material
137 | available to the public including in ways that members of the
138 | public may access the material from a place and at a time
139 | individually chosen by them.
140 |
141 | m. Sui Generis Database Rights means rights other than copyright
142 | resulting from Directive 96/9/EC of the European Parliament and of
143 | the Council of 11 March 1996 on the legal protection of databases,
144 | as amended and/or succeeded, as well as other essentially
145 | equivalent rights anywhere in the world.
146 |
147 | n. You means the individual or entity exercising the Licensed Rights
148 | under this Public License. Your has a corresponding meaning.
149 |
150 |
151 | Section 2 -- Scope.
152 |
153 | a. License grant.
154 |
155 | 1. Subject to the terms and conditions of this Public License,
156 | the Licensor hereby grants You a worldwide, royalty-free,
157 | non-sublicensable, non-exclusive, irrevocable license to
158 | exercise the Licensed Rights in the Licensed Material to:
159 |
160 | a. reproduce and Share the Licensed Material, in whole or
161 | in part, for NonCommercial purposes only; and
162 |
163 | b. produce, reproduce, and Share Adapted Material for
164 | NonCommercial purposes only.
165 |
166 | 2. Exceptions and Limitations. For the avoidance of doubt, where
167 | Exceptions and Limitations apply to Your use, this Public
168 | License does not apply, and You do not need to comply with
169 | its terms and conditions.
170 |
171 | 3. Term. The term of this Public License is specified in Section
172 | 6(a).
173 |
174 | 4. Media and formats; technical modifications allowed. The
175 | Licensor authorizes You to exercise the Licensed Rights in
176 | all media and formats whether now known or hereafter created,
177 | and to make technical modifications necessary to do so. The
178 | Licensor waives and/or agrees not to assert any right or
179 | authority to forbid You from making technical modifications
180 | necessary to exercise the Licensed Rights, including
181 | technical modifications necessary to circumvent Effective
182 | Technological Measures. For purposes of this Public License,
183 | simply making modifications authorized by this Section 2(a)
184 | (4) never produces Adapted Material.
185 |
186 | 5. Downstream recipients.
187 |
188 | a. Offer from the Licensor -- Licensed Material. Every
189 | recipient of the Licensed Material automatically
190 | receives an offer from the Licensor to exercise the
191 | Licensed Rights under the terms and conditions of this
192 | Public License.
193 |
194 | b. Additional offer from the Licensor -- Adapted Material.
195 | Every recipient of Adapted Material from You
196 | automatically receives an offer from the Licensor to
197 | exercise the Licensed Rights in the Adapted Material
198 | under the conditions of the Adapter's License You apply.
199 |
200 | c. No downstream restrictions. You may not offer or impose
201 | any additional or different terms or conditions on, or
202 | apply any Effective Technological Measures to, the
203 | Licensed Material if doing so restricts exercise of the
204 | Licensed Rights by any recipient of the Licensed
205 | Material.
206 |
207 | 6. No endorsement. Nothing in this Public License constitutes or
208 | may be construed as permission to assert or imply that You
209 | are, or that Your use of the Licensed Material is, connected
210 | with, or sponsored, endorsed, or granted official status by,
211 | the Licensor or others designated to receive attribution as
212 | provided in Section 3(a)(1)(A)(i).
213 |
214 | b. Other rights.
215 |
216 | 1. Moral rights, such as the right of integrity, are not
217 | licensed under this Public License, nor are publicity,
218 | privacy, and/or other similar personality rights; however, to
219 | the extent possible, the Licensor waives and/or agrees not to
220 | assert any such rights held by the Licensor to the limited
221 | extent necessary to allow You to exercise the Licensed
222 | Rights, but not otherwise.
223 |
224 | 2. Patent and trademark rights are not licensed under this
225 | Public License.
226 |
227 | 3. To the extent possible, the Licensor waives any right to
228 | collect royalties from You for the exercise of the Licensed
229 | Rights, whether directly or through a collecting society
230 | under any voluntary or waivable statutory or compulsory
231 | licensing scheme. In all other cases the Licensor expressly
232 | reserves any right to collect such royalties, including when
233 | the Licensed Material is used other than for NonCommercial
234 | purposes.
235 |
236 |
237 | Section 3 -- License Conditions.
238 |
239 | Your exercise of the Licensed Rights is expressly made subject to the
240 | following conditions.
241 |
242 | a. Attribution.
243 |
244 | 1. If You Share the Licensed Material (including in modified
245 | form), You must:
246 |
247 | a. retain the following if it is supplied by the Licensor
248 | with the Licensed Material:
249 |
250 | i. identification of the creator(s) of the Licensed
251 | Material and any others designated to receive
252 | attribution, in any reasonable manner requested by
253 | the Licensor (including by pseudonym if
254 | designated);
255 |
256 | ii. a copyright notice;
257 |
258 | iii. a notice that refers to this Public License;
259 |
260 | iv. a notice that refers to the disclaimer of
261 | warranties;
262 |
263 | v. a URI or hyperlink to the Licensed Material to the
264 | extent reasonably practicable;
265 |
266 | b. indicate if You modified the Licensed Material and
267 | retain an indication of any previous modifications; and
268 |
269 | c. indicate the Licensed Material is licensed under this
270 | Public License, and include the text of, or the URI or
271 | hyperlink to, this Public License.
272 |
273 | 2. You may satisfy the conditions in Section 3(a)(1) in any
274 | reasonable manner based on the medium, means, and context in
275 | which You Share the Licensed Material. For example, it may be
276 | reasonable to satisfy the conditions by providing a URI or
277 | hyperlink to a resource that includes the required
278 | information.
279 | 3. If requested by the Licensor, You must remove any of the
280 | information required by Section 3(a)(1)(A) to the extent
281 | reasonably practicable.
282 |
283 | b. ShareAlike.
284 |
285 | In addition to the conditions in Section 3(a), if You Share
286 | Adapted Material You produce, the following conditions also apply.
287 |
288 | 1. The Adapter's License You apply must be a Creative Commons
289 | license with the same License Elements, this version or
290 | later, or a BY-NC-SA Compatible License.
291 |
292 | 2. You must include the text of, or the URI or hyperlink to, the
293 | Adapter's License You apply. You may satisfy this condition
294 | in any reasonable manner based on the medium, means, and
295 | context in which You Share Adapted Material.
296 |
297 | 3. You may not offer or impose any additional or different terms
298 | or conditions on, or apply any Effective Technological
299 | Measures to, Adapted Material that restrict exercise of the
300 | rights granted under the Adapter's License You apply.
301 |
302 |
303 | Section 4 -- Sui Generis Database Rights.
304 |
305 | Where the Licensed Rights include Sui Generis Database Rights that
306 | apply to Your use of the Licensed Material:
307 |
308 | a. for the avoidance of doubt, Section 2(a)(1) grants You the right
309 | to extract, reuse, reproduce, and Share all or a substantial
310 | portion of the contents of the database for NonCommercial purposes
311 | only;
312 |
313 | b. if You include all or a substantial portion of the database
314 | contents in a database in which You have Sui Generis Database
315 | Rights, then the database in which You have Sui Generis Database
316 | Rights (but not its individual contents) is Adapted Material,
317 | including for purposes of Section 3(b); and
318 |
319 | c. You must comply with the conditions in Section 3(a) if You Share
320 | all or a substantial portion of the contents of the database.
321 |
322 | For the avoidance of doubt, this Section 4 supplements and does not
323 | replace Your obligations under this Public License where the Licensed
324 | Rights include other Copyright and Similar Rights.
325 |
326 |
327 | Section 5 -- Disclaimer of Warranties and Limitation of Liability.
328 |
329 | a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
330 | EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
331 | AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
332 | ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
333 | IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
334 | WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
335 | PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
336 | ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
337 | KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
338 | ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.
339 |
340 | b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
341 | TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
342 | NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
343 | INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
344 | COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
345 | USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
346 | ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
347 | DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
348 | IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.
349 |
350 | c. The disclaimer of warranties and limitation of liability provided
351 | above shall be interpreted in a manner that, to the extent
352 | possible, most closely approximates an absolute disclaimer and
353 | waiver of all liability.
354 |
355 |
356 | Section 6 -- Term and Termination.
357 |
358 | a. This Public License applies for the term of the Copyright and
359 | Similar Rights licensed here. However, if You fail to comply with
360 | this Public License, then Your rights under this Public License
361 | terminate automatically.
362 |
363 | b. Where Your right to use the Licensed Material has terminated under
364 | Section 6(a), it reinstates:
365 |
366 | 1. automatically as of the date the violation is cured, provided
367 | it is cured within 30 days of Your discovery of the
368 | violation; or
369 |
370 | 2. upon express reinstatement by the Licensor.
371 |
372 | For the avoidance of doubt, this Section 6(b) does not affect any
373 | right the Licensor may have to seek remedies for Your violations
374 | of this Public License.
375 |
376 | c. For the avoidance of doubt, the Licensor may also offer the
377 | Licensed Material under separate terms or conditions or stop
378 | distributing the Licensed Material at any time; however, doing so
379 | will not terminate this Public License.
380 |
381 | d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
382 | License.
383 |
384 |
385 | Section 7 -- Other Terms and Conditions.
386 |
387 | a. The Licensor shall not be bound by any additional or different
388 | terms or conditions communicated by You unless expressly agreed.
389 |
390 | b. Any arrangements, understandings, or agreements regarding the
391 | Licensed Material not stated herein are separate from and
392 | independent of the terms and conditions of this Public License.
393 |
394 |
395 | Section 8 -- Interpretation.
396 |
397 | a. For the avoidance of doubt, this Public License does not, and
398 | shall not be interpreted to, reduce, limit, restrict, or impose
399 | conditions on any use of the Licensed Material that could lawfully
400 | be made without permission under this Public License.
401 |
402 | b. To the extent possible, if any provision of this Public License is
403 | deemed unenforceable, it shall be automatically reformed to the
404 | minimum extent necessary to make it enforceable. If the provision
405 | cannot be reformed, it shall be severed from this Public License
406 | without affecting the enforceability of the remaining terms and
407 | conditions.
408 |
409 | c. No term or condition of this Public License will be waived and no
410 | failure to comply consented to unless expressly agreed to by the
411 | Licensor.
412 |
413 | d. Nothing in this Public License constitutes or may be interpreted
414 | as a limitation upon, or waiver of, any privileges and immunities
415 | that apply to the Licensor or You, including from the legal
416 | processes of any jurisdiction or authority.
417 |
418 | =======================================================================
419 |
420 | Creative Commons is not a party to its public
421 | licenses. Notwithstanding, Creative Commons may elect to apply one of
422 | its public licenses to material it publishes and in those instances
423 | will be considered the “Licensor.” The text of the Creative Commons
424 | public licenses is dedicated to the public domain under the CC0 Public
425 | Domain Dedication. Except for the limited purpose of indicating that
426 | material is shared under a Creative Commons public license or as
427 | otherwise permitted by the Creative Commons policies published at
428 | creativecommons.org/policies, Creative Commons does not authorize the
429 | use of the trademark "Creative Commons" or any other trademark or logo
430 | of Creative Commons without its prior written consent including,
431 | without limitation, in connection with any unauthorized modifications
432 | to any of its public licenses or any other arrangements,
433 | understandings, or agreements concerning use of licensed material. For
434 | the avoidance of doubt, this paragraph does not form part of the
435 | public licenses.
436 |
437 | Creative Commons may be contacted at creativecommons.org.
438 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 | Bootcamp Data Engineering
7 |
8 |
9 | One week to learn Data Engineering :rocket:
10 |
11 |
12 |
13 |
14 | ### Table of Contents
15 |
16 | - [Curriculum](#curriculum)
17 | - [Module00 - PostgreSQL](#Module00---postgresql)
18 | - [Module01 - Elasticsearch](#Module01---elasticsearch)
19 | - [Module02 - AWS](#Module02---aws)
20 | - [Module03 - Hadoop](#Module03---spark)
21 | - [Module04 - Spark](#Module04---airflow)
22 | - [Acknowledgements](#acknowledgements)
23 | - [Contributors](#contributors)
24 |
25 | This project is a Data Engineering bootcamp created by [42 AI](http://www.42ai.fr).
26 |
27 | A prior Python programming experience is required (the Python bootcamp)! Your mission, should you choose to accept it, is to come and learn some of the essential knowledge for Data Engineering, in a single week. You will start with SQL and NoSQL languages and then get acquainted with some useful tools/frameworks for Data Engineering like Airflow, AWS and Spark.
28 |
29 | 42 Artificial Intelligence is a student organization of the Paris campus of the school 42. Our purpose is to foster discussion, learning, and interest in the field of artificial intelligence, by organizing various activities such as lectures and workshops.
30 |
31 |
32 | ## Curriculum
33 |
34 | ### Module00 - PostgreSQL
35 | **Let's get started with PostgreSQL!** :link:
36 | > Filter Data, Normalize Data, Populate tables, Data Analysis ...
37 |
38 | ### Module01 - Elasticsearch
39 | **Get acquainted with Elasticsearch** :mag_right:
40 | > Elasticsearch setup, Data Analysis, Aggregations, Kibana and Monitoring ...
41 |
42 | ### Module02 - AWS
43 | **Start exploring the cloud on AWS!** :cloud:
44 | > Discover AWS, Flask APIs and infrastructure provisioning with Terraform ...
45 |
46 | ### Module03 - Hadoop
47 | **Work in progress**
48 |
49 | ### Module04 - Spark
50 | **Work in progress**
51 |
52 | ## Acknowledgements
53 |
54 | ### Contributors
55 |
56 | * Francois-Xavier Babin (fbabin@student.42.fr)
57 | * Jeremy Jauzion (jjauzion@student.42.fr)
58 | * Myriam Benzarti (mybenzar@student.42.fr)
59 | * Mehdi Aissa Belaloui (mbelalou@student.42.fr
60 | * Eren Ozdek (eozdek@student.42.fr)
61 |
--------------------------------------------------------------------------------
/assets/42ai_logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/42-AI/bootcamp_data-engineering/0b745562307d5fa6b395ba9b745c0df6e76ed17d/assets/42ai_logo.png
--------------------------------------------------------------------------------
/module00.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/42-AI/bootcamp_data-engineering/0b745562307d5fa6b395ba9b745c0df6e76ed17d/module00.pdf
--------------------------------------------------------------------------------
/module00/assets/client_server.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/42-AI/bootcamp_data-engineering/0b745562307d5fa6b395ba9b745c0df6e76ed17d/module00/assets/client_server.png
--------------------------------------------------------------------------------
/module00/assets/tables.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/42-AI/bootcamp_data-engineering/0b745562307d5fa6b395ba9b745c0df6e76ed17d/module00/assets/tables.png
--------------------------------------------------------------------------------
/module00/ex00/ex00.md:
--------------------------------------------------------------------------------
1 | # Exercise 00 - Setup
2 |
3 | | | |
4 | | --------------------: | ---- |
5 | | Turn-in directory : | ex00 |
6 | | Files to turn in : | None |
7 | | Forbidden functions : | None |
8 | | Remarks : | n/a |
9 |
10 | ## The client-server architecture
11 |
12 | PostgreSQL is an open-source database which follows a client-server architecture. It is divided in three different components :
13 | - a **client**, a program on the user's machine which communicates the user's query to the server and receives the server's answers.
14 | - a **server**, a program running in the background that manages access to a specific resource, service or network. The server will understand the client's query and apply it to the database. Then it will send an answer to the client.
15 | - a **database system**, where the data is stored.
16 |
17 | {width=600px}
18 |
19 | ps: client and server can be located on the same machine
20 |
21 | In the case of PostgreSQL, we are going to use `psql` as a client and `pg_ctl` for the server.
22 |
23 | ## PostgreSQL install
24 |
25 | The first thing we need to do is install PostgreSQL.
26 |
27 | ```bash
28 | brew install postgresql
29 | ```
30 |
31 | nb: if you notice any problem with brew, you can reinstall it with the following command.
32 |
33 | ```bash
34 | rm -rf $HOME/.brew && git clone --depth=1 https://github.com/Homebrew/brew $HOME/.brew && echo 'export PATH=$HOME/.brew/bin:$PATH' >> $HOME/.zshrc && source $HOME/.zshrc && brew update
35 | ```
36 |
37 | The next thing we need to do is export a variable `PGDATA`. We can add the following line to our `.zshrc` file.
38 |
39 | ```bash
40 | export PGDATA=$HOME/.brew/var/postgres
41 | ```
42 |
43 | and source the .zshrc.
44 |
45 | ```bash
46 | source ~/.zshrc
47 | ```
48 |
49 | Now, we can start the postgresql server. A server is a program running in the background that manages access to a specific resource, service or network. As you guessed, the postgresql allows us to access a database here.
50 |
51 | We can start the server.
52 |
53 | ```bash
54 | $> pg_ctl start
55 | waiting for server to start....2019-12-08 15:58:21.171 CET [84406] LOG: starting PostgreSQL 12.1 on x86_64-apple-darwin18.6.0, compiled by Apple LLVM version 10.0.1 (clang-1001.0.46.4), 64-bit
56 | 2019-12-08 15:58:21.173 CET [84406] LOG: listening on IPv6 address "::1", port 5432
57 | 2019-12-08 15:58:21.173 CET [84406] LOG: listening on IPv4 address "127.0.0.1", port 5432
58 | 2019-12-08 15:58:21.174 CET [84406] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
59 | 2019-12-08 15:58:21.192 CET [84407] LOG: database system was shut down at 2019-12-08 15:49:49 CET
60 | 2019-12-08 15:58:21.201 CET [84406] LOG: database system is ready to accept connections
61 | done
62 | server started
63 | ```
64 |
65 | We notice the postgreSQL is associated with the port `5432`.
66 |
67 | `pg_ctl stop` can stop the server.
68 |
69 | A server program is often associated with a client. Our client here is called `psql`. In the beginning, only one database exists, `postgres`. We must use that database first to access the postgresql console.
70 |
71 | ```bash
72 | $> psql -d postgres
73 | psql (12.1)
74 | Type "help" for help.
75 |
76 | postgres=#
77 | ```
78 |
79 | `\?` allows you to see all the possible commands in the PostgreSQL console.
80 | The first thing we can do is list the databases with `\l`.
81 |
82 | ```txt
83 | postgres=# \l
84 | List of databases
85 | Name | Owner | Encoding | Collate | Ctype | Access privileges
86 | -----------+--------+----------+---------+-------+-------------------
87 | postgres | fbabin | UTF8 | C | C |
88 | template0 | fbabin | UTF8 | C | C | =c/fbabin +
89 | | | | | | fbabin=CTc/fbabin
90 | template1 | fbabin | UTF8 | C | C | =c/fbabin +
91 | | | | | | fbabin=CTc/fbabin
92 | (3 rows)
93 | ```
94 |
95 | We are going to create a database for the day.
96 | ```bash
97 | postgres=# CREATE DATABASE appstore_games;
98 | ```
99 | Add a user with a very strong password!
100 | ```bash
101 | postgres=# CREATE USER postgres_user WITH PASSWORD '12345';
102 | ```
103 | We must alter the database (changes the attributes of a database) to allow access only for us.
104 | ```bash
105 | postgres=# ALTER DATABASE appstore_games OWNER TO postgres_user;
106 | ```
107 | The last thing we need to do is edit the `~/.brew/var/postgres/pg_hba.conf` file to modify the following line.
108 | ```
109 | host all all 127.0.0.1/32 trusted
110 | ```
111 | to
112 | ```
113 | host all all 127.0.0.1/32 md5
114 | ```
115 | This modification will force the use of the password to connect to the database.
116 |
117 | We are ready to use Postgres!
118 |
119 | ## Pyenv install
120 |
121 | Dealing with Python is often hell when it comes to python versions and libraries version. This problem is often encountered few people are working on the same server with different library needs.
122 | Furthermore you don't want to mess with the system python. That's why virtual environments and separated python are a preferred solution.
123 |
124 | You can install pyenv with brew using the following command.
125 |
126 | ```txt
127 | brew install pyenv
128 | ```
129 |
130 | All the python candidates can then be listed.
131 |
132 | ```txt
133 | pyenv install --list | grep " 3\.[678]"
134 | ```
135 | ... and installed. For the day we are going to choose version `3.8.0`.
136 |
137 | ```txt
138 | pyenv install -v 3.8.0
139 | ```
140 |
141 | Finally the installed version can be activated through this command.
142 |
143 | ```txt
144 | pyenv global 3.8.0
145 | ```
146 |
147 | Don’t forget to add those lines to your .zshrc file in order to activate your python environment each time you open a terminal.
148 |
149 | ```txt
150 | export PATH="/home/misteir/.pyenv/bin:$PATH"
151 | eval "$(pyenv init -)"
152 | eval "$(pyenv virtualenv-init -)"
153 |
154 | pyenv global 3.8.0 #activate the python 3.8.0 as default python
155 | ```
156 |
157 | ## Pipenv install
158 |
159 | Pipenv is a tool to handle packages versions of an environment. This tool is very similar to the `requirements.txt` file with some extra metadata.
160 |
161 | Pipenv can be installed with this simple command.
162 |
163 | ```txt
164 | pip install pipenv
165 | ```
166 |
167 | You can find a toml file for the day named `Pipfile`.
168 |
169 | ```txt
170 | [[source]]
171 | url = "https://pypi.python.org/simple"
172 | verify_ssl = true
173 | name = "pypi"
174 |
175 | [packages]
176 | jupyter = "*"
177 | numpy = "*"
178 | pandas = "*"
179 | psycopg2 = "*"
180 |
181 | [requires]
182 | python_version = "3.8.0"
183 | ```
184 |
185 | To setup your environment just follow these two steps.
186 |
187 | ```txt
188 | pipenv install
189 | pipenv shell
190 | ```
191 |
192 | You have now PostgreSQL, virtual python and requirements installed and ready for the day!
193 |
--------------------------------------------------------------------------------
/module00/ex01/ex01.md:
--------------------------------------------------------------------------------
1 | # Exercise 01 - Clean
2 |
3 | | | |
4 | | -----------------------:| ------------------ |
5 | | Turn-in directory : | ex01 |
6 | | Files to turn in : | clean.py |
7 | | Forbidden function : | None |
8 | | Remarks : | n/a |
9 |
10 |
11 | ## Objective
12 |
13 | You must clean the given CSV dataset to insert it into a PostgreSQL table.
14 |
15 | ## Instructions
16 |
17 | The `appstore_games.csv.zip` file is available in the resources, you can unzip it to use it.
18 |
19 | We are going to keep the following columns: `ID`, `Name`, `Average User Rating`, `User Rating Count`, `Price`, `Description`, `Developer`, `Age Rating`, `Languages`, `Size`, `Primary Genre`, `Genres`, `Original Release Date`, `Current Version Release Date`.
20 |
21 | 1) You need to implement the function `df_nan_filter`. It takes a pandas dataframe as input and applies the following replacement for NaN values :
22 | * remove the row if `Size` if NaN.
23 | * set `Languages` as "EN" if NaN.
24 | * set `Price` as 0.0 if NaN.
25 | * set `Average User Rating` as the median of the column if NaN.
26 | * set `User Rating Count` as 1 if NaN.
27 |
28 | ```python
29 | def df_nan_filter(df):
30 | """Apply filters on NaN values
31 | Args:
32 | df: pandas dataframe.
33 | Returns:
34 | Filtered Dataframe.
35 | Raises:
36 | This function shouldn't raise any Exception.
37 | """
38 | ```
39 |
40 | 2) Create the function `change_date_format` that will change the date format from `dd/mm/yyyy` to `yyyy-mm-dd`.
41 |
42 | ```python
43 | def change_date_format(date: str):
44 | """Change date format from dd/mm/yyyy to yyyy-mm-dd
45 | Args:
46 | date: a string representing the date.
47 | Returns:
48 | The date in the format yyyy-mm-dd.
49 | Raises:
50 | This function shouldn't raise any Exception.
51 | """
52 | ```
53 |
54 | Your function must work with the following commands.
55 |
56 | ```python
57 | df["Original Release Date"] = df["Original Release Date"].apply(lambda x: change_date_format(x))
58 | df["Current Version Release Date"] = df["Current Version Release Date"].apply(lambda x: change_date_format(x))
59 | ```
60 |
61 | 3) You need to apply the following function to the `Description` column.
62 |
63 | ```python
64 | import re
65 |
66 | def string_filter(s: str):
67 | """Apply filters in order to clean the string.
68 | Args:
69 | s: string.
70 | Returns:
71 | Filtered String.
72 | Raises:
73 | This function shouldn't raise any Exception.
74 | """
75 | # filter : \\t, \\n, \\U1a1b2c3d4, \\u1a2b, \\x1a
76 | # turn \' into '
77 | # replace remaining \\ with \
78 | # turn multiple spaces into one space
79 | s = re.sub(r'''\\+(t|n|U[a-z0-9]{8}|u[a-z0-9]{4}|x[a-z0-9]{2}|[\.]{2})''', ' ', s)
80 | s = s.replace('\\\'', '\'').replace('\\\\', '\\')
81 | s = re.sub(r' +', ' ', s)
82 | return (s)
83 | ```
84 |
85 | 4) Remove the `ID` duplicates.
86 |
87 | 5) Convert the data type of the columns `Age Rating`, `User Rating Count` and `Size` to int.
88 |
89 | 6) Remove the rows whose `Name` length is lower than 4 characters.
90 |
91 | You must apply these steps to create a script producing the file `appstore_games.cleaned.csv`.
92 |
93 | ## Examples
94 |
95 | The following example does not show the true dataset and values obtained after the filters.
96 |
97 | ```txt
98 | >>> df = pd.read_csv("appstore_games.csv")
99 | >>> df.head(1)
100 | Average User Rating User Rating Count Price Languages
101 | 1 NaN NaN NaN NaN
102 | >>> df = nan_filter(df)
103 | >>> df.head(1)
104 | Age User Rating User Rating Count Price Languages
105 | 4 1 15 EN
106 | ```
107 |
108 | ```python
109 | for e in df:
110 | print("'{}' :: {}".format(e, df.loc[0, e]))
111 | ```
112 |
113 | With the above code, you should obtain something similar to this output for the values of the first row. The output shape is (16809, 14).
114 |
115 | \clearpage
116 |
117 | ```txt
118 | 'ID' :: 284921427
119 | 'Name' :: Sudoku
120 | 'Average User Rating' :: 4.0
121 | 'User Rating Count' :: 3553
122 | 'Price' :: 2.99
123 | 'Description' :: Join over 21,000,000 of our fans and download one of our Sudoku games today! Makers of the Best Sudoku Game of 2008, Sudoku (Free), we offer you the best selling Sudoku game for iPhone with great features and 1000 unique puzzles! Sudoku will give you many hours of fun and puzzle solving. Enjoy the challenge of solving Sudoku puzzles whenever or wherever you are using your iPhone or iPod Touch. OPTIONS All options are on by default, but you can turn them off in the Options menu Show Incorrect :: Shows incorrect answers in red. Smart Buttons :: Disables the number button when that number is completed on the game board. Smart Notes :: Removes the number from the notes in the box, column, and row that contains the cell with your correct answer. FEATURES 1000 unique handcrafted puzzles ALL puzzles solvable WITHOUT guessing Four different skill levels Challenge a friend Multiple color schemes ALL notes: tap the All notes button on to show all the possible answers for each square. Tap the All notes button off to remove the notes. Hints: shows the answer for the selected square or a random square when one is not selected Pause the game at any time and resume where you left off Best times, progress statistics, and much more Do you want more? Try one of our other versions of sudoku which have all the same great features! * Try Color Sudoku for a fun twist to solving sudoku puzzles. * For advanced puzzle solving, try Expert Sudoku to challenge your sudoku solving skills.
124 | 'Developer' :: Mighty Mighty Good Games
125 | 'Age Rating' :: 4
126 | 'Languages' :: DA, NL, EN, FI, FR, DE, IT, JA, KO, NB, PL, PT, RU, ZH, ES, SV, ZH
127 | 'Size' :: 15853568
128 | 'Primary Genre' :: Games
129 | 'Genres' :: Games, Strategy, Puzzle
130 | 'Original Release Date' :: 2008-07-11
131 | 'Current Version Release Date' :: 2017-05-30
132 | ```
133 |
--------------------------------------------------------------------------------
/module00/ex02/ex02.md:
--------------------------------------------------------------------------------
1 | # Exercise 02 - Normalize
2 |
3 | | | |
4 | | -----------------------:| ------------------ |
5 | | Turn-in directory : | ex02 |
6 | | Files to turn in : | normalize.py |
7 | | Forbidden function : | None |
8 | | Remarks : | n/a |
9 |
10 |
11 | ## Objective
12 |
13 | You must normalize the given CSV dataset to insert it into a PostgreSQL table.
14 |
15 | ## Instructions
16 |
17 | We are going to use the previously cleaned dataset and apply the `1NF normalization` rule to it.
18 |
19 | ### 1NF normalization
20 | * Each column should contain atomic values (list entries like `x, y` violate this rule).
21 | * Each column should contain values of the same type.
22 | * Each column should have unique names.
23 | * Order in which data is saved does not matter.
24 |
25 | This rule is normally applied to a database but we are going to use those data as database tables in the next exercises.
26 |
27 | The only rule that we are not following concerns the list of values in columns. Not respecting this rule will complicate queries a lot (querying on a list is not convenient).
28 |
29 |
30 | The two columns that don't respect this rule are `Languages` and `Genres`. In order to respect the 1NF rule you have to create 3 dataframes (that are going to be postgresql tables) :
31 |
32 | * **df** : `ID`, `Name`, `Average User Rating`,`User Rating Count`, `Price`, `Description`, `Developer`, `Age Rating`, `Size`, `Original Release Date`, `Current Version Release Date`
33 | * **df_genres** : `ID`, `Primary Genre`, `Genre`
34 | * **df_languages** : `ID`, `Language`
35 |
36 | We want to go from this form ...
37 |
38 | ```txt
39 | +----------+-----------+
40 | |ID |Language |
41 | +----------+-----------+
42 | |284921427 |DA, NL, EN |
43 | +----------+-----------+
44 | ```
45 |
46 | ... to this one.
47 |
48 | ```txt
49 | +----------+---------+
50 | |ID |Language |
51 | +----------+---------+
52 | |284921427 |DA |
53 | |284921427 |NL |
54 | |284921427 |EN |
55 | +----------+---------+
56 | ```
57 |
58 | To do that we can use the `explode` function of pandas. This function only works with lists so we have to convert the string `DA, NL, EN` to a list format like `[DA, NL, EN]`.
59 |
60 | 1) Create the 3 dataframes (with the corresponding columns)
61 |
62 | 2) Convert multiple words genres to a single word format (ex: `Arcade & Aventure` to `Arcade_&_Aventure`)
63 |
64 | 3) Convert strings to list format (for columns with list) and remove the 'Games' genre from each list (it is irrelevant information as it is in each list)
65 |
66 | 4) Use the `explode` function of pandas (index of dataframes will be broken)
67 | 5) reset the index of the dataframes (`reset_index` function)
68 |
69 | 6) Save the dataframes into the files :
70 | * `appstore_games.normalized.csv` (shape : (16809, 11))
71 | * `appstore_games_genres.normalized.csv` (shape : (44252, 3))
72 | * `appstore_games_languages.normalized.csv` (shape : (54695, 2))
73 |
74 | ## Examples
75 |
76 | ```txt
77 | +----------+---------+
78 | |ID |Language |
79 | +----------+---------+
80 | |284921427 |DA |
81 | |284921427 |NL |
82 | |284921427 |EN |
83 | |284921427 |FI |
84 | |284921427 |FR |
85 | |... |... |
86 | +----------+---------+
87 | Only showing 5 lines !
88 | ```
89 |
90 | ```txt
91 | +----------+--------------+--------------+
92 | |ID |Primary Genre |Genre |
93 | +----------+--------------+--------------+
94 | |284921427 |Games |Strategy |
95 | |284921427 |Games |Puzzle |
96 | |284926400 |Games |Strategy |
97 | |284926400 |Games |Board |
98 | |284946595 |Games |Board |
99 | |... |... |... |
100 | +----------+--------------+--------------+
101 | ```
--------------------------------------------------------------------------------
/module00/ex03/ex03.md:
--------------------------------------------------------------------------------
1 | # Exercise 03 - Populate
2 | | | |
3 | | -----------------------:| ------------------ |
4 | | Turn-in directory : | ex03 |
5 | | Files to turn in : | populate.py |
6 | | Forbidden function : | None |
7 | | Remarks : | n/a |
8 |
9 | ## Objective
10 |
11 | You must insert :
12 | * `appstore_games.normalized.csv`
13 | * `appstore_games_genres.normalized.csv`
14 | * `appstore_games_languages.normalized.csv`
15 |
16 | data into a PostgreSQL table.
17 |
18 | ## Instructions
19 |
20 | You can read the psycopg2_basics documentation (some included functions will help you with this exercise).
21 |
22 | 1) You first need to create 3 functions.
23 | - `create_appstore_games`
24 | - `create_appstore_games_genres`
25 | - `create_appstore_games_languages`
26 |
27 | ... to create the following tables :
28 |
29 | {width=450px}
30 |
31 | nb: Foreign keys are a reference to an existing column in another table.
32 |
33 | 2) You will have to create the 3 populate functions
34 |
35 | * `populate_appstore_games`
36 | * `populate_appstore_games_genres`
37 | * `populate_appstore_games_languages`
38 |
39 | ... to insert data into the different tables.
40 |
41 | Before you do anything you must ensure postgresql is running.
42 |
43 | # Examples
44 |
45 | At the end your display table should show the following output for the table :
46 |
47 | * `appstore_games_genres`
48 |
49 | ```txt
50 | +---+----------+--------------+---------+
51 | |id |game_id |primary_genre |genre |
52 | +---+----------+--------------+---------+
53 | |0 |284921427 |Games |Strategy |
54 | |1 |284921427 |Games |Puzzle |
55 | |2 |284926400 |Games |Strategy |
56 | |3 |284926400 |Games |Board |
57 | |4 |284946595 |Games |Board |
58 | |5 |284946595 |Games |Strategy |
59 | |6 |285755462 |Games |Strategy |
60 | |7 |285755462 |Games |Puzzle |
61 | |8 |285831220 |Games |Strategy |
62 | |9 |285831220 |Games |Board |
63 | |.. |... |... |... |
64 | +---+----------+--------------+---------+
65 | ```
66 |
67 | * `appstore_games_languages`
68 |
69 | ```txt
70 | +---+----------+---------+
71 | |id |game_id |language |
72 | +---+----------+---------+
73 | |0 |284921427 |DA |
74 | |1 |284921427 |NL |
75 | |2 |284921427 |EN |
76 | |3 |284921427 |FI |
77 | |4 |284921427 |FR |
78 | |5 |284921427 |DE |
79 | |6 |284921427 |IT |
80 | |7 |284921427 |JA |
81 | |8 |284921427 |KO |
82 | |9 |284921427 |NB |
83 | |.. |... |... |
84 | +---+----------+---------+
85 | ```
--------------------------------------------------------------------------------
/module00/ex03/psycopg2_basics.md:
--------------------------------------------------------------------------------
1 | # Psycopg2 basics
2 |
3 | Psycopg is a very popular PostgreSQL database adapter for the Python programming language. Its full documentation can be seen **[here](https://pypi.org/project/psycopg2/)**.
4 |
5 | The function `connect()` creates a new database session and returns a new connection instance.
6 |
7 | ```python
8 | import psycopg2
9 |
10 | def get_connection():
11 | conn = psycopg2.connect(
12 | database="appstore_games",
13 | host="localhost",
14 | user="postgres_user",
15 | password="12345"
16 | )
17 | return (conn)
18 | ```
19 |
20 | Cursors allows Python code to execute PostgreSQL command in a database session.
21 |
22 | ```python
23 | curr = conn.cursor()
24 | ```
25 |
26 | Tables can be created with the cursor.
27 |
28 | ```python
29 | curr.execute("""CREATE TABLE members (
30 | id serial PRIMARY KEY,
31 | firstname varchar(32),
32 | lastname varchar(32),
33 | birthdate date
34 | )
35 | """)
36 | ```
37 |
38 | It's also possible to remove a table.
39 |
40 | ```python
41 | curr.execute("DROP TABLE members")
42 | ```
43 |
44 | To make changes persistent in the database, we need to commit (queries are called transactions). Finally, we can close the connection.
45 |
46 | ```python
47 | conn.commit()
48 | conn.close()
49 | ```
50 |
51 | This gives the following full code.
52 |
53 | ```python
54 | import psycopg2
55 |
56 | def get_connection():
57 | conn = psycopg2.connect(
58 | database="appstore_games",
59 | host="localhost",
60 | user="postgres_user",
61 | password="12345"
62 | )
63 | return (conn)
64 |
65 | if __name__ == "__main__":
66 | conn = get_connection()
67 | curr = conn.cursor()
68 | curr.execute("""CREATE TABLE members (
69 | id serial PRIMARY KEY,
70 | firstname varchar(32),
71 | lastname varchar(32),
72 | birthdate date
73 | )
74 | """)
75 | conn.commit()
76 | conn.close()
77 | ```
78 |
79 | ## Inserting data
80 |
81 | Data can be inserted into a table with the following syntax.
82 |
83 | ```python
84 | curr.execute("""
85 | INSERT INTO members(firstname, lastname, birthdate) VALUES
86 | ('Eric', 'Clapton', '1945-13-30'),
87 | ('Joe', 'Bonamassa', '1977-05-08')
88 | """)
89 | ```
90 |
91 | ## Delete data
92 |
93 | Data can also be deleted.
94 |
95 | ```python
96 | curr.execute("""DELETE FROM members
97 | WHERE lastname LIKE 'Clapton'
98 | """)
99 | ```
100 |
101 | # Useful functions
102 |
103 | ## get connections
104 |
105 | ```python
106 | def get_connection():
107 | conn = psycopg2.connect(
108 | database="appstore_games",
109 | host="localhost",
110 | user="postgres_user",
111 | password="12345"
112 | )
113 | return (conn)
114 | ```
115 |
116 | ## Showing table content
117 |
118 | We must use the `fetchall` function to gather all the result in a list of tuples.
119 |
120 | ```python
121 | def display_table(table: str):
122 | conn = set_connection()
123 | curr = conn.cursor()
124 | curr.execute("""SELECT * FROM %(table)s
125 | LIMIT 10""", {"table": AsIs(table)})
126 | response = curr.fetchall()
127 | for row in response:
128 | print(row)
129 | conn.close()
130 | ```
131 |
132 | ## Create a table
133 |
134 | ```python
135 | def create_table():
136 | conn = get_connection()
137 | curr = conn.cursor()
138 | curr.execute("""CREATE TABLE test(
139 | FirstName varchar PRIMARY KEY,
140 | LastName varchar,
141 | Age int
142 | );""")
143 | conn.commit()
144 | conn.close()
145 | ```
146 |
147 | ## Drop table
148 |
149 | ```python
150 | def delete_table(table: str):
151 | conn = set_connection()
152 | curr = conn.cursor()
153 | curr.execute("DROP TABLE IF EXISTS %(table)s;", {"table": AsIs(table)})
154 | conn.commit()
155 | conn.close()
156 | ```
157 |
158 | ## Inserting data into a table
159 |
160 | ```python
161 | def populate_table():
162 | conn = get_connection()
163 | curr = conn.cursor()
164 | curr.execute("""INSERT INTO test
165 | (FirstName,
166 | LastName,
167 | Age) VALUES
168 | (%s, %s, %s)""",
169 | ('Michelle',
170 | 'Dupont',
171 | '33'))
172 | conn.commit()
173 | conn.close()
174 | ```
--------------------------------------------------------------------------------
/module00/ex04/ex04.md:
--------------------------------------------------------------------------------
1 | # Exercise 04 - Top100
2 |
3 | | | |
4 | | -----------------------:| ------------------ |
5 | | Turn-in directory : | ex04 |
6 | | Files to turn in : | top100.py |
7 | | Forbidden functions : | None |
8 | | Remarks : | n/a |
9 |
10 | ## Objective
11 |
12 | You must show the top 100 games Name with the best user rating.
13 |
14 | ## Instructions
15 |
16 | You must create a program using the function `get_top_100`.
17 |
18 | This function must show the top 100 games Name ordered by `Avg_user_rating` first then by `Name`.
19 |
20 | The names of games not starting with a letter must be ignored. Then, you must show the first 100 games starting with letters.
21 |
22 | **You must only use PostgreSQL for your queries !**
23 |
24 | ## Example
25 |
26 | ```txt
27 | >> get_top_100()
28 | AFK Arena
29 | APORIA
30 | AbsoluteShell
31 | Action Craft Mini Blockheads Match 3 Skins Survival Game
32 | Adrift by Tack
33 | Agadmator Chess Clock
34 | Age Of Magic
35 | Age of Giants: Tribal Warlords
36 | Age of War Empires: Order Rise
37 | Alicia Quatermain 2 (Platinum)
38 | ...
39 | ```
40 |
41 | As you guessed, you should have 100 hits.
--------------------------------------------------------------------------------
/module00/ex05/ex05.md:
--------------------------------------------------------------------------------
1 | # Exercise 05 - Name_lang
2 |
3 | | | |
4 | | -----------------------:| ------------------ |
5 | | Turn-in directory : | ex05 |
6 | | Files to turn in : | name_lang.py |
7 | | Forbidden functions : | None |
8 | | Remarks : | n/a |
9 |
10 | ## Objective
11 |
12 | You must show Name and Language of games strictly between 5 and 10 euros both excluded.
13 |
14 | ## Instructions
15 |
16 | You must create a program using the function `get_name_lang` that will show the Name and Language of games strictly between 5 and 10 euros.
17 |
18 | **You must only use PostgreSQL for your queries !**
19 |
20 |
21 | ## Example
22 |
23 | ```txt
24 | >> get_name_lang()
25 | Chess Genius, EN
26 | Chess Genius, FR
27 | Chess Genius, DE
28 | Chess Genius, IT
29 | Chess Genius, ES
30 | Chess - tChess Pro, EN
31 | Chess - tChess Pro, FR
32 | Chess - tChess Pro, DE
33 | Chess - tChess Pro, JA
34 | Chess - tChess Pro, KO
35 | ...
36 | ```
37 |
38 | You should have 634 hits.
--------------------------------------------------------------------------------
/module00/ex06/ex06.md:
--------------------------------------------------------------------------------
1 | # Exercise 06 - K-first
2 |
3 | | | |
4 | | -----------------------:| ------------------ |
5 | | Turn-in directory : | ex06 |
6 | | Files to turn in : | k_first.py |
7 | | Forbidden functions : | None |
8 | | Remarks : | n/a |
9 |
10 | ## Objective
11 |
12 | You must show the name of developers starting with 'K' and involved in casual games.
13 |
14 | ## Instructions
15 |
16 | You must create a program using the function `get_k_first` that shows the name of developers starting with 'K' (case sensitive) and involved in casual games.
17 |
18 | **You must only use PostgreSQL for your queries !**
19 |
20 |
21 | ## Example
22 |
23 | ```txt
24 | >> get_k_first()
25 | Koh Jing Yu
26 | Kyle Decot
27 | Kashif Tasneem
28 | Kristin Nutting
29 | Kok Leong Tan
30 | Key Player Publishing Limited
31 | KillerBytes
32 | KillerBytes
33 | Khoa Tran
34 | Kwai Ying Cindy Cheung
35 | KG2 Entertainment LLC
36 | Keehan Roberts
37 | ...
38 | ```
39 |
40 | You should have 40 hits.
--------------------------------------------------------------------------------
/module00/ex07/ex07.md:
--------------------------------------------------------------------------------
1 | # Exercise 07 - Seniors
2 |
3 | | | |
4 | | -----------------------:| ------------------ |
5 | | Turn-in directory : | ex07 |
6 | | Files to turn in : | seniors.py |
7 | | Forbidden functions : | None |
8 | | Remarks : | n/a |
9 |
10 | ## Objective
11 |
12 | You must show the Name of developers involved in games released before 01/08/2008 included and updated after 01/01/2018 included.
13 |
14 | ## Instructions
15 |
16 | You must create a program using a function `get_seniors` that shows the Name of developers involved in games released before 01/08/2008 included and updated after 01/01/2018 included.
17 |
18 | **You must only use PostgreSQL for your queries !**
19 |
20 |
21 | ## Example
22 |
23 | ```txt
24 | >> get_seniors()
25 | Kiss The Machine
26 | ...
27 | ```
28 |
29 | You should have 3 hits.
--------------------------------------------------------------------------------
/module00/ex08/ex08.md:
--------------------------------------------------------------------------------
1 | # Exercise 08 - Battle_royale
2 |
3 | | | |
4 | | -----------------------:| ------------------ |
5 | | Turn-in directory : | ex08 |
6 | | Files to turn in : | battle_royale.py |
7 | | Forbidden functions : | None |
8 | | Remarks : | n/a |
9 |
10 | ## Objective
11 |
12 | You must show the name of the games with "battle royale" in their description and with a URL that will redirect to `facebook.com`.
13 |
14 | ## Instructions
15 |
16 | You must create a program using a function `get_battle_royale` that shows the name of the games with "battle royale" (case insensitive) in their description and with a URL that will redirect to `facebook.com`.
17 |
18 | **You must only use PostgreSQL for your queries !**
19 |
20 |
21 | ## Example
22 |
23 | ```txt
24 | >> get_battle_royale()
25 | Lords Mobile: War Kingdom
26 | Crusaders of Light
27 | Blob io - Throw & split cells
28 | ...
29 | ```
30 |
31 | You should have 5 hits.
--------------------------------------------------------------------------------
/module00/ex09/ex09.md:
--------------------------------------------------------------------------------
1 | # Exercise 09 - Benefits
2 |
3 | | | |
4 | | -----------------------:| ------------------ |
5 | | Turn-in directory : | ex09 |
6 | | Files to turn in : | benefits.py |
7 | | Forbidden function : | None |
8 | | Remarks : | n/a |
9 |
10 | ## Objective
11 |
12 | Show the first 10 games that generated the most benefits.
13 |
14 | ## Instructions
15 |
16 | You must create a program using the function `get_benefits` that will show the first 10 genres that generated the most "benefits".
17 |
18 | Benefits are calculated with the number of users who voted times the price of the game.
19 |
20 | **You must only use PostgreSQL for your queries !**
21 |
22 |
23 | ## Example
24 |
25 | ```txt
26 | >> get_benefits()
27 | Strategy
28 | Entertainment
29 | ...
30 | ```
31 |
32 | You should have 48 hits.
--------------------------------------------------------------------------------
/module00/ex10/ex10.md:
--------------------------------------------------------------------------------
1 | # Exercise 10 - Sweet spot
2 |
3 | | | |
4 | | -----------------------:| ------------------ |
5 | | Turn-in directory : | ex10 |
6 | | Files to turn in : | sweet_spot.py |
7 | | Forbidden function : | None |
8 | | Remarks : | n/a |
9 |
10 | ## Objective
11 |
12 | Find the month where the most important number of games are released.
13 |
14 | ## Instructions
15 |
16 | Find the month where the most important number of games are released.
17 |
18 | **You must only use PostgreSQL for your queries !**
19 |
20 |
21 | ## Example
22 |
23 | This answer may not be the right one.
24 |
25 | ```txt
26 | january
27 | ```
28 |
29 | You should have 1 hit.
--------------------------------------------------------------------------------
/module00/ex11/ex11.md:
--------------------------------------------------------------------------------
1 | # Exercise 11 - Price Analysis
2 |
3 | | | |
4 | | -----------------------:| ------------------ |
5 | | Turn-in directory : | ex11 |
6 | | Files to turn in : | price.py, price.png |
7 | | Forbidden function : | None |
8 | | Remarks : | n/a |
9 | | Allowed python libraries : | matplotlib, numpy |
10 |
11 | ## Objective
12 |
13 | Analyze the price distribution of games by plotting a histogram of the price distribution.
14 |
15 | ## Instructions
16 |
17 | First, you need to write the right query to output a table where you have the distribution of price, i.e. the number of games for each price.
18 |
19 | Then, you can use matplotlib to create a histogram. Your histogram will have to :
20 | - not show games with a price below 1.0
21 | - have a bar plot with 3 euros interval
22 | - have the xlabel `Price`
23 | - have the ylabel `Frequency`
24 | - have the title `Appstore games price`
25 |
26 | You will have to save your histogram in a file named `price.png`
27 |
28 | Finally, you have to use numpy to find the mean and the standard deviation of your data set.
29 |
30 | nb: you do not need to worry about the number of decimals printed
31 |
32 | **You can use PostgreSQL and Python (for numpy, matplotlib, bins creation ...)**
33 |
34 |
35 | ## Example
36 |
37 | This answer may not be the right one.
38 |
39 | ```txt
40 | $> python price.py
41 | mean price : 15.04098
42 | std price : 6.03456
43 | ```
--------------------------------------------------------------------------------
/module00/ex12/ex12.md:
--------------------------------------------------------------------------------
1 | # Exercise 12 - Worldwide
2 |
3 | | | |
4 | | -----------------------:| ------------------ |
5 | | Turn-in directory : | ex12 |
6 | | Files to turn in : | worldwide.py |
7 | | Forbidden function : | None |
8 | | Remarks : | n/a |
9 |
10 | ## Objective
11 |
12 | Give the top 5 most played genres among games that have several distinct languages greater or equal to 3.
13 |
14 | ## Instructions
15 |
16 | You must write a query that filters games according to the number of languages they have, and then filter out the ones that have strictly less than 3 languages. Then you need to select the top 5 genres where those games appear.
17 |
18 | **You must only use PostgreSQL for your queries !**
19 |
20 |
21 | ## Example
22 |
23 | ```txt
24 | $> python worldwide.py
25 | Strategy
26 | ...
27 | ```
28 |
29 | As you guessed, you should have 5 hits.
--------------------------------------------------------------------------------
/module00/ex13/ex13.md:
--------------------------------------------------------------------------------
1 | # Exercise 13 - Italian_market
2 |
3 | | | |
4 | | -----------------------:| ------------------ |
5 | | Turn-in directory: | ex13 |
6 | | Files to turn in: | italian_market.py |
7 | | Forbidden functions: | None |
8 | | Remarks: | n/a |
9 |
10 | ## Objective
11 |
12 | Create a script which list the games supporting the Italian language first and Spanish otherwise.
13 |
14 | ## Instructions
15 |
16 | You must write a script which list the games supporting the Italian language first and Spanish otherwise.
17 |
18 | Hint : You should have a look at window functions.
19 |
20 | **You must only use PostgreSQL for your queries !**
21 |
22 | ## Example
23 |
24 | ```txt
25 | $> python italian_market.py
26 | 100 Balls plus 20
27 | 1010 Block King Puzzle
28 | 1010 Fit for Blocks bricks
29 | 1024 - 2048 - 4096 - 8192
30 | ...
31 | ```
32 |
33 | You should have 2471 hits.
34 |
--------------------------------------------------------------------------------
/module00/ex14/ex14.md:
--------------------------------------------------------------------------------
1 | # Exercise 14 - Sample
2 |
3 | | | |
4 | | -----------------------:| ------------------ |
5 | | Turn-in directory: | ex14 |
6 | | Files to turn in: | sample.py |
7 | | Forbidden functions: | None |
8 | | Remarks: | n/a |
9 |
10 | ## Objective
11 |
12 | Create a statistically representative sample of your dataset.
13 |
14 | ## Instructions
15 |
16 | 1) We need to find a good sample size for our dataset. You must find out how representative sample size calculation works.
17 |
18 | Find a sample size calculator online and compute the sample size using the given parameters:
19 | - The margin of error of 5%
20 | - Confidence Level of 95%
21 | - population size (size of appstore_games table)
22 |
23 | Then put the sample size in a variable.
24 |
25 | 2) Write a PostgreSQL `sample` function that will randomly select a given number of rows (sample_size parameter)
26 |
27 | 3) Use your `sample` function to randomly select a sample and save the result into a CSV file named `appstore_games.sample.csv`
28 |
29 | Hint : you can use `pd.read_sql_query` and `df.to_csv` !
30 |
31 | **You must only use PostgreSQL for your queries!**
32 |
33 |
34 | ## Bonus
35 |
36 | Write a Python function `sample_size` with the following parameters:
37 | - `population_size`
38 | - `confidence_level` : default value `0.95`
39 | - `margin_error` : default value `0.05`
40 | - `standard_deviation` : default value `0.5`
41 |
42 | This function will compute the sample size needed for the given parameter following the given formula:
43 |
44 | $$
45 | sample\_size = \frac{\frac{zscore^2 \times std(1 - std)}{margin\_error^2}}{1 + \frac{zscore^2 \times std(1 - std)}{margin\_error^2 \times Population\_size}}
46 | $$
47 |
48 | The z_score depends on the confidence level following this table:
49 |
50 | \clearpage
51 |
52 | |Confidence_level|Z_score|
53 | |---|---|
54 | |0.80|1.28|
55 | |0.85|1.44|
56 | |0.90|1.65|
57 | |0.95|1.96|
58 | |0.99|2.58|
59 |
--------------------------------------------------------------------------------
/module00/module00.md:
--------------------------------------------------------------------------------
1 | # Module00 - SQL with PostgreSQL
2 |
3 | In this module, you will learn how to use a SQL database: PostgreSQL.
4 |
5 | ## Notions of the module
6 |
7 | The purpose of the module is at first to create, administrate and normalize a PostgreSQL Database. Then, we are going to analyse the data and visualize the content of the database. Finally, we will see advanced notions like caching, replication and backups.
8 |
9 | ## General rules
10 |
11 | * The version of Python to use is 3.7, you can check the version of Python with the following command: `python -V`.
12 | * you will follow the **[Pep8 standard](https://www.python.org/dev/peps/pep-0008/)**.
13 | * The exercises are ordered from the easiest to the hardest.
14 | * Your exercises are going to be evaluated by someone else so make sure that variables and functions names are appropriated.
15 | * Your man is the internet.
16 | * You can also ask any question in the dedicated channel in Slack: **[42ai slack](https://42-ai.slack.com)**.
17 | * If you find any issue or mistakes in the subject please create an issue on our dedicated repository on **[Github issues](https://github.com/42-AI/bootcamp_data-engineering/issues")**.
18 |
19 | ## Foreword
20 |
21 | Data Engineering implies many tasks from organizing the data to putting data systems to productions. Data organization is often a mess in companies and our job is to provide a common, well-organized data source. Historically, the organization of the data is used to analyze the business and determine future business decisions. Those data organizations are called [Data warehouses](https://www.tutorialspoint.com/dwh/index.htm) and are used by business intelligence teams (teams in charge of analyzing the business). This organization of the data follows a [star scheme](https://www.tutorialspoint.com/dwh/dwh_schemas.htm) allowing fast analysis.
22 |
23 | Nowadays, we want to meet other cases' needs such as providing data to data science teams or other projects. To do so, we want to deliver a common data organization which won't be project-specific but which will be used by anyone willing to (business intelligence, data scientists, ...). This
24 | new data organization is called a [Data Lake](https://medium.com/rock-your-data/getting-started-with-data-lake-4bb13643f9). It contains all the company data. The job of data engineering consists of organizing the data :
25 | - ingestion
26 | - storage
27 | - catalog and search engine associated
28 |
29 | To do that SQL is often used to filter, join, select the data. During the module, you will discover an open-source SQL language, PostgreSQL.
30 |
31 | ### Exercise 00 - Setup
32 | ### Exercise 01 - Clean
33 | ### Exercise 02 - Normalize
34 | ### Exercise 03 - Populate
35 | ### Exercise 04 - Top_100
36 | ### Exercise 05 - Name_lang
37 | ### Exercise 06 - K-first
38 | ### Exercise 07 - Seniors
39 | ### Exercise 08 - Battle_royale
40 | ### Exercise 09 - Benefits
41 | ### Exercise 10 - Sweet_spot
42 | ### Exercise 11 - Price_analysis
43 | ### Exercise 12 - Worldwide
44 | ### Exercise 13 - Italian_market
45 | ### Exercise 14 - Sample
46 |
--------------------------------------------------------------------------------
/module00/resources/Pipfile:
--------------------------------------------------------------------------------
1 | [[source]]
2 | url = "https://pypi.python.org/simple"
3 | verify_ssl = true
4 | name = "pypi"
5 |
6 | [packages]
7 | jupyter = "*"
8 | numpy = "*"
9 | pandas = "*"
10 | psycopg2 = "*"
11 |
12 | [requires]
13 | python_version = ">=3.7"
14 |
15 |
--------------------------------------------------------------------------------
/module00/resources/db/Dockerfile:
--------------------------------------------------------------------------------
1 | FROM postgres:12.2-alpine
2 |
3 | # run init.sql
4 | ADD init.sql /docker-entrypoint-initdb.d
5 | ADD pg_hba.conf /var/lib/postgresql/data
--------------------------------------------------------------------------------
/module00/resources/db/init.sql:
--------------------------------------------------------------------------------
1 | CREATE DATABASE appstore_games;
2 | CREATE USER postgres_user WITH PASSWORD '12345';
3 | ALTER DATABASE appstore_games OWNER TO postgres_user;
--------------------------------------------------------------------------------
/module00/resources/db/pg_hba.conf:
--------------------------------------------------------------------------------
1 | # TYPE DATABASE USER ADDRESS METHOD
2 |
3 | # "local" is for Unix domain socket connections only
4 | local all all trust
5 | # IPv4 local connections:
6 | host all all 127.0.0.1/32 md5
7 | # IPv6 local connections:
8 | host all all ::1/128 trust
9 | # Allow replication connections from localhost, by a user with the
10 | # replication privilege.
11 | local replication all trust
12 | host replication all 127.0.0.1/32 trust
13 | host replication all ::1/128 trust
14 |
15 | host all all all md5
--------------------------------------------------------------------------------
/module00/resources/docker-compose.yml:
--------------------------------------------------------------------------------
1 | version: '3.4'
2 |
3 | services:
4 | db:
5 | build:
6 | context: ./db
7 | dockerfile: Dockerfile
8 | ports:
9 | - 54320:5432
10 | environment:
11 | - POSTGRES_USER=postgres
12 | - POSTGRES_PASSWORD=postgres
13 | - PGDATA=/var/lib/postgresql/data/pgdata
14 | volumes:
15 | - type: volume
16 | source: db-data
17 | target: /var/lib/postgresql/data
18 | volume:
19 | nocopy: true
20 |
21 | volumes:
22 | db-data:
--------------------------------------------------------------------------------
/module00/resources/docker_install.sh:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env bash
2 | # **************************************************************************** #
3 | # #
4 | # ::: :::::::: #
5 | # init_docker.sh :+: :+: :+: #
6 | # +:+ +:+ +:+ #
7 | # By: aguiot-- +#+ +:+ +#+ #
8 | # +#+#+#+#+#+ +#+ #
9 | # Created: 2019/11/18 08:17:08 by aguiot-- #+# #+# #
10 | # Updated: 2020/02/20 14:00:32 by aguiot-- ### ########.fr #
11 | # Updated: 2020/02/20 14:34:42 by aguiot-- ### ########.fr #
12 | # #
13 | # **************************************************************************** #
14 |
15 | # https://github.com/alexandregv/42toolbox
16 |
17 | # Ensure USER variabe is set
18 | [ -z "${USER}" ] && export USER=$(whoami)
19 |
20 | ################################################################################
21 |
22 | # Config
23 | docker_destination="/goinfre/$USER/docker" #=> Select docker destination (goinfre is a good choice)
24 |
25 | ################################################################################
26 |
27 | # Colors
28 | blue=$'\033[0;34m'
29 | cyan=$'\033[1;96m'
30 | reset=$'\033[0;39m'
31 |
32 | # Uninstall docker, docker-compose and docker-machine if they are installed with brew
33 | brew uninstall -f docker docker-compose docker-machine &>/dev/null ;:
34 |
35 | # Check if Docker is installed with MSC and open MSC if not
36 | if [ ! -d "/Applications/Docker.app" ] && [ ! -d "~/Applications/Docker.app" ]; then
37 | echo "${blue}Please install ${cyan}Docker for Mac ${blue}from the MSC (Managed Software Center)${reset}"
38 | open -a "Managed Software Center"
39 | read -n1 -p "${blue}Press RETURN when you have successfully installed ${cyan}Docker for Mac${blue}...${reset}"
40 | echo ""
41 | fi
42 |
43 | # Kill Docker if started, so it doesn't create files during the process
44 | pkill Docker
45 |
46 | # Ask to reset destination if it already exists
47 | if [ -d "$docker_destination" ]; then
48 | read -n1 -p "${blue}Folder ${cyan}$docker_destination${blue} already exists, do you want to reset it? [y/${cyan}N${blue}]${reset} " input
49 | echo ""
50 | if [ -n "$input" ] && [ "$input" = "y" ]; then
51 | rm -rf "$docker_destination"/{com.docker.{docker,helper},.docker} &>/dev/null ;:
52 | fi
53 | fi
54 |
55 | # Unlinks all symlinks, if they are
56 | unlink ~/Library/Containers/com.docker.docker &>/dev/null ;:
57 | unlink ~/Library/Containers/com.docker.helper &>/dev/null ;:
58 | unlink ~/.docker &>/dev/null ;:
59 |
60 | # Delete directories if they were not symlinks
61 | rm -rf ~/Library/Containers/com.docker.{docker,helper} ~/.docker &>/dev/null ;:
62 |
63 | # Create destination directories in case they don't already exist
64 | mkdir -p "$docker_destination"/{com.docker.{docker,helper},.docker}
65 |
66 | # Make symlinks
67 | ln -sf "$docker_destination"/com.docker.docker ~/Library/Containers/com.docker.docker
68 | ln -sf "$docker_destination"/com.docker.helper ~/Library/Containers/com.docker.helper
69 | ln -sf "$docker_destination"/.docker ~/.docker
70 |
71 | # Start Docker for Mac
72 | open -g -a Docker
73 |
74 | echo "${cyan}Docker${blue} is now starting! Please report any bug to: ${cyan}aguiot--${reset}"
75 |
--------------------------------------------------------------------------------
/module00/resources/psycopg2_documentation.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/42-AI/bootcamp_data-engineering/0b745562307d5fa6b395ba9b745c0df6e76ed17d/module00/resources/psycopg2_documentation.pdf
--------------------------------------------------------------------------------
/module01.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/42-AI/bootcamp_data-engineering/0b745562307d5fa6b395ba9b745c0df6e76ed17d/module01.pdf
--------------------------------------------------------------------------------
/module01/assets/dashboard.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/42-AI/bootcamp_data-engineering/0b745562307d5fa6b395ba9b745c0df6e76ed17d/module01/assets/dashboard.png
--------------------------------------------------------------------------------
/module01/ex00/ex00.md:
--------------------------------------------------------------------------------
1 | # Exercise 00 - The setup.
2 |
3 | | | |
4 | | -----------------------:| ------------------ |
5 | | Turn-in directory : | ex00 |
6 | | Files to turn in : | |
7 | | Forbidden function : | None |
8 | | Remarks : | n/a |
9 |
10 | Let's start simple:
11 |
12 | * Download and install Elasticsearch.
13 | - Go to [Elasticsearch download](https://www.elastic.co/downloads/past-releases).
14 | - In the product filter select `Elasticsearch`.
15 | - Choose the version 7.5.2 and download the tar.gz file.
16 | * Unzip the file
17 | * You should have several directories:
18 |
19 | | Directory | Description |
20 | | --------:| ------------------------------------------------------------------------------------------------- |
21 | | `/bin` | Binary scripts including elasticsearch to start a node and elasticsearch-plugin to install plugins |
22 | | `/config` | Configuration files including elasticsearch.yml |
23 | | `/data` | The location of the data files of each index and shard allocated on the node |
24 | | `/jdk` | The bundled version of OpenJDK from the JDK maintainers (GPLv2+CE) |
25 | | `/lib` | The Java JAR files of Elasticsearch |
26 | | `/logs` | Elasticsearch log files location |
27 | | `/modules` | Contains various Elasticsearch modules |
28 | | `/plugins` | Plugin files location. Each plugin will be contained in a subdirectory |
29 |
30 | * Start your cluster by running the `./elasticsearch` in the `/bin` folder and wait a few seconds for the node to start.
31 |
32 | Ok so now your cluster should be running and listening on `http://localhost:9200`.
33 | Elasticsearch works with a REST API, which means that to query your cluster you just have to send an HTTP request to the good endpoints (we will come to that).
34 |
35 | Check you can access the cluster:
36 |
37 | ```
38 | curl http://localhost:9200
39 | ```
40 | You can do the same in a web browser.
41 |
42 | You should see something like this:
43 |
44 | ```
45 | {
46 | "name" : "e3r4p23.42.fr",
47 | "cluster_name" : "elasticsearch",
48 | "cluster_uuid" : "SZdgmzxFSnW2IMVxvVj-9w",
49 | "version" : {
50 | "number" : "7.5.2",
51 | "build_flavor" : "default",
52 | "build_type" : "tar",
53 | "build_hash" : "e9ccaed468e2fac2275a3761849cbee64b39519f",
54 | "build_date" : "2019-11-26T01:06:52.518245Z",
55 | "build_snapshot" : false,
56 | "lucene_version" : "8.3.0",
57 | "minimum_wire_compatibility_version" : "6.8.0",
58 | "minimum_index_compatibility_version" : "6.0.0-beta1"
59 | },
60 | "tagline" : "You Know, for Search"
61 | }
62 | ```
63 |
64 | If not, feel free to look at the doc :) (or ask your neighbors, or google...) [Elasticsearch setup](https://www.elastic.co/guide/en/elasticsearch/reference/current/setup.html).
65 |
66 | Now stop the cluster (ctrl-c). Change the configuration so that your cluster name is `"my-cluster"` and the node name is `"node1"`.
67 |
68 | Restart your cluster and check the new names with
69 |
70 | ```
71 | curl http://localhost:9200
72 | ```
73 |
--------------------------------------------------------------------------------
/module01/ex01/ex01.md:
--------------------------------------------------------------------------------
1 | # Exercise 01 - The CRUDité.
2 |
3 | | | |
4 | | -----------------------:| ------------------ |
5 | | Turn-in directory : | ex01 |
6 | | Files to turn in : | create-doc.sh; ex01-queries.txt |
7 | | Forbidden function : | None |
8 | | Remarks : | n/a |
9 |
10 | Now we are going to see how to perform basic CRUD operation on Elasticsearch.
11 |
12 | ## Create
13 |
14 | I'm gonna make it easy for you: Here is a curl request that creates a document with id=1 into an index named "twitter" and containing 3 fields:
15 |
16 | ```
17 | curl -X PUT "http://localhost:9200/twitter/_doc/1?pretty" -H 'Content-Type: application/json' -d'
18 | {
19 | "user" : "popol",
20 | "post_date" : "02 12 2015",
21 | "message" : "trying out elasticsearch"
22 | }
23 | '
24 | ```
25 |
26 | So, what do we have here?
27 |
28 | HTTP PUT method (remember, Elasticsearch use a REST API) followed by:
29 |
30 | `ip_of_the_cluster:9200/index_name/_doc/id_of_the_document`, then a header specifying the content-type as a json, and finally the json.
31 | Every document in Elasticsearch is a json, every request to Elasticsearch is sent as a json within an HTTP request.
32 |
33 | Try it out, you should get an answer from the server confirming the creation of the document.
34 |
35 | Let's see another way to create a document: Modify the above request to create a document in the twitter index but this time without specifying the id of the document. The document shall have the following field:
36 |
37 | ```
38 | {
39 | "user" : "popol",
40 | "post_date" : "20 01 2019",
41 | "message" : "still trying out Elasticsearch"
42 | }
43 | ```
44 |
45 | **Hint**: try POST instead of PUT
46 |
47 | Run the following command and check you have two hits.
48 |
49 | ```
50 | curl -XGET "http://localhost:9200/twitter/_search"\?pretty
51 | ```
52 |
53 | Look at the `_id` of each document and try to understand why those value.
54 |
55 | Save your curl request to a file named `create-doc.sh`. The file shall be executable for the correction and it shall create the two documents above.
56 |
57 | Ok nice, you have just created your two first documents and your first index !!
58 | However, using curl is not very convenient right... Wouldn't it be awesome to have a nice dev tool to write out those requests... Kibana !!
59 | Kibana is the visualization tool of the Elastic Stack. What's the Elastic Stack? -> [ELK stack](https://www.elastic.co/what-is/elk-stack).
60 |
61 | ### Kibana Install
62 |
63 | Let's install Kibana!
64 |
65 | As you did for Elasticsearch, on the same link
66 | - download Kibana v7.5.2
67 | - Unzip the file with tar and run it.
68 | - Wait until Kibana is started.
69 |
70 | You should see something like :
71 | `[16:09:00.957] [info][server][Kibana][http] http server running at http://localhost:5601`
72 |
73 | - Open your browser and go to `http://localhost:5601`
74 | - Click on the dev tool icon on the navigation pane (3rd before last)
75 | Here you can write your query to the cluster in a much nicer environment than curl. You should have a pre-made match_all query. Run it, in the result among other stuff, you should see the documents you have created.
76 |
77 | Try to create the following two documents in Kibana, still in the twitter index:
78 |
79 | ```
80 | {
81 | "user" : "mimich",
82 | "post_date" : "31 12 2015",
83 | "message" : "Trying out Kibana"
84 | }
85 | ```
86 |
87 | and:
88 |
89 | ```
90 | {
91 | "user" : "jean mimich",
92 | "post_date" : "01 01 2016",
93 | "message" : "Trying something else"
94 | }
95 | ```
96 |
97 | Got it? Great! From now on, all queries shall be done in Kibana. Save every query you run in Kibana in the ex01-queries.txt file. You will be evaluated on this file.
98 |
99 | ## Read
100 |
101 | Now that we got the (very) basis of how to query Elasticsearch, I'm gonna let you search the answer on our own.
102 |
103 | - Write a search query that returns all the documents contained in the 'twitter' index.
104 | You should get 4 hits
105 | - Write a search query that returns all the tweets from 'popol'.
106 | You should get 2 hits
107 | - Write a search query that returns all the tweets containing 'elasticsearch' in their message.
108 | You should get 2 hits
109 | - A little more complicated: write a search query that returns all the tweets from 'mimich' (and only this user!).
110 |
111 | You should get 1 hit.
112 |
113 | Save all the queries in `ex01-queries.txt`.
114 |
115 | ### Hints
116 |
117 | - look for the keyword field ;)
118 | - [strings are dead long live strings](https://www.elastic.co/fr/blog/strings-are-dead-long-live-strings)
119 |
120 | For help, please refer to the doc (or to your neighbors, or google) [query dsl](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html)
121 |
122 | ## Update
123 |
124 | Update the document with id 1 as follow and change the value of the field "message" from "trying out elasticsearch" to "updating the document".
125 | If you did this correctly when you update the document you should see "\_version": 2 in the answer from the cluster
126 |
127 | Save the query in `ex01-queries.txt`.
128 |
129 | ## Delete
130 |
131 | - Run the following command:
132 |
133 | ```
134 | POST _bulk
135 | {"index": {"_index": "test_delete", "_id":1}}
136 | {"name": "clark kent", "aka": "superman"}
137 | {"index": {"_index": "test_delete"}}
138 | {"name": "louis XV", "aka": "le bien aimé"}
139 | ```
140 |
141 | It is a bulk indexing algorithm, it allows to index several documents in only one request.
142 | - Delete the document with id 1 of the test_delete index
143 | - Delete the whole test_delete index
144 |
145 | Save all the queries in `ex01-queries.txt`.
146 |
--------------------------------------------------------------------------------
/module01/ex02/ex02.md:
--------------------------------------------------------------------------------
1 | # Exercise 02 - Your first Index Your first Mapping.
2 |
3 | | | |
4 | | -----------------------:| ------------------ |
5 | | Turn-in directory: | ex02 |
6 | | Files to turn in: | ex02-queries.txt |
7 | | Forbidden functions: | None |
8 | | Remarks: | you have to put all the request you run in Kibana in ex02-queries.txt |
9 |
10 |
11 | At this point, you should have 4 documents in your twitter index from the previous exercise. You are now going to learn about the mapping.
12 |
13 | Using NoSQL doesn't mean you should not structure your data. If you want to optimize your cluster you must define the mapping of your index. We will see why. You may have noticed that in the previous exercise, every time you created a document, Elasticsearch automatically created the index for you. Well, it also generates a default mapping for this index.
14 |
15 | However, the default mapping is not ideal...
16 |
17 | - We would like to retrieve all the tweets posted in 2016 and beyond. Try the following search query:
18 |
19 | ```
20 | GET twitter/_search
21 | {
22 | "query": {
23 | "range": {
24 | "post_date": {
25 | "gte": "01 01 2016"
26 | }
27 | }
28 | }
29 | }
30 | ```
31 |
32 | Do you have good results? No... there is a mapping issue.
33 |
34 | - Your objective now is to create a new index called 'twitter_better_mapping' that contains the same 4 documents as the 'twitter' index but with a mapping that comply with those four requirements:
35 |
36 | The following query should only return the tweet posted in 2016 and beyond (2 hits):
37 |
38 | ```
39 | GET twitter_better_mapping/_search
40 | {
41 | "query": {
42 | "range": {
43 | "post_date": {
44 | "gte": "01 01 2016"
45 | }
46 | }
47 | }
48 | }
49 | ```
50 |
51 | The following query should return only 1 hit.
52 |
53 | ```
54 | GET twitter_better_mapping/_search
55 | {
56 | "query": {
57 | "match": {
58 | "user": "mimich"
59 | }
60 | }
61 | }
62 | ```
63 |
64 | The mapping must be strict (if you try to index a document with a field not defined in the mapping, you get an error).
65 |
66 | The size of the twitter_better_mapping index should be less than 5 kb (with four documents). What was the size of the original index?
67 |
68 | ### Hints
69 |
70 | - You can't modify the mapping of an existing index, so you have to define the mapping when you create the index, before indexing any document in the index.
71 | - The easiest way to write a mapping is to start from the default mapping Elasticsearch creates. Index a document sample into a temporary index, retrieve the default mapping of this index and copy and modify it to create a new index. Here you already have the twitter index with a default mapping. Write a request to get this mapping and start from here.
72 | - You will notice that by default ES creates two fields for every string field: my-field as "text" and my-field.keyword as "keyword" type. The "text" type takes computing power at indexing and costs storage space. The "keyword" type is light but does not offer all the search power of the "text" type. Some fields might need both, some might need just one... optimize your index!
73 | - Once you have created the new index with a better mapping, you can index the documents manually as you did in the previous exercise or you can use the reindex API (see Elastic Doc).
74 |
--------------------------------------------------------------------------------
/module01/ex03/ex03.md:
--------------------------------------------------------------------------------
1 | # Exercise 03 - Text Analyzer.
2 |
3 | | | |
4 | | -----------------------:| ------------------ |
5 | | Turn-in directory: | ex03 |
6 | | Files to turn in: | ex03-queries.txt |
7 | | Forbidden functions: | None |
8 | | Remarks: | |
9 |
10 |
11 | So by now you already know that mapping a field as a keyword or as a text makes a big difference. This is because Elasticsearch analyses all the text fields at ingestion so the text is easier to search.
12 |
13 | - Let's see an example. Ingest the two following documents in an index named school.
14 |
15 | ```
16 | POST school/_doc
17 | {
18 | "school": "42",
19 | "text" : "42 is a school where you write a lot of programs"
20 | }
21 |
22 | POST school/_doc
23 | {
24 | "school": "ICART",
25 | "text" : "The school of art management and art market management"
26 | }
27 | ```
28 |
29 | We created an index that contains random schools. Let's look for programming schools in it.
30 |
31 |
32 | - Try this request.
33 |
34 | ```
35 | GET school/_search
36 | {
37 | "query":
38 | {
39 | "match": {
40 | "text": "programming"
41 | }
42 | }
43 | }
44 | ```
45 |
46 | No results... and yet, you have probably noticed that there is a document talking about a famous programming school.
47 | It's a shame that we can't get it when we execute our request using the keyword `programming`.
48 |
49 |
50 | - Your mission is to rectify this! Modify the school index mapping to create a shool_bis index that returns the good result to the following query:
51 |
52 | ```
53 | GET school_bis/_search
54 | {
55 | "query":
56 | {
57 | "match": {
58 | "text": "programming"
59 | }
60 | }
61 | }
62 | ```
63 |
64 | ### Hints
65 |
66 | - Look for the text analyzer section in the documentation.
67 | - There is a key notion to understand Elasticsearch: the **inverted index**. Take the time to understand how the analyzer creates **token** and how this works with the inverted index.
68 |
--------------------------------------------------------------------------------
/module01/ex04/ex04.md:
--------------------------------------------------------------------------------
1 | # Exercise 04 - Ingest dataset
2 |
3 | | | |
4 | | -----------------------:| ------------------ |
5 | | Turn-in directory: | ex04 |
6 | | Files to turn in: | |
7 | | Forbidden functions: | None |
8 | | Remarks: | n/a |
9 |
10 | Now that you know the basics of how Elasticsearch works, you are ready to work with a real dataset !!
11 | And to make this fun you gonna use the same dataset as for the SQL module so you can understand the differences between SQL and NoSQL.
12 |
13 | There are many ways you can ingest data into Elasticsearch. In the previous exercise, you've seen how to create a document manually.
14 | You could do this for every line of the CSV, with a python script for instance that parses the CSV and create a document for each line. There is an Elasticsearch client API for many languages that helps to connect to the cluster (to avoid writing HTTP requests in python): [Elasticsearch client](https://www.elastic.co/guide/en/elasticsearch/client/index.html).
15 |
16 | But there is an easier way: Logstash. Logstash is the ETL (Extract Transform Load) tool of the Elasticsearch stack. We don't want you to spend to much time learning how to use Logstash so we will guide you step by step:
17 |
18 | - Download [logstash](https://www.elastic.co/downloads/logstash)
19 |
20 | - Un-tar the file (still in your `/goinfre`).
21 |
22 | - Move the 'ingest-pipeline.conf' to the `config/` in the logstash directory (unzip appstore_games.csv.zip if you have not already).
23 |
24 | This file describes all the operations that logstash shall do to ingest the data. Let's take a look at the file:
25 |
26 | The file is split into three parts :
27 | - `input`: definition of the inputs.
28 | - `filter`: operation to perform on the inputs.
29 | - `output`: definition of the outputs.
30 |
31 | ```
32 | input {
33 | file {
34 | path => "/absolute/path/to/appstore_games.csv"
35 | start_position => "beginning"
36 | sincedb_path => "sincedb_file.txt"
37 | }
38 | }
39 | ```
40 |
41 | - `file`: our input will be a file, could be something else (stdin, data stream, ...).
42 | - `path`: location of the input file.
43 | - `start_position`: where to start reading the file.
44 | - `sincedb_path`: logstash stores its position in the input file, so if new lines are added, only new lines will be processed (ie, if you want to re-run the ingest, delete the sincedb_file).
45 |
46 | ```
47 | filter {
48 | csv {
49 | separator => ","
50 | columns => ["URL","ID","Name","Subtitle","Icon URL","Average User Rating","User Rating Count","Price","In-app Purchases","Description","Developer","Age Rating","Languages","Size","Primary Genre","Genres","Original Release Date","Current Version Release Date"]
51 | remove_field => ["message", "host", "path", "@timestamp"]
52 | skip_header => true
53 | }
54 | mutate {
55 | gsub => [ "Description", "\\n", "
56 | "]
57 | gsub => [ "Description", "\\t", " "]
58 | gsub => [ "Description", "\\u2022", "•"]
59 | gsub => [ "Description", "\\u2013", "–"]
60 | split => { "Genres" => "," }
61 | split => { "Languages" => "," }
62 | }
63 | }
64 | ```
65 |
66 | - `csv`: we use the csv plugin to parse the file.
67 | - `separator`: split each line on the comma.
68 | - `column`: name of the columns (will create one field in the index mapping per column)
69 | - `remove_field`: here we remove 4 fields, those 4 fields are added by logstash to the raw data but we don't need them.
70 | - `skip_header`: skip the first line
71 | - `mutate`: When logstash parse the field it escape any '\' it found. This changes a '\\n', '\\t', '\\u2022' and '\\u2013' into a '\\\\n', '\\\\t', '\\\\u2022', '\\\\u2013' respectively, which is not what we want. The mutate plugin is used here to fix this.
72 | - `gsub`: substitute '\\n' by a new line and the '\\\\u20xx' by its unicode character.
73 | - `split`: split the "Genres" and "Languages" field on the "," instead of a single string like "FR, EN, KR" we will have ["EN", "FR", "KR]
74 |
75 | ```
76 | output {
77 | elasticsearch {
78 | hosts => "http://localhost:9200"
79 | index => "appstore_games_tmp"
80 | }
81 | stdout {
82 | codec => "dots"
83 | }
84 | }
85 | ```
86 | - `elasticsearch`: we want to output to an Elasticsearch cluster.
87 | - `hosts`: Ip of the cluster.
88 | - `index`: Name of the index where to put the data (index will be created if not existing, otherwise data are added to the cluster).
89 | - `stdout`: we also want an output on stdout to follow the ingestion process.
90 | - `codec => "dots"`: print one dot '.' for every document ingested.
91 |
92 | So, all we do here is create one document for each line of the csv. Then, for each line, split on comma and put the value in a field of the document with the name defined in 'columns'. Exactly what you would have done with Python but in much less line of code.
93 |
94 | Now, let's run Logstash:
95 | - To run logstash you will need to install a JDK or JRE. On a 42 computer, you can do this from the MSC.
96 | - Edit the ingest-pipeline.conf with the path to the appstore_games.csv
97 | - `./bin/logstash -f config/ingest-pipeline.conf`
98 |
99 | You should have 17007 documents in your index.
100 |
--------------------------------------------------------------------------------
/module01/ex05/ex05.md:
--------------------------------------------------------------------------------
1 | # Exercise 05 - Search - Developers
2 |
3 | | | |
4 | | -----------------------:| ------------------ |
5 | | Turn-in directory: | ex05 |
6 | | Files to turn in: | ex05-queries.txt |
7 | | Forbidden functions: | None |
8 | | Remarks : | |
9 |
10 | Let's start with some queries you already did for the SQL module.
11 |
12 | We are looking for developers involved in games released before 01/08/2008 included and updated after 01/01/2018 included.
13 |
14 | - Write a query that returns the games matching this criterion.
15 |
16 | Your query shall also filter the "_source" to only return the following fields: `Developer`, `Original Release Date`, `Current Version Release Date`.
17 |
18 | You should get 3 hits.
19 |
20 |
21 | ### Hints
22 |
23 | - You might need to adjust the mapping of your index
24 | - Create a new index and use the reindex API to change the mapping rather than using logstash to re-ingest the CSV
25 | - The "bool" query will be useful ;)
26 |
--------------------------------------------------------------------------------
/module01/ex06/ex06.md:
--------------------------------------------------------------------------------
1 | # Exercise 06 - Search - Name_Lang
2 |
3 | | | |
4 | | -----------------------:| ------------------ |
5 | | Turn-in directory : | ex06 |
6 | | Files to turn in : | ex06-queries.txt |
7 | | Forbidden function : | None |
8 | | Remarks: | |
9 |
10 |
11 | We are looking for the Name and Language of games between 5 and 10 euros.
12 |
13 | - Write a query that returns the games matching this criterion.
14 |
15 | Your query shall filter the "_source_" to return the following fields only: "Name", "Languages", "Price".
16 |
17 | You should get 192 hits.
18 |
19 | ### Hints
20 |
21 | - You might need to adjust the mapping of your index
22 | - Create a new index and use the reindex API to change the mapping rather than using logstash to re-ingest the CSV.
23 |
--------------------------------------------------------------------------------
/module01/ex07/ex07.md:
--------------------------------------------------------------------------------
1 | # Exercise 07 - Search - Game
2 |
3 | | | |
4 | | -----------------------:| ------------------ |
5 | | Turn-in directory: | ex07 |
6 | | Files to turn in: | ex07-queries.txt |
7 | | Forbidden functions: | None |
8 | | Remarks : | |
9 |
10 | Elasticsearch was initially designed for full-text search, so let's try it.
11 |
12 | - I'm looking for a game (and no other genre will be accepted). I'm a big fan of starcraft and I like real-time strategy. Can you write a query to find me a game?
13 |
14 | It's a good time to look at how Elasticsearch scores the documents so you can tune your query to increase the result relevance.
15 |
16 | There isn't one good answer to this exercise, many answers are possible. Just make sure the top hits you got are relevant to what I'm searching for!
17 |
--------------------------------------------------------------------------------
/module01/ex07bis/ex07bis.md:
--------------------------------------------------------------------------------
1 | # Exercise 07-bis - Search - Vibrant World
2 |
3 | | | |
4 | | -----------------------:| ------------------ |
5 | | Turn-in directory : | ex07bis |
6 | | Files to turn in : | ex07bis-queries.txt |
7 | | Forbidden function : | None |
8 | | Remarks : | |
9 |
10 | A little more of full-text search:
11 |
12 | - Two games speak of a Vibrant World and have a link to facebook.com (or fb.com) in their description. Find them!
13 |
14 | ### Hint
15 |
16 | The distance between the word "vibrant" and "world" longer than a few words...
17 |
18 | ## BONUS
19 |
20 | There are three games, special Kudos if you find the 3rd one!
21 |
--------------------------------------------------------------------------------
/module01/ex08/ex08.md:
--------------------------------------------------------------------------------
1 | # Exercise 08 - Aggregation
2 |
3 | | | |
4 | | -----------------------:| ------------------ |
5 | | Turn-in directory : | ex08 |
6 | | Files to turn in : | ex08-queries.txt |
7 | | Forbidden function : | None |
8 | | Remarks : | |
9 |
10 | Let's do some aggregation!
11 |
12 | - Write a query that returns the top 10 developers in terms of games produced.
13 |
14 | - Set the size to 0 so the query returns only the aggregation, not the hits.
--------------------------------------------------------------------------------
/module01/ex09/ex09.md:
--------------------------------------------------------------------------------
1 | # Exercise 09 - Aggregation in Aggregation
2 |
3 | | | |
4 | | -----------------------:| ------------------ |
5 | | Turn-in directory : | ex09 |
6 | | Files to turn in : | ex09-queries.txt |
7 | | Forbidden function : | None |
8 | | Remarks : | |
9 |
10 | We would like to know what is the most represented game "Genre" (top 10) and for each of those "genre" the repartition of the "Average User Rating" with a bucket interval of one (ie: for each Genre the number of game with an Avg User Rating of 1, of 2, 3, 4, and 5).
11 |
12 | - This must be done in a single query.
--------------------------------------------------------------------------------
/module01/ex10/ex10.md:
--------------------------------------------------------------------------------
1 | # Exercise 10 - Kibana
2 |
3 | | | |
4 | | -----------------------:| ------------------ |
5 | | Turn-in directory: | ex10 |
6 | | Files to turn in: | ex10-queries.txt |
7 | | Forbidden functions: | None |
8 | | Remarks: | |
9 |
10 |
11 | Time to explore Kibana a little bit more.
12 |
13 |
14 | - Your goal is to create a Dashboard with the following visualizations:
15 | - A plot showing the number of games released (Y axis) over time (X axis)
16 | - A histogram that counts the number of games released each year, and for each year the count of the "average user rating" by an interval of 1
17 | - A Pie Chart showing the repartition of Genres
18 | - A cloud of words showing the top developers
19 |
20 | - Once your dashboard has been created, explore the possibilities of Kibana (click on the top developer in the cloud of words for instance).
21 |
22 |
23 | Your Dashboard should look like something like this:
24 |
25 |
26 | {width=550px}
27 |
28 | ### Hints
29 |
30 | - You need to create an index pattern first (Go in the management menu, then index pattern).
31 | - Create each visualization in the visualization tab and then create the dashboard in the dashboard tab.
32 |
--------------------------------------------------------------------------------
/module01/module01.md:
--------------------------------------------------------------------------------
1 | # Module01 - Elasticsearch, Logstash, Kibana
2 |
3 | In the module, you will learn how to use a NoSQL database: Elasticsearch.
4 | Wait... Elasticsearch is a database? Well, not exactly it is more than that. It is defined as a search and analytics engine. But let's keep it simple for now, consider it as a database, we will see the rest later.
5 |
6 | ## Notions of the module
7 |
8 | Create an Elasticsearch cluster, create index and mappings, ingest document, search & aggregate, create visuals with Kibana.
9 |
10 | In the first part of this module (ex00 to ex03) you will learn the basics of Elasticsearch. Then you will apply this to a real dataset.
11 |
12 | ## General rules
13 |
14 | * The exercises are ordered from the easiest to the hardest.
15 | * Your exercises are going to be evaluated by someone else, so make sure that your variable names and function names are appropriate and civil.
16 | * Your manual is the internet.
17 | * You can also ask any question in the dedicated channel in Slack: **[42ai slack](https://42-ai.slack.com)**.
18 | * If you find any issue or mistakes in the subject please create an issue on our dedicated repository on Github: **[Github issues](https://github.com/42-AI/bootcamp_data-engineering/issues)**.
19 |
20 | ## Foreword
21 |
22 | Did you know that Elasticsearch helps you find your soulmate?
23 |
24 | With more than 26 million swipes daily, Tinder connects more people one-on-one and in real-time than any other mobile app in the world. Behind the minimalist UI and elegant "swipe right, swipe left" that Tinder pioneered are tremendous challenges of data science, machine learning, and global scalability.
25 |
26 | Hear how Tinder relies on the Elastic Stack to analyze, visualize, and predict not only a) which people a user will swipe right on, or b) which people will swipe right on that user, but c) when there's a mutual swipe match. Tinder's VP of Engineering will describe how the service is growing into a global platform for social discovery in many facets of life.
27 |
28 | If you wanna know how the magic works: **[elastic tinder](https://www.elastic.co/elasticon/conf/2017/sf/tinder-using-the-elastic-stack-to-make-connections-around-the-world)**.
29 |
30 | ## Helper
31 |
32 | * Your best friend for the module: **[Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html)**.
33 | * We recommend using the `/goinfre` directory for this module as you will need ~3Go for the Elasticsearch cluster. But you are free to do as you wish. Note that using `/sgoinfre` won't work.
34 | * Keep in mind that the `/goinfre` is a local & temporary directory. So if you change computer you will lose your work, if you log out you might lose your work.
35 |
36 | ### Exercise 00 - The setup.
37 | ### Exercise 01 - CRUDité.
38 | ### Exercise 02 - Your first Index Your first Mapping.
39 | ### Exercise 03 - Text Analyzer
40 | ### Exercise 04 - Eat them all!
41 | ### Exercise 05 - Search - Developers
42 | ### Exercise 06 - Search - Name_lang
43 | ### Exercise 07 - Search - Game
44 | ### Exercise 07bis - Search - Vibrant World
45 | ### Exercise 08 - Aggregation
46 | ### Exercise 09 - Aggregation in Aggregation
47 | ### Exercise 10 - Kibana & Monitoring
48 |
--------------------------------------------------------------------------------
/module01/resources/ingest-pipeline.conf:
--------------------------------------------------------------------------------
1 | input
2 | {
3 | file
4 | {
5 | path => "/path/to/the/appstore_games.csv"
6 | start_position => "beginning"
7 | sincedb_path => "my-sincedb"
8 | }
9 | }
10 |
11 | filter
12 | {
13 | csv
14 | {
15 | separator => ","
16 | columns => ["URL","ID","Name","Subtitle","Icon URL","Average User Rating","User Rating Count","Price","In-app Purchases","Description","Developer","Age Rating","Languages","Size","Primary Genre","Genres","Original Release Date","Current Version Release Date"]
17 | remove_field => ["message", "host", "path", "@timestamp"]
18 | skip_header => true
19 | }
20 |
21 | mutate
22 | {
23 | gsub => [ "Description", "\\n", "
24 | "]
25 | gsub => [ "Description", "\\u2022", "•"]
26 | gsub => [ "Description", "\\u2013", "–"]
27 | gsub => [ "Description", "\\t", " "]
28 | split => { "Genres" => "," }
29 | split => { "Languages" => "," }
30 | }
31 | }
32 |
33 |
34 | output
35 | {
36 | elasticsearch
37 | {
38 | hosts => "http://localhost:9200"
39 | index => "appstore_games_tmp"
40 | }
41 |
42 | stdout
43 | {
44 | codec => "dots"
45 | }
46 | }
47 |
--------------------------------------------------------------------------------
/module02.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/42-AI/bootcamp_data-engineering/0b745562307d5fa6b395ba9b745c0df6e76ed17d/module02.pdf
--------------------------------------------------------------------------------
/module02/assets/access_key.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/42-AI/bootcamp_data-engineering/0b745562307d5fa6b395ba9b745c0df6e76ed17d/module02/assets/access_key.png
--------------------------------------------------------------------------------
/module02/assets/aws_regions.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/42-AI/bootcamp_data-engineering/0b745562307d5fa6b395ba9b745c0df6e76ed17d/module02/assets/aws_regions.png
--------------------------------------------------------------------------------
/module02/assets/terraform_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/42-AI/bootcamp_data-engineering/0b745562307d5fa6b395ba9b745c0df6e76ed17d/module02/assets/terraform_1.png
--------------------------------------------------------------------------------
/module02/assets/terraform_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/42-AI/bootcamp_data-engineering/0b745562307d5fa6b395ba9b745c0df6e76ed17d/module02/assets/terraform_2.png
--------------------------------------------------------------------------------
/module02/assets/terraform_3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/42-AI/bootcamp_data-engineering/0b745562307d5fa6b395ba9b745c0df6e76ed17d/module02/assets/terraform_3.png
--------------------------------------------------------------------------------
/module02/assets/terraform_4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/42-AI/bootcamp_data-engineering/0b745562307d5fa6b395ba9b745c0df6e76ed17d/module02/assets/terraform_4.png
--------------------------------------------------------------------------------
/module02/assets/terraform_5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/42-AI/bootcamp_data-engineering/0b745562307d5fa6b395ba9b745c0df6e76ed17d/module02/assets/terraform_5.png
--------------------------------------------------------------------------------
/module02/assets/terraform_6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/42-AI/bootcamp_data-engineering/0b745562307d5fa6b395ba9b745c0df6e76ed17d/module02/assets/terraform_6.png
--------------------------------------------------------------------------------
/module02/ex00/ex00.md:
--------------------------------------------------------------------------------
1 | # Exercise 00 - Setup
2 |
3 | | | |
4 | | -----------------------:| ------------------ |
5 | | Turn-in directory: | ex00 |
6 | | Files to turn in: | |
7 | | Forbidden function: | None |
8 | | Remarks: | n/a |
9 |
10 | In this exercise we are going to setup our account to start working with a cloud provider.
11 |
12 | Don't worry! Even if you enter your card number, this module should not cost you anything. First, indeed, AWS as a free tier usage (check if it the case also for your cloud provider) that allows you to use a small amount of AWS resources for free. This will be sufficient enough for what you are going to do today. By the end of the day, you will have to entirely destroy your infrastructure (don't keep things running) !!!
13 |
14 | ## Exercise
15 |
16 | - Create an account on your cloud provider (all the exercise were made using AWS but you can choose another cloud provider).
17 | - Set up a billing alarm linked to your email that will alert you if the cost of your infrastructure exceeds 1\$.
18 | - Create a new administrator user separated from your root account (you will need to use this user for all the exercises). Save the credentials linked to the administrator user into a file called `credentials.csv`.
19 |
20 | All the mechanisms we are creating now will ensure your access is secured and will allow you to quickly be alerted if you forgot to destroy your infrastructure.
--------------------------------------------------------------------------------
/module02/ex01/ex01.md:
--------------------------------------------------------------------------------
1 | # Exercise 01 - Storage
2 |
3 | | | |
4 | | -----------------------:| ------------------ |
5 | | Turn-in directory: | ex01 |
6 | | Files to turn in: | presigned_url.sh |
7 | | Forbidden function: | None |
8 | | Remarks: | n/a |
9 |
10 |
11 | ## AWS CLI
12 |
13 | We are going to use the AWS command-line interface. The first thing we need to do is install it.
14 |
15 | You should be able to run `aws --version` now.
16 |
17 | We can setup our AWS account for the CLI with the command `aws configure`. You will need to enter:
18 |
19 | - access key : in your `credentials.csv` file
20 | - secret access key : in your `credentials.csv` file
21 | - region : `eu-west-1` (Ireland)
22 | - default output format : `None`
23 |
24 | The AWS CLI is now ready!
25 |
26 | ## S3 bucket creation
27 |
28 | Amazon S3 provides developers and IT teams with secure, durable, and highly-scalable cloud storage. Amazon S3 is easy-to-use object storage with a simple web service interface that you can use to store and retrieve any amount of data from anywhere on the web.
29 |
30 | A bucket is a container (web folder) for objects (files) stored in Amazon S3. Every Amazon S3 object is contained in a bucket. Buckets form the top-level namespace for Amazon S3, and bucket names are global. This means that your bucket names must be unique globally (across all AWS accounts). The reason for that is when we create a bucket, it is going to have a web address (ex : `https://s3-eu-west-1.amazonaws.com/example`).
31 |
32 | Even though the namespace for Amazon S3 buckets is global, each Amazon S3 bucket is created in a specific region that you choose. This lets you control where your data is stored.
33 |
34 | With your free usage you can store up to 5 Gb of data!
35 |
36 | ## Exercise
37 |
38 | In this exercise, you will learn to create an S3 bucket and use aws-cli.
39 |
40 | - Connect to the console of your administrator user
41 | - Create an S3 bucket starting with the prefix `module02-` and finished with whatever numbers you want.
42 | - Using aws-cli, copy `appstore_games.csv` file to the bucket. You can check the file was correctly copied using the AWS console.
43 | - Using aws-cli, create a presigned URL allowing you to download the file. Your presigned url must have an expiring time of 10 minutes. Your AWS CLI command must be stored in the `presigned_url.sh` script.
44 |
--------------------------------------------------------------------------------
/module02/ex02/ex02.md:
--------------------------------------------------------------------------------
1 | # Exercise 02 - Compute
2 |
3 | | | |
4 | | -----------------------:| ------------------ |
5 | | Turn-in directory: | ex02 |
6 | | Files to turn in: | os_name.txt |
7 | | Forbidden function: | None |
8 | | Remarks: | n/a |
9 |
10 | Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable compute capacity in the cloud. Amazon EC2 reduces the time required to obtain and boot new server instances to minutes, allowing us to quickly scale capacity (up or down) depending on our needs.
11 |
12 | Amazon EC2 allows you to acquire compute through the launching of virtual servers called instances. When you launch an instance, you can make use of the compute as you wish, just as you would with an on-premises server (local servers). Because you are paying for the computing power of the instance, you are charged per hour while the instance is running. When you stop the instance, you are no longer charged.
13 |
14 | Two concepts are key to launching instances on AWS:
15 | - **instance type** : the amount of virtual hardware dedicated to the instance.
16 | - **AMI (Amazon Machine Image)** : the software loaded on the instance (Linux, MacOS, Debian, ...).
17 |
18 | The instance type defines the virtual hardware supporting an Amazon EC2 instance. There are dozens of instance types available, varying in the following dimensions:
19 |
20 | - Virtual CPUs (vCPUs)
21 | - Memory
22 | - Storage (size and type)
23 | - Network performance
24 |
25 | Instance types are grouped into families based on the ratio of these values to each other. Today we are going to use t2.micro instances (they are included in the free usage)!
26 |
27 | One of the impressive features of EC2 is autoscaling. If you have a website, with 100 users you can have your website running on a little instance. If the next day, you have 10000 users then your server can scale up by recruiting new ec2 instances to handle this new load!
28 |
29 | ## Exercise
30 |
31 | In this exercise, you will learn how to create and connect to an ec2-instance. If you are on another cloud provider aim for linux based instances with a very little size (if it can be free it is better).
32 |
33 | Follow these steps for the exercise:
34 | - launch an ec2 instance with the AMI : `Amazon Linux 2 AMI`.
35 | - choose `t2.micro` as instance type.
36 | - create a key pair.
37 | - connect in ssh to your instance using your key pair.
38 | - get and save the os name of your instance in the `os_name.txt` file.
39 | - terminate your instance.
40 |
41 | Within minutes we have created a server and we can work on it!
--------------------------------------------------------------------------------
/module02/ex03/ex03.md:
--------------------------------------------------------------------------------
1 | # Exercise 03 - Flask API - List & Delete
2 |
3 | | | |
4 | | -----------------------:| ------------------ |
5 | | Turn-in directory: | ex03 |
6 | | Files to turn in: | app.py, \*.py |
7 | | Forbidden function: | None |
8 | | Remarks: | n/a |
9 |
10 | Before getting into AWS infrastructure, we are going to discover how to interact with AWS resources using a Python SDK (Software Development Kit) called boto3. We are going to work with a python micro framework called Flask to create an API (a programmatic interface) to interact with your s3 bucket. For now, the API will be built locally to ease the development.
11 |
12 | nb: for a simplification of the following exercises we are going to use Flask directly like a development environment. If we wanted a more production-ready application we would add a webserver like Nginx/Apache linked with Gunicorn/Wsgi.
13 |
14 | ## Exercise
15 |
16 | Create a Flask application `app.py` with three routes:
17 |
18 | - **`/`**
19 | - **status** : `200`
20 | - **message** : `Successfully connected to module02 upload/download API`
21 | - **`/list_files`** :
22 | - **status** : `200`
23 | - **message** : `Successfully listed files on s3 bucket ''`
24 | - **content** : list of files within the s3 bucket
25 | - **`/delete/`** :
26 | - **status** : `200`
27 | - **message** : `Successfully deleted file '' on s3 bucket ''`
28 |
29 | The content you return with your Flask API must be json formatted. You should use boto3 to interact with the s3 bucket you previously created (`module02-...`).
--------------------------------------------------------------------------------
/module02/ex04/ex04.md:
--------------------------------------------------------------------------------
1 | # Exercise 04 - Flask API - Download & Upload
2 |
3 | | | |
4 | | -----------------------:| ------------------ |
5 | | Turn-in directory: | ex04 |
6 | | Files to turn in: | app.py, \*.py |
7 | | Forbidden function: | None |
8 | | Remarks: | n/a |
9 |
10 |
11 | We will continue to work on our Flask API to add new functionalities. This time we will work around file download and upload. In order to upload and download files we are going to use something we already used, presigned urls!
12 |
13 | ## Exercise
14 |
15 | Create a Flask application `app.py` with two more routes:
16 |
17 | - **`/download/`** :
18 | - **status** : `200`
19 | - **message** : `Successfully downloaded file '' on s3 bucket ''`
20 | - **content** : presigned url to download file
21 | - **`/upload/`** :
22 | - **status** : `200`
23 | - **message** : `Successfully uploaded file '' on s3 bucket ''`
24 | - **content** : presigned url to upload file
25 |
26 | The content you return with your Flask API has to be json formatted. You should use boto3 to interact with the s3 bucket you previously created (`module02-...`).
--------------------------------------------------------------------------------
/module02/ex05/ex05.md:
--------------------------------------------------------------------------------
1 | # Exercise 05 - Flask API - Download
2 |
3 | | | |
4 | | -----------------------:| ------------------ |
5 | | Turn-in directory: | ex05 |
6 | | Files to turn in: | client.py app.py, \*.py |
7 | | Forbidden function: | None |
8 | | Remarks: | n/a |
9 |
10 |
11 | Our API is finished but it is not quite convenient to request! To ease the use of this API, we are going to create a client that will allow us to interact more easily with the API.
12 |
13 | ## Exercise
14 |
15 | Create a client `client.py` that will call the API you are creating and show results in a more human readable way. The client will have two parameters:
16 |
17 | - **`--ip`**: IP address of the API (the default IP must be defined as `0.0.0.0`)
18 | - **`--filename`**: file name to delete, download or upload.
19 |
20 | ... and the following options:
21 |
22 | - **`ping`**: call the route `/` of the API and print the message.
23 | - **`list`**: call the route `/list_files` of the API and show the files on the bucket.
24 | - **`delete`**: call the route `/delete/` of the API and delete a file on the bucket.
25 | - **`download`**: call the route `/download/` of the API and download a file from the bucket.
26 | - **`upload`**: call the route `/delete/` of the API and upload a file on the bucket.
--------------------------------------------------------------------------------
/module02/ex06/ex06.md:
--------------------------------------------------------------------------------
1 | # Exercise 06 - IAM role
2 |
3 | | | |
4 | | -----------------------:| ------------------ |
5 | | Turn-in directory: | ex06 |
6 | | Files to turn in: | 00_variables.tf, 01_networking.tf, 07_iam.tf, 10_terraform.auto.tfvars |
7 | | Forbidden function: | None |
8 | | Remarks: | n/a |
9 |
10 | Terraform is a tool to deploy infrastructure as code. It can be used for multiple cloud providers (AWS, Azure, GCP, ...). We are going to use it to deploy our new API!
11 |
12 | As you already know, we are using our AWS free tier. However, if you let your server run for weeks you will have to pay. We want to avoid this possibility. That's why we are going to use a tool to automatically deploy and destroy our infrastructure, Terraform.
13 |
14 | All potentially critical data **MUST NOT** be deployed using infrastructure as code like terraform. If they are, they may be destroyed accidentally and you never want that to happen!
15 |
16 | ## Terraform install
17 |
18 | First, download the terraform software for macOS.
19 |
20 | ```
21 | brew install terraform
22 | ```
23 |
24 | You can now run the `terraform --version`. Terraform is ready!
25 |
26 | Terraform is composed of three kinds of files:
27 | - `.tfvars` : terraform variables.
28 | - `.tf` : terraform infrastructure description.
29 | - `.tfstate` : describe all the parameters of the stack you applied (is updated after an apply)
30 |
31 | You can run `terraform destroy` to delete your stack.
32 |
33 | No further talking, let's deep dive into Terraform!
34 |
35 | **For all the following exercises**, all the resources that can be tagged must use the `project_name` variable with the following tags structure:
36 | - `Name`: `-`
37 | - `Project_name`: ``
38 |
39 | Variables must be specified in variable files!
40 |
41 | ## Exercise
42 |
43 | For this first exercise, you will have to use the default VPC (Virtual Private Cloud). A VPC emulates a network within AWS infrastructure. This default VPC ease the use of AWS services like EC2 (you do not need to know anything in network setup). You will have to work in the Ireland region (this region can be changed depending on the cloud provider and your location).
44 |
45 | The main objective is to create an IAM role for an EC2 instance allowing it to use all actions on s3 buckets (list, copy, ...). In order to create a role in terraform you will have to create:
46 | - a role called `module02_s3FullAccessRole`
47 | - a profile called `module02_s3FullAccessProfile`
48 | - a policy called `module02_s3FullAccessPolicy`
49 |
50 | To test your role you can create an EC2 instance and link your newly created role to it, if the AWS cli works then the exercise is done. You must be able to destroy your stack entirely.
--------------------------------------------------------------------------------
/module02/ex07/ex07.md:
--------------------------------------------------------------------------------
1 | # Exercise 07 - Security groups
2 |
3 | | | |
4 | | -----------------------:| ------------------ |
5 | | Turn-in directory: | ex07 |
6 | | Files to turn in: | 08_security_groups.tf, \*.tf, \*.auto.tfvars |
7 | | Forbidden function: | None |
8 | | Remarks: | n/a |
9 |
10 | As you already noticed, Flask uses the port 5000. In EC2 all the incoming and outgoing traffic is blocked by default (for security reasons). If we want to interact with our API we will have allow the traffic. In AWS, we can define traffic rules using security groups. The security group will then be associated with an EC2 instance.
11 |
12 | ## Exercise
13 |
14 | Create a security that will allow:
15 | - `ssh` incoming traffic (we will use it in the next exercise)
16 | - `tcp` incoming traffic on port `5000` (to interact with our Flask API)
17 | - outgoing traffic to the whole internet
18 |
19 | To test the security group you can associate it to a newly created EC2 instance (you will need to use an existing key pair).
--------------------------------------------------------------------------------
/module02/ex08/ex08.md:
--------------------------------------------------------------------------------
1 | # Exercise 08 - Cloud API
2 |
3 | | | |
4 | | -----------------------:| ------------------ |
5 | | Turn-in directory: | ex08 |
6 | | Files to turn in: | 02_ec2.tf, \*.tf, \*.auto.tfvars |
7 | | Forbidden function: | None |
8 | | Remarks: | n/a |
9 |
10 | As you may have noticed, building a whole infrastructure needs a lot of steps (which includes). To ease the deployment, it was split into two parts: an intermediate deployment and the final implementation. The first part consists in deploying one EC2 instance with your working API. Thus we will ensure everything we did on terraform is working well. I have a good news, by the end of this exercise you will have done the first intermediate solution!
11 |
12 | ## Exercise
13 |
14 | To finalize our intermediate infrastructure we are going to add two components.
15 |
16 | First, you will need a key-pair file we will call `module02.pem` and provisioned through Terraform. The key pair must use the RSA algorithm and have the appropriate permissions to be functional.
17 |
18 | You must provision an EC2 resource which will use:
19 | - the default vpc
20 | - the role you created
21 | - the security group you created
22 | - the key pair you just created
23 | - a public ip address
24 | - an instance type `t2.micro` with a Linux AMI
25 |
26 | You must create an output that will show the public ip of the instance you created.
27 |
28 | At this point you should be able to ssh into the EC2 and use aws cli on s3 buckets. However, our API is still not working!
29 |
30 | First, upload the files of your API onto your s3 bucket (you don't need to upload the client). Those files should never be deleted to provision your API. This solution is not the cleanest solution but it will be sufficient for the purpose of this module.
31 |
32 | Create a bootstrap script that will:
33 | - install the necessary libraries
34 | - download the files of the API from the s3 bucket
35 | - start the API in background
36 |
37 | The exercise will be considered valid only if the API is working after a `terraform apply`. You should be able to use your client on the output ip.
--------------------------------------------------------------------------------
/module02/ex09/ex09.md:
--------------------------------------------------------------------------------
1 | # Exercise 09 - Network
2 |
3 | | | |
4 | | -----------------------:| ------------------ |
5 | | Turn-in directory: | ex09 |
6 | | Files to turn in: | 00_variables.tf, 01_networking.tf, 10_terraform.auto.tfvars |
7 | | Forbidden function: | None |
8 | | Remarks: | n/a |
9 |
10 | AWS and I lied to you! You thought deploying a server was that simple? A huge part of the required stack for the deployment is hidden! This hidden layer uses a wizard configuration (default configuration suitable for most users). The default configuration includes:
11 | - network (VPC, subnets, CIDR blocks)
12 | - network components (routing table, Internet gateway, NAT gateway)
13 | - security (NACLS, security groups)
14 |
15 | ## Exercise
16 |
17 | For this new implementation we are going to recode more parts of our architecture like the network. We are going to use our own VPC to not rely on the default one.
18 |
19 | Create a VPC using terraform. You have to respect the following constraints:
20 |
21 | - your vpc is deployed in Ireland (specified as the variable `region`).
22 | - your vpc uses a `10.0.0.0/16` CIDR block.
23 | - your vpc must enable DNS hostname (this will be useful for the next exercises)
24 |
25 | On your AWS console, you can go in the VPC section to check if your VPC was correctly created.
26 |
27 | Within our newly created VPC we want to divide the network's IPs into subnets. This can be useful for many different purposes and helps isolate groups of hosts together and deal with them easily. In AWS, subnets are often associated with different availability zones which guarantees the high availability in case an AWS data center is destroyed.
28 |
29 | {width=300px}
30 |
31 | Within our previously created VPC, add 2 subnets with the following characteristics:
32 | - their depends on the creation of the VPC (it has to be specified on terraform)
33 | - your subnets will use `10.0.1.0/24` and `10.0.2.0/24` CIDR blocks.
34 | - your subnets will use `eu-west-1a` and `eu-west-1b` availability zones.
35 | - they must map public ip on launch.
36 |
--------------------------------------------------------------------------------
/module02/ex10/ex10.md:
--------------------------------------------------------------------------------
1 | # Exercise 10 - IGW - Route table
2 |
3 | | | |
4 | | -----------------------:| ------------------ |
5 | | Turn-in directory: | ex10 |
6 | | Files to turn in: | 00_variables.tf, 01_networking.tf, 10_terraform.auto.tfvars |
7 | | Forbidden function: | None |
8 | | Remarks: | n/a |
9 |
10 | Let's continue our infrastructure! We created a Network with VPC and divided our network into subnets into two different availability zones. However, our network is still not accessible from the internet (or your IP). First, we need to create an internet gateway and link it with our VPC. This first step will allow us to interact with the internet.
11 |
12 | The subnets we created will be used to host our EC2 instances but our subnets are now disconnected from the internet and other IPs within the VPC. To fix this problem, we will create a route table (it acts like a combination of a switch (when you interact with IPs inside your VPC) and a router (when you want to interact with external IPs)).
13 |
14 | {width=300px}
15 |
16 | ## Exercise
17 |
18 | Create an Internet gateway (IGW) which depends on the VPC you created. Your IGW will need the tags:
19 | - `project_name` with the value `module02`
20 | - `Name` with the value `module02-igw`
21 |
22 | Create a route table that depends on the VPC and the IGW. Your route table will have to implement a route linking the `0.0.0.0/0` CIDR Block with the IGW. The `0.0.0.0/0` is really important in the IP search process, it means if you cannot find the IP you are looking for within the VPC then search it on other networks (through the IGW). Your route table will need the following tags:
23 | - `project_name` with the value `module02`
24 | - `Name` with the value `module02-rt`
25 |
26 | You thought you were finished ? We now need to associate our subnets to the route table ! Create route table associations for both of your subnets. They will depend on the route table and the concerned subnet.
--------------------------------------------------------------------------------
/module02/ex11/ex11.md:
--------------------------------------------------------------------------------
1 | # Exercise 11 - Autoscaling
2 |
3 | | | |
4 | | -----------------------:| ------------------ |
5 | | Turn-in directory: | ex11 |
6 | | Files to turn in: | 02_asg.tf, \*.tf, \*.tfvars |
7 | | Forbidden function: | None |
8 | | Remarks: | n/a |
9 |
10 |
11 | Any Cloud provider is based on a pay as you go system. This system allows us not to pay depending on the number of users we have. If we have 10 users our t2.micro EC2 may be sufficient for our Flask application but if tomorrow 1000000 users want to try our super API we have to find a way to scale our infrastructure!
12 |
13 | This can be done trough autoscaling groups. The more traffic we have the more EC2 instances will spawn to handle the growing traffic. Of course those new instances will be terminated if the number of users goes down.
14 |
15 | {width=300px}
16 |
17 | ## Exercise
18 |
19 | Transform your `02_ec2.t` terraform file into `02_asg.tf`. You will have to transform your code into an autoscaling group.
20 |
21 | You need to implement a launch configuration with your EC2 parameters and add a create before destroy lifecycle.
22 |
23 | Create an autoscaling group with :
24 | - dependency on the launch configuration
25 | - a link to the subnets of our vpc
26 | - the launch configuration you previously created
27 | - a minimal and maximal size of your autoscaling group will be 2 (this will allow us to always keep 2 instances up even if one terminates)
28 | - a tag with:
29 | - `Autoscaling Flask` for a key
30 | - `flask-asg` for the value
31 | - the propagate at launch option
32 |
33 | You should see 2 EC2 instances created within your AWS console.
34 |
--------------------------------------------------------------------------------
/module02/ex12/ex12.md:
--------------------------------------------------------------------------------
1 | # Exercise 12 - Load balancer
2 |
3 | | | |
4 | | -----------------------:| ------------------ |
5 | | Turn-in directory: | ex12 |
6 | | Files to turn in: | 03_elb.tf, \*.tf, \*.tfvars |
7 | | Forbidden function: | None |
8 | | Remarks: | n/a |
9 |
10 |
11 | Let's finish our infrastructure! With our autoscaling group, we now have 2 instances but we still need to go to our AWS console to search the IP of each EC2 instance which is not convenient!
12 |
13 | A solution is to create a load balancer. A load balancer as its name indicates will balance the traffic between EC2 instances (of our autoscaling group here).
14 |
15 | {width=300px}
16 |
17 | ## Exercise
18 |
19 | Create a security group for your load balancer. It must:
20 | - depend on the vpc you created
21 | - allow traffic from port 5000
22 |
23 | Create a load balancer with:
24 | - a health check on port 5000 every 30 sec and a healthy threshold at 2
25 | - a listener on the port 5000
26 | - the cross-zone load balancing option
27 |
28 | In your autoscaling group add your load balancer and a health check type of type `ELB`.
29 |
30 | Create a terraform output that will display the DNS name of your load balancer (this output will replace the output ip of the EC2 we had).
31 |
32 | You should be able to use the DNS name of your load balancer to call the API now (yes this should work with the ip option you created without any other modifications)!
33 |
34 | After the `terraform apply` finished you probably will have to wait 30 sec - 1 minute before the API is working.
35 |
36 | **Do not forget to `terraform destroy` at the end of the module!**
--------------------------------------------------------------------------------
/module02/module02.md:
--------------------------------------------------------------------------------
1 | # Module02 - Cloud Storage API
2 |
3 | In the module, you will learn how to use a Cloud Provider. For all the exercises, I took Amazon Web Services (AWS) as an example but **you are totally free to use the cloud provider you want which is compatible with Terraform** (we advise you to use AWS if you don't have one). AWS has become the most popular cloud service provider in the world followed by Google Cloud Platform and Microsoft Azure.
4 |
5 | Amazon Web Services started in 2005 and it now delivers nearly 2000 services. Due to the large number of services and the maturity of AWS, it is a better option to start learning cloud computing.
6 |
7 | If you never heard about the Cloud before, do not worry! You will learn step by step what the Cloud is and how to use it.
8 |
9 | ## Notions of the module
10 |
11 | The module will be divided into two parts. In the first one, you will learn to use a tool called Terraform which will allow you to deploy/destruct cloud infrastructures. In the second part of the module, you will learn to use a software development kit (SDK) which will allow you to use Python in order to interact with your cloud.
12 |
13 | ## General rules
14 |
15 | * The exercises are ordered from the easiest to the hardest.
16 | * Your exercises are going to be evaluated by someone else, so make sure that your variable names and function names are appropriate and civil.
17 | * Your manual is the internet.
18 | * You can also ask any question in the dedicated channel in Slack: **[42ai slack](https://42-ai.slack.com)**.
19 | * If you find any issue or mistakes in the subject please create an issue on our dedicated repository on Github: **[Github issues](https://github.com/42-AI/bootcamp_data-engineering/issues)**.
20 |
21 | ## Foreword
22 |
23 | Cloud computing is the on-demand delivery of IT resources and applications via the Internet with pay-as-you-go pricing. In fact, a cloud server is located in a data center that could be anywhere in the world.
24 |
25 | Whether you run applications that share photos to millions of mobile users or deliver services that support the critical operations of your business, the cloud provides rapid access to flexible and low-cost IT resources. With cloud computing, you don’t need to make large up-front investments in hardware and spend a lot of time managing that hardware. Instead, you can provision exactly the right type and size of computing resources you need to power your newest bright idea or operate your IT department. With cloud computing, you can access as many resources as you need, almost instantly, and only pay for what you use.
26 |
27 | In its simplest form, cloud computing provides an easy way to access servers, storage, databases, and a broad set of application services over the Internet. Cloud computing providers such as AWS own and maintain the network-connected hardware required for these application services, while you provision and use what you need for your workloads.
28 |
29 | As seen previously, Cloud computing provides some real benefits :
30 |
31 | - **Variable expense**: You don't need to invest in huge data centers you may not use at full capacity. You pay for how much you consume!
32 | - **Available in minutes**: New IT resources can be accessed within minutes.
33 | - **Economies of scale**: A large number of users enables Cloud providers to achieve higher economies of scale translating at lower prices.
34 | - **Global in minutes**: Cloud architectures can be deployed really easily all around the world.
35 |
36 | Deployments using the cloud can be `all-in-cloud-based` (the entire infrastructure is in the cloud) or `hybrid` (using on-premise and cloud).
37 |
38 | ## AWS global infrastructure
39 |
40 | Amazon Web Services (AWS) is a cloud service provider, also known as infrastructure-as-a-service (`IaaS`). AWS is the clear market leader in this domain and offers much more services compared to its competitors.
41 |
42 | AWS has some interesting properties such as:
43 |
44 | - **High availability** : Any file can be accessed from anywhere
45 | - **Fault tolerance**: In case an AWS server fails, you can still retrieve the files (the fault tolerance is due to redundancy).
46 | - **Scalability**: Possibility to add more servers when needed.
47 | - **Elasticity**: Possibility to grow or shrink infrastructure.
48 |
49 | AWS provides a highly available technology infrastructure platform with multiple locations worldwide. These locations are composed of `regions` and `availability zones`.
50 |
51 | Each region represents a unique geographic area. Each region contains multiple, isolated locations known as availability zones. An availability zone is a physical data center geographically separated from other availability zones (redundant power, networking, and connectivity).
52 |
53 | You can achieve high availability by deploying your application across multiple availability zones.
54 |
55 | {width=400px}
56 |
57 | The `edge locations` you see on the picture are endpoints for AWS which are used for caching content (performance optimization mechanism in which data is delivered from the closest servers for optimal application performance). Typically consists of CloudFront (Amazon's content delivery network (CDN)).
58 |
59 | ## Helper
60 |
61 | * Your best friends for the module: **[AWS documentation](https://docs.aws.amazon.com/index.html)** and **[Terraform documentation](https://www.terraform.io/docs/index.html)**.
62 |
63 | ### Exercise 00 - Setup
64 | ### Exercise 01 - Storage
65 | ### Exercise 02 - Compute
66 | ### Exercise 03 - Flask API - List & Delete
67 | ### Exercise 04 - Flask API - Download & Upload
68 | ### Exercise 05 - Client
69 | ### Exercise 06 - IAM role
70 | ### Exercise 07 - Security group
71 | ### Exercise 08 - Cloud API
72 | ### Exercise 09 - Network
73 | ### Exercise 10 - IGW - Route table
74 | ### Exercise 12 - Autoscaling
75 | ### Exercise 13 - Load balancer
76 |
--------------------------------------------------------------------------------
/resources/appstore_games.csv.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/42-AI/bootcamp_data-engineering/0b745562307d5fa6b395ba9b745c0df6e76ed17d/resources/appstore_games.csv.zip
--------------------------------------------------------------------------------