├── .gitignore
├── img
    ├── author_image.png
    └── shield_image.png
├── course.yml
├── README.md
└── chapter1.md


/.gitignore:
--------------------------------------------------------------------------------
1 | .DS_STORE
2 | .cache
3 | .ipynb_checkpoints
4 | .spyderproject
5 | 


--------------------------------------------------------------------------------
/img/author_image.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacamp/community-courses-jhu-genomics-labs/master/img/author_image.png


--------------------------------------------------------------------------------
/img/shield_image.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacamp/community-courses-jhu-genomics-labs/master/img/shield_image.png


--------------------------------------------------------------------------------
/course.yml:
--------------------------------------------------------------------------------
1 | title                : Genomic Data Science Labs
2 | instructors          :
3 |  - weston@datacamp.com
4 | description          : This is a demo for the Python for Genomic Data Science course of the Genomic Data Science Specialization developed by Johns Hopkins.
5 | university           : DataCamp
6 | difficulty_level     : 2
7 | time_needed          : 1 hour
8 | programming_language : python
9 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # DataCamp Template Course
 2 | <a href=https://www.datacamp.com//teach/repositories/68167528/go target="_blank"><img src="https://s3.amazonaws.com/assets.datacamp.com/img/github/content-engineering-repos/course_button.png" width="150"></a>
 3 | <a href=https://www.datacamp.com//teach/repositories target="_blank"><img src="https://s3.amazonaws.com/assets.datacamp.com/img/github/content-engineering-repos/dashboard_button.png" width="150"></a>
 4 | 
 5 | This an automatically generated <a href=https://www.datacamp.com target="_blank">DataCamp</a> course. You can start from these template files to create your own course.
 6 | 
 7 | Changes you make to this GitHub repository are automatically reflected in the linked DataCamp course. This means that you can enjoy all the advantages of version control, collaboration, issue handling ... of GitHub.
 8 | 
 9 | ## Workflow
10 | 
11 | 1. Edit the markdown and yml files in this repository. You can use GitHub's online editor or use <a href=https://git-scm.com/ target="_blank">git</a> locally and push your changes.
12 | 2. Check out your build attempts on the <a href=https://www.datacamp.com//teach/repositories target="_blank">Dashboard</a>.
13 | 3. Check out your automatically updated <a href=https://www.datacamp.com/teach/repositories/68167528/go target="_blank">course on DataCamp</a>
14 | 
15 | ## Getting Started
16 | 
17 | A DataCamp course consists of two types of files:
18 | 
19 | - `course.yml`, a <a href=http://docs.ansible.com/ansible/YAMLSyntax.html target="_blank">YAML-formatted file</a> that's prepopulated with some general course information.
20 | - `chapterX.md`, a markdown file with:
21 |    - a YAML header containing chapter information.
22 |    - markdown chunks representing DataCamp Exercises.
23 | 
24 | To learn more about the structure of a DataCamp course, check out the <a href=https://www.datacamp.com//teach/documentation#tab_course_structure target="_blank">documentation</a>.
25 | 
26 | Every DataCamp exercise consists of different parts, read up about them <a href=https://www.datacamp.com//teach/documentation#tab_code_exercises target="_blank">here</a>. A very important part about DataCamp exercises is to provide automated personalized feedback to students. In R, these so-called Submission Correctness Tests (SCTs) are written with the <a href=https://github.com/datacamp/testwhat target="_blank">`testwhat`</a> package. SCTs for Python exercises are coded up with <a href=https://github.com/datacamp/pythonwhat target="_blank">`pythonwhat`</a>. Check out the GitHub repositories' wiki pages for more information and examples.
27 | 
28 | Want to learn more? Check out the <a href=https://www.datacamp.com//teach/documentation target="_blank">documentation</a> on teaching at DataCamp.
29 | 
30 | *Happy teaching!*
31 | 


--------------------------------------------------------------------------------
/chapter1.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title       : Genomic Data Science Specialization - Final Quiz Demo
  3 | description : This class provides an introduction to the Python programming language and the iPython notebook. This is the third course in the Genomic Big Data Science Specialization from Johns Hopkins University
  4 | attachments :
  5 | slides_link :
  6 | 
  7 | --- type:MultipleChoiceExercise lang:python xp:50 skills:1 key:781f7226ca
  8 | ## Question 1
  9 | 
 10 | Write a Python program that takes as input a multi-FASTA file with DNA sequences, and computes the answers to the following questions. You can choose to write one program with multiple functions to answer these questions, or you can write several programs to address them. There will be a final quiz at the end of the Python for Genomic Data course that will require you to run this program on a provided multi-FASTA file. 
 11 | 
 12 | How many records are in the file? A record in a FASTA file is defined as a single-line header,  followed by lines of sequence data. The header of the record is distinguished from the sequence data by a greater-than (">") symbol in the first column. The word following the ">" symbol is the identifier of the sequence, and the rest of the line is the description.There should be no space between the ">" and the first letter of the identifier.
 13 | 
 14 | *** =instructions
 15 | - 14
 16 | - 740
 17 | - 1240
 18 | - 20
 19 | - 18
 20 | - 19
 21 | - 760
 22 | 
 23 | *** =hint
 24 | Hint Hint Hint
 25 | 
 26 | *** =pre_exercise_code
 27 | ```{r}
 28 | # The pre exercise code runs code to initialize the user's workspace.
 29 | # You can use it to load packages, initialize datasets and draw a plot in the viewer
 30 | 
 31 | import pandas as pd
 32 | import matplotlib.pyplot as plt
 33 | 
 34 | ```
 35 | 
 36 | *** =sct
 37 | ```{r}
 38 | # SCT written with pythonwhat: https://github.com/datacamp/pythonwhat/wiki
 39 | 
 40 | msg_bad = "That is not correct!"
 41 | msg_success = "Exactly! Keep up the great work."
 42 | test_mc(4, [msg_bad, msg_bad, msg_bad, msg_success, msg_bad, msg_bad, msg_bad])
 43 | ```
 44 | --- type:MultipleChoiceExercise lang:python xp:50 skills:1 key:7187d6e84e
 45 | ## Question 2
 46 | 
 47 | What are the length of the sequences in the file? What is the longest sequence and what is the shortest sequence? Are there more than one longest or shortest sequence? What are their identifiers?
 48 | 
 49 | What is the length of the longest sequence in the file?
 50 | 
 51 | *** =instructions
 52 | - 6007
 53 | - 3245
 54 | - 1247
 55 | - 10523
 56 | - 4510
 57 | - 4200
 58 | 
 59 | *** =hint
 60 | Hint Hint Hint
 61 | 
 62 | *** =pre_exercise_code
 63 | ```{r}
 64 | # The pre exercise code runs code to initialize the user's workspace.
 65 | # You can use it to load packages, initialize datasets and draw a plot in the viewer
 66 | 
 67 | import pandas as pd
 68 | import matplotlib.pyplot as plt
 69 | 
 70 | ```
 71 | 
 72 | *** =sct
 73 | ```{r}
 74 | # SCT written with pythonwhat: https://github.com/datacamp/pythonwhat/wiki
 75 | 
 76 | msg_bad = "That is not correct!"
 77 | msg_success = "Great job! You are awesome."
 78 | test_mc(5, [msg_bad, msg_bad, msg_bad, msg_bad, msg_success, msg_bad])
 79 | ```
 80 | --- type:MultipleChoiceExercise lang:python xp:50 skills:1 key:8f07414a65
 81 | ## Question 3
 82 | 
 83 | In molecular biology, a reading frame is a way of dividing the DNA sequence of nucleotides into a set of consecutive, non-overlapping triplets. Depending on what nucleotide in the first triplet of the DNA sequence we start the reading frame, there are six possible reading frames - three in the forward (5' to 3') direction and three in the reverse (3' to 5'). For instance, the three possible forward reading frames for the sequence 
 84 | >``` AGGTGACACCGCAAGCCTTATATTAGC are: AGG∑TGA∑CAC∑CGC∑AAG∑CCT∑
 85 | TAT∑ATT∑AGCA∑GGT∑GAC∑
 86 | ACC∑GCA∑AGC∑CTT∑ATA∑
 87 | TTA∑GCAG∑GTG∑ACA∑CCG∑
 88 | CAA∑GCC∑TTA∑TAT∑TAG∑C ```
 89 | 
 90 | These are called reading frames 1,2, and 3 respectively. An open reading frame (ORF) is the part of a reading frame that has the potential to code for a protein or peptide. It starts with a start codon (ATG), and ends with a stop codon (TAA, TAG or TGA). For instance, ATGAAATAG is an ORF of length 9. Given an input forward reading frame (1,2, or 3) your program should be able to identify all ORFs present in each sequence of the FASTA file, and answer the following questions: what is the length of the longest ORF in the file? What is the identifier of the sequence containing the longest ORF? What is the starting position of the longest ORF in the sequence that contains it? The position should indicate the character number in the sequence. For instance, the following ORF in reading frame 1:
 91 | >sequence1
 92 | ```ATGCCCTAG```
 93 | <p>starts at position 1.
 94 | <p>Note that although the following sequence:
 95 | >sequence2
 96 | ```ATGAAAAAA```
 97 | doesn't have any stop codon in reading frame 1, we will not consider it to be an ORF in reading frame 1 because it doesn't end with a stop codon.
 98 | 
 99 | What is the length of the shortest sequence in the file?
100 | 
101 | 
102 | *** =instructions
103 | - 1081
104 | - 257
105 | - 475
106 | - 132
107 | - 834
108 | - 1721
109 | 
110 | *** =hint
111 | Hint Hint Hint
112 | 
113 | *** =pre_exercise_code
114 | ```{r}
115 | # The pre exercise code runs code to initialize the user's workspace.
116 | # You can use it to load packages, initialize datasets and draw a plot in the viewer
117 | 
118 | import pandas as pd
119 | import matplotlib.pyplot as plt
120 | 
121 | ```
122 | 
123 | *** =sct
124 | ```{r}
125 | # SCT written with pythonwhat: https://github.com/datacamp/pythonwhat/wiki
126 | 
127 | msg_bad = "That is not correct!"
128 | msg_success = "Great job! You are awesome."
129 | test_mc(3, [msg_bad, msg_bad, msg_success, msg_bad, msg_bad, msg_bad])
130 | ```
131 | --- type:MultipleChoiceExercise lang:python xp:50 skills:1 key:642d18896c
132 | ## Question 4
133 | A repeat is a substring of the DNA sequence that occurs in multiple copies (more than one) throughout the sequence. Although repeats can occur on both forward or reverse strand of the DNA sequence, we will only consider repeats on the forward strand here. 
134 | 
135 | Also we will assume that the repeats can overlap. For instance the sequence ACACA contains two repeats of length 3: ACA amd CAC. CAC occurs while once in the sequence at position 2 (index 1 in Python), while ACA occurs twice - once at position 1, and once at position 3. 
136 | 
137 | Given a length n, your program should be able to identify all the repeats of length n in all sequences in the FASTA file. Your program should also find out how many times each repeat occurs in the file, and which is the most frequent repeat of a given length.
138 | 
139 | What is the length of the longest ORF appearing in reading frame 2 of any of the sequences?
140 | 
141 | 
142 | *** =instructions
143 | - 1488
144 | - 1134
145 | - 1518
146 | - 834
147 | - 1271
148 | - 1272
149 | 
150 | *** =hint
151 | Hint Hint Hint
152 | 
153 | *** =pre_exercise_code
154 | ```{r}
155 | # The pre exercise code runs code to initialize the user's workspace.
156 | # You can use it to load packages, initialize datasets and draw a plot in the viewer
157 | 
158 | import pandas as pd
159 | import matplotlib.pyplot as plt
160 | 
161 | ```
162 | 
163 | *** =sct
164 | ```{r}
165 | # SCT written with pythonwhat: https://github.com/datacamp/pythonwhat/wiki
166 | 
167 | msg_bad = "That is not correct!"
168 | msg_success = "Great job! You are awesome."
169 | test_mc(2, [msg_bad, msg_success, msg_bad, msg_bad, msg_bad, msg_bad])
170 | ```
171 | --- type:MultipleChoiceExercise lang:python xp:50 skills:1 key:60978bf3e8
172 | ## Question 5
173 | 
174 | At what position in the sequence starts the longest ORF in reading frame 3? The position should indicate the character number in the sequence. For instance, the following ORF:
175 | 
176 | > sequence1
177 | ATGCCCTAG
178 | starts at position 1.
179 | 
180 | 
181 | *** =instructions
182 | - 237
183 | - 1045
184 | - 1818
185 | - 39
186 | - 575
187 | - 2100
188 | 
189 | *** =hint
190 | Hint Hint Hint
191 | 
192 | *** =pre_exercise_code
193 | ```{r}
194 | # The pre exercise code runs code to initialize the user's workspace.
195 | # You can use it to load packages, initialize datasets and draw a plot in the viewer
196 | 
197 | import pandas as pd
198 | import matplotlib.pyplot as plt
199 | 
200 | ```
201 | 
202 | *** =sct
203 | ```{r}
204 | # SCT written with pythonwhat: https://github.com/datacamp/pythonwhat/wiki
205 | 
206 | msg_bad = "That is not correct!"
207 | msg_success = "Great job! You are awesome."
208 | test_mc(3, [msg_bad, msg_bad, msg_success, msg_bad, msg_bad, msg_bad])
209 | ```
210 | --- type:MultipleChoiceExercise lang:python xp:50 skills:1 key:24d0be4fc4
211 | ## Question 6
212 | 
213 | What is the length of the longest ORF appearing in any sequence and in any forward reading frame?
214 | 
215 | 
216 | *** =instructions
217 | - 1518
218 | - 1488
219 | - 1095
220 | - 834
221 | - 1272
222 | 
223 | *** =hint
224 | Hint Hint Hint
225 | 
226 | *** =pre_exercise_code
227 | ```{r}
228 | # The pre exercise code runs code to initialize the user's workspace.
229 | # You can use it to load packages, initialize datasets and draw a plot in the viewer
230 | 
231 | import pandas as pd
232 | import matplotlib.pyplot as plt
233 | 
234 | ```
235 | 
236 | *** =sct
237 | ```{r}
238 | # SCT written with pythonwhat: https://github.com/datacamp/pythonwhat/wiki
239 | 
240 | msg_bad = "That is not correct!"
241 | msg_success = "Great job! You are awesome."
242 | test_mc(1, [msg_success, msg_bad, msg_bad, msg_bad, msg_bad])
243 | ```
244 | --- type:MultipleChoiceExercise lang:python xp:50 skills:1 key:53437f0648
245 | ## Question 7
246 | 
247 | What is the length of the longest ORF that appears in the sequence with the identifier  
248 | > gi|142022655|gb|EQ086233.1|97?
249 | 
250 | 
251 | *** =instructions
252 | - 2067  
253 | - 1551
254 | - 1358
255 | - 1272
256 | - 1080
257 | - 783
258 | - 588
259 | 
260 | *** =hint
261 | Hint Hint Hint
262 | 
263 | *** =pre_exercise_code
264 | ```{r}
265 | # The pre exercise code runs code to initialize the user's workspace.
266 | # You can use it to load packages, initialize datasets and draw a plot in the viewer
267 | 
268 | import pandas as pd
269 | import matplotlib.pyplot as plt
270 | 
271 | ```
272 | 
273 | *** =sct
274 | ```{r}
275 | # SCT written with pythonwhat: https://github.com/datacamp/pythonwhat/wiki
276 | 
277 | msg_bad = "That is not correct!"
278 | msg_success = "Great job! You are awesome."
279 | test_mc(4, [msg_bad, msg_bad, msg_bad, msg_success, msg_bad, msg_bad, msg_bad])
280 | ```
281 | --- type:MultipleChoiceExercise lang:python xp:50 skills:1 key:ea1198f108
282 | ## Question 8
283 | 
284 | What is the most number of times the most frequent repeat of length 6 appears in all sequences?
285 | 
286 | *** =instructions
287 | - 153
288 | - 158
289 | - 72
290 | - 541
291 | - 98
292 | - 356
293 | 
294 | *** =hint
295 | Hint Hint Hint
296 | 
297 | *** =pre_exercise_code
298 | ```{r}
299 | # The pre exercise code runs code to initialize the user's workspace.
300 | # You can use it to load packages, initialize datasets and draw a plot in the viewer
301 | 
302 | import pandas as pd
303 | import matplotlib.pyplot as plt
304 | 
305 | ```
306 | 
307 | *** =sct
308 | ```{r}
309 | # SCT written with pythonwhat: https://github.com/datacamp/pythonwhat/wiki
310 | 
311 | msg_bad = "That is not correct!"
312 | msg_success = "Great job! You are awesome."
313 | test_mc(2, [msg_bad, msg_success, msg_bad, msg_bad, msg_bad, msg_bad])
314 | ```
315 | --- type:MultipleChoiceExercise lang:python xp:50 skills:1 key:ed76e2e48e
316 | ## Question 9
317 | 
318 | How many repeats of length 12 have the largest number of occurences in all the sequences?
319 | 
320 | *** =instructions
321 | - 3
322 | - 1
323 | - 7
324 | - 24
325 | - 9
326 | 
327 | *** =hint
328 | Hint Hint Hint
329 | 
330 | *** =pre_exercise_code
331 | ```{r}
332 | # The pre exercise code runs code to initialize the user's workspace.
333 | # You can use it to load packages, initialize datasets and draw a plot in the viewer
334 | 
335 | import pandas as pd
336 | import matplotlib.pyplot as plt
337 | 
338 | ```
339 | 
340 | *** =sct
341 | ```{r}
342 | # SCT written with pythonwhat: https://github.com/datacamp/pythonwhat/wiki
343 | 
344 | msg_bad = "That is not correct!"
345 | msg_success = "Great job! You are awesome."
346 | test_mc(1, [msg_success, msg_bad, msg_bad, msg_bad, msg_bad])
347 | ```
348 | --- type:MultipleChoiceExercise lang:python xp:50 skills:1 key:9a3bc45efb
349 | ## Question 10
350 | 10:
351 | Which one of the following repeats of length 7 has a maximum number of occurences?
352 | 
353 | 
354 | *** =instructions
355 | - CGGCGGC
356 | - CGGCGCT
357 | - TGGTGGC
358 | - GCCGCCG
359 | - GCGGCGC
360 | - TCGGCGC
361 | 
362 | *** =hint
363 | Hint Hint Hint
364 | 
365 | *** =pre_exercise_code
366 | ```{r}
367 | # The pre exercise code runs code to initialize the user's workspace.
368 | # You can use it to load packages, initialize datasets and draw a plot in the viewer
369 | 
370 | import pandas as pd
371 | import matplotlib.pyplot as plt
372 | 
373 | ```
374 | 
375 | *** =sct
376 | ```{r}
377 | # SCT written with pythonwhat: https://github.com/datacamp/pythonwhat/wiki
378 | 
379 | msg_bad = "That is not correct!"
380 | msg_success = "Great job! You are awesome."
381 | test_mc(1, [msg_success, msg_bad, msg_bad, msg_bad, msg_bad, msg_bad])
382 | ```
383 | 
384 | --- type:NormalExercise lang:python xp:100 skills:1 key:9d1e104368
385 | ## Regular Exercise
386 | 
387 | Do you remember the plot of the last exercise? Let's make an even cooler plot!
388 | 
389 | A dataset of movies, `movies`, is available in the workspace.
390 | 
391 | *** =instructions
392 | - The first function, `np.unique()`, uses the `unique()` function of the `numpy` package to get integer values for the movie genres. You don't have to change this code, just have a look!
393 | - Import `pyplot` in the `matplotlib` package. Set an alias for this import: `plt`.
394 | - Use `plt.scatter()` to plot `movies.runtime` onto the x-axis, `movies.rating` onto the y-axis and use `ints` for the color of the dots. You should use the first and second positional argument, and the `c` keyword.
395 | - Show the plot using `plt.show()`.
396 | 
397 | *** =hint
398 | - You don't have to program anything for the first instruction, just take a look at the first line of code.
399 | - Use `import ___ as ___` to import `matplotlib.pyplot` as `plt`.
400 | - Use `plt.scatter(___, ___, c = ___)` for the third instruction.
401 | - You'll always have to type in `plt.show()` to show the plot you created.
402 | 
403 | *** =pre_exercise_code
404 | ```{python}
405 | import pandas as pd
406 | movies = pd.read_csv("http://s3.amazonaws.com/assets.datacamp.com/course/introduction_to_r/movies.csv")
407 | 
408 | import numpy as np
409 | ```
410 | 
411 | *** =sample_code
412 | ```{python}
413 | # Get integer values for genres
414 | _, ints = np.unique(movies.genre, return_inverse = True)
415 | 
416 | # Import matplotlib.pyplot
417 | 
418 | 
419 | # Make a scatter plot: runtime on  x-axis, rating on y-axis and set c to ints
420 | 
421 | 
422 | # Show the plot
423 | 
424 | ```
425 | 
426 | *** =solution
427 | ```{python}
428 | # Get integer values for genres
429 | _, ints = np.unique(movies.genre, return_inverse = True)
430 | 
431 | # Import matplotlib.pyplot
432 | import matplotlib.pyplot as plt
433 | 
434 | # Make a scatter plot: runtime on  x-axis, rating on y-axis and set c to ints
435 | plt.scatter(movies.runtime, movies.rating, c=ints)
436 | 
437 | # Show the plot
438 | plt.show()
439 | ```
440 | 
441 | *** =sct
442 | ```{python}
443 | # SCT written with pythonwhat: https://github.com/datacamp/pythonwhat/wiki
444 | 
445 | test_function("numpy.unique",
446 |               not_called_msg = "Don't remove the call of `np.unique` to define `ints`.",
447 |               incorrect_msg = "Don't change the call of `np.unique` to define `ints`.")
448 | 
449 | test_object("ints",
450 |             undefined_msg = "Don't remove the definition of the predefined `ints` object.",
451 |             incorrect_msg = "Don't change the definition of the predefined `ints` object.")
452 | 
453 | test_import("matplotlib.pyplot", same_as = True)
454 | 
455 | test_function("matplotlib.pyplot.scatter",
456 |               incorrect_msg = "You didn't use `plt.scatter()` correctly, have another look at the instructions.")
457 | 
458 | test_function("matplotlib.pyplot.show")
459 | 
460 | success_msg("Great work!")
461 | ```
462 | 


--------------------------------------------------------------------------------