├── .gitignore
├── img
├── author_image.png
└── shield_image.png
├── course.yml
├── README.md
└── chapter1.md
/.gitignore:
--------------------------------------------------------------------------------
1 | .DS_STORE
2 | .cache
3 | .ipynb_checkpoints
4 | .spyderproject
5 |
--------------------------------------------------------------------------------
/img/author_image.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacamp/community-courses-jhu-genomics-labs/master/img/author_image.png
--------------------------------------------------------------------------------
/img/shield_image.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacamp/community-courses-jhu-genomics-labs/master/img/shield_image.png
--------------------------------------------------------------------------------
/course.yml:
--------------------------------------------------------------------------------
1 | title : Genomic Data Science Labs
2 | instructors :
3 | - weston@datacamp.com
4 | description : This is a demo for the Python for Genomic Data Science course of the Genomic Data Science Specialization developed by Johns Hopkins.
5 | university : DataCamp
6 | difficulty_level : 2
7 | time_needed : 1 hour
8 | programming_language : python
9 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # DataCamp Template Course
2 |
3 |
4 |
5 | This an automatically generated DataCamp course. You can start from these template files to create your own course.
6 |
7 | Changes you make to this GitHub repository are automatically reflected in the linked DataCamp course. This means that you can enjoy all the advantages of version control, collaboration, issue handling ... of GitHub.
8 |
9 | ## Workflow
10 |
11 | 1. Edit the markdown and yml files in this repository. You can use GitHub's online editor or use git locally and push your changes.
12 | 2. Check out your build attempts on the Dashboard.
13 | 3. Check out your automatically updated course on DataCamp
14 |
15 | ## Getting Started
16 |
17 | A DataCamp course consists of two types of files:
18 |
19 | - `course.yml`, a YAML-formatted file that's prepopulated with some general course information.
20 | - `chapterX.md`, a markdown file with:
21 | - a YAML header containing chapter information.
22 | - markdown chunks representing DataCamp Exercises.
23 |
24 | To learn more about the structure of a DataCamp course, check out the documentation.
25 |
26 | Every DataCamp exercise consists of different parts, read up about them here. A very important part about DataCamp exercises is to provide automated personalized feedback to students. In R, these so-called Submission Correctness Tests (SCTs) are written with the `testwhat` package. SCTs for Python exercises are coded up with `pythonwhat`. Check out the GitHub repositories' wiki pages for more information and examples.
27 |
28 | Want to learn more? Check out the documentation on teaching at DataCamp.
29 |
30 | *Happy teaching!*
31 |
--------------------------------------------------------------------------------
/chapter1.md:
--------------------------------------------------------------------------------
1 | ---
2 | title : Genomic Data Science Specialization - Final Quiz Demo
3 | description : This class provides an introduction to the Python programming language and the iPython notebook. This is the third course in the Genomic Big Data Science Specialization from Johns Hopkins University
4 | attachments :
5 | slides_link :
6 |
7 | --- type:MultipleChoiceExercise lang:python xp:50 skills:1 key:781f7226ca
8 | ## Question 1
9 |
10 | Write a Python program that takes as input a multi-FASTA file with DNA sequences, and computes the answers to the following questions. You can choose to write one program with multiple functions to answer these questions, or you can write several programs to address them. There will be a final quiz at the end of the Python for Genomic Data course that will require you to run this program on a provided multi-FASTA file.
11 |
12 | How many records are in the file? A record in a FASTA file is defined as a single-line header, followed by lines of sequence data. The header of the record is distinguished from the sequence data by a greater-than (">") symbol in the first column. The word following the ">" symbol is the identifier of the sequence, and the rest of the line is the description.There should be no space between the ">" and the first letter of the identifier.
13 |
14 | *** =instructions
15 | - 14
16 | - 740
17 | - 1240
18 | - 20
19 | - 18
20 | - 19
21 | - 760
22 |
23 | *** =hint
24 | Hint Hint Hint
25 |
26 | *** =pre_exercise_code
27 | ```{r}
28 | # The pre exercise code runs code to initialize the user's workspace.
29 | # You can use it to load packages, initialize datasets and draw a plot in the viewer
30 |
31 | import pandas as pd
32 | import matplotlib.pyplot as plt
33 |
34 | ```
35 |
36 | *** =sct
37 | ```{r}
38 | # SCT written with pythonwhat: https://github.com/datacamp/pythonwhat/wiki
39 |
40 | msg_bad = "That is not correct!"
41 | msg_success = "Exactly! Keep up the great work."
42 | test_mc(4, [msg_bad, msg_bad, msg_bad, msg_success, msg_bad, msg_bad, msg_bad])
43 | ```
44 | --- type:MultipleChoiceExercise lang:python xp:50 skills:1 key:7187d6e84e
45 | ## Question 2
46 |
47 | What are the length of the sequences in the file? What is the longest sequence and what is the shortest sequence? Are there more than one longest or shortest sequence? What are their identifiers?
48 |
49 | What is the length of the longest sequence in the file?
50 |
51 | *** =instructions
52 | - 6007
53 | - 3245
54 | - 1247
55 | - 10523
56 | - 4510
57 | - 4200
58 |
59 | *** =hint
60 | Hint Hint Hint
61 |
62 | *** =pre_exercise_code
63 | ```{r}
64 | # The pre exercise code runs code to initialize the user's workspace.
65 | # You can use it to load packages, initialize datasets and draw a plot in the viewer
66 |
67 | import pandas as pd
68 | import matplotlib.pyplot as plt
69 |
70 | ```
71 |
72 | *** =sct
73 | ```{r}
74 | # SCT written with pythonwhat: https://github.com/datacamp/pythonwhat/wiki
75 |
76 | msg_bad = "That is not correct!"
77 | msg_success = "Great job! You are awesome."
78 | test_mc(5, [msg_bad, msg_bad, msg_bad, msg_bad, msg_success, msg_bad])
79 | ```
80 | --- type:MultipleChoiceExercise lang:python xp:50 skills:1 key:8f07414a65
81 | ## Question 3
82 |
83 | In molecular biology, a reading frame is a way of dividing the DNA sequence of nucleotides into a set of consecutive, non-overlapping triplets. Depending on what nucleotide in the first triplet of the DNA sequence we start the reading frame, there are six possible reading frames - three in the forward (5' to 3') direction and three in the reverse (3' to 5'). For instance, the three possible forward reading frames for the sequence
84 | >``` AGGTGACACCGCAAGCCTTATATTAGC are: AGG∑TGA∑CAC∑CGC∑AAG∑CCT∑
85 | TAT∑ATT∑AGCA∑GGT∑GAC∑
86 | ACC∑GCA∑AGC∑CTT∑ATA∑
87 | TTA∑GCAG∑GTG∑ACA∑CCG∑
88 | CAA∑GCC∑TTA∑TAT∑TAG∑C ```
89 |
90 | These are called reading frames 1,2, and 3 respectively. An open reading frame (ORF) is the part of a reading frame that has the potential to code for a protein or peptide. It starts with a start codon (ATG), and ends with a stop codon (TAA, TAG or TGA). For instance, ATGAAATAG is an ORF of length 9. Given an input forward reading frame (1,2, or 3) your program should be able to identify all ORFs present in each sequence of the FASTA file, and answer the following questions: what is the length of the longest ORF in the file? What is the identifier of the sequence containing the longest ORF? What is the starting position of the longest ORF in the sequence that contains it? The position should indicate the character number in the sequence. For instance, the following ORF in reading frame 1:
91 | >sequence1
92 | ```ATGCCCTAG```
93 |
starts at position 1. 94 |
Note that although the following sequence: 95 | >sequence2 96 | ```ATGAAAAAA``` 97 | doesn't have any stop codon in reading frame 1, we will not consider it to be an ORF in reading frame 1 because it doesn't end with a stop codon. 98 | 99 | What is the length of the shortest sequence in the file? 100 | 101 | 102 | *** =instructions 103 | - 1081 104 | - 257 105 | - 475 106 | - 132 107 | - 834 108 | - 1721 109 | 110 | *** =hint 111 | Hint Hint Hint 112 | 113 | *** =pre_exercise_code 114 | ```{r} 115 | # The pre exercise code runs code to initialize the user's workspace. 116 | # You can use it to load packages, initialize datasets and draw a plot in the viewer 117 | 118 | import pandas as pd 119 | import matplotlib.pyplot as plt 120 | 121 | ``` 122 | 123 | *** =sct 124 | ```{r} 125 | # SCT written with pythonwhat: https://github.com/datacamp/pythonwhat/wiki 126 | 127 | msg_bad = "That is not correct!" 128 | msg_success = "Great job! You are awesome." 129 | test_mc(3, [msg_bad, msg_bad, msg_success, msg_bad, msg_bad, msg_bad]) 130 | ``` 131 | --- type:MultipleChoiceExercise lang:python xp:50 skills:1 key:642d18896c 132 | ## Question 4 133 | A repeat is a substring of the DNA sequence that occurs in multiple copies (more than one) throughout the sequence. Although repeats can occur on both forward or reverse strand of the DNA sequence, we will only consider repeats on the forward strand here. 134 | 135 | Also we will assume that the repeats can overlap. For instance the sequence ACACA contains two repeats of length 3: ACA amd CAC. CAC occurs while once in the sequence at position 2 (index 1 in Python), while ACA occurs twice - once at position 1, and once at position 3. 136 | 137 | Given a length n, your program should be able to identify all the repeats of length n in all sequences in the FASTA file. Your program should also find out how many times each repeat occurs in the file, and which is the most frequent repeat of a given length. 138 | 139 | What is the length of the longest ORF appearing in reading frame 2 of any of the sequences? 140 | 141 | 142 | *** =instructions 143 | - 1488 144 | - 1134 145 | - 1518 146 | - 834 147 | - 1271 148 | - 1272 149 | 150 | *** =hint 151 | Hint Hint Hint 152 | 153 | *** =pre_exercise_code 154 | ```{r} 155 | # The pre exercise code runs code to initialize the user's workspace. 156 | # You can use it to load packages, initialize datasets and draw a plot in the viewer 157 | 158 | import pandas as pd 159 | import matplotlib.pyplot as plt 160 | 161 | ``` 162 | 163 | *** =sct 164 | ```{r} 165 | # SCT written with pythonwhat: https://github.com/datacamp/pythonwhat/wiki 166 | 167 | msg_bad = "That is not correct!" 168 | msg_success = "Great job! You are awesome." 169 | test_mc(2, [msg_bad, msg_success, msg_bad, msg_bad, msg_bad, msg_bad]) 170 | ``` 171 | --- type:MultipleChoiceExercise lang:python xp:50 skills:1 key:60978bf3e8 172 | ## Question 5 173 | 174 | At what position in the sequence starts the longest ORF in reading frame 3? The position should indicate the character number in the sequence. For instance, the following ORF: 175 | 176 | > sequence1 177 | ATGCCCTAG 178 | starts at position 1. 179 | 180 | 181 | *** =instructions 182 | - 237 183 | - 1045 184 | - 1818 185 | - 39 186 | - 575 187 | - 2100 188 | 189 | *** =hint 190 | Hint Hint Hint 191 | 192 | *** =pre_exercise_code 193 | ```{r} 194 | # The pre exercise code runs code to initialize the user's workspace. 195 | # You can use it to load packages, initialize datasets and draw a plot in the viewer 196 | 197 | import pandas as pd 198 | import matplotlib.pyplot as plt 199 | 200 | ``` 201 | 202 | *** =sct 203 | ```{r} 204 | # SCT written with pythonwhat: https://github.com/datacamp/pythonwhat/wiki 205 | 206 | msg_bad = "That is not correct!" 207 | msg_success = "Great job! You are awesome." 208 | test_mc(3, [msg_bad, msg_bad, msg_success, msg_bad, msg_bad, msg_bad]) 209 | ``` 210 | --- type:MultipleChoiceExercise lang:python xp:50 skills:1 key:24d0be4fc4 211 | ## Question 6 212 | 213 | What is the length of the longest ORF appearing in any sequence and in any forward reading frame? 214 | 215 | 216 | *** =instructions 217 | - 1518 218 | - 1488 219 | - 1095 220 | - 834 221 | - 1272 222 | 223 | *** =hint 224 | Hint Hint Hint 225 | 226 | *** =pre_exercise_code 227 | ```{r} 228 | # The pre exercise code runs code to initialize the user's workspace. 229 | # You can use it to load packages, initialize datasets and draw a plot in the viewer 230 | 231 | import pandas as pd 232 | import matplotlib.pyplot as plt 233 | 234 | ``` 235 | 236 | *** =sct 237 | ```{r} 238 | # SCT written with pythonwhat: https://github.com/datacamp/pythonwhat/wiki 239 | 240 | msg_bad = "That is not correct!" 241 | msg_success = "Great job! You are awesome." 242 | test_mc(1, [msg_success, msg_bad, msg_bad, msg_bad, msg_bad]) 243 | ``` 244 | --- type:MultipleChoiceExercise lang:python xp:50 skills:1 key:53437f0648 245 | ## Question 7 246 | 247 | What is the length of the longest ORF that appears in the sequence with the identifier 248 | > gi|142022655|gb|EQ086233.1|97? 249 | 250 | 251 | *** =instructions 252 | - 2067 253 | - 1551 254 | - 1358 255 | - 1272 256 | - 1080 257 | - 783 258 | - 588 259 | 260 | *** =hint 261 | Hint Hint Hint 262 | 263 | *** =pre_exercise_code 264 | ```{r} 265 | # The pre exercise code runs code to initialize the user's workspace. 266 | # You can use it to load packages, initialize datasets and draw a plot in the viewer 267 | 268 | import pandas as pd 269 | import matplotlib.pyplot as plt 270 | 271 | ``` 272 | 273 | *** =sct 274 | ```{r} 275 | # SCT written with pythonwhat: https://github.com/datacamp/pythonwhat/wiki 276 | 277 | msg_bad = "That is not correct!" 278 | msg_success = "Great job! You are awesome." 279 | test_mc(4, [msg_bad, msg_bad, msg_bad, msg_success, msg_bad, msg_bad, msg_bad]) 280 | ``` 281 | --- type:MultipleChoiceExercise lang:python xp:50 skills:1 key:ea1198f108 282 | ## Question 8 283 | 284 | What is the most number of times the most frequent repeat of length 6 appears in all sequences? 285 | 286 | *** =instructions 287 | - 153 288 | - 158 289 | - 72 290 | - 541 291 | - 98 292 | - 356 293 | 294 | *** =hint 295 | Hint Hint Hint 296 | 297 | *** =pre_exercise_code 298 | ```{r} 299 | # The pre exercise code runs code to initialize the user's workspace. 300 | # You can use it to load packages, initialize datasets and draw a plot in the viewer 301 | 302 | import pandas as pd 303 | import matplotlib.pyplot as plt 304 | 305 | ``` 306 | 307 | *** =sct 308 | ```{r} 309 | # SCT written with pythonwhat: https://github.com/datacamp/pythonwhat/wiki 310 | 311 | msg_bad = "That is not correct!" 312 | msg_success = "Great job! You are awesome." 313 | test_mc(2, [msg_bad, msg_success, msg_bad, msg_bad, msg_bad, msg_bad]) 314 | ``` 315 | --- type:MultipleChoiceExercise lang:python xp:50 skills:1 key:ed76e2e48e 316 | ## Question 9 317 | 318 | How many repeats of length 12 have the largest number of occurences in all the sequences? 319 | 320 | *** =instructions 321 | - 3 322 | - 1 323 | - 7 324 | - 24 325 | - 9 326 | 327 | *** =hint 328 | Hint Hint Hint 329 | 330 | *** =pre_exercise_code 331 | ```{r} 332 | # The pre exercise code runs code to initialize the user's workspace. 333 | # You can use it to load packages, initialize datasets and draw a plot in the viewer 334 | 335 | import pandas as pd 336 | import matplotlib.pyplot as plt 337 | 338 | ``` 339 | 340 | *** =sct 341 | ```{r} 342 | # SCT written with pythonwhat: https://github.com/datacamp/pythonwhat/wiki 343 | 344 | msg_bad = "That is not correct!" 345 | msg_success = "Great job! You are awesome." 346 | test_mc(1, [msg_success, msg_bad, msg_bad, msg_bad, msg_bad]) 347 | ``` 348 | --- type:MultipleChoiceExercise lang:python xp:50 skills:1 key:9a3bc45efb 349 | ## Question 10 350 | 10: 351 | Which one of the following repeats of length 7 has a maximum number of occurences? 352 | 353 | 354 | *** =instructions 355 | - CGGCGGC 356 | - CGGCGCT 357 | - TGGTGGC 358 | - GCCGCCG 359 | - GCGGCGC 360 | - TCGGCGC 361 | 362 | *** =hint 363 | Hint Hint Hint 364 | 365 | *** =pre_exercise_code 366 | ```{r} 367 | # The pre exercise code runs code to initialize the user's workspace. 368 | # You can use it to load packages, initialize datasets and draw a plot in the viewer 369 | 370 | import pandas as pd 371 | import matplotlib.pyplot as plt 372 | 373 | ``` 374 | 375 | *** =sct 376 | ```{r} 377 | # SCT written with pythonwhat: https://github.com/datacamp/pythonwhat/wiki 378 | 379 | msg_bad = "That is not correct!" 380 | msg_success = "Great job! You are awesome." 381 | test_mc(1, [msg_success, msg_bad, msg_bad, msg_bad, msg_bad, msg_bad]) 382 | ``` 383 | 384 | --- type:NormalExercise lang:python xp:100 skills:1 key:9d1e104368 385 | ## Regular Exercise 386 | 387 | Do you remember the plot of the last exercise? Let's make an even cooler plot! 388 | 389 | A dataset of movies, `movies`, is available in the workspace. 390 | 391 | *** =instructions 392 | - The first function, `np.unique()`, uses the `unique()` function of the `numpy` package to get integer values for the movie genres. You don't have to change this code, just have a look! 393 | - Import `pyplot` in the `matplotlib` package. Set an alias for this import: `plt`. 394 | - Use `plt.scatter()` to plot `movies.runtime` onto the x-axis, `movies.rating` onto the y-axis and use `ints` for the color of the dots. You should use the first and second positional argument, and the `c` keyword. 395 | - Show the plot using `plt.show()`. 396 | 397 | *** =hint 398 | - You don't have to program anything for the first instruction, just take a look at the first line of code. 399 | - Use `import ___ as ___` to import `matplotlib.pyplot` as `plt`. 400 | - Use `plt.scatter(___, ___, c = ___)` for the third instruction. 401 | - You'll always have to type in `plt.show()` to show the plot you created. 402 | 403 | *** =pre_exercise_code 404 | ```{python} 405 | import pandas as pd 406 | movies = pd.read_csv("http://s3.amazonaws.com/assets.datacamp.com/course/introduction_to_r/movies.csv") 407 | 408 | import numpy as np 409 | ``` 410 | 411 | *** =sample_code 412 | ```{python} 413 | # Get integer values for genres 414 | _, ints = np.unique(movies.genre, return_inverse = True) 415 | 416 | # Import matplotlib.pyplot 417 | 418 | 419 | # Make a scatter plot: runtime on x-axis, rating on y-axis and set c to ints 420 | 421 | 422 | # Show the plot 423 | 424 | ``` 425 | 426 | *** =solution 427 | ```{python} 428 | # Get integer values for genres 429 | _, ints = np.unique(movies.genre, return_inverse = True) 430 | 431 | # Import matplotlib.pyplot 432 | import matplotlib.pyplot as plt 433 | 434 | # Make a scatter plot: runtime on x-axis, rating on y-axis and set c to ints 435 | plt.scatter(movies.runtime, movies.rating, c=ints) 436 | 437 | # Show the plot 438 | plt.show() 439 | ``` 440 | 441 | *** =sct 442 | ```{python} 443 | # SCT written with pythonwhat: https://github.com/datacamp/pythonwhat/wiki 444 | 445 | test_function("numpy.unique", 446 | not_called_msg = "Don't remove the call of `np.unique` to define `ints`.", 447 | incorrect_msg = "Don't change the call of `np.unique` to define `ints`.") 448 | 449 | test_object("ints", 450 | undefined_msg = "Don't remove the definition of the predefined `ints` object.", 451 | incorrect_msg = "Don't change the definition of the predefined `ints` object.") 452 | 453 | test_import("matplotlib.pyplot", same_as = True) 454 | 455 | test_function("matplotlib.pyplot.scatter", 456 | incorrect_msg = "You didn't use `plt.scatter()` correctly, have another look at the instructions.") 457 | 458 | test_function("matplotlib.pyplot.show") 459 | 460 | success_msg("Great work!") 461 | ``` 462 | --------------------------------------------------------------------------------