├── .gitignore ├── README.rst └── ucd-farm-intro.md /.gitignore: -------------------------------------------------------------------------------- 1 | *~ 2 | -------------------------------------------------------------------------------- /README.rst: -------------------------------------------------------------------------------- 1 | Advanced Beginner/Intermediate Shell 2 | ==================================== 3 | 4 | Learning goals: 5 | 6 | * expose you to a bunch of syntax around shell use & scripting. 7 | * show you the proximal possibilities of shell use & scripting. 8 | * give you some useful tricks! 9 | * provide fodder for discussion, so please ask questions! 10 | 11 | Points to make: 12 | 13 | * almost everything we'll do today has existed since 70s or 80s 14 | - so pre-Python & R. 15 | * I use almost everything below on a ~weekly basis. 16 | * I learn new things every year ('set -e', for example). 17 | * For me, anything more complicated than what is below => Python 18 | (easier to test, handle errors, etc.). 19 | 20 | Also note! "Data Therapy" sessions 3-5pm Wed, in the Center for 21 | Companion Animal Health (Bennett Room, 2nd floor, CCAH). 22 | Keurig will be provided 23 | 24 | ----- 25 | 26 | We'll starting at the end of `the shell genomics lesson 27 | `__. 28 | 29 | Make sure you have the test data! Download and unpack:: 30 | 31 | https://s3-us-west-1.amazonaws.com/dib-training.ucdavis.edu/shell-data.zip 32 | 33 | and set your current working directory to be the top level dir, 'data/'. 34 | 35 | We'll be posting code snippets to:: 36 | 37 | https://public.etherpad-mozilla.org/p/2017-jan-adv-beginner-shell 38 | 39 | Lisa Cohen will also be talking about HPC qsub scripts: 40 | 41 | https://github.com/ngs-docs/2016-adv-begin-shell-genomics/blob/master/ucd-farm-intro.md 42 | 43 | ---- 44 | 45 | Exploring directory structures 46 | ------------------------------ 47 | 48 | So, if we do 'ls', we see a bunch of stuff. We didn't create this folder. 49 | How do we figure out what's in it? 50 | 51 | Here, 'find' is your first friend:: 52 | 53 | find . -type d 54 | 55 | This walks systematically (recursively) through all files underneath '.', 56 | finds all directories (type d), and prints them (assumed, if not other 57 | actions). 58 | 59 | We'll come back to 'find' later, when we use it for finding files. 60 | 61 | ---- 62 | 63 | Renaming a bunch of files 64 | ------------------------- 65 | 66 | Let's go into the MiSeq directory:: 67 | 68 | cd MiSeq 69 | 70 | and take a look with `ls`. 71 | 72 | For our first task, let's pretend that we want to rename all of the fastq 73 | files to be .fq files instead. Here, we get to use two of my favorite 74 | commands - 'for' and 'basename'. 75 | 76 | 'for' lets you do something to every file in a list. To see it in action:: 77 | 78 | for i in *.fastq 79 | do 80 | echo $i 81 | done 82 | 83 | This is running the command 'echo' for every value of the variable 'i', which 84 | is set (one by one) to all the values in the expression '*.fastq'. 85 | 86 | If we want to get rid of the extension '.fastq', we can use the 'basename' 87 | command:: 88 | 89 | for i in *.fastq 90 | do 91 | basename $i .fastq 92 | done 93 | 94 | Now, this doesn't actually rename the files - it just prints out the name, 95 | with the suffix '.fastq' removed. To rename the files, we need to capture 96 | the new name in a variable:: 97 | 98 | for i in *.fastq 99 | do 100 | newname=$(basename $i .fastq).fq 101 | echo $newname 102 | done 103 | 104 | What ``$( ... )`` does is run the command in the middle, and then replace the 105 | ``$( )`` with the value of running the command. 106 | 107 | Now we have the old name ($i) and the new name ($newname) and we're ready 108 | to write the rename command -- :: 109 | 110 | for i in *.fastq 111 | do 112 | newname=$(basename $i .fastq).fq 113 | echo mv $i $newname 114 | done 115 | 116 | Q: why did I use 'echo' here? 117 | 118 | Now that we're pretty sure it all looks good, let's run it for realz:: 119 | 120 | for i in *.fastq 121 | do 122 | newname=$(basename $i .fastq).fq 123 | mv $i $newname 124 | done 125 | 126 | and voila, we have renamed all the files! 127 | 128 | Side note: you may see backquotes used instead of ``$(...)``. Same thing. 129 | 130 | ---- 131 | 132 | Let's also get rid of the annoying '_001' that's at the end of the 133 | files. basename is all fine and good with the end of files, but what 134 | do we do about things in the middle? Now we get to use another one of 135 | my favorite commands -- 'cut'. 136 | 137 | What 'cut' does is slide and dice strings. So, for example, :: 138 | 139 | echo hello, world | cut -c5- 140 | 141 | will give you 'o, world'. 142 | 143 | But this is kind of a strange construction! What's going on? 144 | 145 | Well, 'cut' expects to take a bunch of lines of input from a file. By 146 | default it is happy to take them in from stdin ("standard input"), so 147 | you can specify '-' and give it some input via a pipe, which is what 148 | we're doing with echo: 149 | 150 | We're taking the output of 'echo hello, world' and sending it to the 151 | input of cut with the ``|`` command ('pipe'). 152 | 153 | You've probably already seen this with head or tail, but many UNIX 154 | commands take stdin and stdout. 155 | 156 | Let's construct the cut command we want to use. If we look at the names of 157 | the files, and we want to remove 001 only, we can see that each filename 158 | has a bunch of fields separated by '_'. So we can ask 'cut' to pay attention 159 | to the first four fields, and omit the fifth, around the separator (or 160 | delimiter) '_':: 161 | 162 | echo F3D141_S207_L001_R1_001.fq | cut -d_ -f1-4 163 | 164 | That looks about right -- let's put it into a for loop:: 165 | 166 | for i in *.fq 167 | do 168 | echo $i | cut -d_ -f1-4 169 | done 170 | 171 | Good - now assign it to a variable and append an ending:: 172 | 173 | for i in *.fq 174 | do 175 | newname=$(echo $i | cut -d_ -f1-4).fq 176 | echo $newname 177 | done 178 | 179 | and now construct the 'mv' command:: 180 | 181 | for i in *.fq 182 | do 183 | newname=$(echo $i | cut -d_ -f1-4).fq 184 | echo mv $i $newname 185 | done 186 | 187 | and if that looks right, run it:: 188 | 189 | for i in *.fq 190 | do 191 | newname=$(echo $i | cut -d_ -f1-4).fq 192 | mv $i $newname 193 | done 194 | 195 | Ta-da! You've renamed all your files. 196 | 197 | ---- 198 | 199 | Let's do something quite useful - subset a bunch of FASTQ files. 200 | 201 | If you look at one of the FASTQ files with head, :: 202 | 203 | head F3D0_S188_L001_R1.fq 204 | 205 | you'll see that it's full of FASTQ sequencing records. Often I want 206 | to run a bioinformatices pipeline on some small set of records first, 207 | before running it on the full set, just to make sure all the commands work. 208 | So I'd like to subset all of these files without modifying the originals. 209 | 210 | First, let's make sure the originals are read-only:: 211 | 212 | chmod u-w *.fq 213 | 214 | Now, let's make a 'subset' directory:: 215 | 216 | mkdir subset 217 | 218 | Now, to subset each file, we want to run a 'head' with an argument 219 | that is the total number of lines we want to take. In this case, it 220 | should be a multiple of 4, because FASTQ records have 4 lines each. 221 | So let's plan to take the first 100 lines of each file by using 'head 222 | -400'. 223 | 224 | The for loop will now look something like:: 225 | 226 | for i in *.fq 227 | do 228 | echo "head -400 $i > subset/$i" 229 | done 230 | 231 | If that command looks right, run it for realz:: 232 | 233 | for i in *.fq 234 | do 235 | head -400 $i > subset/$i 236 | done 237 | 238 | and voila, you have your subsets! 239 | 240 | ---- 241 | 242 | Challenge exercise: can you rename all of your files in subset/ to 243 | have 'subset.fq' at the end? 244 | 245 | (Work in small groups; start from working code; there are several ways 246 | to do it, all that matters is getting there.) 247 | 248 | Some backtracking 249 | ----------------- 250 | 251 | Variables: 252 | 253 | You can use either $varname or ${varname}. The latter is useful 254 | when you want to construct a new filename, e.g.:: 255 | 256 | MY${varname}SUBSET 257 | 258 | would expand ${varname} and then put MY .. SUBSET on either end, while :: 259 | 260 | MY$varnameSUBSET 261 | 262 | would try to put MY in front of $varnameSUBSET which won't work. 263 | 264 | (Unknown/uncreated variables give nothing.) 265 | 266 | --- 267 | 268 | We used "$varname" above - what happens if we use ''? 269 | 270 | (Variables are interpreted inside of "", and not inside of ''.) 271 | 272 | ---- 273 | 274 | Pipes and redirection: 275 | 276 | To redirect stdin and stdout, you can use:: 277 | 278 | > - send stdout to a file 279 | < - take stdin from a file 280 | | - take stdout from first command and make it stdin for second command 281 | >> - appends stdout to a previously-existing file 282 | 283 | stderr (errors) can be redirected:: 284 | 285 | 2> - send stderr to a file 286 | 287 | and you can also say:: 288 | 289 | >& - to send all output to a file 290 | 291 | Editing on the command line: 292 | 293 | Most prompts support 'readline'-style editing. This uses emacs control 294 | keys. 295 | 296 | Type something out; then type CTRL-a. Now type CTRL-e. Beginning and end! 297 | 298 | Up arrows to recall previous command, left/right arrows, etc. 299 | 300 | ---- 301 | 302 | Another useful command along with 'basename' is 'dirname'. Any idea what 303 | it does? 304 | 305 | ----- 306 | 307 | Working with collections of files; conditionals 308 | ----------------------------------------------- 309 | 310 | Let's go back to the 'data' directory and play around with loops some more. :: 311 | 312 | cd .. 313 | 314 | 'if' acts on things conditionally:: 315 | 316 | for i in * 317 | do 318 | if [ -f $i ]; then 319 | echo $i is a file 320 | elif [ -d $i ]; then 321 | echo $i is a directory 322 | fi 323 | done 324 | 325 | but what the heck is this ``[ ]`` notation? That's actually running 326 | the 'test' command; try 'help test | less' to see the docs. This is a 327 | weird syntax that lets you do all sorts of useful things with files -- 328 | I usually use it to get rid of empty files:: 329 | 330 | touch emptyfile.txt 331 | 332 | to create an empty file, and then:: 333 | 334 | for i in * 335 | do 336 | if [ \! -s $i ]; then 337 | echo rm $i 338 | fi 339 | done 340 | 341 | ...and as you can see here, I'm using '!' to say 'not'. 342 | 343 | Executing things conditionally based on exit status 344 | --------------------------------------------------- 345 | 346 | Let's create two scripts (you can use 'nano' here if you want) -- in 347 | 'success.sh', put:: 348 | 349 | #! /bin/bash 350 | echo mesucceed 351 | exit 0 352 | 353 | and in 'fail.sh', put:: 354 | 355 | #! /bin/bash 356 | echo mefail 357 | exit 1 358 | 359 | You can do this with 'heredocs' -- :: 360 | 361 | cat > success.sh < fail.sh <