├── README.md ├── animals.txt ├── test-trimmed.fastq └── test.fastq /README.md: -------------------------------------------------------------------------------- 1 | # A Quick bioawk tutorial 2 | 3 | There was some interest in bioawk, a useful awk fork for handling 4 | bioinformatics formats at the UC Davis Software Carpentry course, so 5 | here is a quick tutorial. 6 | 7 | ## Concepts 8 | 9 | Don't write your own FASTA/FASTQ parsers! FASTA is much easier, but 10 | *code reuse* is important here. FASTQ is a very hard format to parse 11 | *safely* and *quickly*. See 12 | [this post](http://www.biostars.org/p/10353/#11256) for more info on 13 | how tricky it can be to parse FASTQ. 14 | 15 | 16 | Heng Li (author of samtools, bwa) has written a nice set of parsers 17 | for different languages called 18 | [readfq](https://github.com/lh3/readfq). bioawk is also a tool of Heng 19 | Li's too. 20 | 21 | ## Installing 22 | 23 | For this quick tutorial, let's just clone bioawk and install it to 24 | your `/usr/local/bin/`: 25 | 26 | git clone git://github.com/lh3/bioawk.git && cd bioawk && make && mv awk bioawk && sudo cp bioawk /usr/local/bin/ 27 | 28 | 29 | ## What's awk? 30 | 31 | awk is an old programming language that is useful for processing 32 | *streams* of data. Very quickly, there are three parts of each quick 33 | awk program: BEGIN, middle, and END. BEGIN is loaded before the data, 34 | the middle is run on each line, and the END is run after the lines are 35 | done processing. For example, we could do: 36 | 37 | $ cat animals.txt 38 | 2011-04-22 21:06 Grizzly 36 39 | 2011-04-23 14:12 Elk 25 40 | 2011-04-23 10:24 Elk 26 41 | 2011-04-23 20:08 Wolverine 31 42 | 2011-04-23 18:46 Muskox 20 43 | 44 | $ awk '{ if ($4 > 26) print $3 }' animals.txt 45 | Grizzly 46 | Wolverine 47 | 48 | So awk maps these columns to column numbers, like `$1` and `$3`. `$0` 49 | is the whole line. Delimiters (field separators) can be set with `-F`. 50 | 51 | ## Bioawk 52 | 53 | Bioawk is just like awk, but instead of working with mapping columns 54 | to variables for you, it maps bioinformatics field formats (like 55 | FASTA/FASTQ name and sequence). 56 | 57 | You can count sequences very effectively with bioawk, because awk 58 | updates the built-in variable `NR` (number of records): 59 | 60 | bioawk -cfastx 'END{print NR}' test.fastq 61 | 62 | But this is just the beginning; what if you wanted to use it to make a 63 | tab-delimited table of names and sequence lengths, you could do: 64 | 65 | bioawk -cfastx '{print $name, length($seq)}' test-trimmed.fastq 66 | 67 | Or maybe you want to see how many sequences are shorter (less than 68 | 80bp) now? 69 | 70 | bioawk -cfastx 'BEGIN{ shorter = 0} {if (length($seq) < 80) shorter += 1} END {print "shorter sequences", shorter}' test-trimmed.fastq 71 | 72 | bioawk can also take other input formats: 73 | 74 | bed: 75 | 1:chrom 2:start 3:end 4:name 5:score 6:strand 7:thickstart 8:thickend 9:rgb 10:blockcount 11:blocksizes 12:blockstarts 76 | sam: 77 | 1:qname 2:flag 3:rname 4:pos 5:mapq 6:cigar 7:rnext 8:pnext 9:tlen 10:seq 11:qual 78 | vcf: 79 | 1:chrom 2:pos 3:id 4:ref 5:alt 6:qual 7:filter 8:info 80 | gff: 81 | 1:seqname 2:source 3:feature 4:start 5:end 6:score 7:filter 8:strand 9:group 10:attribute 82 | fastx: 83 | 1:name 2:seq 3:qual 4:comment 84 | 85 | -------------------------------------------------------------------------------- /animals.txt: -------------------------------------------------------------------------------- 1 | 2011-04-22 21:06 Grizzly 36 2 | 2011-04-23 14:12 Elk 25 3 | 2011-04-23 10:24 Elk 26 4 | 2011-04-23 20:08 Wolverine 31 5 | 2011-04-23 18:46 Muskox 20 6 | --------------------------------------------------------------------------------