├── README.md
├── animals.txt
├── test-trimmed.fastq
└── test.fastq


/README.md:
--------------------------------------------------------------------------------
 1 | # A Quick bioawk tutorial
 2 | 
 3 | There was some interest in bioawk, a useful awk fork for handling
 4 | bioinformatics formats at the UC Davis Software Carpentry course, so
 5 | here is a quick tutorial.
 6 | 
 7 | ## Concepts
 8 | 
 9 | Don't write your own FASTA/FASTQ parsers! FASTA is much easier, but
10 | *code reuse* is important here. FASTQ is a very hard format to parse
11 | *safely* and *quickly*. See
12 | [this post](http://www.biostars.org/p/10353/#11256) for more info on
13 | how tricky it can be to parse FASTQ.
14 | 
15 | 
16 | Heng Li (author of samtools, bwa) has written a nice set of parsers
17 | for different languages called
18 | [readfq](https://github.com/lh3/readfq). bioawk is also a tool of Heng
19 | Li's too. 
20 | 
21 | ## Installing 
22 | 
23 | For this quick tutorial, let's just clone bioawk and install it to
24 | your `/usr/local/bin/`:
25 | 
26 |     git clone git://github.com/lh3/bioawk.git && cd bioawk && make && mv awk bioawk && sudo cp bioawk /usr/local/bin/
27 | 
28 | 
29 | ## What's awk?
30 | 
31 | awk is an old programming language that is useful for processing
32 | *streams* of data. Very quickly, there are three parts of each quick
33 | awk program: BEGIN, middle, and END. BEGIN is loaded before the data,
34 | the middle is run on each line, and the END is run after the lines are
35 | done processing. For example, we could do:
36 | 
37 |     $ cat animals.txt
38 |     2011-04-22 21:06 Grizzly 36
39 |     2011-04-23 14:12 Elk 25
40 |     2011-04-23 10:24 Elk 26
41 |     2011-04-23 20:08 Wolverine 31
42 |     2011-04-23 18:46 Muskox 20
43 | 
44 |     $ awk '{ if ($4 > 26) print $3 }' animals.txt
45 |     Grizzly
46 |     Wolverine
47 | 
48 | So awk maps these columns to column numbers, like `$1` and `$3`. `$0`
49 | is the whole line. Delimiters (field separators) can be set with `-F`.
50 | 
51 | ## Bioawk
52 | 
53 | Bioawk is just like awk, but instead of working with mapping columns
54 | to variables for you, it maps bioinformatics field formats (like
55 | FASTA/FASTQ name and sequence).
56 | 
57 | You can count sequences very effectively with bioawk, because awk
58 | updates the built-in variable `NR` (number of records):
59 | 
60 |     bioawk -cfastx 'END{print NR}' test.fastq
61 | 
62 | But this is just the beginning; what if you wanted to use it to make a
63 | tab-delimited table of names and sequence lengths, you could do:
64 | 
65 |     bioawk -cfastx '{print $name, length($seq)}' test-trimmed.fastq
66 | 
67 | Or maybe you want to see how many sequences are shorter (less than
68 | 80bp) now?
69 | 
70 |     bioawk -cfastx 'BEGIN{ shorter = 0} {if (length($seq) < 80) shorter += 1} END {print "shorter sequences", shorter}' test-trimmed.fastq
71 | 	
72 | bioawk can also take other input formats: 
73 | 
74 |     bed:
75 |          1:chrom 2:start 3:end 4:name 5:score 6:strand 7:thickstart 8:thickend 9:rgb 10:blockcount 11:blocksizes 12:blockstarts
76 |     sam:
77 |         1:qname 2:flag 3:rname 4:pos 5:mapq 6:cigar 7:rnext 8:pnext 9:tlen 10:seq 11:qual
78 |     vcf:
79 |         1:chrom 2:pos 3:id 4:ref 5:alt 6:qual 7:filter 8:info
80 |     gff:
81 |         1:seqname 2:source 3:feature 4:start 5:end 6:score 7:filter 8:strand 9:group 10:attribute
82 |     fastx:
83 |     	1:name 2:seq 3:qual 4:comment
84 | 
85 | 


--------------------------------------------------------------------------------
/animals.txt:
--------------------------------------------------------------------------------
1 | 2011-04-22 21:06 Grizzly 36
2 | 2011-04-23 14:12 Elk 25
3 | 2011-04-23 10:24 Elk 26
4 | 2011-04-23 20:08 Wolverine 31
5 | 2011-04-23 18:46 Muskox 20
6 | 


--------------------------------------------------------------------------------