sample.vcf.gz
30 | ```
31 | map file should contain "`old_name new_name`" pairs separated by whitespaces, each on a separate line
32 |
--------------------------------------------------------------------------------
/misc/git.md:
--------------------------------------------------------------------------------
1 | # Git tips
2 |
3 | - [Pro git book](https://git-scm.com/book/en/v2)
4 | - https://github.com/Kunena/Kunena-Forum/wiki/Create-a-new-branch-with-git-and-manage-branches
5 | - https://nvie.com/posts/a-successful-git-branching-model/
6 | - http://sandofsky.com/blog/git-workflow.html
7 | - https://blog.izs.me/2012/12/git-rebase
8 | - https://benmarshall.me/git-rebase/
9 | - [find big files in history](https://stackoverflow.com/questions/10622179/how-to-find-identify-large-commits-in-git-history)
10 | - [remove a big file from history](https://www.czettner.com/2015/07/16/deleting-big-files-from-git-history.html)
11 | - [git-tips](https://github.com/git-tips/tips)
12 |
13 |
14 | # merge master branch into (empty) main and delete master
15 | ```
16 | module load git
17 | git fetch origin main
18 | git branch -a
19 | git checkout main
20 | git merge master --allow-unrelated-histories
21 | git add -A .
22 | git commit
23 | git push
24 |
25 | git branch -d master
26 | git push origin :master
27 | ```
28 |
29 | # Add remote upstream
30 | ```bash
31 | git remote -v
32 | git remote add upstream https://github.com/ORIGINAL_OWNER/ORIGINAL_REPOSITORY.git
33 | ```
34 |
35 | # Create a tag in the upstream
36 | ```bash
37 | git fetch upstream
38 | git checkout master
39 | git reset --hard upstream/master
40 | git tag -a -m "project tag (date)" vx.y.z
41 | git push upstream vx.y.z
42 | git push origin vx.y.z
43 | ```
44 |
45 | # Sync with upstream/master, delete all commits in origin/master
46 | ```
47 | git fetch upstream
48 | git checkout main
49 | git reset --hard upstream/main
50 | git push --force
51 | ```
52 |
53 | # Sync with upstream/master
54 | ```
55 | git fetch upstream
56 | git checkout main
57 | git merge upstream/main
58 | ```
59 | # big feature workflow - rebase - squash
60 | ```
61 | # sync master with upstream first
62 | # create new branch and switch to it
63 | git checkout -b feature1
64 | # create many commits with meaningfull messages
65 | git add -A.
66 | git commit
67 | # upstream accumulated some commits
68 | git fetch upsteam
69 | # rebasing the branch not the master
70 | # to PR from the branch later not from the master
71 | # automatic rebase - replay all commit on top of master
72 | git rebase upstream/master
73 |
74 | # alternative - interactive rebase
75 | # 1. see latest commits from HEAD down to the start of feature1
76 | # on top of upstream
77 | # git log --oneline --decorate --all --graph
78 | # 2. interactive rebase for the last 13 commits (including head)
79 | # git rebase -i HEAD~13
80 | # set s (squash) in the interactive editor for all commits except for the top one
81 | # alter commit message
82 |
83 | # force since origin as 13 separate commits
84 | git push --force --set-upstream origin feature1
85 | # PR from feature1 branch to upstream/master
86 | ```
87 |
88 | # 2 Feature workflow
89 | ```
90 | git checkout -b feature1
91 | git add -A.
92 | git commit
93 | git push --set-upstream origin feature1
94 | # pull request1
95 | git checkout master
96 | git checkout -b feature2
97 | git add -A.
98 | git commit
99 | git push --set-upstream origin feature2
100 | # pull request 2
101 | ```
102 |
103 | # Feature workflow w squash
104 | ```
105 | git checkout -b feature_branch
106 | # 1 .. N
107 | git add -A .
108 | git commit -m "sync"
109 |
110 | git checkout master
111 | git merge --squash private_feature_branch
112 | git commit -v
113 | git push
114 | # pull request to upstream
115 | # code review
116 | # request merged
117 | git branch -d feature_branch
118 | git push origin :feature_branch
119 | ```
120 |
121 | # get commits from maintainers in a pull request and push back
122 | ```
123 | git fetch upstream pull/[PR_Number]/head:new_branch
124 | git checkout new_branch
125 | git add
126 | git commit
127 | git push --set-upstream origin new_branch
128 | ```
129 |
130 | # ~/.ssh/config
131 | ```
132 | Host github.com
133 | HostName github.com
134 | PreferredAuthentications publickey
135 | IdentityFIle ~/.ssh/id_rsa_git
136 | User git
137 | ```
138 |
139 | # Migrating github.com repos to [code.harvard.edu](https://code.harvard.edu/)
140 |
141 | See [this page](https://gist.github.com/niksumeiko/8972566) for good general guidance
142 |
143 | 1. Set up your ssh keys. You can use your old keys (if you remember your passphrase) by going to `Settings --> SSH and GPG keys --> New SSH key`
144 | 2. Create your repo in code.harvard.edu. Copy the 'Clone with SSH link`: `git@code.harvard.edu:HSPH/repo_name.git` (*NOTE: some of us have had trouble with the HTTPS link*)
145 | 3. Go to your local repo that you would like to migrate. Enter the directory.
146 |
147 | ```
148 | # this will add a second remote location
149 | git remote add harvard git@code.harvard.edu:HSPH/repo_name.git
150 |
151 | # this will get rid of the old origin remote
152 | git push -u harvard --all
153 | ```
154 |
155 | 4. You should see the contents of your local repo in Enterprise. Now go to 'Settings' for the repo and 'Collaborators and Teams'. Here you will need to add Bioinformatics Core and give 'Admin' priveleges.
156 |
157 |
158 | > **NOTE:** If you decide to compile all your old repos into one giant repo (i.e. [hbc_mistrm_reports_legacy](https://code.harvard.edu/HSPH/hbc_mistrm_reports_legacy)), make sure that you remove all `.git` folders from each of them before committing. Otherwise you will not be able to see the contents on each folder on Enterprise.
159 |
160 | # Remove sensitive information from the file and from the history
161 | ```
162 | Make a backup
163 | # cd ~/backup
164 | # git clone git@github.com:hbc/knowledgebase.git
165 | cd ~/work
166 | git clone git@github.com:hbc/knowledgebase.git
167 | git filter-branch --tree-filter 'rm -f admin/download_data.md' HEAD
168 | git push --force-with-lease origin master
169 | # commit saved copy of download_data.md without secrets
170 | ```
171 |
--------------------------------------------------------------------------------
/misc/miRNA.md:
--------------------------------------------------------------------------------
1 | * https://github.com/lpantano/bcbioSmallRna
2 |
--------------------------------------------------------------------------------
/misc/mounting_o2_mac.md:
--------------------------------------------------------------------------------
1 | ## For OSX
2 |
3 | To have O2 accessible on your laptop/desktop as a folder, you need to use something called [`sshfs`](https://en.wikipedia.org/wiki/SSHFS) (ssh filesystem). This is a command that is not native to OSXand you need to go through several steps in order to get it. Once you have `sshfs`, then you need to set up ssh keys to connect O2 to your laptop without having to type in a password.
4 |
5 | ### 1. Installing sshfs on OSX
6 |
7 | Download macFUSE from [https://github.com/osxfuse/osxfuse/releases](https://github.com/osxfuse/osxfuse/releases/download/macfuse-4.6.0/macfuse-4.6.0.dmg), and install it.
8 |
9 | NOTE: In order to install macFUSE, you may need to first enable system extensions, following [this guideline from Apple](https://support.apple.com/guide/mac-help/change-security-settings-startup-disk-a-mac-mchl768f7291/mac), which will require restarting your computer.
10 |
11 | Download sshfs from [https://github.com/osxfuse/sshfs/releases](https://github.com/osxfuse/sshfs/releases/download/osxfuse-sshfs-2.5.0/sshfs-2.5.0.pkg), and install it.
12 |
13 | > #### Use this only if the above option fails!
14 | >
15 | > Step 1. Install [Xcode](https://developer.apple.com/xcode/)
16 | > ```bash
17 | > $ xcode-select --install
18 | > ```
19 | >
20 | > Step 2. Install Homebrew using ruby (from Xcode)
21 | > ```bash
22 | > $ /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
23 | >
24 | > # Uninstall Homebrew
25 | > # /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/uninstall)"
26 | > ```
27 | >
28 | > Step 2.1. Check to make sure that Homebrew is working properly
29 | > ```bash
30 | > $ brew doctor
31 | > ```
32 | >
33 | > Step 3. Install Cask from Homebrew's caskroom
34 | > ```bash
35 | > $ brew tap caskroom/cask
36 | > ```
37 | >
38 | > Step 4. Install OSXfuse using Cask
39 | > ```bash
40 | > $ brew cask install osxfuse
41 | > ```
42 | >
43 | > Step 5. Install sshfs from fuse
44 | > ```bash
45 | > $ brew install sshfs
46 | > ```
47 |
48 | ### 2. Set up "ssh keys"
49 |
50 | Once `sshfs` is installed, the next step is to connect O2 (or a remote server) to our laptops. To make this process seamless, first set up ssh keys which can be used to connect to the server without having to type in a password every time.
51 |
52 | Log into O2 and use `vim` to open `~/.ssh/authorized_keys` and paste the code below copied from your computer to this file and save it. NOTE: make sure to replace `ecommonsID` with your actual username!
53 |
54 | ```bash
55 | # set up ssh keys
56 | $ ssh-keygen -t rsa -b 4096 -f ~/.ssh/id_rsa -C "ecommonsID"
57 | $ ssh-add -K ~/.ssh/id_rsa
58 | ```
59 |
60 | Arguments for `ssh-keygen`:
61 | * `-t` = Specifies the type of key to create. The possible values are "rsa1" for protocol version 1 and "rsa" or "dsa" for protocol version 2. *We want rsa.*
62 | * `-b` = Specifies the number of bits in the key to create. For RSA keys, the minimum size is 768 bits and the default is 2048 bits. *We want 4096*
63 | * `-f` = name of output "keyfile"
64 | * `-C` = Provides a new comment
65 |
66 | Arguments for `ssh-add`:
67 | * `-K` = Store passphrases in your keychain
68 |
69 | ```bash
70 | # copy the contents of `id_rsa.pub` to ~/.ssh/authorized_keys on O2
71 | $ cat ~/.ssh/id_rsa.pub | pbcopy
72 | ```
73 |
74 | > `pbcopy` puts the output of `cat` into the clipboard (in other words, it is equivalent to copying with ctrl + c) so you can just paste it as usual with ctrl + v.
75 |
76 | ### 3. Mount O2 using sshfs
77 |
78 | Now, let's set up for running `sshfs` on our laptops (local machines), by creating a folder with an intuitive name for your home directory on the cluster to be mounted in.
79 |
80 | ```bash
81 | $ mkdir ~/O2_mount
82 | ```
83 |
84 | Finally, let's run the `sshfs` command to have O2 mount as a folder in the above space. Again, replace `ecommonsID` with your username.
85 | ```bash
86 | $ sshfs ecommonsID@transfer.rc.hms.harvard.edu:. ~/O2_mount -o volname="O2" -o compression=no -o Cipher=arcfour -o follow_symlinks
87 | ```
88 |
89 | Now we can browse through our home directory on O2 as though it was a folder on our laptop.
90 |
91 | > If you want to access your lab's directory in `/groups/` or your directory in `/n/scratch2`, you will need to create sym links to those in your home directory and you will be able to access those as well.
92 |
93 | Once you are finished using O2 in its mounted form, you can cancel the connection using `umount` and the name of the folder.
94 |
95 | ```bash
96 | $ umount ~/O2_mount
97 | ```
98 |
99 | ### 4. Set up alias (optional)
100 |
101 | It is optional to set shorter commands using `alias` for establishing and canceling `sshfs` connection. Use `vim` to create or open `~/.bashrc` and paste the following `alias` commands and save it.
102 |
103 | ```bash
104 | $ alias mounto2='sshfs ecommonsID@transfer.rc.hms.harvard.edu:. ~/O2_mount -o volname="O2" -o follow_symlinks'
105 | $ alias umounto2='umount ~/O2_mount'
106 | ```
107 |
108 | > If your default shell is `zsh` instead of `bash`, use `vim` to create or open `~/.zshrc` and paste the `alias` commands.
109 |
110 | Update changes in `.bashrc`
111 |
112 | ```bash
113 | $ source .bashrc
114 | ```
115 | Now we can type `mounto2` and `umounto2` to mount and unmount O2.
116 |
117 | ***
118 | *This lesson has been developed by members of the teaching team at the [Harvard Chan Bioinformatics Core (HBC)](http://bioinformatics.sph.harvard.edu/). These are open access materials distributed under the terms of the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.*
119 |
--------------------------------------------------------------------------------
/misc/mtDNA_variants.md:
--------------------------------------------------------------------------------
1 | # SNV and indels
2 | - when starting from WGS or WES, subset MT chromosome
3 | - estimate coverage, callable >=100X: https://github.com/naumenko-sa/bioscripts/blob/master/scripts/bam.coverage.bamstats05.sh
4 | - use template for bcbio: https://github.com/bcbio/bcbio-nextgen/pull/3059
5 |
6 | # Large deletions
7 | - MitoDel: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5657046/
8 | - eKLIPse: https://www.ncbi.nlm.nih.gov/pubmed/30393377
9 |
10 | # Databases
11 | - https://www.mitomap.org/foswiki/bin/view/MITOMAP/WebHome
12 | - https://www.mitomap.org/foswiki/bin/view/MITOMAP/TopVariants
13 | - mvTool V2: https://mseqdr.org/mv.php
14 | - MSeqDR, ClinVar, ICGC, COSMIC
15 |
16 |
--------------------------------------------------------------------------------
/misc/multiomics_factor_analysis.md:
--------------------------------------------------------------------------------
1 | # Uploading a program called MOFA2
2 | ## It is used for finding factors from multiomics datasets.
3 |
4 | The author presented an example usage in scRNAseq & scATACseq as well as other datasets (e.g. bulkRNAseq with proteomics).
5 |
6 | I think it will be nice to look over.
7 |
8 | https://biofam.github.io/MOFA2/
9 |
10 | Nice Review article of integrating multiomics, by one of the presenters.
11 |
12 | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7034308/
13 |
14 | iOMICSPass - co-expression based data integration using network
15 |
16 | https://www.nature.com/articles/s41540-019-0099-y
17 |
--------------------------------------------------------------------------------
/misc/new_to_remote_github_CLI_start_here.md:
--------------------------------------------------------------------------------
1 | # Putting local git repos into the HBC Github organization remotely via command line interface
2 |
3 | Heather Wick
4 |
5 | Has your experience with github primarily been through the browser? This document has the basics to begin turning your working directories into github repositories which can be pushed to the HBC location remotely via command line.
6 |
7 | ### Wait, back up, what do you mean by push?
8 |
9 | Confused by push, pull, branch, main, commit? If you're not sure, it's worthwhile to familiarize yourself with the basics of git/github. There are some great resources and tutorials to learn from out there. Here's an interactive one (I could only get it to work in safari, not chrome):
10 | https://learngitbranching.js.org
11 |
12 | This won't teach you how to put your things into the HBC Github organization though.
13 |
14 | ## Set up/configuration
15 |
16 | You only need to do these once!
17 |
18 | ### 1. Configure git locally to link to your github account
19 | Open up a terminal and type
20 |
21 | ```bash
22 | git config --global user.email EMAIL_YOU_USE_TO_SIGN_IN_TO_GITHUB
23 | git config --global user.name YOUR_GITHUB_USERNAME
24 | ```
25 |
26 | ### 2. Make personal access token
27 | Configuring your local git isn't enough, as github is moving away from passwords. You will need to make a personal access token through your github account. Follow the instructions here:
28 | https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens
29 |
30 | **Copy this personal access token and save it somewhere or keep the window open for now. You will be prompted to enter your personal access token the first time you type `push -u origin main`. You will not be able to access this token again!**
31 |
32 | ## Creating your git repo
33 |
34 | ### 1. Initialize git repo on the HBC Github via web browser
35 |
36 | I have yet to find a way to do this remotely via CLI, but as far as I can tell this step is a necessary pre-requisite to pushing a local repo to the HBC github. Will update to add CLI if possible.
37 |
38 | Go to https://github.com/hbctraining and click the green "New Repository" button. Initialize a new, empty repository.
39 |
40 | Once you do this, there will be some basic code you can copy under `Quick Setup`, including the `https` location of your repo which can be used below
41 |
42 | ### 2. Create a local git repo and push it to the HBC Github via CLI
43 |
44 | In your terminal, navigate to the folder you would like to turn into a github repo and type the following:
45 |
46 | ```bash
47 | echo "# text to add to readme" >> README.md
48 | git init
49 | git add README.md
50 | git commit -m "first commit"
51 | git branch -M main
52 | git remote add origin https://github.com/hbctraining/NAME_OF_REPOT.git
53 | git push -u origin main
54 | ```
55 | **You will be prompted to enter your personal access token the first time you type `push -u origin main`.**
56 |
57 | ## Useful tips/tricks
58 |
59 | If you are doing this in a directory with folders/data/files you don't necessarily want to put on github, you will want to pick/choose what you upload. Here are some tips and notes:
60 |
61 | ### Add all, but exclude some
62 |
63 | **note: will not exclude if already pushed to HBC repo! Just untracks them!**
64 |
65 | The best time to implement this is when you are making your first upload
66 |
67 | ```bash
68 | git add .
69 | git reset -- path/to/thing/to/exclude
70 | git reset -- path/to/more/things/to/exclude
71 | git commit -m "NAME_OF_COMMIT"
72 | git push -u origin main
73 | ```
74 |
75 | ### Add specific files/folders:
76 |
77 | ```bash
78 | git add path/to/files*
79 | git commit -m "NAME_OF_COMMIT"
80 | git push -u origin main
81 | ```
82 |
83 | ### Add all files/folder except this file/folder:
84 |
85 | ```bash
86 | git add -- . ':!THING_TO_EXCLUDE'
87 | git commit -m "NAME_OF_COMMIT"
88 | git push -u origin main
89 | ```
90 |
91 | ### Remove a file you already pushed to Github
92 |
93 | You might be tempted to just do this in the browser, but be warned! It will break your local repo until you pull from the HBC location. This could be a problem if that was important data you want to continue to store locally. Fortunately, you can "delete" files on the HBC Github without deleting them locally. Here's how:
94 |
95 | ```bash
96 | git rm --cached NAME_OF_FILE
97 | ```
98 |
99 | Or for a folder:
100 | ```bash
101 | git rm -r --cached NAME_OF_FILE
102 | ```
103 | **Side effects of resorting to `git rm -r --cached REALLY_IMPORTANT_DATA_DIRECTORY` include anxiety, sweating, heart palpitations, and appeals to spiritual beings**
104 |
105 | ### Check what changes have been made to the current commit
106 |
107 | Very useful to see what will actually be added, removed, etc or if everything is up to date.
108 | ```bash
109 | git status
110 | ```
111 |
112 | ### .gitignore: coming soon
113 |
114 |
--------------------------------------------------------------------------------
/misc/organized_papers.md:
--------------------------------------------------------------------------------
1 | # Human genome reference T2T - CHR13 - 2022
2 | - https://www.genome.gov/about-genomics/telomere-to-telomere
3 | - https://www.science.org/doi/pdf/10.1126/science.abj6987
4 | - https://www.science.org/doi/10.1126/science.abl3533
5 | - https://www.science.org/doi/epdf/10.1126/science.abl3533
6 |
7 | # GWAS
8 | - https://nature.com/articles/nrg1521
9 | - https://nature.com/articles/nrg1916
10 | - https://nature.com/articles/nrg2344
11 | - https://nature.com/articles/nrg2544
12 | - https://nature.com/articles/nrg2796
13 | - https://nature.com/articles/nrg2813
14 | - https://nature.com/articles/nrg.2016.142
15 | - https://nature.com/articles/s4157
16 |
17 | # bulk-RNA-seq
18 | - Systematic evaluation of splicing calling tools - 2019: https://academic.oup.com/bib/article/21/6/2052/5648232
19 | - [RPKM/TPM misuse](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7373998/)
20 |
--------------------------------------------------------------------------------
/misc/orphan_improvements.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: Improvements for the analysis
3 | description: List of things to try.
4 | category: research
5 | subcategory: orphans
6 | tags: [hbc]
7 | ---
8 |
9 |
10 | 1. Try out Alevin from Salmon for a more principled single-cell quantification (https://www.biorxiv.org/content/early/2018/06/01/335000)
11 | 2. Add retained intron analysis with IRFinder to bcbio-nextgen
12 | 3. See if adding support for grolar to convert pizzly output to something more parseable makes sense. It's an R script and hasn't really been worked on so might not be useable: https://github.com/MattBashton/grolar
13 | 4. Add automatic loading/QC of bcbioSingleCell data from bcbio
14 | 5. Convert bcbio-nextgen singlecell matrices to HDF5 format in bcbio
15 | 6. Swap bcbioSingleCell to read the already-combined matrices for speed purposes
16 | 7. Add bcbioRNASeq template to do DTU usage using DRIMseq (https://f1000research.com/articles/7-952/v1)
17 | 8. Update installed genomes to use newest Ensembl build for RNA-seq for bcbio-supported genomes.
18 |
--------------------------------------------------------------------------------
/misc/power_calc_simulations.md:
--------------------------------------------------------------------------------
1 | Code and sample metadata and data for running simulation based power calculations can be found [here](https://github.com/hbc/power_calc_simulations).
2 |
3 |
4 | This code will use the mean and variance of the data set to derive simulated datasets with a defined number of values with defined fold changes. The simulated data will the be tested to determine precision and recall estimates for the comparison.
5 |
6 |
7 |
8 | *original code written by Lorena Pantano and adapted by John Hutchinson*
9 |
--------------------------------------------------------------------------------
/misc/snakemake-example-pipeline:
--------------------------------------------------------------------------------
1 | ---
2 | title: Example of snakemake pipeline
3 | description: An example of snakemake file to run a pipeline applied to a bunch of files
4 | category: research
5 | subcategory: general_ngs
6 | tags: [snakemake]
7 | ---
8 |
9 |
10 | This file shows how to run a pipeline with snakemake for a bunch of files defined in `SAMPLES` variables.
11 |
12 | It shows how to put together different steps and how they are related to each other.
13 |
14 | The tricky part is to have always `rule all` step and get all the output filenames you want to generate. If you miss
15 | some files that you want to generate and they are not the input in any other step then that step is not happening.
16 |
17 | ```
18 | from os.path import join
19 |
20 | # Globals ---------------------------------------------------------------------
21 |
22 | # Full path to a FASTA file.
23 | GENOME_DIR = '../reference'
24 |
25 | # Full path to a folder that holds all of your FASTQ files.
26 | FASTQ_DIR = '../rawdata'
27 |
28 | # A Snakemake regular expression matching the forward mate FASTQ files.
29 | SAMPLES, = glob_wildcards(join(FASTQ_DIR, '{sample,[^/]+}_R1_001.fastq.gz'))
30 |
31 | # Patterns for the 1st mate and the 2nd mate using the 'sample' wildcard.
32 | PATTERN_R1 = '{sample}_R1_001.fastq.gz'
33 | PATTERN_R2 = '{sample}_R2_001.fastq.gz'
34 | PATTERN_GENOME = '{sample}.fa'
35 |
36 |
37 | rule all:
38 | input:
39 | index = expand(join(GENOME_DIR, '{sample}.fa.bwt'), sample = SAMPLES),
40 | vcf = expand(join('vcf', '{sample}.vcf'), sample = SAMPLES),
41 | vcfpileup = expand(join('pileup', '{sample}.vcf'), sample = SAMPLES),
42 | sam = expand(join('stats', '{sample}.txt'), sample = SAMPLES)
43 |
44 | rule index:
45 | input:
46 | join(GENOME_DIR, '{sample}.fa')
47 | output:
48 | join(GENOME_DIR, '{sample}.fa.bwt')
49 | shell:
50 | 'bwa index {input}'
51 |
52 | rule map:
53 | input:
54 | genome = join(GENOME_DIR, PATTERN_GENOME),
55 | index = join(GENOME_DIR, '{sample}.fa.bwt'),
56 | r1 = join(FASTQ_DIR, PATTERN_R1),
57 | r2 = join(FASTQ_DIR, PATTERN_R2)
58 | output:
59 | 'bam/{sample}.bam'
60 | shell:
61 | 'bwa mem -c 250 -M -t 6 -v 1 {input.genome} {input.r1} {input.r2} | samtools sort - > {output}'
62 |
63 | rule stats:
64 | input:
65 | bam = 'bam/{sample}.bam'
66 | output:
67 | 'stats/{sample}.txt'
68 | shell:
69 | 'samtools stats {input} > {output}'
70 |
71 | rule pileup:
72 | input:
73 | bam = 'bam/{sample}.bam',
74 | genome = join(GENOME_DIR, PATTERN_GENOME)
75 | output:
76 | 'pileup/{sample}.mp'
77 | shell:
78 | 'samtools mpileup -f {input.genome} -t DP -t AD -d 10000 -u -g {input.bam} > {output}'
79 |
80 |
81 | rule mpconvert:
82 | input:
83 | 'pileup/{sample}.mp',
84 | output:
85 | 'pileup/{sample}.vcf'
86 | shell:
87 | 'bcftools convert -O v {input} > {output}'
88 |
89 |
90 | rule bcf:
91 | input:
92 | 'pileup/{sample}.mp',
93 | output:
94 | 'vcf/{sample}.vcf'
95 | shell:
96 | 'bcftools call -v -m {input} > {output}'
97 |
98 | ```
99 |
--------------------------------------------------------------------------------
/python/conda.md:
--------------------------------------------------------------------------------
1 | ## Conda
2 |
3 | Every system has a Python installation, but you don't necessarily want to use that. Why not? That version is typically outdated and configured to support system functions. Most tools require specific versions of Python and depedencies so you need more flexibility.
4 |
5 | **Solution?**
6 |
7 | Set up a full-stack scientific Python deployment **using a Python distribution** (Anaconda or Miniconda). It is an installation of Python with a set of curated packages which are guaranteed to work together.
8 |
9 |
10 | ## Setting up Python distribution on O2
11 |
12 | You can install it to your home directory, though not needed as O2 has a miniconda module available.
13 |
14 | By default, miniconda and conda envs are installed under user home space.
15 |
16 | ### Conda Environments
17 | Environments allow you to creata an isolated, reproducible environments where you have fine-tuned control over python version, all packages and configuration. _This is always recommended over using the default environment._
18 |
19 | To create an environment using Pytho 3.9 and the numpy package:
20 |
21 | ```bash
22 | $ conda create --name my_environment python=3.9 numpy
23 | ```
24 |
25 | Now that you have created it, you need to activate it. Once activated, all installations of tools are specific to that environment; do this using `conda install`. It is a configured space where you can run analyses reproducibly.
26 |
27 | ```bash
28 | $ conda activate my_environment
29 | ```
30 |
31 | When you are done you can deactivate the environment or close it:
32 |
33 | ```bash
34 | conda deactivate
35 | ```
36 |
37 | The environments and associated libs are located in `~/home/minconda3/envs`. Both miniconda and the created environments can occupy a lot of space and max out your home directory!
38 |
39 | **Solution: Create the conda env in another space**
40 |
41 | For conda envs, you can use the full path outside of home when creating env:
42 |
43 | ```bash
44 | module purge
45 | module load miniconda3/23.1.0
46 | conda create -p /path/to/somewhere/not/home/myEnv python=3.9 numpy
47 | ```
48 |
49 | > **NOTE:** It's common that installing packages using Conda is slow or fails because Conda is unable to resolve dependencies. To get around this, we suggest the use of Mamba.
50 |
51 | **Installing lots of dependency packages?**
52 |
53 | You can do this easily by creating a yaml file, for example `environment.yaml` below was used to install Pytables:
54 |
55 | ```bash
56 | name: pytables
57 | channels:
58 | - defaults
59 | dependencies:
60 | - python=3.9*
61 | - numpy >= 1.19.0
62 | - zlib
63 | - cython >= 0.29.32
64 | - hdf5=1.14.0
65 | - numexpr >= 2.6.2
66 | - packaging
67 | - py-cpuinfo
68 | - python-blosc2 >= 2.3.0
69 | ```
70 |
71 | Now to create the environment we reference the file in the command:
72 |
73 | ```bash
74 | conda env create -f environment.yaml
75 | ```
76 |
77 | ### Channels
78 |
79 | Where do conda packages come from? The packages are hosted on conda “channels”. From the conda pages:
80 |
81 | _"Conda channels are the locations where packages are stored. They serve as the base for hosting and managing packages. Conda packages are downloaded from remote channels, which are URLs to directories containing conda packages. The conda command searches a set of channels."_
82 |
83 | Using `-c` you can specify whihc channels you want conda to search in for packages.
84 |
85 | > Adapted from [An Introduction to Earth and Environmental Data Science](https://earth-env-data-science.github.io/lectures/environment/python_environments.html)
86 |
--------------------------------------------------------------------------------
/r/.Rprofile:
--------------------------------------------------------------------------------
1 | version <- paste0(R.Version()$major,".",R.Version()$minor)
2 | if (version == "3.6.1") {
3 | .libPaths("~/R-3.6.1/library")
4 | }else if (version == "3.5.1") {
5 | .libPaths("/R-3.5.1/library")
6 | }
7 |
8 | #Add this to your home folder, and make modifications to the version numbers if you need to.
9 | #This will let you load the correct library folders for the different versions you have on O2.
10 | #R-3.6.1 and R-3.5.1 will load their corresponding library path for mine, but feel free to modify them to fit your needs.
11 | # This has been created as ChIPQC was problematic in R-3.6.1 and I had to load R-3.5.1 to generate a html report. (Joon)
12 |
--------------------------------------------------------------------------------
/r/R-tips-and-tricks.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: R tips
3 | description: This code helps with regular data improving efficiency
4 | category: computing
5 | subcategory: tips_tricks
6 | tags: [R, visualization]
7 | ---
8 |
9 | # Import/Export of files
10 | Stop using write.csv, write.table and use the [rio](https://cran.r-project.org/web/packages/rio/index.html) library instead. All rio needs is the file extension to figure out what file type you're dealing with. Easy import and export to Excel files for clients.
11 |
12 | # Parsing in R using Tidyverse
13 | This is a link to a nice tutorial from Ista Zahn from IQSS using stringr and tidyverse for parsing files in R. It is from the Computefest 2017 workshop:
14 | http://tutorials-live.iq.harvard.edu:8000/user/zwD2ioESyGbS/notebooks/workshops/R/RProgramming/Rprogramming.ipynb
15 |
16 | # Better clean default ggplot
17 | install cowplot (https://cran.r-project.org/web/packages/cowplot/index.html)
18 | ```r
19 | library(cowplot)
20 | ```
21 |
22 | # Nice looking log scales
23 | Example for x-axis
24 | ```r
25 | library(scales)
26 | p + scale_x_log10(
27 | breaks = scales::trans_breaks("log10", function(x) 10^x),
28 | labels = scales::trans_format("log10", scales::math_format(10^.x))) +
29 | annotation_logticks(sides='b')
30 | ```
31 |
32 | # Read a bunch of files into one dataframe
33 | ```r
34 | library(tidyverse)
35 | read_files = function(files) {
36 | data_frame(filename = files) %>%
37 | mutate(contents = map(filename, ~ read_tsv(.))) %>%
38 | unnest()
39 | }
40 | ```
41 |
42 | # remove a layer from a ggplot2 object with ggedit
43 | ```
44 | plotGeneSaturation(bcb, interestingGroups=NULL) +
45 | ggrepel::geom_text_repel(aes(label=description, color=NULL))
46 | p %>%
47 | ggedit::remove_geom('point', 1) +
48 | geom_point(aes(color=NULL))
49 | ```
50 |
51 | # [Link to information about count normalization methods](https://github.com/hbc/knowledgebase/wiki/Count-normalization-methods)
52 | The images currently break, but I will update when the course materials are in a more permanent state.
53 |
54 | # .Rprofile usefulness
55 | ```R
56 | ## don't ask for CRAN repository
57 | options("repos" = c(CRAN = "http://cran.rstudio.com/"))
58 | ## for the love of god don't open up tcl/tk ever
59 | options(menu.graphics=FALSE)
60 | ## set seed for reproducibility
61 | set.seed(123456)
62 | ## don't print out more than 100 lines at once
63 | options(max.print=100)
64 | ## helps with debugging Bioconductor/S4 code
65 | options(showErrorCalls = TRUE, showWarnCalls = TRUE)
66 |
67 | ## Create a new invisible environment for all the functions to go in
68 | ## so it doesn't clutter your workspace.
69 | .env <- new.env()
70 |
71 | ## ht==headtail, i.e., show the first and last 10 items of an object
72 | .env$ht <- function(d, n=10) rbind(head(d, n), tail(d, n))
73 |
74 | ## copy from clipboard
75 | .env$pbcopy = function(x) {
76 | capture.output(x, file=pipe("pbcopy"))
77 | }
78 |
79 | ## update your local bcbioRNASeq and bcbioSingleCell installations
80 | .env$update_bcbio = function(x) {
81 | devtools::install_github("steinbaugh/basejump")
82 | devtools::install_github("hbc/bcbioBase")
83 | devtools::install_github("hbc/bcbioRNASeq")
84 | devtools::install_github("hbc/bcbioSingleCell")
85 | }
86 |
87 | attach(.env)
88 | ```
89 |
90 | # Make density plot without underline
91 | ```R
92 | ggplot(colData(sce) %>%
93 | as.data.frame(), aes(log10GenesPerUMI)) +
94 | stat_density(geom="line") +
95 | facet_wrap(~period + intervention)
96 | ```
97 |
98 | # Archive a file to Dropbox with a link to it
99 | ```R
100 | ```{r results='asis'}
101 | dropbox_dir = "HSPH/eggan/hbc02067"
102 | archive_data_with_link = function(data, filename, description, dropbox_dir) {
103 | readr::write_csv(data, filename)
104 | links = bcbioBase::copyToDropbox(filename, dropbox_dir)
105 | link = gsub("dl=0", "dl=1", links[[1]]$url)
106 | basejump::markdownLink(filename, link, paste0(" ", description))
107 | }
108 | archive_data_with_link(als, "dexseq-all.csv", "All DEXSeq results", dropbox_dir)
109 | archive_data_with_link(als %>%
110 | filter(padj < 0.1), "dexseq-sig.csv",
111 | "All significant DEXSeq results", dropbox_dir)
112 | ```
113 |
114 | # Novel operators from magrittr
115 | The “%<>%” operator lets you pipe an object to a function and then back into the same object.
116 | So:
117 | `foo -> foo %>% bar()`
118 | is the same as
119 | `foo %<>% bar()`
120 |
121 | # gghelp: Converts a natural language query into a 'ggplot2' command
122 | This [package](https://rdrr.io/github/brandmaier/ggx/) allows users to issue natural language commands
123 | related to theme-related styling of plots (colors, font size and such), which then are translated into
124 | valid 'ggplot2' commands.
125 |
126 | ### Examples:
127 | ```R
128 | gghelp("rotate x-axis labels by 90 degrees")
129 | gghelp("increase font size on x-axis label")
130 | gghelp("set x-axis label to 'Length of Sepal'")
131 | ```
132 |
--------------------------------------------------------------------------------
/r/Shiny_images/Added_tabs.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/Added_tabs.png
--------------------------------------------------------------------------------
/r/Shiny_images/Adding_panels.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/Adding_panels.png
--------------------------------------------------------------------------------
/r/Shiny_images/Adding_theme.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/Adding_theme.png
--------------------------------------------------------------------------------
/r/Shiny_images/Altered_action_button.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/Altered_action_button.png
--------------------------------------------------------------------------------
/r/Shiny_images/Check_boxes_with_action_button.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/Check_boxes_with_action_button.png
--------------------------------------------------------------------------------
/r/Shiny_images/R_Shiny_hello_world.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/R_Shiny_hello_world.gif
--------------------------------------------------------------------------------
/r/Shiny_images/R_shiny_req_after.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/R_shiny_req_after.gif
--------------------------------------------------------------------------------
/r/Shiny_images/R_shiny_req_initial.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/R_shiny_req_initial.gif
--------------------------------------------------------------------------------
/r/Shiny_images/Return_table.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/Return_table.png
--------------------------------------------------------------------------------
/r/Shiny_images/Return_text_app_blank.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/Return_text_app_blank.png
--------------------------------------------------------------------------------
/r/Shiny_images/Return_text_app_hello.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/Return_text_app_hello.png
--------------------------------------------------------------------------------
/r/Shiny_images/Sample_size_hist_100.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/Sample_size_hist_100.png
--------------------------------------------------------------------------------
/r/Shiny_images/Sample_size_hist_5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/Sample_size_hist_5.png
--------------------------------------------------------------------------------
/r/Shiny_images/Shiny_UI_server.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/Shiny_UI_server.png
--------------------------------------------------------------------------------
/r/Shiny_images/Shiny_process.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/Shiny_process.png
--------------------------------------------------------------------------------
/r/Shiny_images/Squaring_number_app.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/Squaring_number_app.png
--------------------------------------------------------------------------------
/r/Shiny_images/mtcars_table.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/mtcars_table.png
--------------------------------------------------------------------------------
/r/htmlwidgets:
--------------------------------------------------------------------------------
1 | (See http://gallery.htmlwidgets.org/ for more awesome widgets.)
2 |
3 | Using some basic R libraries, you can setup some interactive visualizations wihtout using Rshiny
4 |
5 | Here is some example code illustrating what I am thinking about, using the iris dataset from R
6 |
7 | `library(crosstalk)`
8 | `library(lineupjs)`
9 | `library(d3scatter)`
10 |
11 | `shared_iris = SharedData$new(iris)`
12 | `d3scatter(shared_iris, ~Petal.Length, ~Petal.Width, ~Species, width="100%")`
13 | `lineup(shared_iris, width="100%")`
14 |
15 | Similarly, the morpheus.js html widget makes for fantastic, interactive heatmaps.
16 | `library(morpheus)`
17 |
18 | rowAnnotations <- data.frame(annotation1=1:32, annotation2=sample(LETTERS[1:3], nrow(mtcars), replace = TRUE))`
19 | `morpheus(mtcars, colorScheme=list(scalingMode="fixed", colors=heat.colors(3)), rowAnnotations=rowAnnotations, overrideRowDefaults=FALSE, rows=list(list(field='annotation2', highlightMatchingValues=TRUE, display=list('color'))))`
20 |
--------------------------------------------------------------------------------
/rc/O2-tips.md:
--------------------------------------------------------------------------------
1 | # O2 tips
2 |
3 | ## Making conda not slow down your login
4 | If you have a complex base environment that gets loaded on login, you can end up having freezes of 30 seconds or more when
5 | logging into O2. It is ultra annoying. You can fix this by not running the `_conda_setup` script in your .bashrc, like this:
6 |
7 | ```bash
8 | # >>> conda initialize >>>
9 | # !! Contents within this block are managed by 'conda init' !!
10 | #__conda_setup="$('/home/rdk4/local/share/bcbio/anaconda/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
11 | #if [ $? -eq 0 ]; then
12 | # eval "$__conda_setup"
13 | #else
14 | if [ -f "/home/rdk4/local/share/bcbio/anaconda/etc/profile.d/conda.sh" ]; then
15 | . "/home/rdk4/local/share/bcbio/anaconda/etc/profile.d/conda.sh"
16 | else
17 | export PATH="/home/rdk4/local/share/bcbio/anaconda/bin:$PATH"
18 | fi
19 | #fi
20 | #unset __conda_setup
21 | # <<< conda initialize <<<
22 | ```
23 |
24 | ## Interactive function to request memory and hours
25 |
26 | Can be added to .bashrc (or if you don't want to clutter it, put it in .o2_aliases and then source it from .bashrc)
27 |
28 | Defaults: 4G mem, 8 hours.
29 | ```
30 | function interactive() {
31 | mem=${1:-4}
32 | hours=${2:-8}
33 |
34 | srun --pty -p interactive --mem ${mem}G -t 0-${hours}:00 /bin/bash
35 | }
36 | ```
37 |
38 |
--------------------------------------------------------------------------------
/rc/O2_portal_errors.md:
--------------------------------------------------------------------------------
1 | # O2 Portal - R
2 |
3 | These are common errors found when running R on the O2 portal and ways to fix them.
4 |
5 | ## How to launch Rstudio
6 |
7 | - Besides your private R library, we have now platform shared R library, add this to *Shared R Personal Library* section
8 | : "/n/data1/cores/bcbio/R/library/4.3.1" (RNAseq and scRNAseq)
9 | - Minimum modules to load for R4.3.*: `cmake/3.22.2 gcc/9.2.0 R/4.3.1 `
10 | - Minimum modules to load for 4.2.1 single cell analyses (some might be specific to trajectory analysis):
11 | `gcc/9.2.0 imageMagick/7.1.0 geos/3.10.2 cmake/3.22.2 R/4.2.1 fftw/3.3.10 gdal/3.1.4 udunits/2.2.28`
12 | - Sometimes specific nodes work better: under "Slurm Custom Arguments": `-x compute-f-17-[09-25]`
13 |
14 | # Issues
15 |
16 | ## Issue 1 - You can make a session and open Rstudio on O2 but cannot actually type.
17 |
18 | Potential solution: Make a new session and put the following under "Slurm Custom Arguments":
19 | ```
20 | -x compute-f-17-[09-25]
21 | ```
22 |
23 | ## Issue 2 - Everything was fine but then you lost connection.
24 |
25 | When you attempt to reload you see:
26 |
27 |
28 |
29 |
30 |
31 | Potential solutions: Refresh your interactive sessions page first then refresh your R page.
32 | If that doesn't work close your R session and re-open from the interactive sessions page.
33 | If that doesn't work wait 5-10 min then repeat.
34 |
35 | ## Issue 3 - You made a session but cannot connect
36 |
37 | When you attempt to connect you see:
38 |
39 |
40 |
41 |
42 |
43 | Potential solutions: This error indicates that either you did not load a gcc module or you loaded the incorrect one for the version of R you are running.
44 | Kill the current session and start a new one with the correct gcc loaded in the modules to be loaded tab.
45 |
46 | ## Issue 4 - When you finally refresh your environment is gone (THE WORST)
47 |
48 | What happened is you ran out of memory and R restarted itself behind the scenes. You will NOT get an error message for this of any kind. The best thing to do is quit your session and restart a new one with more memory.
49 |
50 | ## Issue 5 - Crashing
51 |
52 | Also, previous issues with O2portal RStudio crashing - “the compute-f architecture is not good enough and this part of the process fails because (maybe) it was built/installed on a newer node” .
53 | Solution: add the flag when you start the session to just exclude those nodes -x compute-f-17-[09-25]
54 |
55 | ## Issue 6 - commands using cores fail
56 |
57 | ```
58 | Error in `.rowNamesDF<-`(x, value = value) : invalid 'row.names' length
59 | In addition: Warning message:
60 | In mclapply(X, function(...) { :
61 | scheduled cores 1, 2 did not deliver results, all values of the jobs will be affected
62 | ```
63 |
--------------------------------------------------------------------------------
/rc/arrays_in_slurm.md:
--------------------------------------------------------------------------------
1 |
2 | # Arrays in Slurm
3 |
4 | When I am working on large data sets my mind often drifts back to an old Simpsons episode. Bart is in France and being taught to pick grapes. They show him a detailed technique and he does it successfully. Then they say:
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 | We've all been here
13 |
14 |
15 | A pipeline or process may seem easy or fast when you have 1-3 samples but totally daunting when you have 50. When scaling up you need to consider file overwriting, computational resources, and time.
16 |
17 | One easy way to scale up is to use the array feature in slurm.
18 |
19 | ## What is a job array?
20 |
21 | Atlassian says this about job arrays on O2: "Job arrays can be leveraged to quickly submit a number of similar jobs. For example, you can use job arrays to start multiple instances of the same program on different input files, or with different input parameters. A job array is technically one job, but with multiple tasks." [link](https://harvardmed.atlassian.net/wiki/spaces/O2/pages/1586793632/Using+Slurm+Basic#Job-Arrays).
22 |
23 | Array jobs run simultaneously rather than one at a time which means they are very fast! Additionally, running a job array is very simple!
24 |
25 | ```bash
26 | sbatch --array=1-10 my_script.sh
27 | ```
28 |
29 | This will run my_script.sh 10 times with the job IDs 1,2,3,4,5,6,7,8,9,10
30 |
31 | We can also put this directly into the bash script itself (although we will continue with the command line version here).
32 | ```bash
33 | $SBATCH --array=1-10
34 | ```
35 |
36 | We can specify any job IDs we want.
37 |
38 | ```bash
39 | sbatch --array=1,7,12 my_script.sh
40 | ```
41 | This will run my_script.sh 3 times with the job IDs 1,7,12
42 |
43 | Of course we don't want to run the same job on the same input files over and over, that would be pointless. We can use the job IDs within our script to specify different input or output files. In bash the job id is given a special variable `${SLURM_ARRAY_TASK_ID}`
44 |
45 |
46 | ## How can I use ${SLURM_ARRAY_TASK_ID}?
47 |
48 | The value of `${SLURM_ARRAY_TASK_ID}` is simply job ID. If I run
49 |
50 | ```bash
51 | sbatch --array=1,7 my_script.sh
52 | ```
53 | This will start two jobs, one where `${SLURM_ARRAY_TASK_ID}` is 1 and one where it is 7
54 |
55 | There are several ways we can use this. If we plan ahead and name our files with these numbers (e.g., sample_1.fastq, sample_2.fastq) we can directly refer to these files in our script: `sample_${SLURM_ARRAY_TASK_ID}.fastq` However, using the ID for input files is often not a great idea as it means you need to strip away most of the information that you might put in these names.
56 |
57 | Instead we can keep our sample names in a separate file and use [awk](awk.md) to pull the file names.
58 |
59 | here is our complete list of long sample names which is found in our file `samples.txt`:
60 |
61 | ```
62 | DMSO_control_day1_rep1
63 | DMSO_control_day1_rep2
64 | DMSO_control_day2_rep1
65 | DMSO_control_day2_rep2
66 | DMSO_KO_day1_rep1
67 | DMSO_KO_day1_rep2
68 | DMSO_KO_day2_rep1
69 | DMSO_KO_day2_rep2
70 | Drug_control_day1_rep1
71 | Drug_control_day1_rep2
72 | Drug_control_day2_rep1
73 | Drug_control_day2_rep2
74 | Drug_KO_day1_rep1
75 | Drug_KO_day1_rep2
76 | Drug_KO_day2_rep1
77 | Drug_KO_day2_rep2
78 | ```
79 |
80 | If we renamed all of these to 1-16 we would lose a lot of information that may be helpful to have on hand. If these are all sam files and we want to convert them to bam files our script could look like this
81 |
82 | ```bash
83 |
84 | file=$(awk -v awkvar="${SLURM_ARRAY_TASK_ID}" 'NR==awkvar' samples.txt)
85 |
86 | samtools view -S -b ${file}.sam > ${file}.bam
87 |
88 | ```
89 |
90 | Since we have sixteen samples we would run this as
91 |
92 | ```bash
93 | sbatch --array=1-16 my_script.sh
94 | ```
95 |
96 | So what is this script doing? `file=$(awk -v awkvar="${SLURM_ARRAY_TASK_ID}" 'NR==awkvar' samples.txt)` pulls the line of `samples.txt` that matched the job ID. Then we assign that to a variable called `${file}` and use that to run our command.
97 |
98 | Job IDs can also be helpful for output files or folders. We saw above how we used the job ID to help name our output bam file. But creating and naming folders is helpful in some instances as well.
99 |
100 | ```bash
101 |
102 | file=$(awk -v awkvar="${SLURM_ARRAY_TASK_ID}" 'NR==awkvar' samples.txt)
103 |
104 | PREFIX="Folder_${SLURM_ARRAY_TASK_ID}"
105 | mkdir $PREFIX
106 | cd $PREFIX
107 |
108 | samtools view -S -b ../${file}.sam > ${file}.bam
109 |
110 | ```
111 |
112 | This script differs from our previous one in that it makes a folder with the job ID (Folder_1 for job ID 1) then moves inside of it to execute the command. Instead of getting all 16 of our bam files output in a single folder each of them will be in its own folder labled Folder_1 to Folder_16.
113 |
114 | **NOTE** That we define `${file}` BEFORE we move into our new folder as samples.txt is only present in the main directory.
115 |
116 |
117 |
118 |
--------------------------------------------------------------------------------
/rc/connection-to-hpc.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: Connecting to hpc from local
3 | description: This code helps with connecting to hpc computers
4 | category: computing
5 | subcategory: tips_tricks
6 | tags: [ssh, hpc]
7 | ---
8 |
9 |
10 | # osx
11 |
12 | Use [Homebrew](http://brew.sh/) to get linux-like functionality on OSX
13 |
14 | Use [XQuartz](https://www.xquartz.org/) for X11 window functionality in OSX.
15 |
16 | # Odyssey with 2FA
17 | Enter one time password into the current window (https://github.com/jwm/os-x-otp-token-paster)
18 |
19 | # Fix 'Warning: No xauth data; using fake authentication data for X11 forwarding'
20 | Add this to your ~/.ssh/config on your OSX machine:
21 |
22 | ```
23 | Host *
24 | XAuthLocation /opt/X11/bin/xauth
25 | ```
26 |
27 | # Use ssh keys on remote server
28 | This will add your key to the OSX keychain, here your private key is assumed to be named "id_rsa":
29 |
30 | ```
31 | ssh-add -K ~/.ssh/id_rsa
32 | ```
33 |
34 | Now tell ssh to use the keychain. Add this to the ~/.ssh/config on your OSX machine:
35 |
36 | ```
37 | Host *
38 | AddKeysToAgent yes
39 | UseKeychain yes
40 | IdentityFile ~/.ssh/id_rsa
41 | XAuthLocation /opt/X11/bin/xauth
42 | ```
43 |
--------------------------------------------------------------------------------
/rc/ipython-notebook-on-O2.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: IPython notebook on O2
3 | description: How to open up an ipython notebook running on O2
4 | category: computing
5 | subcategory: tips_tricsk
6 | tags: [python, ipython, singlecell]
7 | ---
8 |
9 | 1. First connect to O2 and open up an interactive session with all of the cores and memory you want to use. Here I'm connecting to the short queue so I can get more cores to use.
10 |
11 | ```bash
12 | srun -n 8 --pty -p short --mem 64G -t 0-12:00 --x11 /bin/bash
13 | ```
14 |
15 | 2. Note the name of the compute node you are on:
16 |
17 | ```bash
18 | uname --nodename
19 | ```
20 |
21 | 3. Start a jupyter notebook server on a specific port:
22 |
23 | ```bash
24 | jupyter notebook --no-browser --port=1234
25 | ```
26 |
27 | This command will open up a notebook server on port 1234. You might have to pick
28 | a different port if 1234 is being used. Note the token it provides for you, you
29 | will need this token to use your notebook server.
30 |
31 | 4. Create an auto-closing SSH tunnel from your local machine to the jupyter notebook:
32 |
33 | On your local machine do:
34 |
35 | ```bash
36 | ssh -f -L 9999:localhost:9999 o2 -t 'ssh -f -L 9999:localhost:1234 compute-a-16-49 "sleep 60"'
37 | ```
38 |
39 | This sets up a two SSH tunnels. The first one is connecting port 9999 on your laptop to port 9998 on `login02` on o2 (134.174.159.22). The second is connecting port 9998 on `login02` to port 1234 on `compute-e-16-49`. This script will auto-close the tunnel if you don't connect to it in 60 seconds, and will auto-close the tunnel when your session is closed.
40 |
41 | 5. Open a web browser and put `localhost:9999` as the address.
42 |
43 | This should now connect you to the jupyter notebook server. It will ask you for
44 | the token. If you put the token in, you can now log in and will be in your
45 | home directory on O2.
46 |
47 | You are now running a notebook server. This is just running using a single core now-- we want to hook up our computing that we reserved. We asked for 8 cores, so we'll set up a cluster with 8 cores. Click on "IPython Clusters", set the number of engines on
48 | "default" to 8, and you will have your notebook connected to the 8 cores.
49 |
50 | 6. Start working!
51 |
52 | You can open up a terminal by going to the Files tab and clicking on new and opening
53 | the terminal. You can start a new notebook by going to the Files tab, clicking on
54 | new and opening a python notebook.
55 |
--------------------------------------------------------------------------------
/rc/jupyter_notebooks.md:
--------------------------------------------------------------------------------
1 | # Jupyter notebooks
2 |
3 | This post is for those who are interested in running notebooks seemingly on O2. There is a well written documentation about running jupyter notebook on O2. you can find it here. https://wiki.rc.hms.harvard.edu/display/O2/Jupyter+on+O2. However, this involves multiple steps, opening bunch of terminals at times, and importantly finding an unused port every-time. I found it quite cumbersome and annoying, so i spent some time solving it. It took me a while to nail it down, with the help of FAC RC, but they suggested a simpler solution. If you wish to run jupyter/R notebook on O2 {where your data sits},
4 | Here what you need to do :
5 |
6 | Install https://github.com/aaronkollasch/jupyter-o2 by `pip install jupyter-o2` on your local terminal.
7 |
8 | Run `jupyter-o2 --generate-config` on command line.
9 | This will generate the configuration file and will tell you where it is located. Un comment the fields which are not needed. Since configuration file is the key, find attach the a template for use, you need to change your credentials though.
10 |
11 | You are all set to run notebook on O2 from your local machine now, without logging into server.
12 | Now at your local terminal Run `jupyter-o2 notebook` for python notebooks. Alternatively you can also do `jupyter-o2 lab` for R/python
13 | This will ask you a paraphrase, you should enter your ecommons password as paraphrase.
14 | Boom!!! you are good to go! Happy Pythoning :):)
15 | If you wish you run R notebooks on O2, refer this. https://docs.anaconda.com/anaconda/navigator/tutorials/r-lang/
16 |
17 |
18 | # Example code
19 |
20 | Just to add, in the HMS-RC documentation they suggested any ports over 50000. To give examples of logging into a jupyter notebook session I have provided the code below.
21 |
22 | ## Creating a Jupyter notebook
23 |
24 | Log onto a login node
25 |
26 | ```
27 | # Log onto O2 using a specific port - I used '50000' in this instance - you can choose a different port and just replace the 50000 with the number of your specific port
28 | ssh -Y -L 50000:127.0.0.1:50000 ecommons_id@o2.hms.harvard.edu
29 | ```
30 |
31 | Once on the login node, you can start an interactive session specifying the port with `--tunnel`
32 |
33 | ```
34 | # Create interactive session
35 | srun --pty -p interactive -t 0-12:00 --x11 --mem 128G --tunnel 50000:50000 /bin/bash
36 | ```
37 |
38 | Load the modules that you will need
39 |
40 | ```
41 | # Load modules
42 | module load gcc/9.2.0 python/3.8.12
43 | ```
44 |
45 | Create environment for running analysis (example here is for velocity)
46 |
47 | ```
48 | # Create virtual environment (only do this once)
49 | virtualenv velocyto --system-site-packages
50 | ```
51 |
52 | Activate virtual environment
53 |
54 | ```
55 | # Activate virtual environment
56 | source velocyto/bin/activate
57 | ```
58 |
59 | Install Jupyter notebook and any other libraries (only need to do this once)
60 |
61 | ```
62 | # Install juypter notebook
63 | pip3 install jupyter
64 |
65 | # Install any other libraries needed for analysis (this is for velocity)
66 | pip3 install numpy scipy cython numba matplotlib scikit-learn h5py click
67 | pip3 install velocyto
68 | pip3 install scvelo
69 | ```
70 |
71 | To create a Jupyter notebook run the following (again instead of 50000, use your port #):
72 |
73 | ```
74 | # Start jupyter notebook
75 | jupyter notebook --port=50000 --browser='none'
76 | ```
77 |
78 | ## Logging onto an existing notebook
79 |
80 | ```
81 | # Log onto O2 using a specific port - I used '50000' in this instance - you can choose a different port and just replace the 50000 with the number of your specific port
82 | ssh -Y -L 50000:127.0.0.1:50000 ecommons_id@o2.hms.harvard.edu
83 |
84 | # Create interactive session
85 | srun --pty -p interactive -t 0-12:00 --x11 --mem 128G --tunnel 50000:50000 /bin/bash
86 |
87 | # Load modules
88 | module load gcc/9.2.0 python/3.8.12
89 |
90 | # Activate virtual environment
91 | source velocyto/bin/activate
92 |
93 | # Open existing notebook
94 | jupyter notebook name_of_notebook.ipynb --port=50000 --browser='none'
95 | ```
96 |
97 | ## Sharing your notebook
98 | To share the contents of your notebook, you can either upload the notebook directly to Github and add your client as a collaborator on the repo, or export the report as a markdown or PDF.
99 |
100 | To export as a PDF, you need to have additional modules loaded and python packages installed:
101 |
102 | ```
103 | module load texlive/2007
104 |
105 | pip3 install Pyppeteer
106 | pip3 install nbconvert
107 | ```
108 |
--------------------------------------------------------------------------------
/rc/keepalive.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: Transfer files inside cluster
3 | description: This code helps with transfer files inside cluster
4 | category: computing
5 | subcategory: tips_tricks
6 | tags: [ssh, hpc]
7 | ---
8 |
9 | Useful for file transfers on O2's new transfer cluster (transfer.rc.hms.harvard.edu).
10 |
11 | The nohup command can be prepended to the bash command and the command will keep running after you logout (or have your connection interrupted).
12 |
13 | From HMS RC:
14 |
15 | `From one of the file transfer systems under transfer.rc.hms.harvard.edu , you can prefix your command with "nohup" to put it in the background and be able to log out without interrupting the process.`
16 |
17 | `For example, after logging in to e.g. the transfer01 host, run your command:`
18 |
19 | `nohup rsync -av /dir1 /dir2`
20 |
21 | `and then log out. rsync will keep running.`
22 |
23 | `To check in on the process later, just remember which machine you ran rsync and you can directly re-login to that system if you like.`
24 |
25 | `For example:`
26 |
27 | `1. ssh transfer.rc.hms.harvard.edu (let's say you land on transfer03), and then:`
28 | `2. ssh transfer01`
29 | `-- from there you can run the "ps" command or however you like to monitor the process.`
30 |
31 |
32 | ## Another option from John
33 | If you run tmux from the login node before you ssh to the transfer node to xfer files, you can drop your connection and then re-attach to your tmux session later. It should still be running your transfer.
34 |
35 | **General steps**
36 | 1) Login to O2
37 | 2) write down what login node your are on (usually something like login0#)
38 | *at login node*
39 | 3) Start a new tmux session
40 | `tmux new -s myname`
41 | 4) SSH to the transfer node
42 | `ssh user@transfer.rc.hms.harvard.edu`
43 | *on transfer node*
44 | 5) start transfer with rsync, scp etc.
45 | 6) close terminal window without logging out
46 | *time passes*
47 | 7) Login to O2 again
48 | 8) ssh to the login node you wrote down above
49 | `ssh user@login0#`
50 | 9) Reattach to your tmux session
51 | `tmux a -t myname`
52 | 10) Profit
53 |
54 | You can get around having to remember which node you logged into by alwasys logging into the same node. For example you can add this to your .bash_profile on OSX:
55 | `alias ssho2='ssh -XY -l user login05.o2.rc.hms.harvard.edu'`
56 |
57 |
--------------------------------------------------------------------------------
/rc/manage-files.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: Managing files
3 | description: This code helps with managing file names
4 | category: computing
5 | subcategory: tips_tricks
6 | tags: [bash, osx, linux]
7 | ---
8 |
9 | ## How to remove all files except the ones you want:
10 |
11 | First, expand rm capabilities by:
12 | `shopt -s extglob`
13 |
14 | Then use find and remove:
15 | `find . ! -name 'file.txt' -type f -exec rm -f {} +`
16 |
17 |
18 | ## Rename files
19 |
20 | It has a lot of good options but one pretty useful for removing whitespaces and set all to lowercase:
21 |
22 | `rename -c --nows `
23 |
24 | ## Use umask to restrict default permissions for users outside of the group
25 |
26 | Set umask 007 in our .bashrc. Then newly created directories will have 770 (rwxrwx---) permissions,
27 | and files will have 660 (rw-rw----).
28 |
--------------------------------------------------------------------------------
/rc/openondemand.md:
--------------------------------------------------------------------------------
1 | ## My intitial notes on using the FAS-RC Open on Demand virtual desktop system
2 |
3 | ### Steps to get going
4 | - Download and install Cisco connect VPN
5 |
6 | - Login to to VPN using your FAS-RC credentials and two-pass authentication code
7 |
8 | - Navigate to https://vdi.rc.fas.harvard.edu/pun/sys/dashboard
9 |
10 | - Login to page using FAS-RC user id and password
11 |
12 | - Click on Interactive Apps pulldown and select the “Rstudio Server” under the SEr5ver heading. DO NOT select, “RStudio Server (bioconductor + tidyverse)”
13 |
14 | Here most of the settings are self-explanatory.
15 | - I have tried out multiple cores (12) and while I am unsure it is using the full 48 cores parallel::detectCores finds, my simulation did run substantially faster (4 cores = 5% done when “12 cores” was at 75%. I would be interested to hear other people’s experiences and how transparently/well it works with R.
16 |
17 | - Maximum memory allocation is supposed to bey 120GHB. I haven’t tried asking for more.
18 |
19 | - I’ve been loading the R/4.02-fasrrc01 Core R versions. I tried the R/4.02-fasrc Core & gcc 9.3.0 first and ran into package compilation issues.
20 |
21 | - You can set the R_LIBS_USER folder to use which will contain your library packages. Using this approach, I was able to install packages in session , delete the server and come back to the installed packages in a new session. You could theoretically also switch between R versions using this and the version selector.
22 |
23 | - I haven’t tired executing a script before staring Rstudio, but theoretically, I could see using this to launch a condo environment.
24 |
25 | - I don’t know about reservations but they sound interesting for getting a high mem machine.
26 |
--------------------------------------------------------------------------------
/rc/scheduler.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: Alias for cluster jobs stats
3 | description: This code helps with commands related to job submission
4 | category: computing
5 | subcategory: tips_tricks
6 | tags: [bash, hpc]
7 | ---
8 |
9 | # SLURM
10 |
11 | * Useful aliases
12 |
13 | ```bash
14 | alias bjobs='sacct -u ${USER} --format="JobID,JobName%25,NodeList,State,ncpus,start,elapsed" -s PD,R'
15 | alias bjobs_all='sacct -u ${USER} --format="JobID,JobName%25,NodeList,State,ncpus,AveCPU,AveRSS,MaxRSS,MaxRSSTask,start,elapsed"'
16 | ```
17 |
--------------------------------------------------------------------------------
/rc/tmux.md:
--------------------------------------------------------------------------------
1 | Tmux is a great way to work on the server as it allow syou to:
2 | 1) keep your session alive.
3 | 3) have multiple named sessions open to firewall tasks/projects.
4 | 4) run multiple windows/command lines from a single login (O2 allows a maximum of 2-3 logins to their system).
5 | 5) quickly spin off windows to do small commands (see 3).
6 |
7 | ### Useful resources
8 |
9 | #### [Tmux cheat sheet](https://tmuxcheatsheet.com/)
10 |
11 |
12 | #### Tmux configuration
13 | Tmux works great but can have some issues upon first use that make it challenging to use for those of us used to a GUI:
14 | a) the default command key is not great.
15 | b) it doesn't work well with a mouse.
16 | c) it doesn't let you copy text easily.
17 | d) it doesn't scroll your window easily.
18 | e) resizing the windows can be challenging.
19 |
20 | The confirguation file code below should make some of these issues easier:
21 |
22 | set -g default-terminal "screen-256color"
23 | set -g status-bg red
24 | set -g status-fg black
25 |
26 | # Use instead of the default as Tmux prefix
27 | set-option -g prefix C-a
28 | unbind-key C-b
29 | bind-key C-a send-prefix
30 |
31 |
32 | # Options enable mouse support in Tmux
33 | #set -g terminal-overrides 'xterm*:smcup@:rmcup@'
34 | # For Tmux >= 2.1
35 | #set -g mouse on
36 | # For Tmux <2.1
37 | # Make mouse useful in copy mode
38 | setw -g mode-mouse on
39 | #
40 | # # Allow mouse to select which pane to use
41 | set -g mouse-select-pane on
42 | #
43 | # # Allow mouse dragging to resize panes
44 | set -g mouse-resize-pane on
45 | #
46 | # # Allow mouse to select windows
47 | set -g mouse-select-window on
48 |
49 |
50 | # set colors for the active window
51 | # START:activewindowstatuscolor
52 | setw -g window-status-current-fg white
53 | setw -g window-status-current-bg red
54 | setw -g window-status-current-attr bright
55 | # END:activewindowstatuscolor
56 |
57 |
58 | ## Optional- act more like vim:
59 | #set-window-option -g mode-keys vi
60 | #bind h select-pane -L
61 | #bind j select-pane -D
62 | #bind k select-pane -U
63 | #bind l select-pane -R
64 | #unbind p
65 | #bind p paste-buffer
66 | #bind -t vi-copy v begin-selection
67 | #bind -t vi-copy y copy-selection
68 |
69 |
70 | # moving between panes
71 | # START:paneselect
72 | bind h select-pane -L
73 | bind j select-pane -D
74 | bind k select-pane -U
75 | bind l select-pane -R
76 | # END:paneselect
77 |
78 |
79 | # START:panecolors
80 | set -g pane-border-fg green
81 | set -g pane-border-bg black
82 | set -g pane-active-border-fg white
83 | set -g pane-active-border-bg yellow
84 | # END:panecolors
85 |
86 | # Command / message line
87 | # START:cmdlinecolors
88 | set -g message-fg white
89 | set -g message-bg black
90 | set -g message-attr bright
91 | # END:cmdlinecolorsd -g pane-active-border-bg yellow
92 |
93 |
94 | bind-key C-a last-window
95 | setw -g aggressive-resize on
96 |
97 | #### [Restoring your tmux session after reboot](https://andrewjamesjohnson.com/restoring-tmux-sessions/)
98 |
--------------------------------------------------------------------------------
/rnaseq/RepEnrich2_guide.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: How to run repeat enrichment analysis
3 | description: This guide shows how to run RepEnrich2
4 | category: research
5 | subcategory: rnaseq
6 | tags: [annotation]
7 | ---
8 |
9 | RepEnrich2 tries to look at something that standard RNA-seq pipelines miss, the
10 | enrichment of repeats in NGS data. It is extremely slow and is a pain to get
11 | going. Below is a guide getting it working and has some links to a fork of
12 | RepEnrich2 I made that makes it more friendly to use.
13 |
14 | I have not actually validated the RepEnrich2 output, so caveat emptor.
15 |
16 | # Preparing RepEnrich2
17 |
18 | ## Create isolated conda environment
19 |
20 | ```bash
21 | conda create -c bioconda -n repenrich2 python=2.7 biopython bedtools samtools bowtie2 bcbio-nextgen
22 | ```
23 |
24 | ## Download my fork of RepEnrich2
25 | This has quality of life fixes such as memoization of outputs so if it fails you don't
26 | have to redo steps.
27 |
28 | ```bash
29 | git clone git@github.com:nerettilab/RepEnrich2.git
30 | ```
31 |
32 | ## Download a pre-created index
33 | You can make your own, for example I made
34 | [hg38](https://www.dropbox.com/s/lefkk38q6bbj76b/Repenrich2_setup_hg38.tar.gz?dl=1)
35 | and the RepEnrich2 folks have mm9 and hg19
36 | [here](https://drive.google.com/drive/folders/0B8_2gE04f4QWNmdpWlhaWEYwaHM). But the RepeatMasker
37 | file it uses needs to be cleaned first and I'm not sure how they cleaned it. They had a hg38 one cleaned
38 | already from RepEnrich so I just used that.
39 |
40 | ## Download bcbio_RepEnrich2
41 | Download [bcbio_RepEnrich2](https://github.com/roryk/bcbio_RepEnrich2). This will need modification if you
42 | want to use it, but it is simple, I just didn't bother as I don't anticipate us running this again.
43 |
44 | # Running RepEnrich2
45 | `bcbio_RepEnrich2` is all you need to run it, the help should give you enough information to go on.
46 | annotation here is the file from RepeatMasker that was used to generate the RepEnrich setup. The
47 | bowtie index is a bowtie2 index of the genome you aligned to. Running RepEnrich2 takes FOREVER, so
48 | be sure to run it on the long queue.
49 |
50 | Example command:
51 |
52 | ```bash
53 | python bcbio_RepEnrich2.py --threads 16 ../human-dsrna/config/human-dsrna.yaml /n/app/bcbio/biodata/genomes/Hsapiens/hg38/bowtie2/hg38 metadata/hg38_repeatmasker_clean.txt metadata/RepEnrich2_setup_hg38/
54 | ```
55 |
56 | # RepEnrich2 outputs
57 |
58 | You will get three files for each sample, for example:
59 |
60 | ```
61 | P1722_class_fraction_counts.txt
62 | P1722_family_fraction_counts.txt
63 | P1722_fraction_counts.txt
64 | ```
65 |
66 | The `class` and `family` files are the counts in the `samplename_fraction_counts.txt` file aggregated by family or
67 | class. Those could be used as aggregate analyses, but the `fraciton_counts` looks at the different repeat
68 | types individually, so is more what folks are probably looking for.
69 |
--------------------------------------------------------------------------------
/rnaseq/Volcano_Plots.md:
--------------------------------------------------------------------------------
1 | ## [Enhanced Volcano](https://bioconductor.org/packages/devel/bioc/vignettes/EnhancedVolcano/inst/doc/EnhancedVolcano.html) is a great and flexible way to creat volcano plots.
2 |
3 | Input is a dataframe of test statistics. It works well with the output of `lfcShrink()`
4 | Below is an example call and output:
5 |
6 | ```
7 | library(EnhancedVolcano)
8 |
9 | EnhancedVolcano(shrunken_res_treatment,
10 | lab= NA,
11 | x = 'log2FoldChange',
12 | y = 'pvalue', title="Volcano Plot for Treatment", subtitle = "")
13 | ```
14 |
15 |
16 |
17 |
18 |
19 |
20 | Almost every aspect is flexible and changable.
21 |
--------------------------------------------------------------------------------
/rnaseq/ase.md:
--------------------------------------------------------------------------------
1 | # ASE = allele specific expression, allelic imbalance
2 |
3 | - https://stephanecastel.wordpress.com/2017/02/15/how-to-generate-ase-data-with-phaser/
4 |
5 |
6 | ## Installation of Phaser:
7 | ```
8 | git clone https://github.com/secastel/phaser.git
9 | module load gcc/6.2.0
10 | module load python/2.7.12
11 | cython/0.25.1
12 | pip install intervaltree --user
13 | cd phaser/phaser
14 | python setup.py build_ext --inplace
15 | ```
16 |
17 | ASE QC:
18 | - https://github.com/gimelbrantlab/Qllelic
19 |
--------------------------------------------------------------------------------
/rnaseq/bibliography.md:
--------------------------------------------------------------------------------
1 | # Normalization
2 | * [Comparing the normalization methods for
3 | the differential analysis of Illumina high-
4 | throughput RNA-Seq data](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0778-7). Paper that compares different normalization methods on RNASeq data.
5 |
6 | # Power
7 | * [Power in pairs: assessing the statistical value of paired samples in tests for differential expression](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6302489/). Paper that looks at effect of paired-design on power in RNA-seq.
8 |
9 | # Functional analysis
10 | https://yulab-smu.github.io/clusterProfiler-book/index.html
11 |
12 |
13 | [RNA-seq qc](https://seqqc.wordpress.com/)
14 |
--------------------------------------------------------------------------------
/rnaseq/failure_types:
--------------------------------------------------------------------------------
1 | Different ways that RNAseq can fail with examples.
2 |
3 | https://docs.google.com/presentation/d/1d5hyuTJMei0myG_vr7YR3vFewajwF9I9loS3I--kpnw/edit?usp=sharing
4 |
--------------------------------------------------------------------------------
/rnaseq/img/test:
--------------------------------------------------------------------------------
1 | d
2 |
--------------------------------------------------------------------------------
/rnaseq/img/volcano.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/rnaseq/img/volcano.png
--------------------------------------------------------------------------------
/rnaseq/running_IRFinder.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: How to run intron retention analysis
3 | description: This code helps to run IRFinder in the cluster.
4 | category: research
5 | subcategory: rnaseq
6 | tags: [hpc, intro_retention]
7 | ---
8 |
9 | To run any of these commands, need to activate the bioconda IRFinder environment prior to running script.
10 |
11 | 1. First script creates reference build required for IRFinder
12 |
13 | ```bash
14 | #SBATCH -t 24:00:00 # Runtime in minutes
15 | #SBATCH -n 4
16 | #SBATCH -p medium # Partition (queue) to submit to
17 | #SBATCH --mem=128G # 128 GB memory needed (memory PER CORE)
18 | #SBATCH -o %j.out # Standard out goes to this file
19 | #SBATCH -e %j.err # Standard err goes to this file
20 | #SBATCH --mail-type=END # Mail when the job ends
21 |
22 | IRFinder -m BuildRefProcess -r reference_data/
23 | ```
24 |
25 | >**NOTE:** The files in the `reference_data` folder are sym links to the bcbio ref files and need to be named specifically `genome.fa` and `transcripts.gtf`:
26 | >
27 | >`genome.fa -> /n/app/bcbio/biodata/genomes/Hsapiens/hg19/seq/hg19.fa`
28 | >
29 | >`transcripts.gtf -> /n/app/bcbio/biodata/genomes/Hsapiens/hg19/rnaseq/ref-transcripts.gtf`
30 |
31 | 2. Second script (.sh) runs IRFinder and STAR on input file
32 |
33 | ```bash
34 | #!/bin/bash
35 |
36 | module load star/2.5.4a
37 |
38 | IRFinder -r /path/to/irfinder/reference_data \
39 | -t 4 -d results \
40 | $1
41 | ```
42 |
43 | 3. Third script (.sh) runs a batch job for each input file in directory
44 |
45 | ```bash
46 | #!/bin/bash
47 |
48 | for fq in /path/to/*fastq
49 | do
50 |
51 | sbatch -p medium -t 0-48:00 -n 4 --job-name irfinder --mem=128G -o %j.out -e %j.err --wrap="sh /path/to/irfinder/irfinder_input_file.sh $fq"
52 | sleep 1 # wait 1 second between each job submission
53 |
54 | done
55 | ```
56 |
57 | 4. Fourth script takes output (IRFinder-IR-dir.txt) and uses the replicates to determine differential expression using the Audic and Claverie test (# replicates < 4). analysisWithLowReplicates.pl script comes with the IRFinder github repo clone, so I cloned the repo at https://github.com/williamritchie/IRFinder/. Notes on the Audic and Claverie test can be found at: https://github.com/williamritchie/IRFinder/wiki/Small-Amounts-of-Replicates-via-Audic-and-Claverie-Test.
58 |
59 | ```bash
60 | #!/bin/bash
61 |
62 | #SBATCH -t 24:00:00 # Runtime in minutes
63 | #SBATCH -n 4
64 | #SBATCH -p medium # Partition (queue) to submit to
65 | #SBATCH --mem=128G # 8 GB memory needed (memory PER CORE)
66 | #SBATCH -o %j.out # Standard out goes to this file
67 | #SBATCH -e %j.err # Standard err goes to this file
68 | #SBATCH --mail-type=END # Mail when the job ends
69 |
70 | analysisWithLowReplicates.pl \
71 | -A A_ctrl/Pooled/IRFinder-IR-dir.txt A_ctrl/AJ_1/IRFinder-IR-dir.txt A_ctrl/AJ_2/IRFinder-IR-dir.txt A_ctrl/AJ_3/IRFinder-IR-dir.txt \
72 | -B B_nrde2/Pooled/IRFinder-IR-dir.txt B_nrde2/AJ_4/IRFinder-IR-dir.txt B_nrde2/AJ_5/IRFinder-IR-dir.txt B_nrde2/AJ_6/IRFinder-IR-dir.txt \
73 | > KD_ctrl-v-nrde2.tab
74 | ```
75 |
76 | 5. Output `KD_ctrl-v-nrde2.tab` file can be read directly into R for filtering and results exploration.
77 |
78 | 6. Rmarkdown workflow (included in report): IRFinder_report.md
79 |
--------------------------------------------------------------------------------
/rnaseq/running_leafviz.md:
--------------------------------------------------------------------------------
1 | # Leafviz - Visualize Leafcutter Results
2 |
3 |
4 | **All scripts found in `/HBC Team Folder (1)/Resources/LeafCutter_2023`**
5 |
6 | ### Step 1 - Get annotation files
7 |
8 | Annotation files already prepared for hg38 are in the folder listed above in a folder named new_hg38. To make annotation files for another organism run
9 |
10 | ```bash
11 | ./gtf2leafcutter.pl -o /full/path/to/directory/annotation_directory_name/annotation_file_prefix \
12 | /full/path/to/gencode/annotation.gtf
13 | ```
14 |
15 | ### Step 2 - Make RData files from your results
16 |
17 | Use the `prepare_results.R` script in the linked folder. The one on the leafcutter github has not been updated.
18 | Your leafcutter should have output `leafcutter_ds_cluster_significance_XXXXX.txt`, `leafcutter_ds_effect_sizes_XXXXX.txt`, and `XXXX_perind_numers.counts.gz`.
19 | You will need all three of these files and the groups file you made. An example groups file for my PD comparison is below:
20 |
21 | ```bash
22 | NCIH1568_RB1_KO_DMSO_Replicate_3-ready.bam DMSO
23 | NCIH1568_RB1_KO_DMSO_Replicate_2-ready.bam DMSO
24 | NCIH1568_RB1_KO_DMSO_Replicate_1-ready.bam DMSO
25 | NCIH1568_RB1_KO_PD_Replicate_1-ready.bam PD
26 | NCIH1568_RB1_KO_PD_Replicate_3-ready.bam PD
27 | NCIH1568_RB1_KO_PD_Replicate_2-ready.bam PD
28 | ```
29 |
30 | Below is an example to make the RData from the PD comparison. This code outputs `PD_new.RData`.
31 | Note that the annotation has the file path and the annotation file prefix.
32 |
33 | ```bash
34 | ./prepare_results.R -m PD_groups.txt NCIH_PD_perind_numers.counts.gz \
35 | leafcutter_ds_cluster_significance_PD.txt leafcutter_ds_effect_sizes_PD.txt \
36 | new_hg38/new_hg38 -o PD_new.RData
37 | ```
38 |
39 | ### Step 3 - Visualize
40 |
41 | Once you have RDatas made for all of your contrasts it is time to actually run leafviz.
42 | The critical scripts to run leafviz are `run_leafviz.R `, `ui.r`, and `server.R`.
43 | Before you can run it you **must** change the path on line 41 of `run_leafviz.R`.
44 | This path needs to reflect the location of the `ui.R` and `server.R` files.
45 | To run leafviz with my PD data set I give
46 |
47 | ```bash
48 | ./run_leafviz.R PD_new.RData
49 | ```
50 |
51 | This will open a new tab in your browswer with your results!
52 |
53 | For more on what is being shown check the leafviz [documentation](http://davidaknowles.github.io/leafcutter/articles/Visualization.html)
54 |
--------------------------------------------------------------------------------
/rnaseq/running_rMATS.md:
--------------------------------------------------------------------------------
1 | ## rMATS for differential splicing analysis
2 |
3 | * Event-based analysis of splicing (e.g. skipped exon, retained intron, alternative 5' and 3' splice site)
4 | * rMATS handles replicate RNA-Seq data from both paired and unpaired study design
5 | * statistical model of rMATS calculates the P-value and false discovery rate that the difference in the isoform ratio of a gene between two conditions exceeds a given user-defined threshold
6 |
7 | Software: https://rnaseq-mats.sourceforge.net/
8 |
9 | Paper: https://www.pnas.org/doi/full/10.1073/pnas.1419161111
10 |
11 | GitHub: https://github.com/Xinglab/rmats-turbo
12 |
13 | ### Installation
14 |
15 | Issues with the conda build installation provided on the GitHub page `./build_rmats --conda`. Had problem with shared libraries (" "loading shared libraries" error ).
16 |
17 | Instead install from bioconda. Reference: https://groups.google.com/g/rmats-user-group/c/S1GFEqB9TE8/m/YV9R27CoCwAJ?pli=1
18 |
19 | ```bash
20 |
21 | # Need speicific python version
22 | conda create -n "rMATS_python3.7" python=3.7
23 |
24 | conda activate rMATS_python3.7
25 |
26 | conda install -c conda-forge -c bioconda rmats=4.1.0
27 |
28 | ```
29 |
30 | If you are running as a script on O2:
31 |
32 | ```bash
33 | #! /bin/bash
34 |
35 | #SBATCH -t 0-24:00 # Runtime
36 | #SBATCH -p medium # Partition (queue)
37 | #SBATCH -J rmats # Job name
38 | #SBATCH -o rmats_frag.out # Standard out
39 | #SBATCH -e rmats_frag.err # Standard error
40 | #SBATCH --mem=50G # Memory needed per core
41 | #SBATCH -c 6
42 |
43 |
44 | # USAGE: For paired-end BAM files;run rMATS
45 |
46 | # Define the project path
47 | path=/n/data1/cores/bcbio/PIs/peter_sicinski/sicinski_inhibition_RNAseq_human_hbc04676
48 |
49 | # Change directories
50 | cd ${path}/rMATS
51 |
52 | # Activate conda env for rmats
53 | source ~/miniconda3/bin/activate
54 | conda init bash
55 | source ~/.bashrc
56 | conda activate rMATS_python3.7
57 |
58 | ```
59 |
--------------------------------------------------------------------------------
/rnaseq/strandedness.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: Stranded RNA-seq libraries.
3 | description: Explains strandedness and where to find info in bcbio.
4 | category: research
5 | subcategory: rnaseq
6 | ---
7 |
8 | Bulk RNA-seq libraries retaining strand information (stranded) are useful to quantify expression with higher accuracy for opposite
9 | strand transcripts which overlap or have overlapping UTRs.
10 | https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-015-1876-7.
11 |
12 | Bcbio RNA-seq pipeline has a 'strandedness' parameter: [unstranded|firststrand|secondstrand]
13 | https://bcbio-nextgen.readthedocs.io/en/latest/contents/pipelines.html?highlight=strand#configuration. <- link not working*
14 |
15 | The terminology was inherited from Tophat, see the detailed description in the Salmon doc.
16 | https://salmon.readthedocs.io/en/latest/library_type.html
17 | Note, that firstrand = ISR for PE and SR for SE.
18 |
19 | If the strandedness is unknown, run a small subset of reads with 'unstranded' in bcbio and check out what Salmon reports in
20 | `bcbio_project/final/sample/salmon/lib_format_counts.json`:
21 | ```
22 | {
23 | "read_files": [
24 | "/dev/fd/63",
25 | "/dev/fd/62"
26 | ],
27 | "expected_format": "IU",
28 | "compatible_fragment_ratio": 1.0,
29 | "num_compatible_fragments": 721856,
30 | "num_assigned_fragments": 721856,
31 | "num_frags_with_concordant_consistent_mappings": 692049,
32 | "num_frags_with_inconsistent_or_orphan_mappings": 47441,
33 | "strand_mapping_bias": 0.9477291347866986,
34 | "MSF": 0,
35 | "OSF": 0,
36 | "ISF": 36174,
37 | "MSR": 0,
38 | "OSR": 0,
39 | "ISR": 655875,
40 | "SF": 37676,
41 | "SR": 9765,
42 | "MU": 0,
43 | "OU": 0,
44 | "IU": 0,
45 | "U": 0
46 | }
47 | ```
48 | Here the majority of reads are ISR.
49 |
50 | Another way to check strand bias is
51 | `bcbio_project/final/sample/qc/qualimap_rnaseq/rnaseq_qc_results.txt`.
52 | It has `SSP estimation (fwd/rev) = 0.04 / 0.96` meaning strand bias (ISR, firststrand).
53 |
54 | Yet another way to confirm strand bias is seqc.
55 | http://rseqc.sourceforge.net/#infer-experiment-py.
56 | It uses a small subset of the input bam file:
57 | `infer_experiment.py -r /bcbio/genomes/Mmusculus/mm10/rnaseq/ref-transcripts.bed -i test.bam`
58 |
59 | ```
60 | This is PairEnd Data
61 | Fraction of reads failed to determine: 0.1461
62 | Fraction of reads explained by "1++,1--,2+-,2-+": 0.0177
63 | Fraction of reads explained by "1+-,1-+,2++,2--": 0.8362
64 | ```
65 |
--------------------------------------------------------------------------------
/rnaseq/tools.md:
--------------------------------------------------------------------------------
1 | - [IsoformSwitchAnalyzer](https://bioconductor.org/packages/release/bioc/vignettes/IsoformSwitchAnalyzeR/inst/doc/IsoformSwitchAnalyzeR.html)
2 | - LP/VB?, 2019/02?
3 | - version#?
4 | - helps to detect alternative splicing
5 | - output very nice figures
6 | - what requirements are needed (e.g. R-3.5.1, etc.)?
7 | - no tutorials available
8 | - not incorporated into bcbio
9 | - I tried it and an example of a consults is here:https://code.harvard.edu/HSPH/hbc_RNAseq_christiani_RNAediting_on_lung_in_humna_hbc02307. This packages has very nice figures: https://www.dropbox.com/work/HBC%20Team%20Folder%20(1)/consults/david_christiani/RNAseq_christiani_RNAediting_on_lung_in_humna?preview=dtu.html (see at the end of the report).
10 |
11 | - [DEXseq](https://bioconductor.riken.jp/packages/3.0/bioc/html/DEXSeq.html)
12 | - LP/VB/RK?, date?
13 | - version#?
14 | - used to call isoform switching
15 | - not recommended - use DTU tool instead
16 | - what requirements are needed (e.g. R-3.5.1, etc.)?
17 | - no tutorials available
18 | - yes, DEXseq is incorporated in bcbio
19 | - Following this paper from MLove et al: https://f1000research.com/articles/7-952/v3 I used salmon and DEXseq to call isoform switching. This consult has an example: https://code.harvard.edu/HSPH/hbc_RNAseq_christiani_RNAediting_on_lung_in_humna_hbc02307. I found that normally one isomform changes a lot and another very little, but I found some examples were the switching is more evident.
20 |
21 | - [clusterProfiler](https://yulab-smu.github.io/clusterProfiler-book/index.html)
22 |
--------------------------------------------------------------------------------
/scrnaseq/10XVisium.md:
--------------------------------------------------------------------------------
1 | ## Analysis of 10X Visium data
2 |
3 | > Download and install spaceranger: https://support.10xgenomics.com/spatial-gene-expression/software/pipelines/latest/installation
4 |
5 | Helpful resource: https://lmweber.org/OSTA-book/
6 |
7 | ### Analysis software/packages
8 | * [scanpy](https://scanpy-tutorials.readthedocs.io/en/latest/spatial/basic-analysis.html)
9 | * [Spatial transcriptomics with Seurat](https://yu-tong-wang.github.io/talk/sc_st_data_analysis_R.html)
10 | * [Spatial single-cell quantification with alevin-fry](https://combine-lab.github.io/alevin-fry-tutorials/2021/af-spatial/)
11 |
12 |
13 | ### 1. BCL to FASTQ
14 |
15 | `spaceranger mkfastq` can be used here. Input is the flow cell directory.
16 |
17 | Note that **if your SampleSheet is formatted for BCL Convert**, which is Illumina's new demultiplexing software that is soon going to replace bcl2fastq, you will get an error.
18 |
19 | You will need to change the formatting slightly.
20 |
21 | https://support.10xgenomics.com/spatial-gene-expression/software/pipelines/latest/using/mkfastq#simple_csv
22 |
23 | If you are creating the simple csv samplesheet with specific oligo sequences for each sample index you may need to make edits. The I2 auto-orientation detector cannot be activated when supplying a simple csv. For example, teh NextSeq instrument needs the index 2 in reverse complement. So if you had:
24 |
25 | ```
26 | Lane,Sample,Index,Index2
27 | *,CD_3-GEX_03,GCGGGTAAGT,TAGCACTAAG
28 | ```
29 |
30 | You can either:
31 |
32 | 1. Specify the reverse complement Index2 oligo
33 |
34 | ```
35 | *,CD_3-GEX_03,GCGGGTAAGT,CTTAGTGCTA
36 | ```
37 |
38 | 2. Use the 10x index names, e.g. SI-TT-G6
39 |
40 | ```
41 | *,CD_3-GEX_03,SI-TT-G6
42 | ```
43 |
44 | You can find more information [linked here on the 10X website](https://support.10xgenomics.com/spatial-gene-expression/software/pipelines/latest/using/bcl2fastq-direct#sample-sheet)
45 |
46 | Additionally, if you find you have `AdpaterRead1` or `AdapterRead2` under "Setting" in this file (example below), you will want to remove that. For any 10x library regardless of how the demultiplexing is being done, we do not recommend trimming adapters in the Illumina -- **this will cause problems with reads in downstream analyses**.
47 |
48 |
49 | **To create FASTQ files:**
50 |
51 | ```bash
52 | spaceranger mkfastq --run data/11-7-2022-DeVries-10x-GEX-Visium/Files
53 | --simple-csv t samplesheets/11-7-2022-DeVries-10x-GEX-Visium_Samplesheet.csv
54 | --output-dir fastq/11-7-2022-DeVries-10x-GEX-Visium
55 |
56 | ```
57 |
58 | ### 2. Image files
59 |
60 | Each slide has 4 capture areas and therefore for a single slide you should have 4 image files.
61 |
62 | More on image types [from 10X docs here](https://support.10xgenomics.com/spatial-gene-expression/software/pipelines/latest/using/image-recommendations)
63 |
64 | * Check what type of image you have (you will need to specify in `spaceranger` with the correct flag)
65 | * Open up the image to make sure you have the fiducial border. It's probably done for you. If there are issues with the fiducial alignment (i.e. too tall, too wide) given to you, you may need to manually align using the Loupe browser
66 |
67 |
68 | ### 3. Counting expression data
69 |
70 | The next step is to quantify expression for each capture area. To do this we will use [`spaceranger count`](https://support.10xgenomics.com/spatial-gene-expression/software/pipelines/latest/tutorials/count-ff-tutorial). This command will need to be run for each capture area. Below is the command for a single capture area (in this case Slide 1, capture area A0. You may find your files have not been named with A-D, so map them accordingly.
71 |
72 | A few things to note if you have samples that were run on multiple flow cells:
73 |
74 | * include the `--sample` argument to specify the samplename which corresponds to the capture area
75 | * for the `--fastqs` you can add multiple paths to the different flow cell folders and separate them by a comma
76 |
77 | ```bash
78 |
79 | spaceranger count --id="CD_Visium_01" \
80 | --sample=CD_Visium_01 \
81 | --description="Slide1_CaptureArea1" \
82 | --transcriptome=refdata-gex-mm10-2020-A \
83 | --fastqs=mkfastq/11-10-2022-DeVries-10x-GEX-Visium/AAAW33YHV/,mkfastq/11-4-2022-Devries-10x-GEX-Visium/AAAW3N3HV/,mkfastq/11-7-2022-DeVries-10x-GEX-Visium/AAAW352HV/,mkfastq/11-8-2022-DeVries-10x-GEX-Visium/AAAW3FCHV/,mkfastq/11-9-2022-DeVries-10x-GEX-Visium/AAAW3F3HV/ \
84 | --image=images/100622_Walker_Slides1_and_2/V11S14-092_20221006_01_Field1.tif\
85 | --slide=V11S14-092 \
86 | --area=A1 \
87 | --localcores=6 \
88 | --localmem=20
89 |
90 | ```
91 |
--------------------------------------------------------------------------------
/scrnaseq/CellRanger.md:
--------------------------------------------------------------------------------
1 | # When running Cell Ranger on O2
2 | ## Shared by Victor
3 | > 10x documentation is the one he uses:
4 |
5 | https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger
6 |
7 | > For the custom genome:
8 |
9 | https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/tutorial_mr
10 |
11 | > For GFP:
12 |
13 | https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/tutorial_mr#marker
14 |
15 | ## Build custom genome
16 | > High level steps (based on the 10x tutorial):
17 | 1. Download the gtf and fasta files for the species of interest;
18 | 2. Filter gtf with `cellranger mkgtf` command;
19 | 3. Create the fasta file for the additional gene (for example, GFP);
20 | 4. Create the corresponding gtf file for the additional gene;
21 | 5. Append fasta file of the additional gene to the end of the fasta file for the genome;
22 | 6. Append gtf file of the additional gene to the end of the gtf file for the genome;
23 | 7. Make custom genome with `cellranger mkref` command;
24 |
25 |
--------------------------------------------------------------------------------
/scrnaseq/Demuxafy_HowTo.md:
--------------------------------------------------------------------------------
1 | # How to run Demuxafy on O2
2 |
3 | For detailed instructions and updates on `demuxafy`, see the comprehensive [Read the Docs](https://demultiplexing-doublet-detecting-docs.readthedocs.io/en/latest/index.html#)
4 |
5 |
6 | ## Installation
7 |
8 | I originally downloaded the `Demuxafy.sif` singularity image for use on O2 as instructed [here](https://demultiplexing-doublet-detecting-docs.readthedocs.io/en/latest/Installation.html). However, **this singularity image did not pass O2's security checks**. The folks at HMS-RC were kind enough to amend the image for me so that it would pass the security checks. The working image is found at: `/n/app/singularity/containers/Demuxafy.sif` allowing anyone to use it.
9 |
10 | Of note, this singularity image includes a bunch of software, including popscle, demuxlet, freemuxlet, souporcell and other demultiplexing as well as doublet detection tools, so very useful to have installed!
11 |
12 |
13 | ## Input data
14 |
15 | Each tool included in `demuxafy` requires slightly different input (see [Read the Docs](https://demultiplexing-doublet-detecting-docs.readthedocs.io/en/latest/index.html#)).
16 |
17 | For the demultiplexing tools, in most cases, you will need:
18 |
19 | - A common SNP genotypes VCF file (pre-processed VCF files can be downloaded [here](https://demultiplexing-doublet-detecting-docs.readthedocs.io/en/latest/DataPrep.html), which is what I did after repeatedly failing to re-generate my own VCF file from the 1000 genome dataset following the provided instructions...)
20 | - A Barcode file (`outs/raw_feature_bc_matrix/barcodes.tsv.gz` from a typical `cellranger count` run)
21 | - A BAM file of aligned single-cell reads (`outs/possorted_genome_bam.bam` from a typical `cellranger count` run)
22 | - Knowledge of the number of samples in the pool you're trying to demultiplex
23 | - Potentially, a FASTA file of the genome your sample was aligned to
24 |
25 | _NOTE_: When working from a multiplexed dataset (e.g. cell hashing experiment), you may have to re-run `cellranger count` instead of `cellranger multi` to generate the proper barcodes and BAM files. In addition, it may be necessary to use the `barcodes.tsv.gz` file from the `filtered_feature_bc_matrix` (instead of raw) in such cases (see for example this [issue](https://github.com/wheaton5/souporcell/issues/128) when running `souporcell`).
26 |
27 |
28 | ## Pre-processing steps
29 |
30 | Once you've collated those files, you need to make sure your VCF and BAM files are sorted in the same way. This can be achieved by running the following command after sourcing `sort_vcf_same_as_bam.sh` from the Aerts' Lab popscle helper tool GitHub repo (available [here](https://github.com/aertslab/popscle_helper_tools/blob/master/sort_vcf_same_as_bam.sh)):
31 |
32 | ```
33 | # Sort VCF file in same order as BAM file
34 | sort_vcf_same_as_bam.sh $BAM $VCF > demuxafy/data/GRCh38_1000G_MAF0.01_ExonFiltered_ChrEncoding_sorted.vcf
35 | ```
36 |
37 | ### dsc pileup
38 |
39 | If you wish to run `freemuxlet` (and possibly other tools I haven't piloted), you will also need to run `dsc-pileup` (available within the singularity image) ahead of `freemuxlet` itself. For larger samples (>30k cells), it also helps (= significantly speeds up computational time, from several days to a couple of hours) to pre-filter the BAM file using another of the Aerts' Lab popscle helper tool scripts: `filter_bam_file_for_popscle_dsc_pileup.sh` (available [here](https://github.com/aertslab/popscle_helper_tools/blob/master/filter_bam_file_for_popscle_dsc_pileup.sh))
40 |
41 | ```
42 | # [OPTIONAL but recommended]
43 | module load gcc/9.2.0 samtools/1.14
44 | scripts/filter_bam_file_for_popscle_dsc_pileup.sh $BAM $BARCODES demuxafy/data/GRCh38_1000G_MAF0.01_ExonFiltered_ChrEncoding_sorted.vcf demuxafy/data/possorted_genome_bam_filtered.bam
45 |
46 | # Run popscle pileup ahead of freemuxlet
47 | singularity exec $DEMUXAFY popscle dsc-pileup --sam demuxafy/data/possorted_genome_bam_filtered.bam --vcf demuxafy/data/GRCh38_1000G_MAF0.01_ExonFiltered_ChrEncoding_sorted.vcf --group-list $BARCODES --out $FREEMUXLET_OUTDIR/pileup
48 | ```
49 |
50 | _NOTE_: When running the dsc-pileup step on O2, at some point the job might get stalled despite no error message being issued. From my experience, this usually means that the requested memory needs to be increased (I used 48G-56G for most samples I processed, and encountered issues when lowering down to 32G). After filtering the BAM file and with the appropriate amount of memory available, the dsc-pileup step usually completes within 2-3 hours.
51 |
52 |
53 | ## Workflow
54 |
55 | After that, you should be set to run whichever demultiplexing tool you want! See sample scripts for a simple case (small 10X study) in the following [GitHub repo](https://github.com/hbc/neuhausser_scRNA-seq_human_embryo_hbc04528/tree/main/pilot_scRNA-seq/demuxafy/scripts); and for a more complex case (large study, multiplexed 10X data using cell hashing) [here](https://github.com/hbc/hbc_10xCITESeq_Pregizer-Visterra-_hbc04485/tree/main/demuxafy/scripts)
56 |
57 | You also have the option to generate combined results files to contrast results from different software more easily, as described [here](https://demultiplexing-doublet-detecting-docs.readthedocs.io/en/latest/CombineResults.html), and as implemented in the `combine_results.sbatch` script in the first GitHub repo linked above.
58 |
--------------------------------------------------------------------------------
/scrnaseq/MDS_plot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/scrnaseq/MDS_plot.png
--------------------------------------------------------------------------------
/scrnaseq/README.md:
--------------------------------------------------------------------------------
1 | # scRNA-seq
2 |
3 | * **[Tools for scRNA-seq analysis](tools.md):** This document lists the various tools that are currently being used for scRNA-seq analysis (and who has used/tested them) in addition to new tools that we are interested in but have yet to be tested.
4 |
5 | * **Tutorials for scRNA-seq analysis:** These documents are tutorials to help you with various types of scRNA-seq analysis.
6 |
7 | - **[Single-Cell-conda.md](https://github.com/hbc/knowledgebase/blob/master/research/scrnaseq/Single-Cell-conda.md):** installing tools for scRNA-seq analysis with conda.
8 | - **[Single-Cell.md](https://github.com/hbc/knowledgebase/blob/master/research/scrnaseq/Single-Cell.md):** installing tools and setting up docker for single cell rnaseq
9 | - **[rstudio_sc_docker.md](https://github.com/hbc/knowledgebase/blob/master/research/scrnaseq/rstudio_sc_docker.md):** This docker image contains an rstudio installation with some helpful packages for singlecell analysis. It also includes a conda environment to deal with necessary python packages (like umap-learn).
10 | - **[Single-cell analysis workflow](https://github.com/hbc/tutorials/tree/master/scRNAseq/scRNAseq_analysis_tutorial):** tutorials walking through the steps in a single-cell RNA-seq analysis, including differential expression analysis, power analysis, and creating a SPRING interface
11 |
12 | * **[Bibliography](bibliography.md):** This document lists relevant papers pertaining to scRNA-seq analysis
13 |
--------------------------------------------------------------------------------
/scrnaseq/SNP_demultiplex.md:
--------------------------------------------------------------------------------
1 | # Demultiplexing SC data using SNP information
2 |
3 | ## Overview
4 |
5 | ## Methods:
6 |
7 | * scSplit:
8 |
9 |
10 | **References**:
11 |
12 | * [Paper](https://doi.org/10.1186/s13059-019-1852-7)
13 |
14 | * [Repo](https://github.com/jon-xu/scSplit)
15 |
16 | * Demuxlet/Freemuxlet/Popscle:
17 |
18 | Demuxlet is the first iteration of the software. Popscle is a suite that includes an improved version of demuxlet and also freemuxlet. It is recommended
19 | by the authors to use popscle.
20 |
21 | **Running it:**
22 |
23 | Installing demuxlet is not straightforward with very particular instructions. A similar situation might happen with popscle which it's not published yet.
24 | I recommend using Docker. The repo contains the [Dockerfile](https://github.com/statgen/popscle/blob/master/Dockerfile). You can use it to create your own
25 | docker image. One available is [here](https://hub.docker.com/repository/docker/vbarrerab/popscle). This image can also be used to create a singularity container
26 | on O2.
27 |
28 | _Running on O2: singularity_
29 |
30 |
31 |
32 | singularity exec -B :/bam_files,:/vcf_files,:/results
33 | /n/app/singularity/containers// popscle dsc-pileup --sam /bam_files/ --vcf /vcf_files/
34 | --out /results/
35 |
36 | **Recommendations:**
37 |
38 | It is highly reccomended to reduce the number of reads and SNPs before running
39 |
40 |
41 |
42 | **References**:
43 |
44 | _Demuxlet_
45 |
46 | * [Paper - Demuxlet](https://www.nature.com/articles/nbt.4042)
47 |
48 | * [Repo](https://github.com/statgen/demuxlet)
49 |
50 | _Popscle (Demuxlet/Freemuxlet)_
51 |
52 | * [Repo](https://github.com/statgen/popscle)
53 |
54 | _popscle helper tools_
55 |
56 | * [Repo](https://github.com/aertslab/popscle_helper_tools)
57 |
--------------------------------------------------------------------------------
/scrnaseq/Single-Cell-conda.md:
--------------------------------------------------------------------------------
1 | *Single cell analyses require a lot of memory and often fail on the laptops.
2 | Having R + Seurat installed in a conda environment + interactive session or batch jobs with 50-100G RAM helps.*
3 |
4 | # 1. Use conda from bcbio
5 | ```
6 | which conda
7 | /n/app/bcbio/dev/anaconda/bin/conda
8 | conda --version
9 | conda 4.6.14
10 | ```
11 |
12 | # 2. Create and setup r conda environment
13 | ```
14 | conda create -n r r-essentials r-base zlib pandoc
15 | conda init bash
16 | conda config --set auto_activate_base false
17 | . ~/.bashrc
18 | ```
19 |
20 | # 3. Activate conda env
21 | ```
22 | conda activate r
23 | which R
24 |
25 | ```
26 |
27 | # 4. Install packages from within R
28 |
29 | ## 4.1 Install Seurat
30 | ```
31 | R
32 | install.packages("Seurat")
33 | library(Seurat)
34 | q()
35 | ```
36 |
37 | ## 4.2 Install Monocle
38 | ```
39 | R
40 | install.packages(c("BiocManager", "remotes"))
41 | BiocManager::install("monocle")
42 | q()
43 | ```
44 |
45 | ## 4.3 Install liger
46 | ```
47 | R
48 | install.packages("devtools")
49 | library(devtools)
50 | install_github("MacoskoLab/liger")
51 | library(liger)
52 | q()
53 | ```
54 |
55 | # 5. Install umap-learn for UMAP clustering
56 | ```
57 | pip install umap-learn
58 | ```
59 |
60 | # 6. Deactivate conda
61 | ```
62 | conda deactivate
63 | ```
64 |
65 | # 7. (Troubleshooting)
66 | - It may ask you to install github token - too many packages loaded from github.
67 | I generated token on my laptop and placed it in ~/.Renviron
68 | - BiocManager::install("slingshot") - I failed to install it due to gsl issues.
69 | - when running a batch job, use source activate r/ source deactivate
70 | - if conda is trying to write in bcbio cache, check and set cache priority, your home cache should be first:
71 | `conda info`,
72 | ~/.condarc
73 | ```
74 | pkgs_dirs:
75 | - /home/[UID]/.conda/pkgs
76 | - /n/app/bcbio/dev/anaconda/pkgs
77 | ```
78 |
--------------------------------------------------------------------------------
/scrnaseq/bcbio_indrops3.md:
--------------------------------------------------------------------------------
1 | # Counting cells with bcbio for inDrops3 data - proto-SOP
2 |
3 | ## Last use - 2020-03-17
4 |
5 | ## 1. Check reference genome and transcriptome - is it a mouse project?
6 | - mm10 reference genome: /n/shared_db/bcbio/biodata/genomes/Mmusculus/mm10
7 | - transcriptome_fasta: /n/shared_db/bcbio/biodata/genomes/Mmusculus/mm10/rnaseq/ref-transcripts.fa
8 | - transcriptome_gtf: /n/shared_db/bcbio/biodata/genomes/Mmusculus/mm10/rnaseq/ref-transcripts.gtf
9 |
10 | ## 2. Create bcbio project structure in /scratch
11 | ```
12 | mkdir sc_mouse
13 | cd sc_mouse
14 | mkdir config input final work
15 | ```
16 |
17 | ## 3. Prepare fastq input in sc_mouse/input
18 | - some FC come in 1..4 lanes, merge lanes for every read:
19 | ```
20 | cat lane1_r1.fq.gz lane2_r1.fq.gz > project_1.fq.gz
21 | cat lane1_r2.fq.gz lane2_r2.fq.gz > project_2.fq.gz
22 | ```
23 | - cat'ing gzip files sounds ridiculous, but works for the most part, for purists:
24 | ```
25 | zcat KM_lane1_R1.fastq KM_lane2_R1.fastq.gz | gzip > KM_1.fq.gz
26 | ```
27 |
28 | - some cores send bz2 files not gz
29 | ```
30 | bunzip2 *.bz2
31 | cat *R1.fastq | gzip > sample_1.fq.gz
32 | ```
33 |
34 | - some cores produce R1,R2,R3,R4, others R1,R2,I1,I2, rename them
35 | ```
36 | bcbio_R1 = R1 = 86 or 64 bp transcript read
37 | bcbio_R2 = I1 = 8 bp part 1 of cell barcode
38 | bcbio_R3 = I2 = 8 bp sample (library) barcode
39 | bcbio_R4 = R2 = 14 bp = 8 bp part 2 of cell barcode + 6 bp of transcript UMI
40 | ```
41 | - files in sc_mouse/input should be (KM here is project name):
42 | ```
43 | KM_1.fq.gz
44 | KM_2.fq.gz
45 | KM_3.fq.gz
46 | KM_4.fq.gz
47 | ```
48 |
49 | ## 4. Create `sc_mouse/config/sample_barcodes.csv`
50 | Check out if the sample barcodes provided match the actual barcodes in the data.
51 | ```
52 | gunzip -c FC_X_3.fq.gz | awk '{if(NR%4 == 2) print $0}' | head -n 400000 | sort | uniq -c | sort -k1,1rn | awk '{print $2","$1}' | head
53 |
54 | AGGCTTAG,112303
55 | ATTAGACG,95212
56 | TACTCCTT,94906
57 | CGGAGAGA,62461
58 | CGGAGATA,1116
59 | CGGATAGA,944
60 | GGGGGGGG,852
61 | ATTAGACC,848
62 | ATTAGCCG,840
63 | ATTATACG,699
64 | ```
65 |
66 | Sometimes you need to reverse complement sample barcodes:
67 | ```
68 | cat barcodes_original.csv | awk -F ',' '{print $1}' | tr ACGTacgt TGCAtgca | rev
69 | ```
70 |
71 | sample_barcodes.csv
72 | ```
73 | TCTCTCCG,S01
74 | GCGTAAGA,S02
75 | CCTAGAGT,S03
76 | TCGACTAG,S04
77 | TTCTAGAG,S05
78 | ```
79 |
80 | ## 5. Create `sc_mouse/config/sc-mouse.yaml`
81 | ```
82 | details:
83 | - algorithm:
84 | cellular_barcode_correction: 1
85 | minimum_barcode_depth: 1000
86 | sample_barcodes: /full/path/sc_mouse/config/sample_barcodes.csv
87 | transcriptome_fasta: /n/shared_db/bcbio/biodata/genomes/Mmusculus/mm10/rnaseq/ref-transcripts.fa
88 | transcriptome_gtf: /n/shared_db/bcbio/biodata/genomes/Mmusculus/mm10/rnaseq/ref-transcripts.gtf
89 | umi_type: harvard-indrop-v3
90 | analysis: scRNA-seq
91 | description: PI_name
92 | files:
93 | - /full/path/sc_mouse/input/KM_1.fq.gz
94 | - /full/path/sc_mouse/input/KM_2.fq.gz
95 | - /full/path/sc_mouse/input/KM_3.fq.gz
96 | - /full/path/sc_mouse/input/KM_4.fq.gz
97 | genome_build: mm10
98 | metadata: {}
99 | fc_name: sc-mouse
100 | upload:
101 | dir: /full/path/sc_mouse/final
102 | ```
103 | Use `cd sc_mouse/input; readlink -f *` to grab full path to each file and paste into yaml.
104 |
105 | ## 6. Create `sc_mouse/config/bcbio.sh`
106 | ```
107 | #!/bin/bash
108 |
109 | # https://slurm.schedmd.com/sbatch.html
110 |
111 | #SBATCH --partition=priority # Partition (queue)
112 | #SBATCH --time=10-00:00 # Runtime in D-HH:MM format
113 | #SBATCH --job-name=km # Job name
114 | #SBATCH -c 20
115 | #SBATCH --mem-per-cpu=5G # Memory needed per CPU
116 | #SBATCH --output=project_%j.out # File to which STDOUT will be written, including job ID
117 | #SBATCH --error=project_%j.err # File to which STDERR will be written, including job ID
118 | #SBATCH --mail-type=ALL # Type of email notification (BEGIN, END, FAIL, ALL)
119 |
120 | bcbio_nextgen.py ../config/sc-mouse.yaml -n 20
121 | ```
122 | - most projects take < 5days, but some large 4 lane could take more, like 7-8
123 |
124 | ## 7. Run bcbio
125 | ```
126 | cd sc_mouse_work
127 | sbatch ../config/bcbio.sh
128 | ```
129 |
130 | ## 1a. (Optional).
131 | If you care, download fresh transcriptome annotation from Gencode (https://www.gencodegenes.org/mouse/)
132 | (it has chrom names with chr matching mm10 assembly).
133 | ```
134 | cd sc_mouse/input
135 | wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M23/gencode.vM23.annotation.gtf.gz
136 | gunzip gencode.vM23.annotation.gtf.gz
137 | gffread -g /n/shared_db/bcbio/biodata/genomes/Mmusculus/mm10/seq/mm10.fa gencode.vM23.annotation.gtf -x gencode.vM23.annotation.cds.fa
138 | ```
139 | update sc_mouse/config/sc_mouse.yaml:
140 | ```
141 | transcriptome_fasta: gencode.vM23.annotation.cds.fa
142 | transcriptome_gtf: gencode.vM23.annotation.gtf
143 | ```
144 | ## References
145 | - indrops3 library structure: https://singlecellcore.hms.harvard.edu/resources
146 | - [Even shorter guide](https://github.com/bcbio/bcbio-nextgen/blob/master/config/templates/indrop-singlecell.yaml)
147 | - [Much more comprehensive guide](https://github.com/hbc/tutorials/blob/master/scRNAseq/scRNAseq_analysis_tutorial/lessons/01_bcbio_run.md)
148 |
--------------------------------------------------------------------------------
/scrnaseq/bibliography.md:
--------------------------------------------------------------------------------
1 | # Integration
2 | - [Seurat](https://www.cell.com/cell/fulltext/S0092-8674(19)30559-8)
3 | - [Harmony](https://www.biorxiv.org/content/10.1101/461954v2)
4 |
5 | # More references
6 | 1. [A collection of resources from seandavi](https://github.com/seandavi/awesome-single-cell)
7 | 2. https://scrnaseq-course.cog.sanger.ac.uk/website/index.html
8 | 3. https://broadinstitute.github.io/2019_scWorkshop/index.html
9 | 4. https://github.com/SingleCellTranscriptomics/ISMB2018_SingleCellTranscriptomeTutorial
10 | 5. [Bibliography in bib](bcbio_sc.bib)
11 |
--------------------------------------------------------------------------------
/scrnaseq/cite_seq.md:
--------------------------------------------------------------------------------
1 |
2 | - https://cite-seq.com
3 | - https://en.wikipedia.org/wiki/CITE-Seq
4 | - https://github.com/Hoohm/CITE-seq-Count
5 | - https://sites.google.com/site/fredsoftwares/products/cite-seq-counter
6 |
--------------------------------------------------------------------------------
/scrnaseq/doublets.md:
--------------------------------------------------------------------------------
1 | # Doublet identification
2 |
3 | - gene number filter is not effective in identifying doublets (Scrublet2019 article).
4 | - there is no good unsupervised doublets detection method for now
5 | - DoubletIdentification works for a group of cells we suspect they might be doublets (a cluster or a group of clusters) - if we see mixed marker signature and we know that these cells are not in transitional state, i.e. expert review of clusters is needed before doublet deconvolution
6 | - dump counts from suspected counts from Seurat
7 | - identify doublets with Scrublet
8 | - get back to Seurat
9 |
10 | R based DoubletFinder and DoubletDecon have issues
11 | - https://github.com/chris-mcginnis-ucsf/DoubletFinder/issues/64
12 | - https://github.com/EDePasquale/DoubletDecon/issues/21
13 |
14 |
--------------------------------------------------------------------------------
/scrnaseq/pub_quality_umaps.md:
--------------------------------------------------------------------------------
1 | # Here is a collaction of code for nice looking umaps from people in the core. Please add!
2 |
3 |
4 | ## Zhu's pretty white boxes
5 |
6 |
7 |
8 | ### Code
9 |
10 | **Note: Zhu says "The gist is to add cluster numbers to the ggplot data, then using LableClusters to plot."**
11 |
12 | ```R
13 | Idents(seurat_stroma_SCT) <- "celltype"
14 |
15 | p1 <- DimPlot(object = seurat_stroma_SCT,
16 | reduction = "umap",
17 | label = FALSE,
18 | label.size = 4,
19 | repel = TRUE) + xlab("UMAP 1") + ylab("UMAP 2") + labs(title="UMAP")
20 |
21 | # add a new column of clusterNo to ggplot data
22 | p1$data$clusterNo <- as.factor(sapply(strsplit(as.character(p1$data$ident), " "), "[", 1))
23 |
24 | LabelClusters(plot = p1, id = "clusterNo", box = T, repel = F, fill = "white")
25 | ```
26 |
27 | ## Noor's embedded labels
28 |
29 |
30 |
31 | ### Code
32 |
33 |
34 | ```R
35 | LabelClusters(p, id = "ident", fontface = "bold", size = 3, bg.colour = "white", bg.r = .2, force = 0)
36 | ```
37 |
38 |
39 |
--------------------------------------------------------------------------------
/scrnaseq/rstudio_sc_docker.md:
--------------------------------------------------------------------------------
1 | # Docker image with rstudio for single cell analysis
2 |
3 | ## Description
4 |
5 | Docker images for single cell analysis.
6 |
7 | All docker images contain an rstudio installation with some helpful packages for singlecell analysis. It also includes a conda environment to deal with necessary python packages (like umap-learn).
8 |
9 | Docker Rstudio images are obtained from [rocker/rstudio](https://hub.docker.com/r/rocker/rstudio).
10 |
11 | ## R version and Bioconductor
12 |
13 | The R and Bioconductor versions are specified in the image name (along with the OS version):
14 |
15 | Example:
16 | `singlecell-base:R.4.0.3-BioC.3.11-ubuntu_20.04`
17 |
18 | ## Use
19 |
20 | `docker run -d -p 8787:8787 --name -e USER='rstudio' -e PASSWORD='rstudioSC' -e ROOT=TRUE -v :/home/rstudio/projects vbarrerab/)`
21 |
22 | `-e DISABLE_AUTH=true` option can be added to avoid Rstudio login prompt. Only use on local machine.
23 |
24 | This instruction will download and launch a container using the singlecell image. Once launch, it can be access through a web browser with the URL 8787:8787 or localhost:8787.
25 |
26 | ### Important parameters
27 |
28 | * -v option is mounting a folder from the host in the container. This allows for data transfer between the host and the container. **This can only be done when creating the container!**
29 |
30 | * --name assigns a name to the container. Helpful to keep thins tidy.
31 | * -e ROOT=TRUE options provides root access, in case more tweaking inside the container is necessary.
32 | * -p 8787: Change the local port to access the container. **This can only be done when creating the container!**
33 | * FYI: The working directory will be set as /home/rstudio, not /home/rstudio/projects as default behavior.
34 |
35 | ## Resources
36 |
37 | The dockerfile and other configuration files can be found on:
38 |
39 | https://github.com/vbarrera/docker_configuration
40 |
41 | The docker images:
42 |
43 | vbarrerab/singlecell-base
44 |
45 | ## Available images:
46 |
47 | - R.4.0.2-BioC.3.11-ubuntu_20.04
48 | - R.4.0.3-BioC.3.11-ubuntu_20.04
49 |
50 | **Important:**
51 |
52 | Docker changed its policies to only keep images that have been modified in the last 6 months. This means that previous images will eventually disappear. For previous versions. Check with availability with @vbarrera.
53 |
54 | # Bibliography
55 |
56 | Inspired by:
57 |
58 | https://www.r-bloggers.com/running-your-r-script-in-docker/
59 |
60 | # Other resources
61 | Using Singularity Containers on the Cluster: https://docs.rc.fas.harvard.edu/kb/singularity-on-the-cluster/
62 |
--------------------------------------------------------------------------------
/scrnaseq/running_MAST.md:
--------------------------------------------------------------------------------
1 | # Running MAST
2 |
3 | [MAST](https://github.com/RGLab/MAST) analyzes differential expression using the cell as the unit of replication rather than the sample (as is done for pseduobulk)
4 |
5 | **NOTES**
6 |
7 | -MAST uses a hurdle model designed for zero heavy data.
8 | -MAST "expects" log transformed count data.
9 | -A [recent paper](https://doi.org/10.1038/s41467-021-21038-1) advises the use of sample id as a random factor to prevent pseudoreplication
10 | -Most MAST models include the total number of genes expressed in the cell.
11 |
12 | ## Where to run MAST?
13 |
14 | MAST can be run directly in Seurat (via [FindMarkers](https://satijalab.org/seurat/reference/findmarkers) or by itself.
15 |
16 | While MAST is easier to run with Seurat there are two big downsides:
17 |
18 | 1. Seurat does not log transform the data
19 | 2. You cannot edit the model with seurat, meaning that you cannot add sample ID or number of genes expressed.
20 |
21 | ## Running MAST (1) - from seurat to a SCA object
22 |
23 |
24 | ```r
25 | # Seurat to SCE
26 | sce <- as.SingleCellExperiment(seurat_obj)
27 |
28 | # Add log counts
29 | assay(sce, "log") = log2(counts(sce) + 1)
30 |
31 | # Create new sce object (only 'log' count data)
32 | sce.1 = SingleCellExperiment(assays = list(log = assay(sce, "log")))
33 | colData(sce.1) = colData(sce)
34 |
35 | # Change to SCA
36 | sca = SceToSingleCellAssay(sce.1)
37 |
38 | ```
39 |
40 | ## Running MAST (2) - Filter SCA object
41 |
42 | Here we are only filtering for genes expressed in 10% of cells but this can be altered and other filters can be added.
43 |
44 | ```r
45 | expressed_genes <- freq(sca) > 0.1
46 | sca_filtered <- sca[expressed_genes, ]
47 |
48 | ```
49 |
50 | ## Format SCA metadata
51 |
52 | We add the total number of genes expressed per cell as well as setting factors as factors and scaling all continuous variables as suggested by MAST.
53 |
54 | ```r
55 | cdr2 <- colSums(SummarizedExperiment::assay(sca_filtered)>0)
56 |
57 | SummarizedExperiment::colData(sca_filtered)$ngeneson <- scale(cdr2)
58 | SummarizedExperiment::colData(sca_filtered)$orig.ident <- factor(SummarizedExperiment::colData(sca_filtered)$orig.ident)
59 | SummarizedExperiment::colData(sca_filtered)$Gestational_age_scaled <- scale(SummarizedExperiment::colData(sca_filtered)$Gestational_age)
60 | ```
61 |
62 | ## Run MAST
63 |
64 | This is the most computationally instensive step and takes the longest.
65 | Here our model includes the number of genes epxressed (ngeneson), sample id as a random variable ((1 | orig.ident)), Gender, and Gestational age scaled.
66 |
67 | We extract the results from our model for our factor of interest (Gestational_age_scaled)
68 |
69 |
70 | ```r
71 | zlmCond <- suppressMessages(MAST::zlm(~ ngeneson + Gestational_age_scaled + Gender + (1 | orig.ident), sca_filtered, method='glmer',ebayes = F,strictConvergence = FALSE))
72 | summaryCond <- suppressMessages(MAST::summary(zlmCond,doLRT='Gestational_age_scaled'))
73 | ```
74 |
75 | ## Format Results
76 |
77 | MAST results look quite different than DESeq2 results so we need to apply a bit of formatting to make them readable.
78 |
79 | After formatting outputs can be written directly to csv files.
80 | ```r
81 | summaryDt <- summaryCond$datatable
82 |
83 | # Create reable results table for all genes tested
84 | fcHurdle <- merge(summaryDt[contrast == "Gestational_age_scaled"
85 | & component == 'H', .(primerid, `Pr(>Chisq)`)], # This extracts hurdle p-values
86 | summaryDt[contrast == "Gestational_age_scaled" & component == 'logFC',
87 | .(primerid, coef, ci.hi, ci.lo)],
88 | by = 'primerid') # This extract LogFC data
89 |
90 | fcHurdle <- stats::na.omit(as.data.frame(fcHurdle))
91 |
92 | fcHurdle$fdr <- p.adjust(fcHurdle$`Pr(>Chisq)`, 'fdr')
93 |
94 |
95 | # Create reable results table for significant genes
96 | fcHurdleSig <- merge(fcHurdle[fcHurdle$fdr < .05,],
97 | as.data.table(mcols(sca_filtered)), by = 'primerid')
98 | setorder(fcHurdleSig, fdr)
99 |
100 | ```
101 |
--------------------------------------------------------------------------------
/scrnaseq/running_doubletfinder.md:
--------------------------------------------------------------------------------
1 | # Running DoubletFinder
2 |
3 |
4 | [DoubletFinder](https://github.com/chris-mcginnis-ucsf/DoubletFinder) is one of the most popular doublet finding methods with over 1200 citations since 2019 (as of Sept 2023).
5 |
6 | ## Preparing to run DoubletFinder
7 |
8 | The key notes for running doubletFinder are:
9 | -Each sample MUST be run separately
10 | -Various parameters can be tweaked in the run (see doubletfinder website for details) but the most critical is the prior value of the percentage of doublets.
11 | -DoubletFinder is not fast so best to run on O2 and save output as an RDS file.
12 |
13 |
14 | ## Step 1 - Generate subsets
15 |
16 | Starting with your post-qc seurat object separate out each sample. Then make a list of these new objects and a vector of object names.
17 |
18 | ```r
19 | sR01 <- subset(x = seurat_qc, subset = orig.ident %in% c("R01"))
20 | sW01 <- subset(x = seurat_qc, subset = orig.ident %in% c("W01"))
21 | s3N00 <- subset(x = seurat_qc, subset = orig.ident %in% c("3N00"))
22 | subsets = list(sR01,sW01,s3N00)
23 | names = c('sR01','sW01',"s3N00")
24 | ```
25 |
26 | ## Step 2 - Run loop
27 |
28 | This is the most computationally intensive step. Here we will loop through the list we created and run doublet finder.
29 |
30 | ```r
31 | for (i in seq(1,length(subsets)) {
32 |
33 | # SCT Transform and Run UMAP
34 | obj <- subsets[[i]]
35 | obj <- SCTransform(obj)
36 | obj <- RunPCA(obj)
37 | obj <- RunUMAP(obj, dims = 1:10)
38 |
39 | #Run doublet Finder
40 | sweep.res.list_obj <- paramSweep_v3(obj, PCs = 1:10, sct = TRUE)
41 | sweep.stats_obj <- summarizeSweep(sweep.res.list_obj, GT = FALSE)
42 | bcmvn_obj <- find.pK(sweep.stats_obj)
43 | nExp_poi <- round(0.09*nrow(obj@meta.data)) ## Assuming 9% doublet formation rate can be changed.
44 | obj <- doubletFinder_v3(obj, PCs = 1:10, pN = 0.25, pK = 0.1, nExp = nExp_poi, reuse.pANN = FALSE, sct = TRUE)
45 |
46 | # Rename output columns from doublet finder to be consistent across samples for easy merging
47 | colnames(obj@meta.data)[c(22,23)] <- c("pANN","doublet_class") ## change coordinates based on your own metadata size
48 | assign(paste0(names[i]), obj)
49 | }
50 | ```
51 |
52 | ## Step 3 - Merge doubletfinder output and save
53 |
54 | After doubletfinder is run merge to output to a single seuarat object and save that as an RDS.
55 |
56 | ```r
57 | seurat_doublet <- merge(x = subsets[[1]],
58 | y = subsets[2:length(subsets)])
59 | saveRDS(seurat_doublet, file = "seurat_postQC_doubletFinder.rds")
60 | ```
61 |
62 | ## Step 4 (optional) add doublet info to a pre-existing seurat object for plotting
63 |
64 | If you have gone ahead and run most of the seurat pipeline before running DoubletFinder you can add the doublet information to any object for plotting on a UMAP
65 |
66 | ```r
67 | doublet_info <- seurat_doublet@meta.data$doublet_class
68 | names(doublet_info) <- colnames(x = seurat_doublet)
69 | seurat_norm <- AddMetaData(seurat_norm, metadata=doublet_info, col.name="doublet")
70 | ```
71 |
72 | ## Step 5 remove doublets
73 |
74 | You can remove doublets from any seurat object that has the doublet info.
75 |
76 | ```r
77 | seurat_qc_nodub <- subset(x = seurat_doublet, subset = doublet == "Singlet")
78 | saveRDS(seurat_qc_nodub, file = "seurat_qc_nodoublets.rds")
79 | ```
80 |
81 |
--------------------------------------------------------------------------------
/scrnaseq/saturation_qc.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: Single Cell Quality control with saturation
3 | category: Single Cell
4 | ---
5 |
6 | People often ask how many cells they need to sequence in their next experiment.
7 | Saturation analysis helps to answer that question looking at the current experiment.
8 | Would adding more coverage to the current experiment result in getting more transcripts,
9 | genes, or in just more duplicated reads?
10 |
11 | First, use [from bcbio to single cell script](https://github.com/hbc/hbcABC/blob/master/inst/rmarkdown/Rscripts/singlecell/from_bcbio_to_singlecell.R)
12 | to load data from bcbio into SingleCellExperiment object.
13 |
14 | Second, use [this Rmd template](https://github.com/hbc/hbcABC/blob/master/inst/rmarkdown/templates/simple_qc_single_cell/skeleton/skeleton.rmd)
15 | to create report.
16 |
--------------------------------------------------------------------------------
/scrnaseq/seurat_markers.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: Seurat Markers
3 | description: This code is for finding Seurat markers
4 | category: research
5 | subcategory: scrnaseq
6 | tags: [differential_analysis]
7 | ---
8 |
9 | ```bash
10 | ssh -XY username@o2.hms.harvard.edu
11 |
12 | srun --pty -p interactive -t 0-12:00 --x11 --mem 128G /bin/bash
13 |
14 | module load gcc/6.2.0 R/3.4.1 hdf5/1.10.1
15 |
16 | R
17 | ```
18 |
19 | ```r
20 | library(Seurat)
21 | library(tidyverse)
22 |
23 | set.seed(1454944673L)
24 | data_dir <- "data"
25 | seurat <- readRDS(file.path(data_dir, "seurat_tsne_all_res0.6.rds"))
26 | ```
27 |
28 | Make sure the TSNEPlot looks as expected
29 |
30 | ```r
31 | TSNEPlot(seurat)
32 | ```
33 |
34 | Check markers for any particular cluster against all others
35 |
36 | ```r
37 | cluster14_markers <- FindMarkers(object = seurat, ident.1 = 14, min.pct = 0.25)
38 | ```
39 |
40 | Or look for markers of every cluster against all others
41 |
42 | ```r
43 | seurat_markers <- FindAllMarkers(object = seurat, only.pos = TRUE, min.pct = 0.25, thresh.use = 0.25)
44 | ```
45 |
46 | >**NOTE:** The `seurat_markers` object with be a dataframe with the row names as Ensembl IDs; however, since row names need to be unique, if a gene is a marker for more than one cluster, then Seurat will add a number to the end of the Ensembl ID. Therefore, do not use the row names as the gene identifiers. Use the `gene` column.
47 |
48 | Save the markers for report generation
49 |
50 | ```r
51 | saveRDS(seurat_markers, "data/seurat_markers_all_res0.6.rds")
52 | ```
53 |
--------------------------------------------------------------------------------
/scrnaseq/tinyatlas.md:
--------------------------------------------------------------------------------
1 | - [Mouse brain markers](https://www.brainrnaseq.org/)
2 | - [Mouse and human markers](https://panglaodb.se)
3 | - https://bioconductor.org/packages/release/bioc/html/SingleR.html
4 | - [tinyatlas](https://github.com/hbc/tinyatlas)
5 | - https://www.flyrnai.org/tools/biolitmine/web/
6 | - [human blood](http://scrna.sklehabc.com/), [article](https://academic.oup.com/nsr/article/8/3/nwaa180/5896476)
7 | - [human skin](https://www.nature.com/articles/s42003-020-0922-4/figures/1)
8 | - [A big atlas from Hemberg's lab](https://scfind.sanger.ac.uk/)
9 |
--------------------------------------------------------------------------------
/scrnaseq/tutorials.md:
--------------------------------------------------------------------------------
1 | - [HSPH](https://github.com/hbc/tutorials/blob/master/scRNAseq/scRNAseq_analysis_tutorial/README.md)
2 | - [HBC's tutorials for scRNA-seq analysis workflows](https://github.com/hbc/tutorials/tree/master/scRNAseq/scRNAseq_analysis_tutorial)
3 | - [Seurat, Satija lab](https://satijalab.org/seurat/vignettes.html)
4 | - [Hemberg lab, Cambridge](https://scrnaseq-course.cog.sanger.ac.uk/website/index.html)
5 | - [Broad](https://broadinstitute.github.io/2019_scWorkshop/)
6 | - [DS pseudobulk edgeR](http://biocworkshops2019.bioconductor.org.s3-website-us-east-1.amazonaws.com/page/muscWorkshop__vignette/)
7 | - [MIG2019](https://biocellgen-public.svi.edu.au/mig_2019_scrnaseq-workshop/public/index.html)
8 | - [OSCA, Bioconductor](https://bioconductor.org/books/release/OSCA/)
9 |
--------------------------------------------------------------------------------
/scrnaseq/velocity.md:
--------------------------------------------------------------------------------
1 | RNA Velocity analysis is a trajectory analysis based on spliced/unspliced RNA ratio.
2 |
3 | It is quite popular https://www.nature.com/articles/s41586-018-0414-6,
4 | however, the original pipeline is not well supported:
5 | https://github.com/velocyto-team/velocyto.R/issues
6 |
7 | There is a new one from kallisto team:
8 | https://bustools.github.io/BUS_notebooks_R/velocity.html
9 |
10 | # 1. Install R4.0 (development version) on O2
11 | - module load gcc/6.2.0
12 | - installed R-devel: https://www.r-bloggers.com/r-devel-in-parallel-to-regular-r-installation/
13 | because one of the packages wanted R4.0
14 | - configure R with `./configure --enable-R-shlib` for rstudio
15 | - remove conda from PATH to avoid using its libcurl
16 | - module load boost/1.62.0
17 | - module load hdf5/1.10.1
18 | - installing velocyto.R: https://github.com/velocyto-team/velocyto.R/issues/86
19 |
20 | # 2. Install velocyto.R with R3.6.3 (Fedora 30 example)
21 | bash:
22 | ```
23 | sudo dnf update R
24 | sudo dnf install boost boost-devel hdf5 hdf5-devel
25 | git clone https://github.com/velocyto-team/velocyto.R
26 | ```
27 | rstudio/R:
28 | ```
29 | BiocManager::install("pcaMethods")
30 | setwd("/where/you/cloned/velocyto.R")
31 | devtools::install_local("velocyto.R")
32 | ```
33 |
34 | # 3. Generate reference files
35 | - `Rscriptdev `[01_get_velocity_files.R](https://github.com/naumenko-sa/crt/blob/master/velocity/01_get_velocity_files.R)
36 | - output:
37 | ```
38 | cDNA_introns.fa
39 | cDNA_tx_to_capture.txt
40 | introns_tx_to_capture.txt
41 | tr2g.tsv
42 | ```
43 |
44 | # 4. Index reference
45 | This step takes ~1-2h and 100G or RAM:
46 | `sbatch `[02_kallisto_index.sh](https://github.com/naumenko-sa/crt/blob/master/velocity/02_kallisto_index.sh)
47 |
48 | - inDrops3 support: https://github.com/BUStools/bustools/issues/4
49 |
50 | # 5. Split reads by sample with barcode_splitter
51 |
52 | - merge reads from multiple flowcells first
53 | - https://pypi.org/project/barcode-splitter/
54 | ```
55 | barcode_splitter --bcfile samples.tsv Undetermined_S0_L001_R1.fastq Undetermined_S0_L001_R2.fastq Undetermined_S0_L001_R3.fastq Undetermined_S0_L001_R4.fastq --idxread 3 --suffix .fq
56 | ```
57 |
58 | kallisto bus counting procedure works on per sample basis, so we need to split samples to separate fastq files, and merge samples across lanes.
59 |
60 | - [split_barcodes.sh](https://github.com/naumenko-sa/crt/blob/master/velocity/03_split_barcodes.sh)
61 |
62 | # 6. Count spliced and unspliced transcripts
63 | - [kallisto_count](https://github.com/naumenko-sa/crt/blob/master/velocity/04_kallisto_count.sh)
64 | - output:
65 | ```
66 | spliced.barcodes.txt
67 | spliced.genes.txt
68 | spliced.mtx
69 | unspliced.barcodes.txt
70 | unspliced.genes.txt
71 | unspliced.mtx
72 | ```
73 |
74 | # 7. Create Seurat objects for every sample
75 | - [create_seurat_sample.Rmd](https://github.com/naumenko-sa/crt/blob/master/velocity/05.create_seurat_sample.Rmd)
76 | - also removes empty droplets
77 |
78 | # 8. Merge seurat objects
79 | - [merge_seurats](https://github.com/naumenko-sa/crt/blob/master/velocity/06.merge_seurats.Rmd)
80 |
81 | # 9. Velocity analysis
82 | - [velocity_analysis](https://github.com/naumenko-sa/crt/blob/master/velocity/07.velocity_analysis.Rmd)
83 |
84 | # 10. Plot velocity picture
85 | - [plot_velocity](https://github.com/naumenko-sa/crt/blob/master/velocity/08.plot_velocity.Rmd)
86 |
87 | # 11. Repeat marker analysis
88 | - [velocity_markers](https://github.com/naumenko-sa/crt/blob/master/velocity/09.velocity_markers.Rmd)
89 |
90 | # 11. References
91 | - https://www.kallistobus.tools/tutorials
92 | - https://github.com/satijalab/seurat-wrappers/blob/master/docs/velocity.md
93 | - [preprocessing influences velociy analysis](https://www.biorxiv.org/content/10.1101/2020.03.13.990069v1)
94 |
95 | # Velocity analysis in Python:
96 | - http://velocyto.org/
97 | - https://github.com/pachterlab/MBGBLHGP_2019/blob/master/Figure_3_Supplementary_Figure_8/supp_figure_9.ipynb
98 |
--------------------------------------------------------------------------------
/scrnaseq/write10Xcounts.md:
--------------------------------------------------------------------------------
1 | ## Need to go from counts in a Seurat object to the 10X format?
2 |
3 | Recently found that some tools like Scrublet (doublet detection), require scRNA-seq counts to be in 10X format.
4 | * barcodes.tsv
5 | * genes.tsv
6 | * matrix.mtx
7 |
8 | How to do this easily?
9 |
10 | ```r
11 | library(Seurat)
12 | library(tidyverse)
13 | library(rio)
14 | library(DropletUtils)
15 |
16 | # Read in Seurat object
17 | seurat_stroma <- readRDS("./seurat_stroma_replicatePaper.rds")
18 |
19 | # Output data
20 | write10xCounts(x = seurat_stroma@assays$RNA@counts, path = "./cell_ranger_data_format_test")
21 |
22 | ```
23 |
--------------------------------------------------------------------------------
/scrnaseq/zinbwaver.md:
--------------------------------------------------------------------------------
1 | # When asked why don't you use zinbwave
2 |
3 | - [zinbwave-deseq2-comparison.Rmd](https://github.com/roryk/zinbwave-deseq2-indrop/blob/master/zinbwave-deseq2-comparison.Rmd)
4 | - [Soneson2018](https://experiments.springernature.com/articles/10.1038/nmeth.4612)
5 | - [Original ZINB](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1406-4)
6 | - [Crowell2020](https://www.biorxiv.org/content/10.1101/713412v2)
7 | - [Droplet scRNA-seq is not zero-inflated](https://www.nature.com/articles/s41587-019-0379-5)
8 |
9 |
--------------------------------------------------------------------------------
/training/mkdocs.md:
--------------------------------------------------------------------------------
1 | # MK-Docs
2 |
3 | Basic prerequisites
4 | -Python
5 | -Github CLI
6 | -VS Code
7 |
8 |
9 | Download Visual Studio Code
10 |
11 | https://code.visualstudio.com/download
12 |
13 | 1. Open new terminal within VS-Code
14 | 2. Navigate to where you want to host the GitHub repo.
15 | 3. Create a new GitHub repo you want to work on or clone an existing one: git clone link_to_repo.git
16 | 4. Cd to the GitHub repo folder
17 | 5. Lets make a virtual python environment: python -m venv venv
18 | 6. Source the python environment you just created: source venv/bin/activate
19 | 7. Need pip installed check: pip —version
20 | 8. Install mkdocs: pip install mkdocs-material
21 | 9. Open visual code from here: code .
22 | 10. Open terminal within vscode
23 | 11. To open up the website mkdocs serve
24 | 12. To change to the “material theme” open mkdocs.yml file and below site_name: My Docs, type
25 | site_name: My Docs
26 | theme:
27 | name: material
28 | 13. Save it
29 | 14. To deploy: type mkdocs serve on terminal. It will restart in the same host.
30 | 15. We can change the appearance of the site by editing and adding plugins to mkdocs.yml (google for common settings)
31 | 16. To add another page. Go inside docs add another nage: eg page2.md
32 | 17. Add
33 | # Page 2
34 |
35 | ## sub heading
36 |
37 | Text inside
38 | 18. Save it
39 | 19. In the terminal type: git add .
40 | 20. Git commit -m $’updating instructor’
41 | git push origin main
42 | git config http.postBuffer 524288000 #(if error comes up related to html)
43 | git pull && git push
44 |
--------------------------------------------------------------------------------
/variants/clonal_evolution.md:
--------------------------------------------------------------------------------
1 | # Variant analysis
2 |
3 | - [SciClone](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003665)
4 | - [FishPlot](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-3195-z)
5 | - [PhylogicNDT](https://github.com/broadinstitute/PhylogicNDT)
6 |
--------------------------------------------------------------------------------
/wgs/crispr-offtarget.md:
--------------------------------------------------------------------------------
1 | # Overview
2 | This guide is how to call offtarget edits in a CRISPR edited genome. This is
3 | pretty easy to and only takes a few steps. First, we need to figure out what is
4 | different between the CRISPR edited samples and the (hopefully they gave you
5 | these) control samples. Then we need to find a set of predicted off-target
6 | CRISPR sites. Finally, once we know what is different, we need to overlap the
7 | differences with predicted off-target sites, allowing some mismatches. Then we
8 | can report the overall differences and differences that could be due to
9 | offtarget edits.
10 |
11 | # Call CRISPR-edited specific variants
12 | You want to call edits that are in the CRISPRed sample but not the unedited
13 | sample. You can do that by plugging into the tumor-normal calling part of bcbio
14 | and pretending the CRISPR-edited sample is a tumor sample and the non-edited
15 | sample is a normal sample.
16 |
17 | To get tumor-normal calling to work you need to use a variant caller that
18 | can handle that, I recommend mutect2.
19 |
20 | To tell bcbio that a pair of samples is a tumor-normal pair you need to
21 |
22 | 1. Put the tumor and normal sample in the same **batch** by setting **batch** in the metadata to the same batch.
23 | 2. Set **phenotype** of the CRISPR-edited sample to **tumor**.
24 | 3. Set the **phenotype** of the non-edited sample to **normal**.
25 |
26 | And kick off the **variant2** pipeline, the normal whole genome sequencing pipeline. An example YAML template is below:
27 |
28 | ```yaml
29 | ---
30 | details:
31 | - analysis: variant2
32 | genome_build: hg38
33 | algorithm:
34 | aligner: bwa
35 | variantcaller: mutect2
36 | tools_on: [gemini]
37 | ```
38 |
39 | And an example metadata file:
40 |
41 | ```csv
42 | samplename,description,batch,phenotype,sex,cas9,gRNA
43 | Hs27_HSV1.cram,Hs27_HSV1,noCas9_nogRNA,normal,male,no,yes
44 | Hs27_HSV1_Cas9.cram,Hs27_HSV1_Cas9,noCas9,normal,male,yes,no
45 | Hs27_HSV1_UL30_5.cram,Hs27_HSV1_UL30_5,noCas9_nogRNA,tumor,male,yes,yes
46 | Hs27_HSV1_UL30_5_repeat.cram,Hs27_HSV1_UL30_5_repeat,noCas9,tumor,male,yes,yes
47 | ```
48 |
49 | # Find predicted off-target sites
50 | There are several tools to do this, a common one folks use is cas-offinder, so
51 | that is what we will use. There is a [web app](http://www.rgenome.net/cas-offinder/) but it will only return 1,000 events per
52 | class. Usually this is fine, but if you allow bulges you can get a lot more offtarget sites so you might bump into this limit.
53 |
54 | First install cas-offinder, there is a conda package so this is easy:
55 |
56 | ```bash
57 | conda create -n crispr -c bioconda cas-offinder
58 | ```
59 |
60 | There is a companion python wrapper cas-offinder-bulge that can also predict
61 | offtarget sites taking bulges into effect. You can download it
62 | [here](https://raw.githubusercontent.com/hyugel/cas-offinder-bulge/master/cas-offinder-bulge) if
63 | you need to do that.
64 |
65 | You'll need to know the sequence of one or more guides you want to check. You will also need to know
66 | the PAM sequence for the endonuclease that is being used.
67 |
68 | You can run cas-offinder like this:
69 |
70 | ```bash
71 | cas-offinder input.txt C output.txt
72 | ```
73 |
74 | where input.txt has this format:
75 |
76 | ```
77 | hg38.fa
78 | NNNNNNNNNNNNNNNNNNNNNNNGRRT
79 | ACACGTGAAAGACGGTGACGGNNGRRT 6
80 | ```
81 |
82 | `hg38.fa` is the path to the FASTA file of the hg38 genome. NNNNNNNNNNNNNNNNNNNNNNNNGRRT is the length of the guide sequence you are interested in with the PAM sequence tacked on the end.
83 | ACACACGTGAAAGACGGTGACGGNNGRRT is the guide sequence with the PAM sequence tacked on the end. 6 is the number of mismatches you are allowing here, it will look for sites with that many
84 | or less mismatches.
85 |
86 | If you want to look for bulges, use cas-offinder-bulge with this format:
87 |
88 | ```
89 | hg38.fa
90 | NNNNNNNNNNNNNNNNNNNNNNNGRRT 2 1
91 | ACACGTGAAAGACGGTGACGGNNGRRT 6
92 | ```
93 |
94 | Where the 2 says to look for a DNA bulge and the 1 a RNA bulge. You can do one or the other, neither or both.
95 |
96 | After you run cas-offinder you can conver the output to a sorted BED file for use with intersecting the your variants:
97 |
98 | ```bash
99 | cat output.txt | sed 1d | awk '{printf("%s\t%s\t%s\n",$4, $5-10,$5+10)}' | sort -V -k 1,1 -k2,2n > output.bed
100 | ```
101 |
102 | # Overlap variants
103 | Finally, use the BED file of predicted off-target sites to pull out possible off-target variant calls:
104 |
105 | ```bash
106 | bedtools intersect -header -u -a noCas9_nogRNA-mutect2-annotated.vcf.gz -b output.bed
107 | ```
108 |
109 | And you are done!
110 |
111 | # More tools
112 | [CRISPResso2](https://github.com/pinellolab/CRISPResso2)
113 |
--------------------------------------------------------------------------------
/wgs/pacbio_genome_assembly.md:
--------------------------------------------------------------------------------
1 | ---
2 | tags:
3 | title: Genome Assembly Using PacBio Reads Only
4 | author: Zhu Zhuo
5 | created: '2019-09-13'
6 | ---
7 |
8 | # Genome Assembly Using PacBio Reads Only
9 |
10 | This tutorial is based on a bacterial genome assembly project using PacBio sequencing reads only, but it can be followed for genome assembly of other species or using other type long reads with no or a little modification.
11 |
12 | ## Demultiplex
13 |
14 | If the sequencing core hasn't demultiplexed the data, [`lima`](https://github.com/PacificBiosciences/barcoding) can be used for demultiplexing.
15 |
16 | ## Convert `bam` file to `fastq` file
17 |
18 | `subreads.bam` files contain the subreads data and we will convert it from `bam` format to `fastq` format as most assemblers take `fastq` as input.
19 |
20 | [A note on the output from PacBio:](https://pacbiofileformats.readthedocs.io/en/5.1/Primer.html)
21 | > Unaligned BAM files representing the subreads will be produced natively by the PacBio instrument. The subreads BAM will be the starting point for secondary analysis. In addition, the scraps arising from cutting out adapter and barcode sequences will be retained in a `scraps.bam` file, to enable reconstruction of HQ regions of the ZMW reads, in case the customer needs to rerun barcode finding with a different option.
22 |
23 | Below is an example of slurm script using `bedtools bamtofastq` to convert `bam` to `fastq`.
24 |
25 | ```
26 | #!/bin/sh
27 | #SBATCH -p medium
28 | #SBATCH -J bam2fq
29 | #SBATCH -o x%_%j.o
30 | #SBATCH -e x%_%j.e
31 | #SBATCH -t 00-23:59:00
32 | #SBATCH -c 2
33 | #SBATCH --mem=4G
34 | #SBATCH --array=1-n%5 #change n to the number of subreads.bam files
35 |
36 | module load bedtools/2.27.1
37 |
38 | files=(/path/to/subreads.bam)
39 | file=${files[$SLURM_ARRAY_TASK_ID-1]}
40 | sample=`basename file .bam`
41 |
42 | echo $file
43 | echo $sample
44 |
45 | bedtools bamtofastq -i $file -fq $sample".fq"
46 | ```
47 |
48 | PacBio Sequel Sequencer reports all base qualities as PHRED 0 (ASCII !). So the quality score for sequel data are all `!` in the `fastq` file generated.
49 |
50 | ## Genome assembly
51 |
52 | ### Using Canu for genome assembly
53 |
54 | [Canu](https://github.com/marbl/canu) does correction, trimming and assembly in a single command. Follow its github page to install the software to a location of preference.
55 |
56 | An example of slurm script for a single sample:
57 | ```
58 | #!/bin/sh
59 | #SBATCH -p priority
60 | #SBATCH -J canu
61 | #SBATCH -o %x_%j.o
62 | #SBATCH -e %x_%j.e
63 | #SBATCH -t 0-23:59:00
64 | #SBATCH -c 1
65 | #SBATCH --mem=1G
66 |
67 | module load java/jdk-1.8u112
68 |
69 | export PATH=/path/to/canu:$PATH
70 |
71 | canu -p sampleName -d sampleName genomeSize=7m \
72 | gridOptions="--time=1-23:59:00 --partition=medium" \
73 | -pacbio-raw /path/to/converted.fq
74 | ```
75 | This is for the 'master' job, so only 1 CPU and 1 Gb memory should be sufficient. Canu will evaluate the resources available and automatically submit jobs to the queue in `gridOptions`.
76 |
77 | _Note_: An alternative is to install a bioconda recipe. But the conda verison is not up-to-date and some additional parameters may need to be specified in the command.
78 |
79 | ### Using Unicycler for bacterial genome assembly
80 |
81 | Follow [Unicycler](https://github.com/rrwick/Unicycler#method-long-read-only-assembly) instructions. [Racon](https://github.com/isovic/racon) is also required and should be installed before running Unicycler.
82 |
83 | ```
84 | module load gcc/6.2.0 python/3.6.0
85 | git clone https://github.com/rrwick/Unicycler.git
86 | cd Unicycler
87 | python3 setup.py install --user
88 | ```
89 | Example of a slurm script
90 | ```
91 | #!/bin/sh
92 | #SBATCH -p priority
93 | #SBATCH -J unicycler
94 | #SBATCH -o %x_%j.o
95 | #SBATCH -e %x_%j.e
96 | #SBATCH -t 29-23:59:00
97 | #SBATCH -c 20
98 | #SBATCH --mem=200G
99 |
100 | module load gcc/6.2.0 python/3.6.0 bowtie2/2.3.4.3 samtools/1.9 blast/2.6.0+
101 | export PATH=/path/to/racon:$PATH
102 |
103 | /path/to/unicycler-runner.py -l /path/to/fastq -t 20 -o sample
104 | ```
105 | ## Assembly quality
106 |
107 | ### Basic assembly metrics
108 |
109 | Download [Quast](http://bioinf.spbau.ru/quast) for basic assembly metrics, such as total length, number of contigs and N50.
110 | `/path/to/quast.py -o output_folder -t 6 assembly.fa`
111 |
112 | ### Assembly completeness
113 |
114 | Use [BUSCO](https://busco.ezlab.org/) for evaluating the completeness of the genome assembly. BUSCO has a lot of dependencies, so it is better to install a conda recipe.
115 |
116 | ```
117 | source activate conda-env # activate conda environment. If you don't have one, you may need to create a conda environment.
118 | conda install -c bioconda busco
119 | conda deactivate # deactivate conda environment
120 | ```
121 | Download the BUSCO database for the species and run BUSCO
122 |
123 | `run_busco -i assembly.fa -o output_folder -l species_odb -m geno`
124 |
125 | BUSCO is also the abbreviation for Benchmarking Universal Single-Copy Orthologs, which is single-copy ortholog found in >90% of species. The more BUSCOs are present, the more complete the genome assembly is.
126 |
--------------------------------------------------------------------------------