├── .gitignore
├── README.md
├── RNAVelocity
    ├── scVelo_tutorial.md
    └── velocyto_tutorial.md
├── SpatialTranscriptomics
    ├── Baysor
    │   ├── LoadingBaysorSegmentationIntoSeurat.R
    │   ├── RunningBaysor.md
    │   └── readme.txt
    └── readme.txt
├── _config.yml
├── admin
    ├── acknowledging_funding
    ├── archive_folders_to_standby.md
    ├── chargeback_models.md
    ├── consulting_resources.md
    ├── data_management.md
    ├── download_data.md
    ├── getting_started.md
    ├── initial_consults.md
    ├── method_snippets.md
    ├── reproducible_research.md
    ├── scripts_for_data_management_on_o2
    ├── setting_up_an_analysis_guidelines.md
    └── using_globus
├── bcbio
    ├── Creating_Hybrid_Mammal_Viral_Reference_Genome.md
    ├── bcbio_genomes.md
    ├── bcbio_tips.md
    ├── bcbio_workflow_mary.md
    ├── building_a_hybrid_murine_transgene_reference_genome.md
    ├── git.md
    └── gtf_gff_validator.md
├── bcbio_chip_userstory_draft.md
├── chipseq
    ├── bcbio_output_summary.sh
    ├── cutandrun.md
    ├── metadata.md
    └── tools.md
├── img
    ├── can_not_connect.png
    ├── images.md
    ├── noor_umap.png
    ├── r_taking_longer.png
    ├── simpsons.gif
    └── zhu_umap.png
├── long_read_data
    ├── Jihe's presentation at ABRF.pptx
    ├── Jihe's summary of ABRF discussion at core meeting.pptx
    ├── Jihe's_long_read_presentation_core_meeting.pptx
    ├── genome_assembly_tools.md
    └── nanopore_DRS_workflow.md
├── misc
    ├── Core_members_September_2019.key
    ├── FAQs.md
    ├── GEO_submissions.md
    ├── OSX.md
    ├── Reform_python.md
    ├── aws.md
    ├── core_resources.md
    ├── general_ngs.md
    ├── git.md
    ├── miRNA.md
    ├── mounting_o2_mac.md
    ├── mtDNA_variants.md
    ├── multiomics_factor_analysis.md
    ├── new_to_remote_github_CLI_start_here.md
    ├── organized_papers.md
    ├── orphan_improvements.md
    ├── power_calc_simulations.md
    └── snakemake-example-pipeline
├── python
    └── conda.md
├── r
    ├── .Rprofile
    ├── R-tips-and-tricks.md
    ├── Shiny_images
    │   ├── Added_tabs.png
    │   ├── Adding_panels.png
    │   ├── Adding_theme.png
    │   ├── Altered_action_button.png
    │   ├── Check_boxes_with_action_button.png
    │   ├── R_Shiny_hello_world.gif
    │   ├── R_shiny_req_after.gif
    │   ├── R_shiny_req_initial.gif
    │   ├── Return_table.png
    │   ├── Return_text_app_blank.png
    │   ├── Return_text_app_hello.png
    │   ├── Sample_size_hist_100.png
    │   ├── Sample_size_hist_5.png
    │   ├── Shiny_UI_server.png
    │   ├── Shiny_process.png
    │   ├── Squaring_number_app.png
    │   └── mtcars_table.png
    ├── htmlwidgets
    └── rshiny_server.md
├── rc
    ├── O2-tips.md
    ├── O2_portal_errors.md
    ├── arrays_in_slurm.md
    ├── connection-to-hpc.md
    ├── ipython-notebook-on-O2.md
    ├── jupyter_notebooks.md
    ├── keepalive.md
    ├── manage-files.md
    ├── openondemand.md
    ├── scheduler.md
    └── tmux.md
├── rnaseq
    ├── IRFinder_report.md
    ├── RepEnrich2_guide.md
    ├── Volcano_Plots.md
    ├── ase.md
    ├── bcbio_rnaseq.bib
    ├── bibliography.md
    ├── dexseq.Rmd
    ├── failure_types
    ├── img
    │   ├── test
    │   └── volcano.png
    ├── running_IRFinder.md
    ├── running_leafcutter.md
    ├── running_leafviz.md
    ├── running_rMATS.md
    ├── strandedness.md
    └── tools.md
├── scrnaseq
    ├── 10XVisium.md
    ├── CellRanger.md
    ├── Demuxafy_HowTo.md
    ├── MDS_plot.png
    ├── README.md
    ├── SNP_demultiplex.md
    ├── Single-Cell-conda.md
    ├── Thoughts_on_lymphocyte_Antigen_Receptor_transcripts_in_sc_RNA_Seq_analyses.md
    ├── bcbio_indrops3.md
    ├── bcbio_sc.bib
    ├── bibliography.md
    ├── cite_seq.md
    ├── doublets.md
    ├── pseudobulkDE_edgeR.md
    ├── pub_quality_umaps.md
    ├── pySCENIC.md
    ├── rstudio_sc_docker.md
    ├── running_MAST.md
    ├── running_doubletfinder.md
    ├── saturation_qc.md
    ├── seurat_clustering_analysis.md
    ├── seurat_markers.md
    ├── tinyatlas.md
    ├── tutorials.md
    ├── velocity.md
    ├── write10Xcounts.md
    └── zinbwaver.md
├── training
    └── mkdocs.md
├── variants
    └── clonal_evolution.md
└── wgs
    ├── crispr-offtarget.md
    └── pacbio_genome_assembly.md


/.gitignore:
--------------------------------------------------------------------------------
1 | *.DS_Store
2 | 


--------------------------------------------------------------------------------
/SpatialTranscriptomics/Baysor/RunningBaysor.md:
--------------------------------------------------------------------------------
  1 | 3_6_25 Billingsley
  2 | 
  3 | ### Running Baysor for spatial cell segmentation
  4 | 
  5 | 
  6 | For review
  7 | 
  8 | https://www.10xgenomics.com/analysis-guides/using-baysor-to-perform-xenium-cell-segmentation
  9 | 
 10 | https://kharchenkolab.github.io/Baysor/dev/
 11 | 
 12 | https://nanostring-biostats.github.io/CosMx-Analysis-Scratch-Space/posts/flat-file-exports/flat-files-compare.html/
 13 | 
 14 | _above has some description of AtoMx transcript file output_<br><br>
 15 | 
 16 | 
 17 | 
 18 | https://datadryad.org/stash/dataset/doi:10.5061/dryad.37pvmcvsg#readme/
 19 | 
 20 | *above has some description of Baysor output data in the segmentation.csv file*<br><br>
 21 | 	
 22 | https://github.com/Bassi-git/Baysor_edit<br><br>
 23 | 
 24 | 
 25 | https://vimeo.com/558564804/
 26 | 
 27 | *above is a very helpful and short vid*<br><br><br><br>
 28 | 
 29 | 
 30 | ### Running Baysor on CosMx output.
 31 | 
 32 | 
 33 | Basic usage.
 34 | 
 35 | 
 36 | 
 37 | Run Baysor here on o2:
 38 | 
 39 | /n/app/bcbio/baysor<br><br>
 40 | 
 41 | 
 42 | 
 43 | First run **Baysor preview**. It requires a transcript coordinate file.  With CosMx output the file will look something like this,  BWH_20240509_WC0933_tx_file.csv.gz (unzipped and renamed here to tx_file.csv)
 44 | 
 45 | 
 46 | Baysor will estimate and construct spatially nearest neighbor clusters of transcripts into Neighborhood Composition Vectors (NCVs) which can be viewed as "pseudocells" and also estimate which transcripts are noise. It will perform unsupervised clustering of the NCVs, assigning them to clusters based on expression profile similarity. You can actually use NCVs similarly to single cell data and do things like create umaps and perform marker identification. The NCVs don't have centroids so you can't plot them spatially the same way you can cell centroids. The number of expected NCV clusters can be selected or it will use a default of 4. Without even having cells called, this can give you an idea of the different cell types present. 
 47 | 
 48 | The output html file will show the individual transcript locations on a spatial plot. The NCVs are colored such that NCVs with similar expression profiles are colored similarly, This can give you an idea of different cell types present and their locations as well as providing a helpful reference when later selecting segmentation parameters to help select correctly sized segments. 
 49 | 
 50 | 
 51 | In the Baysor preview command, -x, -y, and -z point to their respective column names in the transcript file. -m gives the minimum number of transcripts to be included in an NCV "pseudocell". 
 52 | 
 53 | 
 54 | ```../bin/baysor/bin/baysor  preview -x x_global_px -y y_global_px -z z -g target tx_file.csv -m 10 ```<br><br><br><br>
 55 | 	
 56 | 
 57 | 	
 58 | Next, **Segmentation** can be done using a couple of different approaches. It can use prior information from previous segmentation analyses, or can run without priors. It can use an image or other information as a prior. Here I use CosMx cell identifiers as a prior (each transcript is assigned to a called cell.) You can select how much confidence to give the priors, from 0 none to 1 full. (In an attempt to give full confidence, I tried running with with --prior-segmentation-confidence 1 and it gave very odd results, so you might need to try something like 8 for high confidence.)
 59 | 
 60 | 
 61 | -p will plot the segments on an HTML file, you'll need this.
 62 | 
 63 | This example gives confidence of 8 to the CosMx cell calls , the cell ids are found in the "cell" column in the tx_file.csv and indicated by :cell in the Baysor run command. I also ask for the number of unsupervised clusters found to be 13 here.
 64 | 
 65 | 
 66 | ```../bin/baysor/bin/baysor run -x x_global_px -y y_global_px -z z -g target tx_file.csv -p --prior-segmentation-confidence 8  :cell -m 10 --n-clusters=13```<br><br><br><br>
 67 |  
 68 | A second way to run Baysor is without a prior. This requires supplying an -s "scale" argument which corresponds to expected cell diameter. With my CosMx coordinate system (pixels) s = 5 was much much too small, s = 150 too large.  You can view the segmentation output on the output html file, and compare this to the preview html file to assess quality of segmentation.
 69 | 
 70 | 
 71 | ```../bin/baysor/bin/baysor  run -x x_global_px -y y_global_px -z z -g target tx_file.csv -p -s 50 -m 10 --n-clusters=13```
 72 | 
 73 | 
 74 | Output will look like this
 75 | 
 76 | -rw-rw-r-- 1 jmb17 bcbio  16467552 Feb 26 13:20 segmentation_borders.html<br>
 77 | -rw-rw-r-- 1 jmb17 bcbio   2986301 Feb 26 13:18 segmentation_cell_stats.csv<br>
 78 | -rw-rw-r-- 1 jmb17 bcbio   8105239 Feb 26 13:18 segmentation_counts.loom<br>
 79 | -rw-rw-r-- 1 jmb17 bcbio 374830291 Feb 26 13:18 segmentation.csv<br>
 80 | -rw-rw-r-- 1 jmb17 bcbio   1084869 Feb 26 13:18 segmentation_diagnostics.html<br>
 81 | -rw-rw-r-- 1 jmb17 bcbio       985 Feb 26 13:20 segmentation_log.log<br>
 82 | -rw-rw-r-- 1 jmb17 bcbio       651 Feb 26 10:59 segmentation_params.dump.toml<br>
 83 | -rw-rw-r-- 1 jmb17 bcbio  11374607 Feb 26 13:18 segmentation_polygons_2d.json<br>
 84 | -rw-rw-r-- 1 jmb17 bcbio  51628559 Feb 26 13:18 segmentation_polygons_3d.json<br>
 85 | 
 86 | segmentation_borders.html is the segmentation image.
 87 | 
 88 | segmentation_cell_stats.csv is the called cell metadata which can be loaded into a Seurat Object. It includes x, y cell centroid coordinates and cell area, which is very useful for filtering out unreasonably small or large segments.
 89 | 
 90 | segmentation.csv are the transcript quality metrics which can be used for transcript filtering in Seurat. It also contains the x and y transcript coordinates and cell and NVC transcript assignments, which can be used to create transcript by cell (or NCV) count matrices, which can be used to create seurat objects.
 91 | 
 92 | segmentation_polygons_2d.json can be used for plotting the segments.<br><br><br><br>
 93 | 
 94 | 
 95 | After running baysor run, check the segmentation html image for segmentation quality, compare to the preview output. 
 96 | 
 97 | Then also check the range of the segmentation_cell_stats area column (column 9)
 98 | 
 99 | ```awk -F',' 'NR>1 {if(min==""){min=max=$9}; if($9<min){min=$9}; if($9>max){max=$9}} END {print "Min:", min, "Max:", max}' segmentation_cell_stats.csv```
100 | 
101 | 
102 | My CosMx data use pixels, and a conversion of .12028 px per uM, so .12028^2 px per uM^2.  Adjust the scale -s parameter if you're not returning cell areas you expect in your experiment.
103 | 
104 | 
105 | 
106 | 
107 | 
108 | 
109 | 
110 | 
111 | 
112 | 
113 | 
114 | 
115 | 
116 | 
117 | 
118 | 
119 | 
120 | 
121 | 
122 | 


--------------------------------------------------------------------------------
/SpatialTranscriptomics/Baysor/readme.txt:
--------------------------------------------------------------------------------
1 | This directory contains files demonstrating how to run Baysor segmentation on imaging-based spatial trascriptomics data, how to load these data into a Seurat Object, and using Baysor output for qc filtering transcripts and cells
2 | 


--------------------------------------------------------------------------------
/SpatialTranscriptomics/readme.txt:
--------------------------------------------------------------------------------
1 | This directory contains knowledge regarding Spatial Transcriptomics analyses.
2 | 


--------------------------------------------------------------------------------
/_config.yml:
--------------------------------------------------------------------------------
1 | theme: jekyll-theme-minimal
2 | google_analytics: UA-150923842-1
3 | 


--------------------------------------------------------------------------------
/admin/acknowledging_funding:
--------------------------------------------------------------------------------
1 | ## Adding funding section to reports:
2 | 
3 | Information on how to acknowledge the core can be found in the individual [MOUs](https://www.dropbox.com/sh/vfnjq2buj3i329v/AADNClhaY6wwBnJu5l4b5CoKa?dl=0) for each institution.
4 | 
5 | See this instruction to add the different funding information to reports for clients to be added to papers: http://bioinformatics.sph.harvard.edu/hbcABC/articles/general_start.html#adding-funding-to-template
6 | 


--------------------------------------------------------------------------------
/admin/archive_folders_to_standby.md:
--------------------------------------------------------------------------------
  1 | # archive folders
  2 | 
  3 | ## 1) Login to specified login node on O2, via ssh
  4 | In the example below, my login ID is `jnh7` and the login node I am using is `login05.o2.rc.hms.harvard.edu` 
  5 | You will need to change `jnh7` to your ID to login. 
  6 | 
  7 | Example command:
  8 | `ssh -XY -l jnh7 login05.o2.rc.hms.harvard.edu`
  9 | 
 10 | Login with your usual password and DUO challenge
 11 | 
 12 | ## 2) Run tmux
 13 | Running tmux will let you disconnect from your “session” while still letting your command run in the background.. This is super useful if the command is going to take a long time to run and you need to shut down your computer or disconnect form the wifi for some reason
 14 | 
 15 | The command below will setup a named tmux session called 	`foo`. Change `foo` to something that will help you remember what you are working on in this tmux session.
 16 | 
 17 | Example command:
 18 | `tmux new -s foo`
 19 | 
 20 | 
 21 | ## 3) Start an interactive session
 22 | 
 23 | You should run the folder compression in an interactive session instead of the login node as the login nodes are shared and running commands in them can slow down the cluster for everyone. As such, RC may kill commands that take too much memory on login nodes without notifying you.
 24 | 
 25 | The command below will give you an interactive node with 8 gigs of RAM (`-mem 8000M`) for 8 hours (`-t 0-08:00`)
 26 | 
 27 | Example command:
 28 | `srun --pty -p interactive --mem 8000M --x11 -t 0-08:00 /bin/bash`
 29 | 
 30 | 
 31 | ## 4) Compress the folder
 32 | Go to the folder on O2 
 33 | In this example I am pretending to work on  the project we did with Arlene Sharpe under the hbc code hbc03895 
 34 | This project is in the following folder:
 35 | `/n/data1/cores/bcbio/PIs/arlene_sharpe/Sharpe_RNAseq_analysis_of_siRNA_treated_PANCafs_and_myeloid_cells_after_coculture_hbc03895`
 36 | 
 37 | To compress it I can issue the following  commands
 38 | 	Go to the PI  folder:
 39 | `cd /n/data1/cores/bcbio/PIs/arlene_sharpe/`
 40 | 	Compress the project folder with tar and gzip in a single command and 
 41 | a) use the same project name as part of the compressed file name **AND**
 42 | b) add the date to the compressed file (I’ll use the date I wrote this document in the format YYYYMMDD, or 20201020)
 43 | c) to ensure the compression finished,  add the `—remove-files` option to the tar command, this will remove the files once the compression has scceeded. 
 44 | In general, the command to compress files would then look like this (`--remove-files` has to be before the other options as `-f` indicates a file name coming after):
 45 | `tar --remove-files -cvzf  YYYYMMDD_folder.tar.gz folder`
 46 | 
 47 | 
 48 | And in the case of this Arlelen Sharpe hbc038956  example, it would look like this:
 49 | `tar --remove-files -cvzf  20201020_Sharpe_RNAseq_analysis_of_siRNA_treated_PANCafs_and_myeloid_cells_after_coculture_hbc03895.tar.gz Sharpe_RNAseq_analysis_of_siRNA_treated_PANCafs_and_myeloid_cells_after_coculture_hbc03895`
 50 | This will compress the folder  and remove the original files (which will still be around in the backed up .snapshot folder for some weeks or months!)
 51 | 
 52 | ## 5) Login to the transfer node
 53 | The standby folder can only be accessed from the transfer node
 54 | 
 55 | SSH to the transfer node
 56 | `ssh transfer`
 57 | Login with your password and DUO challenge, it shouldn’t require your login ID
 58 | 
 59 | ## 6) Move  compressed folder to the standby folder
 60 | 
 61 | Our standby folder is located at `/n/standby/cores/bcbio/compute/archived_reports/tier2`
 62 | You can see that there are already a bunch of tar gzipped folders in there. 
 63 | 
 64 | Move your newly compressed folder to the standby folder using rsync.
 65 | Srync follows the general pattern of
 66 | `rsync -options source destination`
 67 | 
 68 | I typically use the following rsync options (as `-ravzuP`)
 69 | * r = recursive, ie. transfer all the subfolders too
 70 | * a = archive, preserve everything
 71 | * v= verbose, print  updates during tnansfer to screen
 72 | * u = update, skip files on receiver that are newer (this is useful if we screwup and have already archived the folder)
 73 | * P = show estimate progress as a bar or percentage transferred
 74 | 
 75 | Here I will continue with the example from 5 using Arlene Sharpe’s analysis using our compressed folder as source and the standby folder as destination
 76 | 
 77 | Example command:
 78 | `rsync -ravzuP /n/data1/cores/bcbio/PIs/arlene_sharpe/20201020_Sharpe_RNAseq_analysis_of_siRNA_treated_PANCafs_and_myeloid_cells_after_coculture_hbc03895.tar.gz /n/standby/cores/bcbio/compute/archived_reports/tier2/`
 79 | 
 80 | This will copy/sync the compressed folder over to the standby folder. 
 81 | 
 82 | Once the file is done copying/syncing, you can erase the source file
 83 | 
 84 | Example command:
 85 | `rm /n/data1/cores/bcbio/PIs/arlene_sharpe/20201020_Sharpe_RNAseq_analysis_of_siRNA_treated_PANCafs_and_myeloid_cells_after_coculture_hbc03895.tar.gz`
 86 | 
 87 | ## 7) Link the standby copy of the compressed folder back to the PI folder
 88 | This should help locate it again in the future
 89 | We do this using a “symlink”, which is similar to a “Shortcut” in Windows or and “Alias” in OSX. 
 90 | The command to make a symlink in Unix is `ln -s` and typically has the format of `ln -s  source destination`
 91 | 
 92 | Using our example of the Sharpe analysis, the command would be:
 93 | 
 94 | `ln -s /n/standby/cores/bcbio/compute/archived_reports/tier2/20201020_Sharpe_RNAseq_analysis_of_siRNA_treated_PANCafs_and_myeloid_cells_after_coculture_hbc03895.tar.gz /n/data1/cores/bcbio/PIS/arlene_sharpe/ 20201020_Sharpe_RNAseq_analysis_of_siRNA_treated_PANCafs_and_myeloid_cells_after_coculture_hbc03895.tar.gz`
 95 | 
 96 | This will drop a symlink in the arelene_sharpe PI directory. W
 97 | You can see that you have a symlink by running `ls -lh ` in the PI directory, you should see at least one line with  the `20201020_Sharpe_RNAseq_analysis_of_siRNA_treated_PANCafs_and_myeloid_cells_after_coculture_hbc03895.tar.gz` file listed with an arrow beside it pointing to the standby folder
 98 | i.e. `20201020_Sharpe_RNAseq_analysis_of_siRNA_treated_PANCafs_and_myeloid_cells_after_coculture_hbc03895.tar.gz -> /n/standby/cores/bcbio/compute/archived_reports/tier2/20201020_Sharpe_RNAseq_analysis_of_siRNA_treated_PANCafs_and_myeloid_cells_after_coculture_hbc03895.tar.gz`
 99 | 
100 | 
101 | 
102 | 
103 | 
104 | 
105 | ## NOTES
106 | 
107 | If you get disconnected from your session , you will be able to go back to this session by logging in as above in step 1 again and running the command to reconnect to tmux session `foo`
108 | 
109 | Example command:
110 | `tmux attach -t foo`
111 | 


--------------------------------------------------------------------------------
/admin/chargeback_models.md:
--------------------------------------------------------------------------------
 1 | # Parameters and limits for how we determine discounted rates for projects.
 2 | 
 3 | ## NIEHS
 4 | Discretionary. 
 5 | PI must have appointment to Chan Environmental Health department or be a member of the NIEHS Center. 
 6 | 
 7 | ## CFAR
 8 | Some discretion. 
 9 | Generally limited to a maximum of 80 hours for junior PI analyses but can be less, depending on funding.
10 | Can support core infrastructure and expertise development for senior PIs. 
11 | In all cases work must be HIV related.
12 | 
13 | ## CATALYST
14 | 5 hours total including initial emails, consults and analysis.
15 | Exceptions may be made in consultation with Shannan for:
16 | - Grant submissions 
17 | - manuscript resubmissions 
18 | - GEO (or other publication database) uploads
19 | 
20 | ## HMS
21 | On quad researchers only 
22 | Confirm via APEX ( apex.hms.harvard.edu)
23 | Generally 10 hours free and 30 hours at 50% rate
24 | Can vary depending on funding availability
25 | 
26 | 


--------------------------------------------------------------------------------
/admin/consulting_resources.md:
--------------------------------------------------------------------------------
 1 | ## Collaborating
 2 | Presentation for Boston Women in Bioinformatics meetup by Dr. Eleanor Howe, founder and CEEO of [Diamond Age Data Science](https://diamondage.com/) on collaborating well.  
 3 | [Link to slide deck](https://www.dropbox.com/s/rrq7a3ozbmkvmul/Eleanor%20Howe%20-%20Collaboration%20as%20a%20bioinformatician.pdf?dl=1)
 4 | 
 5 | ## Writing
 6 | Slides From "Writing Clearly and Concisely (3 Parts)" by  Donald Halstead, Lecturer on Epidemiology and Director of Writing Programs. 
 7 | 1) [Part1 - Writing Clearly](https://www.dropbox.com/s/fzr1nzda5ivs49t/201203_Writing_Clearly1.pdf?dl=1)
 8 | 2) [Part2 - Writing Clearly, Voice Stress](https://www.dropbox.com/s/yor51v66ofxr9m9/201210_Writing_Clearly2_Voice_Stress.pdf?dl=1)
 9 | 3) [Part3 - Writing Clearly, Combining Sentences](https://www.dropbox.com/s/4kbycikvg0wknkb/201217_Combining_Sentences.Handout.pdf?dl=1)
10 |   - [Part3 - Sentence Connectors](https://www.dropbox.com/s/oa22q11a1pvr6as/Sentence_Connector_matrix_complete.pdf?dl=1)
11 | 


--------------------------------------------------------------------------------
/admin/getting_started.md:
--------------------------------------------------------------------------------
1 | https://github.com/hbc/hbc_admin/blob/master/Getting_Started.md
2 | 


--------------------------------------------------------------------------------
/admin/initial_consults.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: Information related to initial consults
 3 | description: This document contains information it is useful to get from initial consults
 4 | category: admin
 5 | subcategory: guide
 6 | tags: [consults]
 7 | ---
 8 | 
 9 | # We need to talk about
10 | 
11 | ## Who they are  
12 | 	-  PI for the lab
13 | 		- PI seniority (implications for HSCI funding eligibility)
14 | 	- institute
15 | 	- how they found out about us
16 | 
17 | ## The actual science and biology  
18 | 	- what the lab works on
19 | 	- what the researcher works on
20 | 	- applicability to disease (translational research)
21 | 	- is it stem cell related (potential HSCI funding)
22 | 
23 | ## Technical issues
24 | 	- organism
25 | 	- type and number of comparisons
26 | 	- samples groups
27 | 	- technique
28 | 	- potential batches
29 | 
30 | ## Funding 
31 | - who is paying for the work 
32 | 	- what commitments are entailed by that
33 | 
34 | ## Time frame
35 | - is this urgent
36 | 	- for paper, manuscript or grant
37 | 	- upcoming meetings?
38 | 	- need to do quik QC before continuing with experiments?
39 | 
40 | ## Who will be doing the work
41 | 	- advice or analysis?
42 | 		- advice - important to assess skill of researcher, have clarity that we cannot train, debug code or mentor, can only point to resources
43 | 		- analysis - we will do it all and share results and code##
44 | 
45 | ## Training needs
46 | 	- point to courses
47 | 	- point to materials
48 | 
49 | ## Time estimate
50 | - when we can start,
51 | - how long it will take
52 | - how much it will cost (if paying)
53 | 
54 | ## Authorship expectations
55 | - basic analysis - acknowlegement only, including any funding source
56 | - advanced analysis - middle author for analyst, acknowledgement for funding source
57 | 	- we follow standard practices on intellectual contributions and authorship
58 | 	- offering authorship shows you value our efforts, helps us attract and retain qualified personnel and helps ensure continued funding 
59 | 	- we are skilled professionals and approach no project as routine, taking complete responsibility for the integrity of the data and the accuracy of the data analysis, viewing your projects as opportunities to engage intellectually and venues to develop professionally
60 | 
61 | ## Process outline/ Next steps
62 | - Basecamp
63 | - MOU delivery
64 | - how billing works
65 | - will followup with email when analyst ready
66 | 


--------------------------------------------------------------------------------
/admin/reproducible_research.md:
--------------------------------------------------------------------------------
 1 | # Guidelines for managing HBC Research Data
 2 | 
 3 | [Docker cheatsheet](https://dockerlabs.collabnix.com/docker/cheatsheet/)
 4 | 
 5 | ## Motivation
 6 | To handle data in a manner that allows it to be FAIR, i.e.
 7 | 
 8 | * Findable: associated with a unique identifier
 9 | * Accessible: easily retrievable
10 | * Interoperable: "use and speak the same language" via use of standardized vocabularies
11 | * Reusable: adequately described to a new user, have clear information about data usage, and have a traceable "owner's manual" or provenance
12 | 
13 | As a rule of thumb it may help to think of anything that is *only* on your local machine as being not reproducible or reusable.  
14 | 
15 | Many of the Core's standard operating procedures are geared towards reproducibility/reusability. Please try to adhere to the following general guidelines.
16 | 
17 | 
18 | ### O2
19 | In general, O2 is for the big stuff. Also, anything needed to reproduce the results on run on the server should be here.
20 | * The Core's shared space is located at `/n/data1/cores/bcbio/`.
21 |   *  We keep projects for researcher in the PIs folder
22 |     * this folder has the following structure 
23 |     * PIfirstname_PIlastname/project_folder
24 |     * ideally the project folder would match the github repo name and look similar to the Trello/Harvest name
25 |     * it should at least contain the hbc code for tracking
26 | * Refer to the [Setting up an analysis guidelines](https://github.com/hbc/knowledgebase/blob/master/admin/setting_up_an_analysis_guidelines.md) for how to name directories and which folders to include in the project folder. 
27 | * Store a copy of the yaml config, metadata csv file, and slurm script used to run the analysis along with the raw data so that someone else can access the project and rerun it if necessary. 
28 | * It is also very helpful if you keep a copy of the bcbio object for downstream analysis with your data.
29 | 
30 | 
31 | ### HBC org on github
32 | In general, code is for a continually updated, searchable record of your code, and will mainly be made up of the type of code you run locally. 
33 | Commit the following:
34 | * Rmarkdowns for analysis
35 | * any auxillary scripts required to analyze the data and/or present it in finished form (i.e. data object conversion scripts, yaml files for bcbioRNAseq, etc.)
36 | 
37 | ### Dropbox
38 | 
39 | Use dropbox to share results and code with collaborators (`HBC Team Folder (1)/Consults/piFirstName_piLastName/project_folder`).
40 | * html files
41 | * Rmarkdown files used to generate the html files
42 | * text files and "small" processed data files (i.e. files included in the reports, such as normalized counts)
43 | * documents: manuscripts, extra metadata, presentations
44 | * anything else the collaborator may need to reproduce the R-based analysis *IF* they wish to reproduce it
45 | * *RNA-seq* example: create folders for QC, DE and FA. These may be in the base project folder, or if the project is complex, in folders labeled by dates. 
46 |    * Within the QC folder, store the QC Rmd and the html report
47 |    * In the DE folder, store the DE Rmd file, the DE report, and the raw and normalized counts matrices. Also store the DESeq2 results tables with gene symbols added
48 |    * In the FA folder, store the FA Rmd, the html report and the FA results tables
49 | 
50 | ### Basecamp
51 | 
52 | Use Basecamp for discussions of project progress.
53 | * Link to Dropbox for results
54 | * You may post small files on basecamp to share with collaborators
55 | 
56 | ### Cases to discuss:
57 | 
58 | * Many times, the client sends us a presentation (ppt/pptx) or a paper to better understand the project or to provide us with some information that may end up in the metadata. Where should we store these files?   
59 | *John - a "docs" directory on O2 works for this*  
60 | * Similarly, should we store original metadata files or only the csv file that we end up using.  
61 | *John - I generally do, as it can contain important information or be easier to run past the original researcher. I mark it as "original" so I can clearly differentiate it from the metadata we ended up using. If available, I save any code I used to modify the original metadata with the metadata.*  
62 | * Reviews. Consults that consist on reviewing the client paper (i.e code). Where to keep all the provided documents (if we have to keep them).  
63 | *John - I could see keeping them on Dropbox, as we would typically want to share any reviewer responses with the researcher.*  
64 | * In the case we have downloaded data from multiple flowcells, should we keep the originals or the concatenated files. Note that errors can happen when "preparing samples" (concatenating the wrong files for example).  
65 | *John - would be great if we could only keep the concatenated files, but only after confirming no lane effects and proper concatenation.*
66 | 


--------------------------------------------------------------------------------
/admin/scripts_for_data_management_on_o2:
--------------------------------------------------------------------------------
 1 | # data management on O2 sop
 2 | 
 3 | ## General
 4 | Make a folder with the current date in format to put results in
 5 | Year-month-day (eg. 2020-09-08)
 6 | Code is in the /n/data1/cores/bcbio/PIs/size_monitoring/hmsrc_space_scripts folder
 7 | 
 8 | ## Finding large and old folders
 9 | "dir_sizes_owners.sh -$date“ 
10 | Then run the R script dir_sizes_owners.R and point at the resulting tsv file
11 |      
12 | `sbatch -t  12:00:00 -p short —wrap=“ bash /n/data1/cores/bcbio/PIs/size_monitoring/hmsrc_data_management/dir_sizes_owners.sh 2021-01-06”`
13 | 
14 | ## Finding redundant versions of fastq and fq files
15 | 
16 | ## Finding uncompressed fastq and fq files
17 | Run find_fastqs.sh 
18 | `sbatch -t 24:00:00 -p medium --wrap="bash /n/data1/core/PIs/size_monitoring/hmsrc_data_management/find_fastqs.sh"`
19 | Then run fastq_dirs.sh to reduce down to the directories
20 | `bash fastq_dirs.sh < ../2021-01-29/fastq_files.txt | sort | uniq`
21 | 
22 | ## Finding leftover work directories
23 | Run  find_work_dirs.sh
24 | `sbatch -t 24:00:00 -p medium --wrap="bash /n/data1/core/PIs/size_monitoring/hmsrc_data_management/find_work_dirs.sh"`
25 | Run workdir_details.sh
26 | 


--------------------------------------------------------------------------------
/admin/using_globus:
--------------------------------------------------------------------------------
 1 | 
 2 | ## Scripts for setting up transfers
 3 | 
 4 | 
 5 | >Hi, 
 6 | 
 7 | I’ll be handling the data transfer.
 8 | 
 9 | Currently, we have your data on HMS RC's O2 server. If you have an account there and appropriate space, we can simply transfer ownership to you there. You can then move it to your directory on O2 and decide whether to keep it on O2 (note that they will start charging for data storage in the summer) or move it to another server/drive.
10 | 
11 | Alternatively, we can use the new Globus secure transfer service that HMS offers. It has worked quite well for us to transfer from O2 to external servers and individual laptops/computers. I'm attaching some of their guidance docs on setting up the client and initiating a transfer. We would need you to sign into the system with your Harvard ID first (see below) and give us your Globus user ID (see the “Account” section in the Globus interface) or the email you used to sign in. We'd then share the folder with you to access via Globus and you will receive an email with instructions. 
12 | 
13 | Globus LogIN help
14 | Go to https://www.globus.org/ and click on "LogIn/Use your existing organizational login/Harvard University", this will take you to Harvard Key. You can also login with your ecommons ID. If you don’t have Harvard Key you can create a globus id - (https://www.globusid.org/create).
15 | 
16 | Please let us know if you have O2 access and if not, please share your Globus ID and we'll set up your data for transfer.
17 | 
18 | 
19 | Note, share these two docs:
20 | Globus-Connect-Personal-install-Windows - setting up the client
21 | GlobusOneTimeTransfer - initiate the data transfer
22 | 
23 | (Note SelfServiceCollections is for us - how to setup a share. No need to give this doc to anyone.) 
24 | 
25 | 
26 | Alternate script
27 | HI  – 
28 |  
29 | With the help of HMS, we’ve been using globus to facilitate transferring the data. (globus.org) It requires you to sign up for a free account and install their client on your system (mac/windows, etc.). We’ll give you access to the data on the HMS cluster from your globus account and globus can transfer it to whatever storage you have access to at MGH, on an external drive, etc. (ie it’s a glorified FTP, file transfer capability and takes care of error recovery & handling etc.)
30 |  
31 | If this sounds good, let us know your globus account info and we can go from there. 
32 |  
33 | Thanks, 
34 | 
35 | 
36 | 
37 | 
38 | 
39 | 
40 | 
41 | ## Potential issues
42 | ### Permissions - initiator needs proper permissions to actually share data
43 | ### Non-standard characters in filenames
44 | - files whose names contain non-alphanumeric characters (eg. ; ? =) may be rejected by windows machines during transfer
45 | ### Changes in files on host or client
46 | - due to Globus monitoring for file integrity, changes that are made to files or directories on either side during transfer can interrupt the transfer
47 | 
48 | 
49 | ## Notes
50 | * Email from Globus only contains share information, unclear where the researcher learns about the app
51 | * CAn add additional endpoint folders through Preferences/Options on OSX and Windows.
52 | * Console interface or client may not tell you that transfer is complete
53 | 
54 | 


--------------------------------------------------------------------------------
/bcbio/bcbio_genomes.md:
--------------------------------------------------------------------------------
  1 | **Homo Sapiens + Covid19**
  2 | 
  3 | *GRCh38_SARSCov2 * - built from ensembl
  4 |   - Assembly GRCh38, release 99.
  5 |   - Files:
  6 |     - Genomic sequence: ftp://ftp.ensembl.org/pub/release-99/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
  7 |     - Annotation: ftp://ftp.ensembl.org/pub/release-99/gtf/homo_sapiens/Homo_sapiens.GRCh38.99.gtf.gz
  8 |     - Covid sequence and annotation added: https://www.ncbi.nlm.nih.gov/nuccore/MN988713.1?report=GenBank 
  9 | 
 10 | **Mus musculus**
 11 | 
 12 | *GRCm38_98* - built from ensembl
 13 |   - Assembly GRCm38, release 98.
 14 |   - Files:
 15 |     - Genomic sequence: ftp://ftp.ensembl.org/pub/release-98/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.toplevel.fa.gz
 16 | 
 17 |     - Annotation: ftp://ftp.ensembl.org/pub/release-98/gtf/mus_musculus/Mus_musculus.GRCm38.98.gtf.gz
 18 | 
 19 | **Caenorhabditis elegans**
 20 | 
 21 | *WBcel235_WS272* - built from wormbase
 22 |   - Assembly WBcel235, relase WS272. Project PRJNA13578 (N2 strain)
 23 |   - Files:
 24 |       - Genomic sequence: ftp://ftp.wormbase.org/pub/wormbase/releases/WS272/species/c_elegans/PRJNA13758/c_elegans.PRJNA13758.WS272.genomic.fa.gz
 25 | 
 26 |       - Annotation: ftp://ftp.wormbase.org/pub/wormbase/releases/WS272/species/c_elegans/PRJNA13758/c_elegans.PRJNA13758.WS272.canonical_geneset.gtf.gz
 27 | 
 28 | **Drosophila melanogaster**
 29 | 
 30 | *DGP6* - built from Flybase
 31 |   - has a different format for annotations for non-coding genes in the gtf
 32 |   - only protein coding genes will make it into Salmon and downstream
 33 |   
 34 | *DGP6.92* - built from Ensembl info
 35 |   - will have all non-coding RNAs in Salmon and downstream results
 36 |   - shows lower gene detection rates than Flybase
 37 |  
 38 |  **Updating supported transcriptomes**
 39 | 1. clone cloudbiolinux
 40 | 2. update transcriptome
 41 | ```bash
 42 | bcbio_python cloudbiolinux/utils/prepare_tx_gff.py --cores 8 --gtf Macaca_mulatta.Mmul_8.0.1.95.chr.gtf.gz --fasta /n/app/bcbio/biodata/genomes/Mmulatta/mmul8noscaffold/seq/mmul8noscaffold.fa Mmulatta mmul8noscaffold
 43 | ```
 44 | 3. upload the xz file to the bucket
 45 | ```bash
 46 | aws s3 cp hg19-rnaseq-2019-02-28_75.tar.xz s3://biodata/annotation/ --grants read=uri=http://acs.amazonaws.com/groups/global/AllUsers full=emailaddress=chapmanb@50mail.com
 47 | ```
 48 | 4. edit cloudbiolinux ggd transcripts.yaml recipe to point to the new file uploaded on the bucket
 49 | 5. edit the cloudbiolinux ggd gtf.yaml to show where you got the GTF from and what you did to it
 50 | 6. test before pushing
 51 | ```bash
 52 |    mkdir tmpbcbio-install
 53 |    ln -s `pwd`/cloudbiolinux tmpbcbio-install/cloudbiolinux
 54 |    log into bcbio user: sudo -su bcbio /bin/bash
 55 |    bcbio_nextgen.py upgrade --data
 56 | ```
 57 | 7. push changes back to cloudbiolinux
 58 | 
 59 | **Factual list of genomes in O2:/n/shared_db/bcbio/biodata/genomes as of 2020-03-13**
 60 | ```
 61 | .
 62 | ├── Ad37
 63 | │   ├── GW7619026
 64 | │   └── GW76-19026
 65 | ├── Adenovirus
 66 | │   └── Ad37
 67 | ├── Amexicanus
 68 | │   └── Amexicanus2
 69 | ├── Amis
 70 | │   ├── ASM28112v4
 71 | │   └── ASM28112v4.a
 72 | ├── Anidulans
 73 | │   └── FGSC_A4
 74 | ├── Atta_cephalotes
 75 | │   └── Attacep1.0
 76 | ├── bcbiotx
 77 | ├── Btaurus
 78 | │   └── UMD3.1
 79 | ├── Celegans
 80 | │   ├── WBcel235
 81 | │   ├── WBcel235_90
 82 | │   ├── WBcel235_raw
 83 | │   └── WBcel235_WS272
 84 | ├── Dmelanogaster
 85 | │   ├── BDGP6
 86 | │   ├── BDGP6.15
 87 | │   ├── BDGP6.19
 88 | │   ├── BDGP6.92
 89 | │   ├── flybase
 90 | │   └── flybase_dmel_r6.28
 91 | ├── Drerio
 92 | │   ├── Zv10
 93 | │   ├── Zv11
 94 | │   └── Zv9
 95 | ├── Ecoli
 96 | │   ├── EDL933
 97 | │   ├── k12
 98 | │   ├── MB0009
 99 | │   ├── MB2409
100 | │   ├── MB2455
101 | │   ├── MG1655
102 | │   ├── MG1655_v2
103 | │   ├── MG1655_virus
104 | │   ├── MG1655_wrong_name
105 | │   └── NC_000913.3
106 | ├── Gallus_gallus
107 | │   └── galgal5
108 | ├── gdc-virus
109 | │   └── gdc-virus-hsv
110 | ├── haD37
111 | │   └── DQ900900.1
112 | ├── Hsapiens
113 | │   ├── GRCh37
114 | │   ├── hg19
115 | │   ├── hg19-ercc
116 | │   ├── hg19-mt
117 | │   ├── hg19-subset
118 | │   ├── hg19-test
119 | │   └── hg38
120 | ├── humanAd37
121 | │   └── Ad37.hg19
122 | ├── kraken
123 | │   ├── bcbio
124 | │   ├── micro
125 | │   ├── minikraken_20141208
126 | │   ├── minimal
127 | │   └── old_20141302
128 | ├── Lafricana
129 | │   └── loxAfr3
130 | ├── Macaca
131 | │   ├── Mfascicularis
132 | │   ├── Mmul8
133 | │   └── mmul8noscaffold
134 | ├── Mmulatta
135 | │   ├── mmul8
136 | │   └── mmul8noscaffold
137 | ├── Mmusculus
138 | │   ├── cloudbiolinux
139 | │   ├── GRCm38_90
140 | │   ├── GRCm38_98
141 | │   ├── greenberg-mm9
142 | │   ├── mm10
143 | │   └── mm9
144 | ├── Oaires
145 | │   └── Oar_v31
146 | ├── phiX174
147 | │   └── phix
148 | ├── Pintermedia
149 | │   └── ASM195395v1
150 | ├── Rnorvegicus
151 | │   └── rn6
152 | ├── Scerevisiae
153 | │   └── sacCer3
154 | ├── Spombe
155 | │   ├── ASM284v2.25
156 | │   └── ASM284v2.30
157 | ├── spombe
158 | │   └── ASM294v2
159 | ├── Sscrofa
160 |     ├── ss11.1
161 |     └── Sscrofa10.2
162 | ```
163 | 
164 | **How to install a custom genome in O2**
165 | - `sudo -su bcbio /bin/bash`
166 | - `cd /n/app/bcbio`
167 | - https://bcbio-nextgen.readthedocs.io/en/latest/contents/configuration.html#reference-genome-files
168 | - https://bcbio-nextgen.readthedocs.io/en/latest/contents/configuration.html#adding-custom-genomes
169 | 
170 | **Workflow4: Whole genome trio (50x) - hg38**
171 | 
172 | Inputs (FASTQ files) and results (BAM files, etc) of the [whole genome BWA alignment and GATK variant calling workflow](https://bcbio-nextgen.readthedocs.io/en/latest/contents/germline_variants.html#workflow4-whole-genome-trio-50x-hg38) are stored in `/n/data1/cores/bcbio/shared/NA12878-trio-eval`
173 | 
174 | **Use an updated hg38 transcriptome**
175 | ```bash
176 | wget ftp://ftp.ensembl.org/pub/current_gtf/homo_sapiens/Homo_sapiens.GRCh38.101.gtf.gz
177 | gtf=Homo_sapiens.GRCh38.101.chr.gtf.gz
178 | remap_url=http://raw.githubusercontent.com/dpryan79/ChromosomeMappings/master/GRCh38_ensembl2UCSC.txt
179 | wget --no-check-certificate -qO- $remap_url | awk '{if($1!=$2) print "s/^"$1"/"$2"/g"}' > remap.sed
180 | gzip -cd ${gtf} | sed -f remap.sed | grep -v "*_*_alt" > hg38-remapped.gtf
181 | ```
182 | Then pass `hg38-remapped.gtf` as the `transcriptome_gtf` option.
183 | 


--------------------------------------------------------------------------------
/bcbio/bcbio_tips.md:
--------------------------------------------------------------------------------
 1 | ## Installing a private bcbio development repository on O2
 2 | ```bash
 3 | wget https://raw.githubusercontent.com/bcbio/bcbio-nextgen/master/scripts/bcbio_nextgen_install.py
 4 | python bcbio_nextgen_install.py ${HOME}/local/share/bcbio --tooldir=${HOME}/local --nodata
 5 | ln -s /n/app/bcbio/biodata/genomes/ ${HOME}/local/share/genomes
 6 | mkdir -p ${HOME}/local/share/galaxy
 7 | ln -s /n/app/bcbio/biodata/galaxy/tool-data ${HOME}/local/share/galaxy/tool-data
 8 | export PATH="${HOME}/local/bin:$PATH"
 9 | ```
10 | 
11 | ## How to fix potential conda errors during installation
12 | Add the following to your `${HOME}/.condarc`:
13 | ```yaml
14 | channels:
15 |   - bioconda
16 |   - defaults
17 |   - conda-forge
18 | safety_checks: disabled
19 | add_pip_as_python_dependency : false
20 | rollback_enabled: false
21 | notify_outdated_conda: false
22 | ```
23 | 
24 | ## Using shared bcbio installation on O2
25 | To use bcbio installation in `/n/app/bcbio` add the corresponding tool and Conda directories to your `$PATH`:
26 | ```shell
27 | export PATH="/n/app/bcbio/tools/bin:/n/app/bcbio/dev/anaconda/bin:${PATH}"
28 | ```
29 | 
30 | ## How to fix jobs bcbio jobs timing out
31 | The O2 cluster can take a really long time to schedule jobs. If you are having problems with bcbio timing out, set your --timeout parameter to something high, like this:
32 | ```bash
33 | /n/app/bcbio/dev/anaconda/bin/bcbio_nextgen.py ../config/bcbio_ensembl.yaml -n 72 -t ipython -s slurm -q short -r --tag feany --timeout 6000 -t 0-11:00
34 | ```
35 | 
36 | ## How to run a one-node bcbio job (multicore, not multinode)
37 | it just runs a bcbio job on one node of the cluster (no IPython)
38 | [More slurm options](https://wiki.rc.hms.harvard.edu/display/O2/Using+Slurm+Basic#UsingSlurmBasic-sbatchoptionsquickreference)
39 | 
40 | ```
41 | #!/bin/bash
42 | 
43 | # https://slurm.schedmd.com/sbatch.html
44 | 
45 | #SBATCH --partition=priority        # Partition (queue)
46 | #SBATCH --time=3-00:00              # Runtime in D-HH:MM format
47 | #SBATCH --job-name=bcbio            # Job name - any name
48 | #SBATCH -c 10                       # cores per task 
49 | #SBATCH --mem-per-cpu=10G           # Memory needed per CPU or use --mem to limit total memory
50 | #SBATCH --output=project_%j.out     # File to which STDOUT will be written, including job ID
51 | #SBATCH --error=project_%j.err      # File to which STDERR will be written, including job ID
52 | #SBATCH --mail-type=ALL             # Type of email notification (BEGIN, END, FAIL, ALL) by default goes to the email associated with O2 accounts
53 | #SBATCH --mail-user=abc123@hms.harvard.edu   # Email to which notifications will be sent
54 | 
55 | bcbio_nextgen.py ../config/illumina_rnaseq.yaml -n 10
56 | ```
57 | 
58 | ## Upgrading shared installation of bcbio on O2
59 | How to upgrade `/n/app/bcbio/dev/anaconda/bin/bcbio_nextgen.py` installation:
60 | * switch to `bcbio` user account:
61 |   ```
62 |   sudo -su bcbio
63 |   ```
64 | * make sure `umask` is set correctly:
65 |   ```
66 |   umask 0002
67 |   ```
68 | * edit `/n/app/bcbio/bcbio.upgrade.sh`: set `--mail-user` and other options as necessary
69 | * run the upgrade:
70 |   ```
71 |   sbatch /n/app/bcbio/bcbio.upgrade.sh
72 |   ```
73 | * copy install log (job output) to `/n/app/bcbio/bcbio.upgrade.sh_YYYY-MM-DD.{err,out}` where YYYY-MM-DD is today's date
74 | * test the installation
75 | 
76 | ## conda tricks
77 | Packages dependent on a given one:
78 | ```
79 | grep r-base /n/app/bcbio/dev/anaconda/pkgs/*/info/index.json
80 | ```
81 | 


--------------------------------------------------------------------------------
/bcbio/bcbio_workflow_mary.md:
--------------------------------------------------------------------------------
  1 | # Bcbio workflow by Mary Piper
  2 | https://github.com/marypiper/bcbio_rnaseq_workflow/blob/master/bcbio_rna-seq_workflow.md
  3 | 
  4 | **Documentation for bcbio:** [bcbio-nextgen readthedocs](http://bcbio-nextgen.readthedocs.org/en/latest/contents/pipelines.html#rna-seq)
  5 | 
  6 | ## Set-up
  7 | 1. Follow instructions for starting an analysis using https://github.com/hbc/knowledgebase/blob/master/admin/setting_up_an_analysis_guidelines.md.
  8 | 
  9 | 3. Download fastq files from facility to data folder
 10 | 	
 11 | 	- Download fastq files from a non-password protected url
 12 | 		- `wget --mirror url` (for each file of sample in each lane)
 13 |    	 	- Rory's code to concatenate files for the same samples on multiple lanes: 
 14 |     
 15 |     			barcodes="BC1 BC2 BC3 BC4"
 16 |     			for barcode in $barcodes
 17 |     			do
 18 |     			find folder -name $barcode_*R1.fastq.gz -exec cat {} \; > data/${barcode}_R1.fastq.gz
 19 |     			find folder -name $barcode_*R2.fastq.gz -exec cat {} \; > data/${barcode}_R2.fastq.gz
 20 |     			done
 21 | 
 22 |    	- Download from password protected FTP such as Dana Farber
 23 | 		- `wget -r <FTP address of folder> --user <username>  --password <pwd> <destination>`
 24 | 	
 25 | 	- Download fastq files from BioPolymers: 
 26 |    		- `rsync -avr username@bpfngs.med.harvard.edu:./folder_name .`
 27 |    		
 28 |    		--OR--
 29 |    		
 30 | 		- `sftp username@bpfngs.med.harvard.edu`
 31 | 		- `cd` to correct folder
 32 | 		- `mget *.tab`
 33 | 		- `mget *.bz2`
 34 | 		
 35 | 	- Download from the Broad using Aspera:
 36 | 		- To download data I use this [script](https://github.com/marypiper/bcbio_rnaseq_workflow/blob/master/aspera_connect_lsf).
 37 | 
 38 | 4. Create metadata in Excel create sym links by concatenate("ln -s ", column $A2 with path_to_where_files_are_stored, " ", column with name of sym link $D2). Can extract parts of column using delimiters in Data tab column to text.
 39 | 
 40 | 5. Save Excel as text and replace ^M with new lines in vim:
 41 | 
 42 | 	`:%s/<Ctrl-V><Ctrl-M>/\r/g`
 43 | 
 44 | 6. Settings for bcbio- make sure you have following settings in `~/.bashrc` file:
 45 |  
 46 |  ```bash
 47 |     unset PYTHONHOME
 48 |     unset PYTHONPATH
 49 |     export PATH=/n/app/bcbio/tools/bin:$PATH
 50 |  ```
 51 |     
 52 | 7. Within the `meta` folder, add your comma-separated metadata file (`projectname_rnaseq.csv`)
 53 | 	- first column is `samplename` and is the names of the fastq files as they appear in the directory (should be the file name without the extension (no .fastq or R#.fastq for paired-end reads))
 54 | 	- second column is `description` and is unique names to call samples - provide the names you want to have the samples called by 
 55 | 	- **FOR CHIP-SEQ** need additional columns:
 56 | 		- `phenotype`: `chip` or `input` for each sample
 57 | 		- `batch`: batch1, batch2, batch3, ... for grouping each input with it's appropriate chip(s)
 58 | 	- additional specifics regarding the metadata file: [http://bcbio-nextgen.readthedocs.org/en/latest/contents/configuration.html#automated-sample-configuration](http://bcbio-nextgen.readthedocs.org/en/latest/contents/configuration.html#automated-sample-configuration) 
 59 |         
 60 | 8. Within the `config` folder, add your custom Illumina template
 61 |     - Example template for human RNA-seq using Illumina prepared samples (genome_build for mouse = mm10, human = hg19 or hg38 (need to change star to hisat2 if using hg38):
 62 | 
 63 | 	```yaml
 64 | 	# Template for mouse RNA-seq using Illumina prepared samples
 65 | 	---
 66 | 	details:
 67 | 	  - analysis: RNA-seq
 68 | 	    genome_build: mm10
 69 | 	    algorithm:
 70 | 	      aligner: star
 71 | 	      quality_format: standard
 72 | 	      strandedness: firststrand
 73 | 	      tools_on: bcbiornaseq
 74 | 	      bcbiornaseq:
 75 | 		organism: mus musculus
 76 | 		interesting_groups: [genotype]
 77 | 	upload:
 78 | 	  dir: /n/data1/cores/bcbio/PIs/vamsi_mootha/hbc_mootha_rnaseq_of_metabolite_transporter_KO_mouse_livers_hbc03618_1/bcbio_final
 79 | 	```
 80 | 
 81 | 	- List of genomes available can be found by running `bcbio_setup_genome.py`
 82 | 	- strandedness options: `unstranded`, `firststrand`, `secondstrand`
 83 | 	- Additional parameters can be found: [http://bcbio-nextgen.readthedocs.org/en/latest/contents/configuration.html#automated-sample-configuration](http://bcbio-nextgen.readthedocs.org/en/latest/contents/configuration.html#automated-sample-configuration) 
 84 | 	- Best practice templates can be found: [https://github.com/chapmanb/bcbio-nextgen/tree/master/config/templates](https://github.com/chapmanb/bcbio-nextgen/tree/master/config/templates)
 85 | 
 86 |  
 87 | 9. Within the `data` folder, add all your fastq files to analyze.
 88 | 
 89 | ## Analysis
 90 | 
 91 | 1. Go to `/n/scratch2/your_ECommonsID/PI` and create an `analysis` folder. Change directories to `analysis` folder and create the full Illumina instructions using the Illumina template created in Set-up: step #6.
 92 |     - `srun --pty -p interactive -t 0-12:00 --mem 8G bash` start interactive job
 93 |     - `cd path-to-folder/analysis` change directories to analysis folder
 94 |     - `bcbio_nextgen.py -w template /n/data1/cores/bcbio/PIs/path_to_templates/star-illumina-rnaseq.yaml /n/data1/cores/bcbio/PIs/path_to_meta/*-rnaseq.csv /n/data1/cores/bcbio/PIs/path_to_data/*fastq.gz` run command to create the full yaml file
 95 | 
 96 | 2. Create script for running the job (in analysis folder)
 97 | 
 98 | For a larger job:
 99 | 
100 | 	```bash
101 | 	#!/bin/sh
102 | 	#SBATCH -p medium
103 | 	#SBATCH -J mootha
104 | 	#SBATCH -o run.o
105 | 	#SBATCH -e run.e
106 | 	#SBATCH -t 0-100:00
107 | 	#SBATCH --cpus-per-task=1
108 | 	#SBATCH --mem-per-cpu=8G
109 | 	#SBATCH --mail-type=ALL
110 | 	#SBATCH --mail-user=piper@hsph.harvard.edu
111 | 	
112 | 	export PATH=/n/app/bcbio/tools/bin:$PATH
113 | 	
114 | 	/n/app/bcbio/dev/anaconda/bin/bcbio_nextgen.py ../config/\*\_rnaseq.yaml -n 48 -t ipython -s slurm -q medium -r t=0-100:00 --timeout 300 --retries 3
115 | 	```
116 | 	
117 | For a smaller job, it might be faster in overall time to just run the job on the priority queue. If you only have a few samples, and your fairshare score is low, running on the priority queue could end up being faster since you will quickly get a job there and not have to wait.
118 | 
119 | ```bash
120 | #!/bin/sh
121 | #SBATCH -p priority
122 | #SBATCH -J mootha
123 | #SBATCH -o run.o
124 | #SBATCH -e run.e
125 | #SBATCH -t 0-100:00
126 | #SBATCH --cpus-per-task=8
127 | #SBATCH --mem-per-cpu=64G
128 | #SBATCH --mail-type=ALL
129 | #SBATCH --mail-user=piper@hsph.harvard.edu
130 | export PATH=/n/app/bcbio/tools/bin:$PATH
131 | /n/app/bcbio/dev/anaconda/bin/bcbio_nextgen.py ../config/\*\_rnaseq.yaml -n 8
132 | ```
133 | 
134 | 3. Go to work folder and start the job - make sure in an interactive session 
135 | 
136 | 	```bash
137 | 	cd /n/scratch2/path_to_folder/analysis/\*\_rnaseq/work
138 | 	sbatch ../../runJob-\*\_rnaseq.slurm
139 | 	```
140 | 
141 | ### Exploration of region of interest
142 | 
143 | 1. The bam files will be located here: `path-to-folder/*-rnaseq/analysis/*-rnaseq/work/align/SAMPLENAME/NAME_*-rnaseq_star/` # needs to be updated
144 | 
145 | 2. Extracting interesting region (example)
146 | 	- `samtools view -h -b  sample1.bam "chr2:176927474-177089906" > sample1_hox.bam`
147 | 
148 | 	- `samtools index sample1_hox.bam`
149 | 
150 | 
151 | ## Mounting bcbio
152 | 
153 | `sshfs mp298@transfer.orchestra.med.harvard.edu:/n/data1/cores/bcbio ~/bcbio -o volname=bcbio -o follow_symlinks`
154 | 


--------------------------------------------------------------------------------
/bcbio/building_a_hybrid_murine_transgene_reference_genome.md:
--------------------------------------------------------------------------------
  1 | ## Building hybrid murine/transgene reference genome
  2 | 
  3 | Heather Wick
  4 | 
  5 | This document is based on [prior contributions by James Billingsley](https://github.com/hbc/knowledgebase/blob/master/bcbio/Creating_Hybrid_Mammal_Viral_Reference_Genome.md)
  6 | 
  7 | 1) <b>Download reference genome from ensembl</b>
  8 |   * </b>Example: latest mouse reference (GRCm9) fasta (primary assembly) and gtf downloaded [here](http://useast.ensembl.org/Mus_musculus/Info/Index![image](https://github.com/hbc/knowledgebase/assets/33556230/98d91abd-5cd9-4651-b541-48e3bf413483)
  9 | )
 10 | 
 11 | 2) <b>Acquire transgenes and format to match standard fasta format</b>
 12 |   * In this case, genes were provided by the client
 13 |   * For our purposes, each transgene was considered to be on its own chromosome, the length of which was the length of the individual gene
 14 |   * Editing was done via plain text editor
 15 |     
 16 |     Format:
 17 |       ```
 18 |       > [GENE_NAME] dna:chromosom chromosome:[GENOME]:[GENE_NAME]:[CHR_START]:[CHR_END]:1 REF
 19 |       BASEPAIRS_ALLCAPS_60_CHARACTERS_WIDE
 20 |       ```
 21 |     Example:
 22 |       ```
 23 |       >H2B-GFP dna:chromosome chromosome:GRCm39:H2B-GFP:1:1116:1 REF
 24 |       ATGCCAGAGCCAGCGAAGTCTGCTCCCGCCCCGAAAAAGGGCTCCAAGAAGGCGGTGACT
 25 |       AAGGCGCAGAAGAAAGGCGGCAAGAAGCGCAAGCGCAGCCGCAAGGAGAGCTATTCCATC
 26 |       TATGTGTACAAGGTTCTGAAGCAGGTCCACCCTGACACCGGCATTTCGTCCAAGGCCATG
 27 |       GGCATCATGAATTCGTTTGTGAACGACATTTTCGAGCGCATCGCAGGTGAGGCTTCCCGC
 28 |       CTGGCGCATTACAACAAGCGCTCGACCATCACCTCCAGGGAGATCCAGACGGCCGTGCGC
 29 |       CTGCTGCTGCCTGGGGAGTTGGCCAAGCACGCCGTGTCCGAGGGTACTAAGGCCATCACC
 30 |       AAGTACACCAGCGCTAAGGATCCACCGGTCGCCACCATGGTGAGCAAGGGCGAGGAGCTG
 31 |       TTCACCGGGGTGGTGCCCATCCTGGTCGAGCTGGACGGCGACGTAAACGGCCACAAGTTC
 32 |       AGCGTGTCCGGCGAGGGCGAGGGCGATGCCACCTACGGCAAGCTGACCCTGAAGTTCATC
 33 |       TGCACCACCGGCAAGCTGCCCGTGCCCTGGCCCACCCTCGTGACCACCCTGACCTACGGC
 34 |       GTGCAGTGCTTCAGCCGCTACCCCGACCACATGAAGCAGCACGACTTCTTCAAGTCCGCC
 35 |       ATGCCCGAAGGCTACGTCCAGGAGCGCACCATCTTCTTCAAGGACGACGGCAACTACAAG
 36 |       ACCCGCGCCGAGGTGAAGTTCGAGGGCGACACCCTGGTGAACCGCATCGAGCTGAAGGGC
 37 |       ATCGACTTCAAGGAGGACGGCAACATCCTGGGGCACAAGCTGGAGTACAACTACAACAGC
 38 |       CACAACGTCTATATCATGGCCGACAAGCAGAAGAACGGCATCAAGGTGAACTTCAAGATC
 39 |       CGCCACAACATCGAGGACGGCAGCGTGCAGCTCGCCGACCACTACCAGCAGAACACCCCC
 40 |       ATCGGCGACGGCCCCGTGCTGCTGCCCGACAACCACTACCTGAGCACCCAGTCCGCCCTG
 41 |       AGCAAAGACCCCAACGAGAAGCGCGATCACATGGTCCTGCTGGAGTTCGTGACCGCCGCC
 42 |       GGGATCACTCTCGGCATGGACGAGCTGTACAAGTAA
 43 |       ```
 44 |       
 45 | 3) <b>Create GTF for transgenes</b>
 46 |   * [Information about GTF file format can be found here](https://useast.ensembl.org/info/website/upload/gff.html)
 47 |   * For our purposes, each transgene was considered to be on its own chromosome, the length of which was the length of the individual gene
 48 |   * There are only 9 columns. The Attributes in the last column are separated by semicolons/spaces, not tabs.
 49 |     
 50 |     Format:
 51 |       ```
 52 |       GENE_NAME	SOURCE	FEATURE	CHR_START	CHR_END	SCORE	STRAND	FRAME	ATTRIBUTES;SEPARATED;BY;SEMI-COLONS;NOT;TABS!;
 53 |       ```
 54 |     Example:
 55 |       ```
 56 |       H2B-GFP	unknown	exon	1	1116	.	+	.	gene_id "H2B-GFP"; transcript_id "H2B-GFP"; gene_name "H2B-GFP"; gene_biotype "protein_coding";
 57 |       ```
 58 |       
 59 | 4) <b>Concatenate reference fasta and gtf with transgene fasta and gtfs</b>
 60 | 
 61 |     Format:
 62 |     ```
 63 |     cat GENOME.dna.primary_assembly.fa TRANSGENE.fa > GENOME.dna.primary_assembly_TRANSGENE.fa
 64 |     cat GENOME.gtf TRANSGENE.gtf > GENOME_TRANSGENE.gtf
 65 |     ```
 66 |     Example (two transgenes were added):
 67 |     ```
 68 |     cat Mus_musculus.GRCm39.dna.primary_assembly.fa H2B-GFP.fa tTA.fa > Mus_musculus.GRCm39.dna.primary_assembly_GFP_tTA.fa
 69 |     cat Mus_musculus.GRCm39.110.gtf H2B-GFP.gtf tTA.gtf > Mus_musculus.GRCm39.110_GFP_tTA.gtf
 70 |     ```
 71 |   * <b>Check your formatting!! Sometimes extra new lines or tabs are easy to accidentally add</b>
 72 |   
 73 | 5) <b>Create folder for new reference genome</b>
 74 |     ```
 75 |     sudo -su bcbio /bin/bash
 76 |     cd /n/app/bcbio/1.2.9/genomes/Mmusculus/
 77 |     mkdir GRCm39
 78 |     ```
 79 |   * If you wish to move your new fasta and gtf to the new directory, you may need to move the files to your home directory, then use sudo to sign in as bcbio before copying them to their final location. You may also need to log into an interactive session because the files are quite large
 80 |     ```
 81 |     cd GRCm39
 82 |     srun --pty -p interactive --mem 500M -t 0-06:00
 83 |     mv /path/to/home/dir/GENOME.dna.primary_assembly_TRANSGENE.fa .
 84 |     mv /path/to/home/dir/GENOME_TRANSGENE.gtf .
 85 |     ```
 86 |     
 87 | 6) <b>Run bcbio_setup_genome.py</b>
 88 |   * Process is too long to run interactively, so use a script:
 89 |     
 90 |     Format:
 91 |     ```
 92 |     #!/bin/bash
 93 |     #SBATCH -t 5-00:00              # Runtime in D-HH:MM format
 94 |     #SBATCH --job-name=genome_setupbcbio       # Job name
 95 |     #SBATCH -c 10                       # cores
 96 |     #SBATCH -p medium
 97 |     #SBATCH --mem-per-cpu=5G           # Memory needed per CPU or --mem
 98 |     #SBATCH --output=project_%j.out     # File to which STDOUT will be written, including job ID
 99 |     #SBATCH --error=project_%j.err      # File to which STDERR will be written, including job ID
100 |     #SBATCH --mail-type=ALL             # Type of email notification (BEGIN, END, FAIL, ALL)
101 |     #SBATCH --mail-user=[USER]k@hsph.harvard.edu
102 |     
103 |     #this script submits bcbio_setup_genome.py
104 |     #must sudo into bcbio first:
105 |     sudo -su bcbio /bin/bash
106 |     
107 |     date
108 |     
109 |     bcbio_setup_genome.py -f GENOME.dna.primary_assembly_TRANSGENE.fa -g GENOME_TRANSGENE.gtf -i bwa star seq -n SPECIES -b GENOME --buildversion BUILD
110 |     
111 |     date
112 |     ```
113 |     Example of actual script can be found here: `/n/app/bcbio/1.2.9/genomes/Mmusculus/GRCm39_GFP_tTA/submit_genome_setup.sh`
114 |     
115 | 


--------------------------------------------------------------------------------
/bcbio/git.md:
--------------------------------------------------------------------------------
 1 | # Git tips
 2 | 
 3 | - [Pro git book](https://git-scm.com/book/en/v2)
 4 | - https://github.com/Kunena/Kunena-Forum/wiki/Create-a-new-branch-with-git-and-manage-branches
 5 | - https://nvie.com/posts/a-successful-git-branching-model/
 6 | - http://sandofsky.com/blog/git-workflow.html
 7 | - https://blog.izs.me/2012/12/git-rebase
 8 | 
 9 | # Sync with upstream/master, delete all commits in origin/master branch
10 | ```
11 | git checkout master
12 | git reset --hard upstream/master
13 | git push --force
14 | ```
15 | 
16 | # Sync with upstream/master
17 | ```
18 | git fetch upstream
19 | git checkout master
20 | git merge upstream/master
21 | ```
22 | 
23 | # Feature workflow
24 | ```
25 | git checkout -b feature_branch
26 | # 1 .. N
27 | git add -A .
28 | git commit -m "sync"
29 | git push?
30 | 
31 | git checkout master
32 | git merge --squash private_feature_branch
33 | git commit -v
34 | git push
35 | # pull request to upstream
36 | # code review
37 | # request merged
38 | git branch -d feature_branch
39 | git push origin :feature_branch
40 | ```
41 | 
42 | # Migrating github.com repos to [code.harvard.edu](https://code.harvard.edu/)
43 | 
44 | 1. Set up your ssh keys. You can use your old keys (if you remember your passphrase) by going to `Settings --> SSH and GPG keys --> New SSH key`
45 | 2. Create your repo in code.harvard.edu. Copy the 'Clone with SSH link`:  `git@code.harvard.edu:HSPH/repo_name.git` (*NOTE: some of us have had trouble with the HTTPS link*)
46 | 3. Go to your local repo that you would like to migrate. Enter the directory.
47 | 
48 | ```
49 | # this will add a second remote location
50 | git remote add harvard git@code.harvard.edu:HSPH/repo_name.git
51 | 
52 | # this will get rid of the old origin remote
53 | git push -u harvard --all
54 | ``` 
55 | 
56 | 4. You should see the contents of your local repo in Enterprise. Now go to 'Settings' for the repo and 'Collaborators and Teams'. Here you will need to add Bioinformatics Core and give 'Admin' priveleges.
57 | 
58 | 
59 | > **NOTE:** If you decide to compile all your old repos into one giant repo (i.e. [hbc_mistrm_reports_legacy](https://code.harvard.edu/HSPH/hbc_mistrm_reports_legacy)), make sure that you remove all `.git` folders from each of them before committing. Otherwise you will not be able to see the contents on each folder on Enterprise.
60 | 
61 | 


--------------------------------------------------------------------------------
/bcbio/gtf_gff_validator.md:
--------------------------------------------------------------------------------
 1 | # GTF and GFF validators
 2 | 
 3 | > ### Use case example: bosTau9 genome 
 4 | Building a new genome in bcbio. Reference files were retrieved from NCBI (RefSeq genome and gtf files). There are some additional whitespaces in the file causing errors in the build. Solution: download the gff file instead and validate using `gff3validator` and use as input to bcbio with the added parameter `-gff3`. More info on genometools installs and commands found below. Another option woud be to use the GTF validator (perl-based), also listed below.
 5 | 
 6 | > **NOTE:** Initially found during Moazed consult
 7 | 
 8 | 
 9 | ## GFF Validator
10 | - includes gtf to gff converter
11 | - Download latest from: http://genometools.org/pub/
12 | - Documentation: http://genometools.org/tools.html
13 | 
14 | ### Usage
15 | ```{bash, eval=FALSE}
16 | {installed_path}/gt -help
17 | {installed_path}/gt gff3validator {gff_file}
18 | {installed_path}/gt gtf_to_gff3 {gtf_file}
19 | {installed_path}/gt gff3_to_gtf {gff_file}
20 | ```
21 | ### Errors
22 | Report bugs to https://github.com/genometools/genometools/issues.
23 | 
24 | ## GTF validator (perl-based)
25 | - Download: https://mblab.wustl.edu/software.html#evalLink (press 'RELEASES")
26 | - Documentation: https://mblab.wustl.edu/media/software/eval-documentation.pdf
27 | 
28 | ### Usage
29 | let's say tar is downloaded and extracted at **/home/eval-2.2.8**
30 | 
31 | That folder is noted as **{eval}** in the code:
32 | 
33 | ```{bash, eval=FALSE}
34 | perl -I {eval} {eval}/validate_gtf.pl -f {gtf_file} {fasta_file_associated_with_the_gtf}
35 | ```
36 | '-f' is an option, it creates a fixed file with same title as the origial gtf with '.fixed.gtf' extension.
37 | A custom hg38 gtf ran for an hour.
38 | Memory ran out with 8GB for some reason, so I ran with 64GB just in case. Since it might be using information from the genome.fa extensively.
39 | 


--------------------------------------------------------------------------------
/chipseq/bcbio_output_summary.sh:
--------------------------------------------------------------------------------
 1 | #/bin/bash
 2 | # Written by Will Gammerdinger at HSPH on September 15th, 2022.
 3 | 
 4 | # Assign input and output files to variables
 5 | metadata_file=/n/data1/cores/bcbio/PIs/andrew_lassar/hbc_lassar_mouse_TF_profiling_FoxC_atac_cutnrun_rnaseq_hbc04930/cutnrun/meta/FOXC.csv
 6 | multiqc_file=/n/data1/cores/bcbio/PIs/andrew_lassar/hbc_lassar_mouse_TF_profiling_FoxC_atac_cutnrun_rnaseq_hbc04930/cutnrun/final/2024-02-01_FOXC/multiqc/multiqc_data/multiqc_general_stats.txt
 7 | sample_directory=/n/data1/cores/bcbio/PIs/andrew_lassar/hbc_lassar_mouse_TF_profiling_FoxC_atac_cutnrun_rnaseq_hbc04930/cutnrun/final/
 8 | sample_prefix=X
 9 | input_antibody_label=input
10 | output_file=/n/data1/cores/bcbio/PIs/andrew_lassar/hbc_lassar_mouse_TF_profiling_FoxC_atac_cutnrun_rnaseq_hbc04930/cutnrun/final/summarized_report.txt
11 | 
12 | # Print Header line
13 | echo -e "sample\tantibody\ttreatment\treads\tmapped_reads\tmapping_percent\tpeaks\tRiP_percent\tPBC1\tPBC2\tBottlenecking\tNRF\tComplexity\tGC_percent" > $output_file;
14 | 
15 | # Determine the column number for the columns on interest
16 | antibody_column=`awk -F ',' 'NR==1{for (i=1; i<=NF; i++) { if ($i == "antibody") { print i } }}' $metadata_file`
17 | phenotype_column=`awk -F ',' 'NR==1{for (i=1; i<=NF; i++) { if ($i == "phenotype") { print i } }}' $metadata_file`
18 | treatment_column=`awk -F ',' 'NR==1{for (i=1; i<=NF; i++) { if ($i == "treatment") { print i } }}' $metadata_file`
19 | reads_column=`awk 'NR==1{for (i=1; i<=NF; i++) { if ($i == "bcbio_mqc-generalstats-bcbio-Total_reads") { print i } }}' $multiqc_file`
20 | mapped_reads_column=`awk 'NR==1{for (i=1; i<=NF; i++) { if ($i == "bcbio_mqc-generalstats-bcbio-Mapped_reads") { print i } }}' $multiqc_file`
21 | mapping_percent_column=`awk 'NR==1{for (i=1; i<=NF; i++) { if ($i == "Samtools_mqc-generalstats-samtools-reads_mapped_percent") { print i } }}' $multiqc_file`
22 | RiP_percent_column=`awk 'NR==1{for (i=1; i<=NF; i++) { if ($i == "bcbio_mqc-generalstats-bcbio-RiP_pct") { print i } }}' $multiqc_file`
23 | PBC1_column=`awk 'NR==1{for (i=1; i<=NF; i++) { if ($i == "bcbio_mqc-generalstats-bcbio-PBC1") { print i } }}' $multiqc_file`
24 | PBC2_column=`awk 'NR==1{for (i=1; i<=NF; i++) { if ($i == "bcbio_mqc-generalstats-bcbio-PBC2") { print i } }}' $multiqc_file`
25 | Bottlenecking_column=`awk 'NR==1{for (i=1; i<=NF; i++) { if ($i == "bcbio_mqc-generalstats-bcbio-bottlenecking") { print i } }}' $multiqc_file`
26 | NRF_column=`awk 'NR==1{for (i=1; i<=NF; i++) { if ($i == "bcbio_mqc-generalstats-bcbio-NRF") { print i } }}' $multiqc_file`
27 | Complexity_column=`awk 'NR==1{for (i=1; i<=NF; i++) { if ($i == "bcbio_mqc-generalstats-bcbio-complexity") { print i } }}' $multiqc_file`
28 | GC_percent_column=`awk 'NR==1{for (i=1; i<=NF; i++) { if ($i == "FastQC_mqc-generalstats-fastqc-percent_gc") { print i } }}' $multiqc_file`
29 | 
30 | # For each sample gather the various statistics and print to a the summarized report
31 | for i in ${sample_directory}${sample_prefix}*; do 
32 |     sample=`basename $i`
33 |     antibody=`grep $sample $metadata_file | awk -F ',' -v antibody_column=$antibody_column '{print $antibody_column}'`
34 |     sample_type=`grep $sample $metadata_file | awk -F ',' -v phenotype_column=$phenotype_column '{print $phenotype_column}'`
35 |     treatment=`grep $sample $metadata_file | awk -F ',' -v treatment_column=$treatment_column '{print $treatment_column}'`
36 |     reads=`grep $sample $multiqc_file | awk -F'\t' -v reads_column=$reads_column '{print $reads_column}' | sed 's/.0$//g'`
37 |     mapped_reads=`grep $sample $multiqc_file | awk -F'\t' -v mapped_reads_column=$mapped_reads_column '{print $mapped_reads_column}' | sed 's/.0$//g'`
38 |     mapping_percent=`grep $sample $multiqc_file  | awk -F'\t' -v mapping_percent_column=$mapping_percent_column '{print $mapping_percent_column}' | head -c 5`
39 |     if [[ $antibody == $input_antibody_label ]]; then
40 |         peaks="NA"
41 |     else
42 |         peaks=`wc -l ${sample_directory}${sample}/macs2/${sample}_peaks.* | awk 'NR==1{print $1}'`
43 |     fi
44 |     RiP_percent=`grep $sample $multiqc_file | awk -F'\t' -v RiP_percent_column=$RiP_percent_column '{print $RiP_percent_column}'`
45 |     PBC1=`grep $sample $multiqc_file | awk -F'\t' -v PBC1_column=$PBC1_column '{print $PBC1_column}' | head -c 5`
46 |     PBC2=`grep $sample $multiqc_file | awk -F'\t' -v PBC2_column=$PBC2_column '{print $PBC2_column}' | head -c 5`
47 |     Bottlenecking=`grep $sample $multiqc_file | awk -F'\t' -v Bottlenecking_column=$Bottlenecking_column '{print $Bottlenecking_column}'`
48 |     NRF=`grep $sample $multiqc_file | awk -F'\t' -v NRF_column=$NRF_column '{print $NRF_column}' | head -c 5`
49 |     Complexity=`grep $sample $multiqc_file | awk -F'\t' -v Complexity_column=$Complexity_column '{print $Complexity_column}'`
50 |     GC_percent=`grep $sample $multiqc_file | awk -F'\t' -v GC_percent_column=$GC_percent_column '{print $GC_percent_column}'`
51 |     echo -e  "$sample\t$antibody\t$treatment\t$reads\t$mapped_reads\t$mapping_percent\t$peaks\t$RiP_percent\t$PBC1\t$PBC2\t$Bottlenecking\t$NRF\t$Complexity\t$GC_percent" >> $output_file;
52 | done
53 | 
54 | echo -e "The summarized report has been created and can be found here:\n\t$output_file"
55 | 


--------------------------------------------------------------------------------
/chipseq/metadata.md:
--------------------------------------------------------------------------------
 1 | # Note on Metadata for chipseq
 2 | ## Linking the inputs together with one line. (Thanks to Meeta)
 3 | ## antibody column matters! (needs to be included in vignette maybe?)
 4 | ```
 5 | samplename,description,batch,phenotype,replicate,treatment,antibody
 6 | Lib4.R1.bc.2.WTMTF2.fq,WTMTF2_1,pair1,chip,1,WT,MTF2
 7 | Lib9.R1R2.bc.19.WTMTF2.fq,WTMTF2_2,pair2,chip,2,WT,MTF2
 8 | Lib3.bc.1.WTH3K27ME3.fq,WTH3k27ME3_1,pair3,chip,1,WT,H3k27ME3
 9 | Lib9.R1R2.bc.1.WTH3K27ME3.fq,WTH3k27ME3_2,pair4,chip,2,WT,H3k27ME3
10 | Lib2.bc.1.MKOFLAG.fq,MTF2KO_1,pair5,chip,1,MTF2KO,FLAG
11 | Lib9.R1R2.bc.30.MKOFLAG.fq,MTF2KO_2,pair6,chip,2,MTF2KO,FLAG
12 | Lib2.bc.2.MKOWTFLAG.fq,MTF2KO_WTRES_1,pair7,chip,1,MTF2KO_WTRES,WT-FLAG
13 | Lib9.R1R2.bc.31.MKOWTFLAG.fq,MTF2KO_WTRES_2,pair8,chip,2,MTF2KO_WTRES,WT-FLAG
14 | Lib2.bc.15.MKOMUTFLAG.fq,MTF2KO_MUTRES_1,pair9,chip,1,MTF2KO_MUTRES,MUT-FLAG
15 | Lib9.R1R2.bc.32.MKOMUTFLAG.fq,MTF2KO_MUTRES_2,pair10,chip,2,MTF2KO_MUTRES,MUT-FLAG
16 | Lib3.bc.7.EKOH3K27ME3.fq,EEDKO_1,pair11,chip,1,EEDKO,H3k27ME3
17 | Lib3.bc.8.EKOH3K27ME3.fq,EEDKO_2,pair12,chip,2,EEDKO,H3k27ME3
18 | Lib10.R1R2.bc.3.EKOWTRES.fq,EKOWT_1,pair13,chip,1,EKO_WT,H3k27ME3
19 | Lib10.R1R2.bc.5.EKOWTRES.fq,EKOWT_2,pair14,chip,2,EKO_WT,H3k27ME3
20 | Lib8.R1R2.bc.16.EKOMUTRES.fq,EKOMUT_1,pair15,chip,1,EKO_MUT,H3k27ME3
21 | Lib10.R1R2.bc.4.EKOMUTRES.fq,EKOMUT_2,pair16,chip,2,EKO_MUT,H3k27ME3
22 | Lib2.bc.14.INPUT.fq,input_global,pair1;pair2;pair3;pair4;pair5;pair6;pair7;pair8;pair9;pair10;pair11;pair12;pair13;pair14;pair15;pair16,input,1,WT,Input
23 | ```
24 | 
25 | ## I am getting these warnings and some samples ran with broadpeak. (samples that have H3K27ME3)
26 | > Going through the log, I found this.... as I didn't get any peaks for samples that didn't have H3k27ME3 in the antibody column.
27 | ```
28 | [2021-04-10T05:28Z] h3k27me3 specified, using broad peak settings.
29 | [2021-04-10T05:28Z] h3k27me3 specified, using broad peak settings.
30 | [2021-04-10T05:28Z] h3k27me3 specified, using broad peak settings.
31 | 
32 | [2021-04-10T05:28Z] mut-flag specified, but not listed as a supported antibody. Valid antibodies are {'h3k36me3', 'narrow', 'h3k4me1', 
33 | 'h2afz', 'h3ac', 'h4k20me1', 'h3k4me3', 'h3k4me2', 'h3k9ac', 'h3k79me2', 'h3k9me2', 'h3f3a', 'h3k79me3', 'h3k27me3', 'broad', 'h3k9me3', 'h3k9me1', 'h3k27ac'}. 
34 | If you know your antibody should be called with narrow or broad peaks, supply 'narrow' or 'broad' as the antibody.
35 | 
36 | [2021-04-10T05:28Z] flag specified, but not listed as a supported antibody. Valid antibodies are {'h3k36me3', 'narrow', 'h3k4me1', 
37 | 'h2afz', 'h3ac', 'h4k20me1', 'h3k4me3', 'h3k4me2', 'h3k9ac', 'h3k79me2', 'h3k9me2', 'h3f3a', 'h3k79me3', 'h3k27me3', 'broad', 'h3k9me3', 'h3k9me1', 'h3k27ac'}. 
38 | If you know your antibody should be called with narrow or broad peaks, supply 'narrow' or 'broad' as the antibody.
39 | 
40 | [2021-04-10T05:28Z] wt-flag specified, but not listed as a supported antibody. Valid antibodies are {'h3k36me3', 'narrow', 'h3k4me1', 
41 | 'h2afz', 'h3ac', 'h4k20me1', 'h3k4me3', 'h3k4me2', 'h3k9ac', 'h3k79me2', 'h3k9me2', 'h3f3a', 'h3k79me3', 'h3k27me3', 'broad', 'h3k9me3', 'h3k9me1', 'h3k27ac'}. 
42 | If you know your antibody should be called with narrow or broad peaks, supply 'narrow' or 'broad' as the antibody.
43 | 
44 | [2021-04-10T05:28Z] h3k27me3 specified, using broad peak settings.
45 | [2021-04-10T05:28Z] h3k27me3 specified, using broad peak settings.
46 | [2021-04-10T05:28Z] h3k27me3 specified, using broad peak settings.
47 | ```
48 | 


--------------------------------------------------------------------------------
/img/can_not_connect.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/img/can_not_connect.png


--------------------------------------------------------------------------------
/img/images.md:
--------------------------------------------------------------------------------
1 | images go here!
2 | 


--------------------------------------------------------------------------------
/img/noor_umap.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/img/noor_umap.png


--------------------------------------------------------------------------------
/img/r_taking_longer.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/img/r_taking_longer.png


--------------------------------------------------------------------------------
/img/simpsons.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/img/simpsons.gif


--------------------------------------------------------------------------------
/img/zhu_umap.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/img/zhu_umap.png


--------------------------------------------------------------------------------
/long_read_data/Jihe's presentation at ABRF.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/long_read_data/Jihe's presentation at ABRF.pptx


--------------------------------------------------------------------------------
/long_read_data/Jihe's summary of ABRF discussion at core meeting.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/long_read_data/Jihe's summary of ABRF discussion at core meeting.pptx


--------------------------------------------------------------------------------
/long_read_data/Jihe's_long_read_presentation_core_meeting.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/long_read_data/Jihe's_long_read_presentation_core_meeting.pptx


--------------------------------------------------------------------------------
/long_read_data/genome_assembly_tools.md:
--------------------------------------------------------------------------------
 1 | ## Hybrid Assembly Strategies (smaller/prokaryotic genomes)
 2 | 
 3 | 
 4 | ## Hybrid Assembly Strategies (larger/eukaryotic genomes)
 5 | 
 6 | ### Oxford nanopore and Illumina -
 7 | 
 8 | > Note that these were suggestions for an **algal assembly** from Dr. Chris Fields and Kim Walden at [UIUC's bioinformatics core, HPCBio](https://hpcbio.illinois.edu/).
 9 | 
10 | * Get a good estimate of the genome size using your illumina reads and [Genome Scope](http://qb.cshl.edu/genomescope/). You will need to get a kmer histogram from your illumina data to use as input to Genome Scope, and you can use [KMC](https://github.com/refresh-bio/KMC) or [Jellyfish](http://www.genome.umd.edu/jellyfish.html) for that.
11 | * Workflow
12 |   * Assemble Nanopore reads using something like [wtdbg2](https://github.com/ruanjue/wtdbg2), [Flye](https://github.com/fenderglass/Flye), or [miniasm](https://github.com/lh3/miniasm) (might need to use multiple assemblers and test which works best)
13 |   * Do a first round of assembly polishing with [nanopolish](https://github.com/jts/nanopolish), using only nanopore data
14 |   * Do a second round of assembly polishing with [Racon](https://github.com/isovic/racon) or [Pilon](https://github.com/broadinstitute/pilon/wiki), using illumina data this time
15 | * Use [BUSCO](https://busco.ezlab.org/) for assessment
16 | * If your Nanopore data was generated using tech from before 2020 using older flow cells, you may have to re run the basecalling using something like [Guppy](https://esr-nz.github.io/gpu_basecalling_testing/gpu_benchmarking.html) (speedy if you can use GPUs)
17 | 
18 | ## Genome Annotation tools
19 | 
20 | * MAKER
21 | * Braker
22 | * GeneMark
23 | * antiSMASH
24 | * PGAP (prokaryotes)
25 | 
26 | ## Assembly assessment
27 | 
28 | * [BUSCO](https://busco.ezlab.org/)
29 | 


--------------------------------------------------------------------------------
/misc/Core_members_September_2019.key:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/misc/Core_members_September_2019.key


--------------------------------------------------------------------------------
/misc/FAQs.md:
--------------------------------------------------------------------------------
 1 | # Snippets for Frequently Asked Questions from clients, when they go over the reports.
 2 | ```
 3 | Feel free to add the FAQs you have received.
 4 | ```
 5 | 
 6 | ## General
 7 | ### Functional Analysis
 8 | 1. What is geneRatio and bgRatio in overerpresentation analysis?
 9 | 
10 | - The geneRatio is the {# of annotated genes assigned to term from input}/{# of input genes annotated}
11 | 
12 | - The bgRatio is the {# of annotated genes assigned to term from background}/{# of background genes annotated}
13 | 
14 | Please note that the denominator may be different between MF, BP, and CC as there are different number of genes annotated for those categories.
15 | 
16 | - The input is a list of candidate genes (i.e. list of significant DEGs) while the background is a list of all the genes in the study.
17 | 
18 | 2. How is the p-value calculated?
19 | 
20 | Simplistic link: https://www.pathwaycommons.org/guide/primers/statistics/fishers_exact_test/
21 | 
22 | A bit more mathematical link: http://www.nonlinear.com/progenesis/qi/v2.0/faq/should-i-use-enrichment-or-over-representation-analysis-for-pathways-data.aspx
23 | 
24 | Good video link: https://www.coursera.org/lecture/bd2k-lincs/enrichment-analysis-part-1-xLgN5
25 | 
26 | ### Multiple testing correction
27 | What is q-value and why do we need this?
28 | 
29 | Here is a pretty good slide about the need for multiple testing and how FDR is calculated : 
30 | https://www.gs.washington.edu/academics/courses/akey/56008/lecture/lecture10.pdf
31 | 
32 | ### t-test and Wilcoxon {credits to [Preeti](https://github.com/orgs/hbc/people/preetida)}
33 | Blog by Jonathan Bartlett, super helpful for many stat related questions.
34 | 
35 | https://thestatsgeek.com/2014/04/12/is-the-wilcoxon-mann-whitney-test-a-good-non-parametric-alternative-to-the-t-test/
36 | 
37 | ### UpSetR plots
38 | To visualize the overlaps, we use the UpSetR package in R to draw bar plots that demonstrate the overlap, instead of Venn diagrams. The bar plots drawn by this package, and their associated annotations, are a cleaner way to demonstrate/observe overlaps. Here is brief guide to reading the UpSetR overlap plots:
39 | 
40 | *These plots are relatively intuitive for 2 or 3 categories, but can tend to get more complex for >3 categories. In all cases, you will find the categories being compared and their size listed below the bar plots on the left. As you look to the right (directly below each bar) there are dots with connecting lines that denote which categories the overlap is between, or if there is no overlap (just a dot). The numbers at the top of the bars denote the size of the overlap.*
41 | 
42 | ### PCA
43 | For understanding PCA:
44 | Our lesson - https://hbctraining.github.io/scRNA-seq_online/lessons/05_normalization_and_PCA.html#principal-component-analysis-pca
45 | A youtube video - https://www.youtube.com/watch?v=_UVHneBUBW0
46 | 
47 | 
48 | ## ChIP-seq and ATAC-seq
49 | 
50 | ## BULK RNA-seq
51 | 
52 | ## scRNA-seq
53 | 
54 | 


--------------------------------------------------------------------------------
/misc/GEO_submissions.md:
--------------------------------------------------------------------------------
 1 | Main guide is here:
 2 | https://www.ncbi.nlm.nih.gov/geo/info/submission.html
 3 | 
 4 | Highh throughput sequencing is here:
 5 | https://www.ncbi.nlm.nih.gov/geo/info/seq.html
 6 | 
 7 | Adding the RC guide (Joon):
 8 | https://wiki.rc.hms.harvard.edu/display/O2/Submitting+data+to+GEO
 9 | 
10 | # The preparation 
11 | 
12 | ## Analyst responsibilities
13 | You will need
14 | 1) [GEO metadata sheet](https://www.ncbi.nlm.nih.gov/geo/info/seq.html)
15 | 2) Raw fastq files
16 | 3) Derived files for data  
17 |   a) RNAseq
18 | - raw counts table (as tsv/csv), can put as supplementary file
19 | - TPM (as tsv/csv), can put as supplementary file
20 | - bams are OK too but I have never been asked for them
21 | 
22 | 4) details on the analysis for the GEO metadata sheet, including
23 | - which sequencer was used
24 | - paired or single end reads?
25 | - insert size if paired
26 | - programs and versions used in the analysis, including the bcbio and R portions
27 | 
28 | Example metadata sheets can be found in this Dropbox folder:
29 | https://www.dropbox.com/sh/88035zd8h9qhvzh/AACmHB7xsXhdgrSyZY42uwLYa?dl=0
30 | 
31 | For all of the raw and derived data files, you will need to run md5 checksums.
32 | 
33 | ## Researcher responsibilities
34 | The client wil need to give you the details about things that were involved in the experiment and library preparation.
35 | These include
36 | - growth protocol
37 | - treatment protocol 
38 | - extract protocol
39 | - library construction protocol
40 | They will also need to supply the general info about the experiment including:
41 | - title
42 | - summary
43 | - overall design
44 | - who they want to be a contributor
45 | 
46 | I usually fill out what I can and then send them the metadata sheet with their areas to fill out highlighted.
47 | 
48 | # The upload
49 | Once you have the data, derived data and metadata sheet, its time to upload to the GEO FTP server.
50 | Sign into your NCBI and GEO account and go to the [Transfer Files](https://www.ncbi.nlm.nih.gov/geo/info/submissionftp.html) link on the GEO submission page. 
51 | 
52 | There they will tell you what your directory is on the GEO FTP server (for example, uploads/jnhutchinson_AtsZaoGM) as well as the server address (eg. ftp-private.ncbi.nlm.nih.gov) login (geoftp) and password (rebUzyi1). 
53 | 
54 | Go to your  upload directory on O2 with the GEO submission files and login to the ftp server using `lftp geoftp:rebUzyi1@ftp-private.ncbi.nlm.nih.gov`. Note that lftp is not available on login or interactive nodes, so you will need to ssh to the O2 transfer node (`ssh user@transfer.rc.hms.harvard.edu`) to use it. *You can also use Filezilla if your files are on your local machine.* Then move to your remote upload directory (cd /uploads/jnhutchinson_AtsZaoGM, *for Filezilla, you should enter this into the Remote site: directory box*) and start your upload. For lftp, you can use 
55 | ```mirror -R``` or ```mput *``` to upload the files. For Filezilla, just drag the files over to the remote directory. The sit back and maybe work on something else, or like, take a break from the bionformatics mine while everything uploads. If you have a ton of files, you may want to use something like tmux to prevent your session from being terminated. 
56 | 
57 | 
58 | When the upload is complete, notify GEO of the submission using the cleverly named [Notify GEO](https://submit.ncbi.nlm.nih.gov/geo/submission/) link. 
59 | 
60 | You will receive an email confirming your upload and GEO staff will contact you if there are any issues. Common issues to watch out for are:
61 | 1) column headings in derived data not matching fastq sample names
62 | 2) missing gene ids in derived data
63 | 
64 | Less commonly they may ask you to fix insufficiently descriptive summary and overall design
65 | 
66 | # The aftermath
67 | 
68 | Note that unless you specifically set things up otherwise, the submission will be tied to your name and you will have to be responsible for updates and releases (i.e. you will be the "Investigator"). You can deal with this one of two ways
69 | 1) set things up from your initial login to have you as the submitter and the researcher as the Investigator, I personallly find this inconvenient as I may be doing multiple GEO submission for different researchers, but YMMV
70 | 2) do the submisison yourself as the Investigator and submitter and once the submisison is accepte, email GEO to have the submission transferred to the researcher. 
71 | Note that both of these methods will require the researcher to obtain both an NCBI (if they don't already have one) account and share the login id and email address with you. 
72 | 


--------------------------------------------------------------------------------
/misc/OSX.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: DS_Store tips
 3 | description: Don't leave .DS_Store files on network volumes
 4 | category: computing
 5 | subcategory: tips_tricks
 6 | tags: [osx]
 7 | ---
 8 | 
 9 | # Don't leave .DS_Store files on network volumes
10 | defaults write com.apple.desktopservices DSDontWriteNetworkStores true
11 | 


--------------------------------------------------------------------------------
/misc/Reform_python.md:
--------------------------------------------------------------------------------
 1 | # Using Reform to create custom genome
 2 | > https://gencore.bio.nyu.edu/reform/
 3 | 
 4 | - You can find good example with cellRanger in this link.
 5 | 
 6 | 
 7 | ## Usage guide
 8 | ### Conda env. and install
 9 | ```
10 | module load gcc/6.2.0 conda2/4.2.13
11 | conda activate python_3.6.5 # you need an environment with python3 activated. (this one is custom for me)
12 | pip3 install biopython
13 | git clone https://github.com/gencorefacility/reform.git 
14 | cd reform/
15 | ```
16 | 
17 | ### The command code and the files we need.
18 | ```
19 |   --chrom=<chrom> \
20 |   --position=<pos> \ 
21 |   --in_fasta=<in_fasta> \
22 |   --in_gff=<in_gff> \
23 |   --ref_fasta=<ref_fasta> \
24 |   --ref_gff=<ref_gff>
25 | ```
26 | - **chrom** ID of the chromsome to modify
27 | 
28 | - **position** Position in chromosome at which to insert <in_fasta>. Can use -1 to add to end of chromosome. Note: Either position, or upstream AND downstream sequence must be provided.
29 | 
30 | - **upstream_fasta** Path to Fasta file with upstream sequence. Note: Either *position*, or *upstream AND downstream* sequence must be provided.
31 | 
32 | - **downstream_fasta** Path to Fasta file with downstream sequence. Note: Either *position*, or *upstream AND downstream* sequence must be provided.
33 | 
34 | - **in_fasta** Path to new sequence to be inserted into reference genome in fasta format.
35 | 
36 | - **in_gff** Path to GFF file describing new fasta sequence to be inserted.
37 | 
38 | - **ref_fasta** Path to reference fasta file.
39 | 
40 | - **ref_gff** Path to reference gff file.
41 | 
42 | Example:
43 | ```ruby
44 | python3 reform.py 
45 |  --chrom=X \
46 |  --position=3 \
47 |  --in_fasta=in.fa \
48 |  --in_gff=in.gff3 \
49 |  --ref_fasta=ref.fa \
50 |  --ref_gff=ref.gff3
51 |  ```
52 |  We will put the in.fa sequence in the X chromosome position 3.
53 |  
54 |  Sequence is 10bp, so we expect a new transcript at X:4-13.
55 |  
56 |  > in.fa
57 |  ```
58 |  >input_sequence
59 | TGGAGGATCG
60 | ```
61 | 
62 | > ref.gff
63 | <img src="https://github.com/yoonsquared/knowledgebase_image/blob/master/gff3.png" width="600">
64 | 
65 | > in.gff
66 | <img src="https://github.com/yoonsquared/knowledgebase_image/blob/master/in_gff3.png" width="600">
67 | 
68 | > reformed.gff
69 | <img src="https://github.com/yoonsquared/knowledgebase_image/blob/master/reformed_gff.png" width="600">
70 | 


--------------------------------------------------------------------------------
/misc/aws.md:
--------------------------------------------------------------------------------
 1 | Can be run from ` /n/app/bcbio/dev/anaconda/bin/aws` or from  `/usr/bin/aws` on the O2 transfer nodes.
 2 | 
 3 | ## Setup S3 bucket
 4 | Setup your S3 bucket with: 
 5 | `foo/aws configure`
 6 | You will need your 
 7 | - AWS Access Key ID
 8 | - AWS Secret Access Key
 9 | - Default region name
10 | - Default output format
11 | 
12 |     
13 | ## Interact with AWS bucket      
14 | 
15 | - no dirs in AWS, those strings are just prefixes
16 | - use `--dryrun` to test a command
17 | 
18 | |Problem        | Unix rationale         | AWS spell                                                                     |
19 | |---------------|------------------------|-------------------------------------------------------------------------------|
20 | |copy files     |`cp * destination_dir`  |`aws s3 sync . s3://bucket/dir/`                                               |
21 | |get file sizes |`ls -lh`                |`aws s3 ls --human-readable s3://bucket/dir/`                                  |
22 | |copy bam files |`cp */*.bam /target_dir`|options order matters! <br/> `aws s3 sync s3://bucket/dir/ . --exclude "*" --include "*.bam"`|
23 | 


--------------------------------------------------------------------------------
/misc/core_resources.md:
--------------------------------------------------------------------------------
 1 | # Sequencing cores
 2 | |Name|Affiliation|Website|Contact(s)|Services/Capabilities|
 3 | |---|---|---|---|---|
 4 | |Biopolymers Facility|HMS|Bob Steen|https://genome.med.harvard.edu/|NGS|
 5 | |Molecular Biology Core Facility|DFCI,CFAR|Zach Herbert|http://mbcf.dfci.harvard.edu/|NGS|
 6 | |Bauer Core Facility|FAS| |https://bauercore.fas.harvard.edu/|NGS|
 7 | 
 8 | Partners core
 9 | 
10 | # Single Cell Encapsulation cores
11 | |Name|Affiliation|Website|Contact(s)|Services/Capabilities|
12 | |---|---|---|---|---|
13 | |Single Cell Core|HMS|Sarah Boswell|https://singlecellcore.hms.harvard.edu/|10X,InDrops|
14 | |Bauer Core Facility|FAS| |https://bauercore.fas.harvard.edu/|10X,|
15 | 
16 | BWH single cell core
17 | 
18 | # Analytical Cores
19 | |Name|Affiliation|Website|Contact(s)|Services/Capabilities|
20 | |---|---|---|---|---|
21 | |Joslin Diabeters Center Biostatistics and Bioinformatics Cores|Joslin|https://joslinresearch.org/drc-cores/bioinformatics-and-biostatistics-core|Jon Dreyfuss|NGS,proteomics,metabolomics,microarray|
22 | 
23 | BWH single cell core
24 | 
25 | # Sorting cores (FACS, CyTOF)
26 | CyTOF core at Dana
27 | 
28 | Keith Reeves' FACS core at Dana
29 | 
30 | 
31 | 
32 | 
33 | 
34 | 


--------------------------------------------------------------------------------
/misc/general_ngs.md:
--------------------------------------------------------------------------------
 1 | # 3' DGE from LSP demultiplexing example
 2 | ```
 3 | bcl2fastq --adapter-stringency 0.9 --barcode-mismatches 0 --fastq-compression-level 4 --min-log-level INFO --minimum-trimmed-read-length 0 --sample-sheet /n/boslfs/INSTRUMENTS/illumina/180604_NB501677_0276_AHVMT2BGX5/SampleSheet.csv --runfolder-dir /n/boslfs/INSTRUMENTS/illumina/180604_NB501677_0276_AHVMT2BGX5 --output-dir /n/boslfs/ANALYSIS/180604_NB501677_0276_AHVMT2BGX5 --processing-threads 8 --no-lane-splitting --mask-short-adapter-reads 0 --use-bases-mask y*,y*,y*,y*
 4 | ```
 5 | 
 6 | # Illumina instrument by FASTQ read name
 7 | - @HWI-Mxxxx or @Mxxxx - MiSeq
 8 | - @Kxxxx - HiSeq 3000(?)/4000
 9 | - @Nxxxx - NextSeq 500/550
10 | - @Axxxxx - NovaSeq
11 | - @HWI-Dxxxx - HiSeq 2000/2500
12 | - AAXX, @HWUSI - GAIIx
13 | - BCXX = HiSeq v1.5
14 | - ACXX = HiSeq High-Output v3
15 | - ANXX = HiSeq High-Output v4
16 | - ADXX = HiSeq RR v1
17 | - AMXX, BCXX =HiSeq RR v2
18 | - ALXX = HiSeqX
19 | - BGXX, AGXX = High-Output NextSeq
20 | - AFXX = Mid-Output NextSeq
21 | 
22 | # Illumina BaseSpace CLI
23 | https://developer.basespace.illumina.com/docs/content/documentation/cli/cli-overview
24 | 
25 | # Miscellaneous
26 | 
27 | **Add text "chr" to #CHROM column of VCF**
28 | ```
29 | $ bcftools annotate --rename-chrs <map_file.txt> sample.vcf.gz
30 | ```
31 | map file should contain "`old_name new_name`" pairs separated by whitespaces, each on a separate line
32 | 


--------------------------------------------------------------------------------
/misc/git.md:
--------------------------------------------------------------------------------
  1 | # Git tips
  2 | 
  3 | - [Pro git book](https://git-scm.com/book/en/v2)
  4 | - https://github.com/Kunena/Kunena-Forum/wiki/Create-a-new-branch-with-git-and-manage-branches
  5 | - https://nvie.com/posts/a-successful-git-branching-model/
  6 | - http://sandofsky.com/blog/git-workflow.html
  7 | - https://blog.izs.me/2012/12/git-rebase
  8 | - https://benmarshall.me/git-rebase/
  9 | - [find big files in history](https://stackoverflow.com/questions/10622179/how-to-find-identify-large-commits-in-git-history)
 10 | - [remove a big file from history](https://www.czettner.com/2015/07/16/deleting-big-files-from-git-history.html)
 11 | - [git-tips](https://github.com/git-tips/tips)
 12 | 
 13 | 
 14 | # merge master branch into (empty) main and delete master
 15 | ```
 16 | module load git
 17 | git fetch origin main
 18 | git branch -a
 19 | git checkout main
 20 | git merge master --allow-unrelated-histories
 21 | git add -A .
 22 | git commit
 23 | git push
 24 | 
 25 | git branch -d master
 26 | git push origin :master
 27 | ```
 28 | 
 29 | # Add remote upstream
 30 | ```bash
 31 | git remote -v
 32 | git remote add upstream https://github.com/ORIGINAL_OWNER/ORIGINAL_REPOSITORY.git
 33 | ```
 34 | 
 35 | # Create a tag in the upstream
 36 | ```bash
 37 | git fetch upstream
 38 | git checkout master
 39 | git reset --hard upstream/master
 40 | git tag -a -m "project tag (date)" vx.y.z
 41 | git push upstream vx.y.z
 42 | git push origin vx.y.z
 43 | ```
 44 | 
 45 | # Sync with upstream/master, delete all commits in origin/master
 46 | ```
 47 | git fetch upstream
 48 | git checkout main
 49 | git reset --hard upstream/main
 50 | git push --force
 51 | ```
 52 | 
 53 | # Sync with upstream/master
 54 | ```
 55 | git fetch upstream
 56 | git checkout main
 57 | git merge upstream/main
 58 | ```
 59 | # big feature workflow - rebase - squash
 60 | ```
 61 | # sync master with upstream first
 62 | # create new branch and switch to it
 63 | git checkout -b feature1
 64 | # create many commits with meaningfull messages
 65 | git add -A.
 66 | git commit
 67 | # upstream accumulated some commits
 68 | git fetch upsteam
 69 | # rebasing the branch not the master
 70 | # to PR from the branch later not from the master
 71 | # automatic rebase - replay all commit on top of master
 72 | git rebase upstream/master
 73 | 
 74 | # alternative - interactive rebase
 75 | # 1. see latest commits from HEAD down to the start of feature1
 76 | # on top of upstream
 77 | # git log --oneline --decorate --all --graph
 78 | # 2. interactive rebase for the last 13 commits (including head)
 79 | # git rebase -i HEAD~13
 80 | # set s (squash) in the interactive editor for all commits except for the top one
 81 | # alter commit message
 82 | 
 83 | # force since origin as 13 separate commits
 84 | git push --force --set-upstream origin feature1
 85 | # PR from feature1 branch to upstream/master
 86 | ```
 87 | 
 88 | # 2 Feature workflow
 89 | ```
 90 | git checkout -b feature1
 91 | git add -A.
 92 | git commit
 93 | git push --set-upstream origin feature1
 94 | # pull request1
 95 | git checkout master
 96 | git checkout -b feature2
 97 | git add -A.
 98 | git commit
 99 | git push --set-upstream origin feature2
100 | # pull request 2
101 | ```
102 | 
103 | # Feature workflow w squash
104 | ```
105 | git checkout -b feature_branch
106 | # 1 .. N
107 | git add -A .
108 | git commit -m "sync"
109 | 
110 | git checkout master
111 | git merge --squash private_feature_branch
112 | git commit -v
113 | git push
114 | # pull request to upstream
115 | # code review
116 | # request merged
117 | git branch -d feature_branch
118 | git push origin :feature_branch
119 | ```
120 | 
121 | # get commits from maintainers in a pull request and push back
122 | ```
123 | git fetch upstream pull/[PR_Number]/head:new_branch
124 | git checkout new_branch
125 | git add 
126 | git commit
127 | git push --set-upstream origin new_branch
128 | ```
129 | 
130 | # ~/.ssh/config
131 | ```
132 | Host github.com
133 |     HostName github.com
134 |     PreferredAuthentications publickey
135 |     IdentityFIle ~/.ssh/id_rsa_git
136 |     User git
137 | ```
138 | 
139 | # Migrating github.com repos to [code.harvard.edu](https://code.harvard.edu/)
140 | 
141 | See [this page](https://gist.github.com/niksumeiko/8972566) for good general guidance
142 | 
143 | 1. Set up your ssh keys. You can use your old keys (if you remember your passphrase) by going to `Settings --> SSH and GPG keys --> New SSH key`
144 | 2. Create your repo in code.harvard.edu. Copy the 'Clone with SSH link`:  `git@code.harvard.edu:HSPH/repo_name.git` (*NOTE: some of us have had trouble with the HTTPS link*)
145 | 3. Go to your local repo that you would like to migrate. Enter the directory.
146 | 
147 | ```
148 | # this will add a second remote location
149 | git remote add harvard git@code.harvard.edu:HSPH/repo_name.git
150 | 
151 | # this will get rid of the old origin remote
152 | git push -u harvard --all
153 | ``` 
154 | 
155 | 4. You should see the contents of your local repo in Enterprise. Now go to 'Settings' for the repo and 'Collaborators and Teams'. Here you will need to add Bioinformatics Core and give 'Admin' priveleges.
156 | 
157 | 
158 | > **NOTE:** If you decide to compile all your old repos into one giant repo (i.e. [hbc_mistrm_reports_legacy](https://code.harvard.edu/HSPH/hbc_mistrm_reports_legacy)), make sure that you remove all `.git` folders from each of them before committing. Otherwise you will not be able to see the contents on each folder on Enterprise.
159 | 
160 | # Remove sensitive information from the file and from the history
161 | ```
162 | Make a backup
163 | # cd ~/backup
164 | # git clone git@github.com:hbc/knowledgebase.git
165 | cd ~/work
166 | git clone git@github.com:hbc/knowledgebase.git
167 | git filter-branch --tree-filter 'rm -f admin/download_data.md' HEAD
168 | git push --force-with-lease origin master
169 | # commit saved copy of download_data.md without secrets
170 | ```
171 | 


--------------------------------------------------------------------------------
/misc/miRNA.md:
--------------------------------------------------------------------------------
1 | * https://github.com/lpantano/bcbioSmallRna
2 | 


--------------------------------------------------------------------------------
/misc/mounting_o2_mac.md:
--------------------------------------------------------------------------------
  1 | ## For OSX
  2 | 
  3 | To have O2 accessible on your laptop/desktop as a folder, you need to use something called [`sshfs`](https://en.wikipedia.org/wiki/SSHFS) (ssh filesystem). This is a command that is not native to OSXand you need to go through several steps in order to get it. Once you have `sshfs`, then you need to set up ssh keys to connect O2 to your laptop without having to type in a password. 
  4 | 
  5 | ### 1. Installing sshfs on OSX
  6 | 
  7 | Download macFUSE from [https://github.com/osxfuse/osxfuse/releases](https://github.com/osxfuse/osxfuse/releases/download/macfuse-4.6.0/macfuse-4.6.0.dmg), and install it.
  8 | 
  9 | NOTE: In order to install macFUSE, you may need to first enable system extensions, following [this guideline from Apple](https://support.apple.com/guide/mac-help/change-security-settings-startup-disk-a-mac-mchl768f7291/mac), which will require restarting your computer.
 10 | 
 11 | Download sshfs from [https://github.com/osxfuse/sshfs/releases](https://github.com/osxfuse/sshfs/releases/download/osxfuse-sshfs-2.5.0/sshfs-2.5.0.pkg), and install it.
 12 | 
 13 | > #### Use this only if the above option fails!
 14 | > 
 15 | > Step 1. Install [Xcode](https://developer.apple.com/xcode/)
 16 | > ```bash
 17 | > $ xcode-select --install
 18 | > ```
 19 | > 
 20 | > Step 2. Install Homebrew using ruby (from Xcode)
 21 | > ```bash
 22 | > $ /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
 23 | > 
 24 | > 	# Uninstall Homebrew
 25 | > 	# /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/uninstall)"
 26 | > ```
 27 | > 
 28 | > Step 2.1. Check to make sure that Homebrew is working properly
 29 | > ```bash
 30 | > $ brew doctor
 31 | > ```
 32 | > 
 33 | > Step 3. Install Cask from Homebrew's caskroom
 34 | > ```bash
 35 | > $ brew tap caskroom/cask
 36 | > ```
 37 | > 
 38 | > Step 4. Install OSXfuse using Cask
 39 | > ```bash
 40 | > $ brew cask install osxfuse
 41 | > ```
 42 | > 
 43 | > Step 5. Install sshfs from fuse
 44 | > ```bash
 45 | > $ brew install sshfs
 46 | > ```
 47 | 
 48 | ### 2. Set up "ssh keys"
 49 | 
 50 | Once `sshfs` is installed, the next step is to connect O2 (or a remote server) to our laptops. To make this process seamless,  first set up ssh keys which can be used to connect to the server without having to type in a password every time.
 51 | 
 52 | Log into O2 and use `vim` to open `~/.ssh/authorized_keys` and paste the code below copied from your computer to this file and save it. NOTE: make sure to replace `ecommonsID` with your actual username!
 53 | 
 54 | ```bash
 55 | # set up ssh keys
 56 | $ ssh-keygen -t rsa -b 4096 -f ~/.ssh/id_rsa -C "ecommonsID"
 57 | $ ssh-add -K ~/.ssh/id_rsa
 58 | ```
 59 | 
 60 | Arguments for `ssh-keygen`:
 61 | * `-t` = Specifies the type of key to create. The possible values are "rsa1" for protocol version 1 and "rsa" or "dsa" for protocol version 2. *We want rsa.*
 62 | * `-b` = Specifies the number of bits in the key to create. For RSA keys, the minimum size is 768 bits and the default is 2048 bits. *We want 4096*
 63 | * `-f` = name of output "keyfile"
 64 | * `-C` = Provides a new comment
 65 | 
 66 | Arguments for `ssh-add`:
 67 | * `-K` = Store passphrases in your keychain
 68 | 
 69 | ```bash
 70 | # copy the contents of `id_rsa.pub` to ~/.ssh/authorized_keys on O2
 71 | $ cat ~/.ssh/id_rsa.pub | pbcopy
 72 | ```
 73 | 
 74 | > `pbcopy` puts the output of `cat` into the clipboard (in other words, it is equivalent to copying with <kbd>ctrl + c</kbd>) so you can just paste it as usual with <kbd>ctrl + v</kbd>.
 75 | 
 76 | ### 3. Mount O2 using sshfs
 77 | 
 78 | Now, let's set up for running `sshfs` on our laptops (local machines), by creating a folder with an intuitive name for your home directory on the cluster to be mounted in.
 79 | 
 80 | ```bash
 81 | $ mkdir ~/O2_mount
 82 | ```
 83 | 
 84 | Finally, let's run the `sshfs` command to have O2 mount as a folder in the above space. Again, replace `ecommonsID` with your username.
 85 | ```bash
 86 | $ sshfs ecommonsID@transfer.rc.hms.harvard.edu:. ~/O2_mount -o volname="O2" -o compression=no -o Cipher=arcfour -o follow_symlinks
 87 | ```
 88 | 
 89 | Now we can browse through our home directory on O2 as though it was a folder on our laptop. 
 90 | 
 91 | > If you want to access your lab's directory in `/groups/` or your directory in `/n/scratch2`, you will need to create sym links to those in your home directory and you will be able to access those as well.
 92 | 
 93 | Once you are finished using O2 in its mounted form, you can cancel the connection using `umount` and the name of the folder.
 94 | 
 95 | ```bash
 96 | $ umount ~/O2_mount 
 97 | ```
 98 | 
 99 | ### 4. Set up alias (optional)
100 | 
101 | It is optional to set shorter commands using `alias` for establishing and canceling `sshfs` connection. Use `vim` to create or open `~/.bashrc` and paste the following `alias` commands and save it.
102 | 
103 | ```bash
104 | $ alias mounto2='sshfs ecommonsID@transfer.rc.hms.harvard.edu:. ~/O2_mount -o volname="O2" -o follow_symlinks'
105 | $ alias umounto2='umount ~/O2_mount'
106 | ```
107 | 
108 | > If your default shell is `zsh` instead of `bash`, use `vim` to create or open `~/.zshrc` and paste the `alias` commands.
109 | 
110 | Update changes in `.bashrc`
111 | 
112 | ```bash
113 | $ source .bashrc
114 | ```
115 | Now we can type `mounto2` and `umounto2` to mount and unmount O2.
116 | 
117 | ***
118 | *This lesson has been developed by members of the teaching team at the [Harvard Chan Bioinformatics Core (HBC)](http://bioinformatics.sph.harvard.edu/). These are open access materials distributed under the terms of the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.*
119 | 


--------------------------------------------------------------------------------
/misc/mtDNA_variants.md:
--------------------------------------------------------------------------------
 1 | # SNV and indels
 2 | - when starting from WGS or WES, subset MT chromosome
 3 | - estimate coverage, callable >=100X: https://github.com/naumenko-sa/bioscripts/blob/master/scripts/bam.coverage.bamstats05.sh
 4 | - use template for bcbio: https://github.com/bcbio/bcbio-nextgen/pull/3059
 5 | 
 6 | # Large deletions
 7 | - MitoDel: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5657046/
 8 | - eKLIPse: https://www.ncbi.nlm.nih.gov/pubmed/30393377
 9 | 
10 | # Databases
11 | - https://www.mitomap.org/foswiki/bin/view/MITOMAP/WebHome
12 | - https://www.mitomap.org/foswiki/bin/view/MITOMAP/TopVariants 
13 | - mvTool V2:  https://mseqdr.org/mv.php  
14 | - MSeqDR, ClinVar, ICGC, COSMIC
15 |  
16 | 


--------------------------------------------------------------------------------
/misc/multiomics_factor_analysis.md:
--------------------------------------------------------------------------------
 1 | # Uploading a program called MOFA2
 2 | ## It is used for finding factors from multiomics datasets.
 3 | 
 4 | The author presented an example usage in scRNAseq & scATACseq as well as other datasets (e.g. bulkRNAseq with proteomics).
 5 | 
 6 | I think it will be nice to look over.
 7 | 
 8 | https://biofam.github.io/MOFA2/
 9 | 
10 | Nice Review article of integrating multiomics, by one of the presenters.
11 | 
12 | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7034308/
13 | 
14 | iOMICSPass - co-expression based data integration using network
15 | 
16 | https://www.nature.com/articles/s41540-019-0099-y
17 | 


--------------------------------------------------------------------------------
/misc/new_to_remote_github_CLI_start_here.md:
--------------------------------------------------------------------------------
  1 | # Putting local git repos into the HBC Github organization remotely via command line interface
  2 | 
  3 | Heather Wick
  4 | 
  5 | Has your experience with github primarily been through the browser? This document has the basics to begin turning your working directories into github repositories which can be pushed to the HBC location remotely via command line.
  6 | 
  7 | ### Wait, back up, what do you mean by push?
  8 | 
  9 | Confused by push, pull, branch, main, commit? If you're not sure, it's worthwhile to familiarize yourself with the basics of git/github. There are some great resources and tutorials to learn from out there. Here's an interactive one (I could only get it to work in safari, not chrome):
 10 | https://learngitbranching.js.org
 11 | 
 12 | This won't teach you how to put your things into the HBC Github organization though.
 13 | 
 14 | ## Set up/configuration
 15 | 
 16 | You only need to do these once!
 17 | 
 18 | ### 1. Configure git locally to link to your github account
 19 | Open up a terminal and type
 20 | 
 21 | ```bash
 22 | git config --global user.email EMAIL_YOU_USE_TO_SIGN_IN_TO_GITHUB
 23 | git config --global user.name YOUR_GITHUB_USERNAME
 24 | ```
 25 | 
 26 | ### 2. Make personal access token
 27 | Configuring your local git isn't enough, as github is moving away from passwords. You will need to make a personal access token through your github account. Follow the instructions here:
 28 | https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens
 29 | 
 30 | **Copy this personal access token and save it somewhere or keep the window open for now. You will be prompted to enter your personal access token the first time you type `push -u origin main`. You will not be able to access this token again!**
 31 | 
 32 | ## Creating your git repo
 33 | 
 34 | ### 1. Initialize git repo on the HBC Github via web browser
 35 | 
 36 | I have yet to find a way to do this remotely via CLI, but as far as I can tell this step is a necessary pre-requisite to pushing a local repo to the HBC github. Will update to add CLI if possible.
 37 | 
 38 | Go to https://github.com/hbctraining and click the green "New Repository" button. Initialize a new, empty repository.
 39 | 
 40 | Once you do this, there will be some basic code you can copy under `Quick Setup`, including the `https` location of your repo which can be used below
 41 |    
 42 | ### 2. Create a local git repo and push it to the HBC Github via CLI
 43 | 
 44 | In your terminal, navigate to the folder you would like to turn into a github repo and type the following:
 45 | 
 46 | ```bash
 47 | echo "# text to add to readme" >> README.md
 48 | git init
 49 | git add README.md
 50 | git commit -m "first commit"
 51 | git branch -M main
 52 | git remote add origin https://github.com/hbctraining/NAME_OF_REPOT.git
 53 | git push -u origin main
 54 | ```
 55 | **You will be prompted to enter your personal access token the first time you type `push -u origin main`.**
 56 | 
 57 | ## Useful tips/tricks
 58 | 
 59 | If you are doing this in a directory with folders/data/files you don't necessarily want to put on github, you will want to pick/choose what you upload. Here are some tips and notes:
 60 | 
 61 | ### Add all, but exclude some
 62 | 
 63 | **note: will not exclude if already pushed to HBC repo! Just untracks them!**
 64 | 
 65 | The best time to implement this is when you are making your first upload
 66 | 
 67 | ```bash
 68 | git add .
 69 | git reset -- path/to/thing/to/exclude
 70 | git reset -- path/to/more/things/to/exclude
 71 | git commit -m "NAME_OF_COMMIT"
 72 | git push -u origin main
 73 | ```
 74 | 
 75 | ### Add specific files/folders:
 76 | 
 77 | ```bash
 78 | git add path/to/files*
 79 | git commit -m "NAME_OF_COMMIT"
 80 | git push -u origin main
 81 | ```
 82 | 
 83 | ### Add all files/folder except this file/folder:
 84 | 
 85 | ```bash
 86 | git add -- . ':!THING_TO_EXCLUDE'
 87 | git commit -m "NAME_OF_COMMIT"
 88 | git push -u origin main
 89 | ```
 90 | 
 91 | ### Remove a file you already pushed to Github
 92 | 
 93 | You might be tempted to just do this in the browser, but be warned! It will break your local repo until you pull from the HBC location. This could be a problem if that was important data you want to continue to store locally. Fortunately, you can "delete" files on the HBC Github without deleting them locally. Here's how:
 94 | 
 95 | ```bash
 96 | git rm --cached NAME_OF_FILE
 97 | ```
 98 | 
 99 | Or for a folder:
100 | ```bash
101 | git rm -r --cached NAME_OF_FILE
102 | ```
103 | **Side effects of resorting to `git rm -r --cached REALLY_IMPORTANT_DATA_DIRECTORY` include anxiety, sweating, heart palpitations, and appeals to spiritual beings**
104 | 
105 | ### Check what changes have been made to the current commit
106 | 
107 | Very useful to see what will actually be added, removed, etc or if everything is up to date.
108 | ```bash
109 | git status
110 | ```
111 | 
112 | ### .gitignore: coming soon
113 | 
114 | 


--------------------------------------------------------------------------------
/misc/organized_papers.md:
--------------------------------------------------------------------------------
 1 | # Human genome reference T2T - CHR13 - 2022
 2 | - https://www.genome.gov/about-genomics/telomere-to-telomere
 3 | - https://www.science.org/doi/pdf/10.1126/science.abj6987
 4 | - https://www.science.org/doi/10.1126/science.abl3533
 5 | - https://www.science.org/doi/epdf/10.1126/science.abl3533
 6 | 
 7 | # GWAS
 8 | - https://nature.com/articles/nrg1521
 9 | - https://nature.com/articles/nrg1916
10 | - https://nature.com/articles/nrg2344
11 | - https://nature.com/articles/nrg2544
12 | - https://nature.com/articles/nrg2796
13 | - https://nature.com/articles/nrg2813
14 | - https://nature.com/articles/nrg.2016.142
15 | - https://nature.com/articles/s4157
16 | 
17 | # bulk-RNA-seq 
18 | - Systematic evaluation of splicing calling tools - 2019: https://academic.oup.com/bib/article/21/6/2052/5648232
19 | - [RPKM/TPM misuse](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7373998/)
20 | 


--------------------------------------------------------------------------------
/misc/orphan_improvements.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: Improvements for the analysis
 3 | description: List of things to try.
 4 | category: research
 5 | subcategory: orphans
 6 | tags: [hbc]
 7 | ---
 8 | 
 9 | 
10 | 1. Try out Alevin from Salmon for a more principled single-cell quantification (https://www.biorxiv.org/content/early/2018/06/01/335000)
11 | 2. Add retained intron analysis with IRFinder to bcbio-nextgen
12 | 3. See if adding support for grolar to convert pizzly output to something more parseable makes sense. It's an R script and hasn't really been worked on so might not be useable: https://github.com/MattBashton/grolar
13 | 4. Add automatic loading/QC of bcbioSingleCell data from bcbio
14 | 5. Convert bcbio-nextgen singlecell matrices to HDF5 format in bcbio
15 | 6. Swap bcbioSingleCell to read the already-combined matrices for speed purposes
16 | 7. Add bcbioRNASeq template to do DTU usage using DRIMseq (https://f1000research.com/articles/7-952/v1)
17 | 8. Update installed genomes to use newest Ensembl build for RNA-seq for bcbio-supported genomes.
18 | 


--------------------------------------------------------------------------------
/misc/power_calc_simulations.md:
--------------------------------------------------------------------------------
1 | Code and sample metadata and data for running simulation based power calculations can be found [here](https://github.com/hbc/power_calc_simulations). 
2 | 
3 | 
4 | This code will use the mean and variance of the data set to derive simulated datasets with a defined number of values with defined fold changes. The simulated data will the be tested to determine precision and recall estimates for the comparison. 
5 | 
6 | 
7 | 
8 | *original code written by Lorena Pantano and adapted by John Hutchinson*
9 | 


--------------------------------------------------------------------------------
/misc/snakemake-example-pipeline:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: Example of snakemake pipeline
 3 | description: An example of snakemake file to run a pipeline applied to a bunch of files
 4 | category: research
 5 | subcategory: general_ngs
 6 | tags: [snakemake]
 7 | ---
 8 | 
 9 | 
10 | This file shows how to run a pipeline with snakemake for a bunch of files defined in `SAMPLES` variables. 
11 | 
12 | It shows how to put together different steps and how they are related to each other.
13 | 
14 | The tricky part is to have always `rule all` step and get all the output filenames you want to generate. If you miss
15 | some files that you want to generate and they are not the input in any other step then that step is not happening.
16 | 
17 | ```
18 | from os.path import join
19 | 
20 | # Globals ---------------------------------------------------------------------
21 | 
22 | # Full path to a FASTA file.
23 | GENOME_DIR = '../reference'
24 | 
25 | # Full path to a folder that holds all of your FASTQ files.
26 | FASTQ_DIR = '../rawdata'
27 | 
28 | # A Snakemake regular expression matching the forward mate FASTQ files.
29 | SAMPLES, = glob_wildcards(join(FASTQ_DIR, '{sample,[^/]+}_R1_001.fastq.gz'))
30 | 
31 | # Patterns for the 1st mate and the 2nd mate using the 'sample' wildcard.
32 | PATTERN_R1 = '{sample}_R1_001.fastq.gz'
33 | PATTERN_R2 = '{sample}_R2_001.fastq.gz'
34 | PATTERN_GENOME = '{sample}.fa'
35 | 
36 | 
37 | rule all:
38 |     input:
39 |         index = expand(join(GENOME_DIR, '{sample}.fa.bwt'), sample = SAMPLES),
40 |         vcf = expand(join('vcf', '{sample}.vcf'), sample = SAMPLES),
41 |         vcfpileup = expand(join('pileup', '{sample}.vcf'), sample = SAMPLES),
42 |         sam = expand(join('stats', '{sample}.txt'), sample = SAMPLES)
43 | 
44 | rule index:
45 |     input:
46 |         join(GENOME_DIR, '{sample}.fa')
47 |     output:
48 |         join(GENOME_DIR, '{sample}.fa.bwt')
49 |     shell:
50 |         'bwa index {input}'
51 | 
52 | rule map:
53 |     input:
54 |         genome = join(GENOME_DIR, PATTERN_GENOME),
55 |         index = join(GENOME_DIR, '{sample}.fa.bwt'),
56 |         r1 = join(FASTQ_DIR, PATTERN_R1),
57 |         r2 = join(FASTQ_DIR, PATTERN_R2)
58 |     output:
59 |         'bam/{sample}.bam'
60 |     shell:
61 |         'bwa mem  -c 250 -M -t 6 -v 1 {input.genome} {input.r1} {input.r2} | samtools sort - > {output}'
62 | 
63 | rule stats:
64 |     input:
65 |         bam = 'bam/{sample}.bam'
66 |     output:
67 |         'stats/{sample}.txt'
68 |     shell:
69 |         'samtools stats {input} > {output}'
70 | 
71 | rule pileup:
72 |     input:
73 |         bam = 'bam/{sample}.bam',
74 |         genome = join(GENOME_DIR, PATTERN_GENOME)
75 |     output:
76 |         'pileup/{sample}.mp'
77 |     shell:
78 |         'samtools mpileup -f {input.genome} -t DP -t AD -d 10000 -u -g {input.bam} > {output}'
79 | 
80 | 
81 | rule mpconvert:
82 |     input:
83 |         'pileup/{sample}.mp',
84 |     output:
85 |         'pileup/{sample}.vcf'
86 |     shell:
87 |         'bcftools convert -O v  {input} >  {output}'
88 | 
89 | 
90 | rule bcf:
91 |     input:
92 |         'pileup/{sample}.mp',
93 |     output:
94 |         'vcf/{sample}.vcf'
95 |     shell:
96 |         'bcftools call -v -m  {input} >  {output}'
97 | 
98 | ```
99 | 


--------------------------------------------------------------------------------
/python/conda.md:
--------------------------------------------------------------------------------
 1 | ## Conda
 2 | 
 3 | Every system has a Python installation, but you don't necessarily want to use that. Why not? That version is typically outdated and configured to support system functions. Most tools require specific versions of Python and depedencies so you need more flexibility.
 4 | 
 5 | **Solution?**
 6 | 
 7 | Set up a full-stack scientific Python deployment **using a Python distribution** (Anaconda or Miniconda). It is an installation of Python with a set of curated packages which are guaranteed to work together.
 8 |  
 9 | 
10 | ## Setting up Python distribution on O2
11 | 
12 | You can install it to your home directory, though not needed as O2 has a miniconda module available. 
13 | 
14 | By default, miniconda and conda envs are installed under user home space. 
15 | 
16 | ### Conda Environments
17 | Environments allow you to creata an isolated, reproducible environments where you have fine-tuned control over python version, all packages and configuration. _This is always recommended over using the default environment._
18 | 
19 | To create an environment using Pytho 3.9 and the numpy package:
20 | 
21 | ```bash
22 | $ conda create --name my_environment python=3.9 numpy
23 | ```
24 | 
25 | Now that you have created it, you need to activate it. Once activated, all installations of tools are specific to that environment; do this using `conda install`. It is a configured space where you can run analyses reproducibly.
26 | 
27 | ```bash
28 | $ conda activate my_environment
29 | ```
30 | 
31 | When you are done you can deactivate the environment or close it:
32 | 
33 | ```bash
34 | conda deactivate
35 | ```
36 | 
37 | The environments and associated libs are located in `~/home/minconda3/envs`. Both miniconda and the created environments can occupy a lot of space and max out your home directory! 
38 | 
39 | **Solution: Create the conda env in another space**
40 | 
41 | For conda envs, you can use the full path outside of home when creating env:
42 | 
43 | ```bash
44 | module purge
45 | module load miniconda3/23.1.0
46 | conda create -p /path/to/somewhere/not/home/myEnv python=3.9 numpy
47 | ```
48 | 
49 | > **NOTE:** It's common that installing packages using Conda is slow or fails because Conda is unable to resolve dependencies. To get around this, we suggest the use of Mamba.
50 | 
51 | **Installing lots of dependency packages?**
52 | 
53 | You can do this easily by creating a yaml file, for example `environment.yaml` below was used to install Pytables:
54 | 
55 | ```bash
56 | name: pytables
57 | channels:
58 |   - defaults
59 | dependencies:
60 |   - python=3.9*
61 |   - numpy >= 1.19.0
62 |   - zlib
63 |   - cython >= 0.29.32
64 |   - hdf5=1.14.0
65 |   - numexpr >= 2.6.2
66 |   - packaging
67 |   - py-cpuinfo
68 |   - python-blosc2 >= 2.3.0
69 | ```
70 | 
71 | Now to create the environment we reference the file in the command:
72 | 
73 | ```bash
74 | conda env create -f environment.yaml
75 | ```
76 | 
77 | ### Channels 
78 | 
79 | Where do conda packages come from? The packages are hosted on conda “channels”. From the conda pages:
80 | 
81 | _"Conda channels are the locations where packages are stored. They serve as the base for hosting and managing packages. Conda packages are downloaded from remote channels, which are URLs to directories containing conda packages. The conda command searches a set of channels."_
82 | 
83 | Using `-c` you can specify whihc channels you want conda to search in for packages.
84 | 
85 | > Adapted from [An Introduction to Earth and Environmental Data Science](https://earth-env-data-science.github.io/lectures/environment/python_environments.html) 
86 | 


--------------------------------------------------------------------------------
/r/.Rprofile:
--------------------------------------------------------------------------------
 1 | version <- paste0(R.Version()$major,".",R.Version()$minor)
 2 | if (version == "3.6.1") {
 3 |         .libPaths("~/R-3.6.1/library")
 4 | }else if (version == "3.5.1") {
 5 |         .libPaths("/R-3.5.1/library")
 6 | }
 7 | 
 8 | #Add this to your home folder, and make modifications to the version numbers if you need to.
 9 | #This will let you load the correct library folders for the different versions you have on O2.
10 | #R-3.6.1 and R-3.5.1 will load their corresponding library path for mine, but feel free to modify them to fit your needs.
11 | # This has been created as ChIPQC was problematic in R-3.6.1 and I had to load R-3.5.1 to generate a html report. (Joon)
12 | 


--------------------------------------------------------------------------------
/r/R-tips-and-tricks.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: R tips
  3 | description: This code helps with regular data improving efficiency
  4 | category: computing
  5 | subcategory: tips_tricks
  6 | tags: [R, visualization]
  7 | ---
  8 | 
  9 | # Import/Export of files
 10 | Stop using write.csv, write.table and use the [rio](https://cran.r-project.org/web/packages/rio/index.html) library instead. All rio needs is the file extension to figure out what file type you're dealing with. Easy import and export to Excel files for clients.
 11 | 
 12 | # Parsing in R using Tidyverse
 13 | This is a link to a nice tutorial from Ista Zahn from IQSS using stringr and tidyverse for parsing files in R. It is from the Computefest 2017 workshop:
 14 | http://tutorials-live.iq.harvard.edu:8000/user/zwD2ioESyGbS/notebooks/workshops/R/RProgramming/Rprogramming.ipynb
 15 | 
 16 | # Better clean default ggplot
 17 | install cowplot (https://cran.r-project.org/web/packages/cowplot/index.html)
 18 | ```r
 19 | library(cowplot)
 20 | ```
 21 | 
 22 | # Nice looking log scales
 23 | Example for x-axis
 24 | ```r
 25 | library(scales)
 26 |    p + scale_x_log10(
 27 |          breaks = scales::trans_breaks("log10", function(x) 10^x),
 28 |          labels = scales::trans_format("log10", scales::math_format(10^.x))) +
 29 |        annotation_logticks(sides='b')
 30 | ```
 31 | 
 32 | # Read a bunch of files into one dataframe
 33 | ```r
 34 | library(tidyverse)
 35 | read_files = function(files) {
 36 |   data_frame(filename = files) %>%
 37 |     mutate(contents = map(filename, ~ read_tsv(.))) %>%
 38 |     unnest()
 39 | }
 40 | ```
 41 | 
 42 | # remove a layer from a ggplot2 object with ggedit
 43 | ```
 44 | plotGeneSaturation(bcb, interestingGroups=NULL) +
 45 |   ggrepel::geom_text_repel(aes(label=description, color=NULL))
 46 | p %>%
 47 |   ggedit::remove_geom('point', 1) +
 48 |   geom_point(aes(color=NULL))
 49 | ```
 50 | 
 51 | # [Link to information about count normalization methods](https://github.com/hbc/knowledgebase/wiki/Count-normalization-methods)
 52 | The images currently break, but I will update when the course materials are in a more permanent state.
 53 | 
 54 | # .Rprofile usefulness
 55 | ```R
 56 | ## don't ask for CRAN repository
 57 | options("repos" = c(CRAN = "http://cran.rstudio.com/"))
 58 | ## for the love of god don't open up tcl/tk ever
 59 | options(menu.graphics=FALSE)
 60 | ## set seed for reproducibility
 61 | set.seed(123456)
 62 | ## don't print out more than 100 lines at once
 63 | options(max.print=100)
 64 | ## helps with debugging Bioconductor/S4 code
 65 | options(showErrorCalls = TRUE, showWarnCalls = TRUE)
 66 | 
 67 | ## Create a new invisible environment for all the functions to go in
 68 | ## so it doesn't clutter your workspace.
 69 | .env <- new.env()
 70 | 
 71 | ## ht==headtail, i.e., show the first and last 10 items of an object
 72 | .env$ht <- function(d, n=10) rbind(head(d, n), tail(d, n))
 73 | 
 74 | ## copy from clipboard
 75 | .env$pbcopy = function(x) {
 76 |   capture.output(x, file=pipe("pbcopy"))
 77 | }
 78 | 
 79 | ## update your local bcbioRNASeq and bcbioSingleCell installations
 80 | .env$update_bcbio = function(x) {
 81 |     devtools::install_github("steinbaugh/basejump")
 82 |     devtools::install_github("hbc/bcbioBase")
 83 |     devtools::install_github("hbc/bcbioRNASeq")
 84 |     devtools::install_github("hbc/bcbioSingleCell")
 85 | }
 86 | 
 87 | attach(.env)
 88 | ```
 89 | 
 90 | # Make density plot without underline
 91 | ```R
 92 | ggplot(colData(sce) %>%
 93 |        as.data.frame(), aes(log10GenesPerUMI)) +
 94 |     stat_density(geom="line") +
 95 |     facet_wrap(~period + intervention)
 96 | ```
 97 | 
 98 | # Archive a file to Dropbox with a link to it
 99 | ```R
100 | ```{r results='asis'}
101 | dropbox_dir = "HSPH/eggan/hbc02067"
102 | archive_data_with_link = function(data, filename, description, dropbox_dir) {
103 |     readr::write_csv(data, filename)
104 |     links = bcbioBase::copyToDropbox(filename, dropbox_dir)
105 |     link = gsub("dl=0", "dl=1", links[[1]]$url)
106 |     basejump::markdownLink(filename, link, paste0(" ", description))
107 | }
108 | archive_data_with_link(als, "dexseq-all.csv", "All DEXSeq results", dropbox_dir)
109 | archive_data_with_link(als %>%
110 |                      filter(padj < 0.1), "dexseq-sig.csv",
111 |                      "All significant DEXSeq results", dropbox_dir)
112 | ```
113 | 
114 | # Novel operators from magrittr
115 | The “%<>%” operator lets you pipe  an object to a function and then back into the same object.
116 | So:
117 | `foo -> foo %>% bar()`
118 | is the same as
119 | `foo %<>% bar()`
120 | 
121 | # gghelp: Converts a natural language query into a 'ggplot2' command
122 | This [package](https://rdrr.io/github/brandmaier/ggx/) allows users to issue natural language commands 
123 | related to theme-related styling of plots (colors, font size and such), which then are translated into 
124 | valid 'ggplot2' commands.
125 | 
126 | ### Examples:
127 | ```R
128 | gghelp("rotate x-axis labels by 90 degrees")
129 | gghelp("increase font size on x-axis label")
130 | gghelp("set x-axis label to 'Length of Sepal'")
131 | ```
132 | 


--------------------------------------------------------------------------------
/r/Shiny_images/Added_tabs.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/Added_tabs.png


--------------------------------------------------------------------------------
/r/Shiny_images/Adding_panels.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/Adding_panels.png


--------------------------------------------------------------------------------
/r/Shiny_images/Adding_theme.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/Adding_theme.png


--------------------------------------------------------------------------------
/r/Shiny_images/Altered_action_button.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/Altered_action_button.png


--------------------------------------------------------------------------------
/r/Shiny_images/Check_boxes_with_action_button.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/Check_boxes_with_action_button.png


--------------------------------------------------------------------------------
/r/Shiny_images/R_Shiny_hello_world.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/R_Shiny_hello_world.gif


--------------------------------------------------------------------------------
/r/Shiny_images/R_shiny_req_after.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/R_shiny_req_after.gif


--------------------------------------------------------------------------------
/r/Shiny_images/R_shiny_req_initial.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/R_shiny_req_initial.gif


--------------------------------------------------------------------------------
/r/Shiny_images/Return_table.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/Return_table.png


--------------------------------------------------------------------------------
/r/Shiny_images/Return_text_app_blank.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/Return_text_app_blank.png


--------------------------------------------------------------------------------
/r/Shiny_images/Return_text_app_hello.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/Return_text_app_hello.png


--------------------------------------------------------------------------------
/r/Shiny_images/Sample_size_hist_100.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/Sample_size_hist_100.png


--------------------------------------------------------------------------------
/r/Shiny_images/Sample_size_hist_5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/Sample_size_hist_5.png


--------------------------------------------------------------------------------
/r/Shiny_images/Shiny_UI_server.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/Shiny_UI_server.png


--------------------------------------------------------------------------------
/r/Shiny_images/Shiny_process.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/Shiny_process.png


--------------------------------------------------------------------------------
/r/Shiny_images/Squaring_number_app.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/Squaring_number_app.png


--------------------------------------------------------------------------------
/r/Shiny_images/mtcars_table.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/mtcars_table.png


--------------------------------------------------------------------------------
/r/htmlwidgets:
--------------------------------------------------------------------------------
 1 | (See http://gallery.htmlwidgets.org/ for more awesome widgets.)
 2 | 
 3 | Using some basic R libraries, you can setup some interactive visualizations wihtout using Rshiny
 4 | 
 5 | Here is some example code illustrating what I am thinking about, using the iris dataset from R
 6 | 
 7 | `library(crosstalk)`
 8 | `library(lineupjs)`
 9 | `library(d3scatter)`
10 | 
11 | `shared_iris = SharedData$new(iris)`
12 | `d3scatter(shared_iris, ~Petal.Length, ~Petal.Width, ~Species, width="100%")`
13 | `lineup(shared_iris, width="100%")`
14 | 
15 | Similarly, the morpheus.js html widget makes for fantastic, interactive heatmaps.
16 | `library(morpheus)`
17 | 
18 | rowAnnotations <- data.frame(annotation1=1:32, annotation2=sample(LETTERS[1:3], nrow(mtcars), replace = TRUE))`
19 | `morpheus(mtcars, colorScheme=list(scalingMode="fixed", colors=heat.colors(3)), rowAnnotations=rowAnnotations, overrideRowDefaults=FALSE, rows=list(list(field='annotation2', highlightMatchingValues=TRUE, display=list('color'))))`
20 | 


--------------------------------------------------------------------------------
/rc/O2-tips.md:
--------------------------------------------------------------------------------
 1 | # O2 tips
 2 | 
 3 | ## Making conda not slow down your login
 4 | If you have a complex base environment that gets loaded on login, you can end up having freezes of 30 seconds or more when
 5 | logging into O2. It is ultra annoying. You can fix this by not running the `_conda_setup` script in your .bashrc, like this:
 6 | 
 7 | ```bash
 8 | # >>> conda initialize >>>
 9 | # !! Contents within this block are managed by 'conda init' !!
10 | #__conda_setup="$('/home/rdk4/local/share/bcbio/anaconda/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
11 | #if [ $? -eq 0 ]; then
12 | #    eval "$__conda_setup"
13 | #else
14 | if [ -f "/home/rdk4/local/share/bcbio/anaconda/etc/profile.d/conda.sh" ]; then
15 |     . "/home/rdk4/local/share/bcbio/anaconda/etc/profile.d/conda.sh"
16 | else
17 |     export PATH="/home/rdk4/local/share/bcbio/anaconda/bin:$PATH"
18 | fi
19 | #fi
20 | #unset __conda_setup
21 | # <<< conda initialize <<<
22 | ```
23 | 
24 | ## Interactive function to request memory and hours
25 | 
26 | Can be added to .bashrc (or if you don't want to clutter it, put it in .o2_aliases and then source it from .bashrc)
27 | 
28 | Defaults: 4G mem, 8 hours.
29 | ```
30 | function interactive() {
31 |         mem=${1:-4}
32 |         hours=${2:-8}
33 | 
34 |         srun --pty -p interactive --mem ${mem}G -t 0-${hours}:00 /bin/bash
35 | }
36 | ```
37 | 
38 | 


--------------------------------------------------------------------------------
/rc/O2_portal_errors.md:
--------------------------------------------------------------------------------
 1 | # O2 Portal - R
 2 | 
 3 | These are common errors found when running R on the O2 portal and ways to fix them.
 4 | 
 5 | ## How to launch Rstudio
 6 | 
 7 | - Besides your private R library, we have now platform shared R library, add this to *Shared R Personal Library* section
 8 | : "/n/data1/cores/bcbio/R/library/4.3.1" (RNAseq and scRNAseq)
 9 | - Minimum modules to load for R4.3.*: `cmake/3.22.2 gcc/9.2.0 R/4.3.1 `
10 | - Minimum modules to load for 4.2.1 single cell analyses (some might be specific to trajectory analysis):
11 | `gcc/9.2.0 imageMagick/7.1.0 geos/3.10.2 cmake/3.22.2 R/4.2.1 fftw/3.3.10 gdal/3.1.4 udunits/2.2.28`
12 | - Sometimes specific nodes work better: under "Slurm Custom Arguments":  `-x compute-f-17-[09-25]`
13 | 
14 | # Issues
15 | 
16 | ## Issue 1 - You can make a session and open Rstudio on O2 but cannot actually type.
17 | 
18 | Potential solution:  Make a new session and put the following under "Slurm Custom Arguments":  
19 | ```
20 | -x compute-f-17-[09-25]
21 | ```
22 | 
23 | ## Issue 2 - Everything was fine but then you lost connection.
24 | 
25 | When you attempt to reload you see:
26 | 
27 | <p align = "center">
28 | <img src="../img/r_taking_longer.png">
29 | </p>
30 | 
31 | Potential solutions: Refresh your interactive sessions page first then refresh your R page. 
32 | If that doesn't work close your R session and re-open from the interactive sessions page.
33 | If that doesn't work wait 5-10 min then repeat.
34 | 
35 | ## Issue 3 - You made a session but cannot connect
36 | 
37 | When you attempt to connect you see:
38 | 
39 | <p align = "center">
40 | <img src="../img/can_not_connect.png">
41 | </p>
42 | 
43 | Potential solutions: This error indicates that either you did not load a gcc module or you loaded the incorrect one for the version of R you are running.
44 | Kill the current session and start a new one with the correct gcc loaded in the modules to be loaded tab.
45 | 
46 | ## Issue 4 - When you finally refresh your environment is gone (THE WORST)
47 | 
48 | What happened is you ran out of memory and R restarted itself behind the scenes. You will NOT get an error message for this of any kind. The best thing to do is quit your session and restart a new one with more memory.
49 | 
50 | ## Issue 5 - Crashing
51 | 
52 | Also, previous issues with O2portal RStudio crashing - “the compute-f architecture is not good enough and this part of the process fails because (maybe) it was built/installed on a newer node” .
53 | Solution: add the flag when you start the session to just exclude those nodes -x compute-f-17-[09-25]
54 | 
55 | ## Issue 6 - commands using cores fail
56 | 
57 | ```
58 | Error in `.rowNamesDF<-`(x, value = value) : invalid 'row.names' length
59 | In addition: Warning message:
60 | In mclapply(X, function(...) { :
61 |   scheduled cores 1, 2 did not deliver results, all values of the jobs will be affected
62 | ```
63 | 


--------------------------------------------------------------------------------
/rc/arrays_in_slurm.md:
--------------------------------------------------------------------------------
  1 | 
  2 | # Arrays in Slurm
  3 | 
  4 | When I am working on large data sets my mind often drifts back to an old Simpsons episode. Bart is in France and being taught to pick grapes. They show him a detailed technique and he does it successfully. Then they say:
  5 | 
  6 | 
  7 | <p align = "center">
  8 | <img src="../img/simpsons.gif">
  9 | </p>
 10 |      
 11 | <p align = "center">
 12 | We've all been here
 13 | </p>
 14 | 
 15 | A pipeline or process may seem easy or fast when you have 1-3 samples but totally daunting when you have 50. When scaling up you need to consider file overwriting, computational resources, and time.
 16 | 
 17 | One easy way to scale up is to use the array feature in slurm.
 18 | 
 19 | ## What is a job array?
 20 | 
 21 | Atlassian says this about job arrays on O2: "Job arrays can be leveraged to quickly submit a number of similar jobs. For example, you can use job arrays to start multiple instances of the same program on different input files, or with different input parameters. A job array is technically one job, but with multiple tasks." [link](https://harvardmed.atlassian.net/wiki/spaces/O2/pages/1586793632/Using+Slurm+Basic#Job-Arrays).
 22 | 
 23 | Array jobs run simultaneously rather than one at a time which means they are very fast! Additionally, running a job array is very simple!  
 24 | 
 25 | ```bash
 26 | sbatch --array=1-10 my_script.sh
 27 | ```
 28 | 
 29 | This will run my_script.sh 10 times with the job IDs 1,2,3,4,5,6,7,8,9,10
 30 | 
 31 | We can also put this directly into the bash script itself (although we will continue with the command line version here).
 32 | ```bash
 33 | $SBATCH --array=1-10
 34 | ```
 35 | 
 36 | We can specify any job IDs we want.
 37 | 
 38 | ```bash
 39 | sbatch --array=1,7,12 my_script.sh
 40 | ```
 41 | This will run my_script.sh 3 times with the job IDs 1,7,12
 42 | 
 43 | Of course we don't want to run the same job on the same input files over and over, that would be pointless. We can use the job IDs within our script to specify different input or output files. In bash the job id is given a special variable `${SLURM_ARRAY_TASK_ID}`
 44 | 
 45 | 
 46 | ## How can I use ${SLURM_ARRAY_TASK_ID}?
 47 | 
 48 | The value of `${SLURM_ARRAY_TASK_ID}` is simply job ID. If I run 
 49 | 
 50 | ```bash
 51 | sbatch --array=1,7 my_script.sh
 52 | ```
 53 | This will start two jobs, one where `${SLURM_ARRAY_TASK_ID}` is 1 and one where it is 7
 54 | 
 55 | There are several ways we can use this. If we plan ahead and name our files with these numbers (e.g., sample_1.fastq, sample_2.fastq) we can directly refer to these files in our script: `sample_${SLURM_ARRAY_TASK_ID}.fastq` However, using the ID for input files is often not a great idea as it means you need to strip away most of the information that you might put in these names.
 56 | 
 57 | Instead we can keep our sample names in a separate file and use [awk](awk.md) to pull the file names. 
 58 | 
 59 | here is our complete list of long sample names which is found in our file `samples.txt`:
 60 | 
 61 | ```
 62 | DMSO_control_day1_rep1
 63 | DMSO_control_day1_rep2
 64 | DMSO_control_day2_rep1
 65 | DMSO_control_day2_rep2
 66 | DMSO_KO_day1_rep1
 67 | DMSO_KO_day1_rep2
 68 | DMSO_KO_day2_rep1
 69 | DMSO_KO_day2_rep2
 70 | Drug_control_day1_rep1
 71 | Drug_control_day1_rep2
 72 | Drug_control_day2_rep1
 73 | Drug_control_day2_rep2
 74 | Drug_KO_day1_rep1
 75 | Drug_KO_day1_rep2
 76 | Drug_KO_day2_rep1
 77 | Drug_KO_day2_rep2
 78 | ```
 79 | 
 80 | If we renamed all of these to 1-16 we would lose a lot of information that may be helpful to have on hand. If these are all sam files and we want to convert them to bam files our script could look like this
 81 | 
 82 | ```bash
 83 | 
 84 | file=$(awk -v  awkvar="${SLURM_ARRAY_TASK_ID}" 'NR==awkvar' samples.txt)
 85 | 
 86 | samtools view -S -b ${file}.sam > ${file}.bam
 87 | 
 88 | ```
 89 | 
 90 | Since we have sixteen samples we would run this as 
 91 | 
 92 | ```bash
 93 | sbatch --array=1-16 my_script.sh
 94 | ```
 95 | 
 96 | So what is this script doing? `file=$(awk -v  awkvar="${SLURM_ARRAY_TASK_ID}" 'NR==awkvar' samples.txt)` pulls the line of `samples.txt` that matched the job ID. Then we assign that to a variable called `${file}` and use that to run our command.
 97 | 
 98 | Job IDs can also be helpful for output files or folders. We saw above how we used the job ID to help name our output bam file. But creating and naming folders is helpful in some instances as well. 
 99 | 
100 | ```bash
101 | 
102 | file=$(awk -v  awkvar="${SLURM_ARRAY_TASK_ID}" 'NR==awkvar' samples.txt)
103 | 
104 | PREFIX="Folder_${SLURM_ARRAY_TASK_ID}"
105 |      mkdir $PREFIX
106 |         cd $PREFIX
107 | 
108 | samtools view -S -b ../${file}.sam > ${file}.bam
109 | 
110 | ```    
111 | 
112 | This script differs from our previous one in that it makes a folder with the job ID (Folder_1 for job ID 1) then moves inside of it to execute the command. Instead of getting all 16 of our bam files output in a single folder each of them will be in its own folder labled Folder_1 to Folder_16. 
113 | 
114 | **NOTE** That we define `${file}` BEFORE we move into our new folder as samples.txt is only present in the main directory. 
115 | 
116 | 
117 | 
118 | 


--------------------------------------------------------------------------------
/rc/connection-to-hpc.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: Connecting to hpc from local
 3 | description: This code helps with connecting to hpc computers
 4 | category: computing
 5 | subcategory: tips_tricks
 6 | tags: [ssh, hpc]
 7 | ---
 8 | 
 9 | 
10 | # osx
11 | 
12 | Use [Homebrew](http://brew.sh/) to get linux-like functionality on OSX
13 | 
14 | Use [XQuartz](https://www.xquartz.org/) for X11 window functionality in OSX.
15 | 
16 | # Odyssey with 2FA
17 | Enter one time password into the current window (https://github.com/jwm/os-x-otp-token-paster)
18 | 
19 | # Fix 'Warning: No xauth data; using fake authentication data for X11 forwarding'
20 | Add this to your ~/.ssh/config on your OSX machine:
21 | 
22 | ```
23 | Host *
24 |   XAuthLocation /opt/X11/bin/xauth
25 | ```
26 | 
27 | # Use ssh keys on remote server
28 | This will add your key to the OSX keychain, here your private key is assumed to be named "id_rsa":
29 | 
30 | ```
31 | ssh-add -K ~/.ssh/id_rsa
32 | ```
33 | 
34 | Now tell ssh to use the keychain. Add this to the ~/.ssh/config on your OSX machine:
35 | 
36 | ```
37 | Host *
38 |      AddKeysToAgent yes
39 |      UseKeychain yes
40 |      IdentityFile ~/.ssh/id_rsa
41 |      XAuthLocation /opt/X11/bin/xauth
42 | ```
43 | 


--------------------------------------------------------------------------------
/rc/ipython-notebook-on-O2.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: IPython notebook on O2
 3 | description: How to open up an ipython notebook running on O2
 4 | category: computing
 5 | subcategory: tips_tricsk
 6 | tags: [python, ipython, singlecell]
 7 | ---
 8 | 
 9 | 1. First connect to O2 and open up an interactive session with all of the cores and memory you want to use. Here I'm connecting to the short queue so I can get more cores to use.
10 | 
11 | ```bash
12 | srun -n 8 --pty -p short --mem 64G -t 0-12:00 --x11 /bin/bash
13 | ```
14 | 
15 | 2. Note the name of the compute node you are on:
16 | 
17 | ```bash
18 | uname --nodename
19 | ```
20 | 
21 | 3. Start a jupyter notebook server on a specific port:
22 | 
23 | ```bash
24 | jupyter notebook --no-browser --port=1234
25 | ```
26 | 
27 | This command will open up a notebook server on port 1234. You might have to pick
28 | a different port if 1234 is being used. Note the token it provides for you, you
29 | will need this token to use your notebook server.
30 | 
31 | 4. Create an auto-closing SSH tunnel from your local machine to the jupyter notebook:
32 | 
33 | On your local machine do:
34 | 
35 | ```bash
36 | ssh -f -L 9999:localhost:9999 o2 -t 'ssh -f -L 9999:localhost:1234 compute-a-16-49 "sleep 60"'
37 | ```
38 | 
39 | This sets up a two SSH tunnels. The first one is connecting port 9999 on your laptop to port 9998 on `login02` on o2 (134.174.159.22). The second is connecting port 9998 on `login02` to port 1234 on `compute-e-16-49`. This script will auto-close the tunnel if you don't connect to it in 60 seconds, and will auto-close the tunnel when your session is closed.
40 | 
41 | 5. Open a web browser and put `localhost:9999` as the address.
42 | 
43 | This should now connect you to the jupyter notebook server. It will ask you for
44 | the token. If you put the token in, you can now log in and will be in your
45 | home directory on O2.
46 | 
47 | You are now running a notebook server. This is just running using a single core now-- we want to hook up our computing that we reserved. We asked for 8 cores, so we'll set up a cluster with 8 cores. Click on "IPython Clusters", set the number of engines on
48 | "default" to 8, and you will have your notebook connected to the 8 cores.
49 | 
50 | 6. Start working! 
51 | 
52 | You can open up a terminal by going to the Files tab and clicking on new and opening 
53 | the terminal. You can start a new notebook by going to the Files tab, clicking on 
54 | new and opening a python notebook.
55 | 


--------------------------------------------------------------------------------
/rc/jupyter_notebooks.md:
--------------------------------------------------------------------------------
  1 | # Jupyter notebooks
  2 | 
  3 | This post is for those who are interested in running notebooks seemingly on O2. There is a well written documentation about running jupyter notebook on O2. you can find it here. https://wiki.rc.hms.harvard.edu/display/O2/Jupyter+on+O2. However, this involves multiple steps, opening bunch of terminals at times, and importantly finding an unused port every-time. I found it quite cumbersome and annoying, so i spent some time solving it. It took me a while to nail it down, with the help of FAC RC, but they suggested a simpler solution. If you wish to run jupyter/R notebook on O2 {where your data sits},
  4 | Here what you need to do : 
  5 | 
  6 | Install https://github.com/aaronkollasch/jupyter-o2 by `pip install jupyter-o2` on your local terminal. 
  7 | 
  8 | Run `jupyter-o2 --generate-config` on command line. 
  9 | This will generate the configuration file and will tell you where it is located. Un comment the fields which are not needed. Since configuration file is the key, find attach the a template for use, you need to change your credentials though.
 10 | 
 11 | You are all set to run notebook on O2 from your local machine now, without logging into server.  
 12 | Now at your local terminal Run `jupyter-o2 notebook` for python notebooks. Alternatively you can also do `jupyter-o2 lab` for R/python
 13 | This will ask you a paraphrase, you should enter your ecommons password as paraphrase. 
 14 | Boom!!! you are good to go! Happy Pythoning :):)  
 15 | If you wish you run R notebooks on O2, refer this. https://docs.anaconda.com/anaconda/navigator/tutorials/r-lang/
 16 | 
 17 | 
 18 | # Example code
 19 | 
 20 | Just to add, in the HMS-RC documentation they suggested any ports over 50000. To give examples of logging into a jupyter notebook session I have provided the code below.
 21 | 
 22 | ## Creating a Jupyter notebook
 23 | 
 24 | Log onto a login node
 25 | 
 26 | ```
 27 | # Log onto O2 using a specific port - I used '50000' in this instance - you can choose a different  port and just replace the 50000 with the number of your specific port
 28 | ssh -Y -L 50000:127.0.0.1:50000 ecommons_id@o2.hms.harvard.edu 
 29 | ```
 30 | 
 31 | Once on the login node, you can start an interactive session specifying the port with `--tunnel`
 32 | 
 33 | ```
 34 | # Create interactive session
 35 | srun --pty -p interactive -t 0-12:00 --x11 --mem 128G --tunnel 50000:50000 /bin/bash
 36 | ```
 37 | 
 38 | Load the modules that you will need
 39 | 
 40 | ```
 41 | # Load modules
 42 | module load gcc/9.2.0 python/3.8.12
 43 | ```
 44 | 
 45 | Create environment for running analysis (example here is for velocity)
 46 | 
 47 | ```
 48 | # Create virtual environment (only do this once)
 49 | virtualenv velocyto --system-site-packages
 50 | ```
 51 | 
 52 | Activate virtual environment
 53 | 
 54 | ```
 55 | # Activate virtual environment
 56 | source velocyto/bin/activate
 57 | ```
 58 | 
 59 | Install Jupyter notebook and any other libraries (only need to do this once)
 60 | 
 61 | ```
 62 | # Install juypter notebook
 63 | pip3 install jupyter
 64 | 
 65 | # Install any other libraries needed for analysis (this is for velocity)
 66 | pip3 install numpy scipy cython numba matplotlib scikit-learn h5py click
 67 | pip3 install velocyto
 68 | pip3 install scvelo
 69 | ```
 70 | 
 71 | To create a Jupyter notebook run the following (again instead of 50000, use your port #):
 72 | 
 73 | ```
 74 | # Start jupyter notebook
 75 | jupyter notebook --port=50000 --browser='none'
 76 | ```
 77 | 
 78 | ## Logging onto an existing notebook
 79 | 
 80 | ```
 81 | # Log onto O2 using a specific port - I used '50000' in this instance - you can choose a different  port and just replace the 50000 with the number of your specific port
 82 | ssh -Y -L 50000:127.0.0.1:50000 ecommons_id@o2.hms.harvard.edu 
 83 | 
 84 | # Create interactive session
 85 | srun --pty -p interactive -t 0-12:00 --x11 --mem 128G --tunnel 50000:50000 /bin/bash
 86 | 
 87 | # Load modules
 88 | module load gcc/9.2.0 python/3.8.12
 89 | 
 90 | # Activate virtual environment
 91 | source velocyto/bin/activate
 92 | 
 93 | # Open existing notebook
 94 | jupyter notebook name_of_notebook.ipynb --port=50000 --browser='none'
 95 | ```
 96 | 
 97 | ## Sharing your notebook
 98 | To share the contents of your notebook, you can either upload the notebook directly to Github and add your client as a collaborator on the repo, or export the report as a markdown or PDF.
 99 | 
100 | To export as a PDF, you need to have additional modules loaded and python packages installed:
101 | 
102 | ```
103 | module load texlive/2007
104 | 
105 | pip3 install Pyppeteer
106 | pip3 install nbconvert
107 | ```
108 | 


--------------------------------------------------------------------------------
/rc/keepalive.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: Transfer files inside cluster
 3 | description: This code helps with transfer files inside cluster
 4 | category: computing
 5 | subcategory: tips_tricks
 6 | tags: [ssh, hpc]
 7 | ---
 8 | 
 9 | Useful for file transfers on O2's new transfer cluster (transfer.rc.hms.harvard.edu).
10 | 
11 | The nohup command can be prepended to the bash command and the command will keep running after you logout (or have your connection interrupted).
12 | 
13 | From HMS RC:
14 | 
15 | `From one of the file transfer systems under transfer.rc.hms.harvard.edu , you can prefix your command with "nohup" to put it in the background and be able to log out without interrupting the process.`
16 | 
17 | `For example, after logging in to e.g. the transfer01 host, run your command:`
18 | 
19 | `nohup rsync -av /dir1 /dir2`
20 | 
21 | `and then log out. rsync will keep running.`
22 | 
23 | `To check in on the process later, just remember which machine you ran rsync and you can directly re-login to that system if you like.`
24 | 
25 | `For example:`
26 | 
27 | `1. ssh transfer.rc.hms.harvard.edu (let's say you land on transfer03), and then:`
28 | `2. ssh transfer01`
29 | `-- from there you can run the "ps" command or however you like to monitor the process.`
30 | 
31 | 
32 | ## Another option from John
33 | If you run tmux from the login node before you ssh to the transfer node to xfer files, you can drop your connection and then re-attach to your tmux session later. It should still be running your transfer. 
34 | 
35 | **General steps**
36 | 1) Login to O2
37 | 2) write down what login node your are on (usually something like login0#)  
38 | *at login node*
39 | 3) Start a new tmux session  
40 | `tmux new -s myname`
41 | 4) SSH to the transfer node  
42 | `ssh user@transfer.rc.hms.harvard.edu`  
43 | *on transfer node*
44 | 5) start transfer with rsync, scp etc.
45 | 6) close terminal window without logging out  
46 | *time passes*
47 | 7) Login to O2 again
48 | 8) ssh to the login node you wrote down above  
49 | `ssh user@login0#`
50 | 9) Reattach to your tmux session  
51 | `tmux a -t myname`
52 | 10) Profit
53 | 
54 | You can get around having to remember which node you logged into by alwasys logging into the same node. For example you can add this to your .bash_profile on OSX:  
55 | `alias ssho2='ssh -XY -l user login05.o2.rc.hms.harvard.edu'`
56 | 
57 | 


--------------------------------------------------------------------------------
/rc/manage-files.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: Managing files
 3 | description: This code helps with managing file names
 4 | category: computing
 5 | subcategory: tips_tricks
 6 | tags: [bash, osx, linux]
 7 | ---
 8 | 
 9 | ## How to remove all files except the ones you want:
10 | 
11 | First, expand rm capabilities by:
12 | `shopt -s extglob`
13 | 
14 | Then use find and remove:
15 | `find . ! -name 'file.txt' -type f -exec rm -f {} +`
16 | 
17 | 
18 | ## Rename files
19 | 
20 |   It has a lot of good options but one pretty useful for removing whitespaces and set all to lowercase:
21 | 
22 |   `rename -c --nows <fileName>`
23 |  
24 | ## Use umask to restrict default permissions for users outside of the group
25 | 
26 | Set umask 007 in our .bashrc. Then newly created directories will have 770 (rwxrwx---) permissions,
27 | and files will have 660 (rw-rw----).
28 | 


--------------------------------------------------------------------------------
/rc/openondemand.md:
--------------------------------------------------------------------------------
 1 | ## My intitial notes on using the FAS-RC Open on Demand virtual desktop system
 2 | 
 3 | ### Steps to get going
 4 | - Download and install Cisco connect VPN
 5 | 
 6 | - Login to to VPN using your FAS-RC credentials and two-pass authentication code
 7 | 
 8 | - Navigate to https://vdi.rc.fas.harvard.edu/pun/sys/dashboard
 9 | 
10 | - Login to page using FAS-RC user id and password
11 | 
12 | - Click on Interactive Apps pulldown and select the “Rstudio Server” under the SEr5ver heading. DO NOT select, “RStudio Server (bioconductor + tidyverse)”
13 | 
14 | Here most of the settings are self-explanatory.   
15 | - I have tried out multiple cores (12) and while I am unsure it is using the full 48 cores parallel::detectCores finds, my simulation did run substantially faster (4 cores = 5% done when “12 cores” was at 75%. I would be  interested to hear other people’s experiences and how transparently/well it works with R.
16 | 
17 | - Maximum memory allocation is supposed to bey 120GHB. I haven’t tried asking for more. 
18 | 
19 | - I’ve been loading the R/4.02-fasrrc01 Core R versions.  I tried the  R/4.02-fasrc Core & gcc 9.3.0 first and ran into package compilation issues. 
20 | 
21 | - You can set the R_LIBS_USER folder to use which will  contain your library packages. Using this approach, I was able to install packages in session , delete the server and come back to the installed packages in a new session.  You could theoretically also switch between R versions using this and the version selector. 
22 | 
23 | - I haven’t tired executing a script before staring Rstudio, but theoretically, I could see using this to launch a condo environment. 
24 | 
25 | - I don’t know about reservations but they sound interesting for getting a high mem machine.
26 | 


--------------------------------------------------------------------------------
/rc/scheduler.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: Alias for cluster jobs stats
 3 | description: This code helps with commands related to job submission
 4 | category: computing
 5 | subcategory: tips_tricks
 6 | tags: [bash, hpc]
 7 | ---
 8 | 
 9 | # SLURM
10 | 
11 | * Useful aliases
12 | 
13 | ```bash
14 | alias bjobs='sacct -u ${USER} --format="JobID,JobName%25,NodeList,State,ncpus,start,elapsed" -s PD,R'
15 | alias bjobs_all='sacct -u ${USER} --format="JobID,JobName%25,NodeList,State,ncpus,AveCPU,AveRSS,MaxRSS,MaxRSSTask,start,elapsed"'
16 | ```
17 | 


--------------------------------------------------------------------------------
/rc/tmux.md:
--------------------------------------------------------------------------------
 1 | Tmux is a great way to work on the server as it allow syou to:  
 2 | 1) keep your session alive.  
 3 | 3) have multiple named sessions open to firewall tasks/projects.  
 4 | 4) run multiple windows/command lines from a single login (O2 allows a maximum of 2-3 logins to their system).  
 5 | 5) quickly spin off windows to do small commands (see 3).  
 6 | 
 7 | ### Useful resources
 8 | 
 9 | #### [Tmux cheat sheet](https://tmuxcheatsheet.com/)
10 | 
11 | 
12 | #### Tmux configuration
13 | Tmux works great but can have some issues upon first use that make it challenging to use for those of us used to a GUI:  
14 | a) the default command key is not great.  
15 | b) it doesn't work well with a mouse.  
16 | c) it doesn't let you copy text easily.  
17 | d) it doesn't scroll your window easily.  
18 | e) resizing the windows can be challenging.  
19 | 
20 | The confirguation file code below should make some of these issues easier:
21 | 
22 |     set -g default-terminal "screen-256color"
23 |     set -g status-bg red
24 |     set -g status-fg black
25 |     
26 |     # Use <C-a> instead of the default <C-b> as Tmux prefix
27 |     set-option -g prefix C-a
28 |     unbind-key C-b
29 |     bind-key C-a send-prefix
30 |     
31 |     
32 |     # Options enable mouse support in Tmux
33 |     #set -g terminal-overrides 'xterm*:smcup@:rmcup@'
34 |     # For Tmux >= 2.1
35 |     #set -g mouse on
36 |     # For Tmux <2.1
37 |     # Make mouse useful in copy mode
38 |     setw -g mode-mouse on
39 |     #
40 |     # # Allow mouse to select which pane to use
41 |     set -g mouse-select-pane on
42 |     #
43 |     # # Allow mouse dragging to resize panes
44 |     set -g mouse-resize-pane on
45 |     #
46 |     # # Allow mouse to select windows
47 |     set -g mouse-select-window on
48 |     
49 |     
50 |     # set colors for the active window
51 |     # START:activewindowstatuscolor
52 |     setw -g window-status-current-fg white 
53 |     setw -g window-status-current-bg red 
54 |     setw -g window-status-current-attr bright
55 |     # END:activewindowstatuscolor
56 |     
57 |     
58 |     ## Optional- act more like vim:
59 |     #set-window-option -g mode-keys vi
60 |     #bind h select-pane -L
61 |     #bind j select-pane -D
62 |     #bind k select-pane -U
63 |     #bind l select-pane -R
64 |     #unbind p
65 |     #bind p paste-buffer
66 |     #bind -t vi-copy v begin-selection
67 |     #bind -t vi-copy y copy-selection 
68 |     
69 |     
70 |     # moving between panes
71 |     # START:paneselect
72 |     bind h select-pane -L 
73 |     bind j select-pane -D 
74 |     bind k select-pane -U
75 |     bind l select-pane -R    
76 |     # END:paneselect
77 |     
78 |     
79 |     # START:panecolors
80 |     set -g pane-border-fg green
81 |     set -g pane-border-bg black
82 |     set -g pane-active-border-fg white
83 |     set -g pane-active-border-bg yellow
84 |     # END:panecolors
85 |     
86 |     # Command / message line
87 |     # START:cmdlinecolors
88 |     set -g message-fg white
89 |     set -g message-bg black
90 |     set -g message-attr bright
91 |     # END:cmdlinecolorsd -g pane-active-border-bg yellow
92 |     
93 |     
94 |     bind-key C-a last-window
95 |     setw -g aggressive-resize on
96 |     
97 | #### [Restoring your tmux session after reboot](https://andrewjamesjohnson.com/restoring-tmux-sessions/)
98 | 


--------------------------------------------------------------------------------
/rnaseq/RepEnrich2_guide.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: How to run repeat enrichment analysis
 3 | description: This guide shows how to run RepEnrich2
 4 | category: research
 5 | subcategory: rnaseq
 6 | tags: [annotation]
 7 | ---
 8 | 
 9 | RepEnrich2 tries to look at something that standard RNA-seq pipelines miss, the
10 | enrichment of repeats in NGS data. It is extremely slow and is a pain to get
11 | going. Below is a guide getting it working and has some links to a fork of
12 | RepEnrich2 I made that makes it more friendly to use.
13 | 
14 | I have not actually validated the RepEnrich2 output, so caveat emptor.
15 | 
16 | # Preparing RepEnrich2
17 | 
18 | ## Create isolated conda environment
19 | 
20 | ```bash
21 | conda create -c bioconda -n repenrich2 python=2.7 biopython bedtools samtools bowtie2 bcbio-nextgen
22 | ```
23 | 
24 | ## Download my fork of RepEnrich2 
25 | This has quality of life fixes such as memoization of outputs so if it fails you don't
26 | have to redo steps.
27 | 
28 | ```bash
29 | git clone git@github.com:nerettilab/RepEnrich2.git
30 | ```
31 | 
32 | ## Download a pre-created index 
33 | You can make your own, for example I made
34 | [hg38](https://www.dropbox.com/s/lefkk38q6bbj76b/Repenrich2_setup_hg38.tar.gz?dl=1)
35 | and the RepEnrich2 folks have mm9 and hg19
36 | [here](https://drive.google.com/drive/folders/0B8_2gE04f4QWNmdpWlhaWEYwaHM). But the RepeatMasker
37 | file it uses needs to be cleaned first and I'm not sure how they cleaned it. They had a hg38 one cleaned
38 | already from RepEnrich so I just used that.
39 | 
40 | ## Download bcbio_RepEnrich2
41 | Download [bcbio_RepEnrich2](https://github.com/roryk/bcbio_RepEnrich2). This will need modification if you
42 | want to use it, but it is simple, I just didn't bother as I don't anticipate us running this again.
43 | 
44 | # Running RepEnrich2
45 | `bcbio_RepEnrich2` is all you need to run it, the help should give you enough information to go on.
46 | annotation here is the file from RepeatMasker that was used to generate the RepEnrich setup. The
47 | bowtie index is a bowtie2 index of the genome you aligned to. Running RepEnrich2 takes FOREVER, so
48 | be sure to run it on the long queue.
49 | 
50 | Example command:
51 | 
52 | ```bash
53 | python bcbio_RepEnrich2.py --threads 16 ../human-dsrna/config/human-dsrna.yaml /n/app/bcbio/biodata/genomes/Hsapiens/hg38/bowtie2/hg38 metadata/hg38_repeatmasker_clean.txt metadata/RepEnrich2_setup_hg38/
54 | ```
55 | 
56 | # RepEnrich2 outputs
57 | 
58 | You will get three files for each sample, for example:
59 | 
60 | ```
61 | P1722_class_fraction_counts.txt
62 | P1722_family_fraction_counts.txt
63 | P1722_fraction_counts.txt
64 | ```
65 | 
66 | The `class` and `family` files are the counts in the `samplename_fraction_counts.txt` file aggregated by family or 
67 | class. Those could be used as aggregate analyses, but the `fraciton_counts` looks at the different repeat
68 | types individually, so is more what folks are probably looking for.
69 | 


--------------------------------------------------------------------------------
/rnaseq/Volcano_Plots.md:
--------------------------------------------------------------------------------
 1 | ## [Enhanced Volcano](https://bioconductor.org/packages/devel/bioc/vignettes/EnhancedVolcano/inst/doc/EnhancedVolcano.html) is a great and flexible way to creat volcano plots.
 2 | 
 3 | Input is a dataframe of test statistics. It works well with the output of `lfcShrink()`
 4 | Below is an example call and output:
 5 | 
 6 | ```
 7 | library(EnhancedVolcano)
 8 | 
 9 | EnhancedVolcano(shrunken_res_treatment,
10 |                 lab= NA,
11 |     x = 'log2FoldChange',
12 |     y = 'pvalue', title="Volcano Plot for Treatment", subtitle = "")
13 | ```
14 | 
15 | <p align="center">
16 | <img src="/rnaseq/img/volcano.png" width="800">
17 | </p>
18 | 
19 | 
20 | Almost every aspect is flexible and changable. 
21 | 


--------------------------------------------------------------------------------
/rnaseq/ase.md:
--------------------------------------------------------------------------------
 1 | # ASE = allele specific expression, allelic imbalance
 2 | 
 3 | - https://stephanecastel.wordpress.com/2017/02/15/how-to-generate-ase-data-with-phaser/
 4 | 
 5 | 
 6 | ## Installation of Phaser:
 7 | ```
 8 | git clone https://github.com/secastel/phaser.git
 9 | module load gcc/6.2.0  
10 | module load python/2.7.12
11 | cython/0.25.1
12 | pip install intervaltree --user
13 | cd phaser/phaser
14 | python setup.py build_ext --inplace
15 | ```
16 | 
17 | ASE QC:
18 | - https://github.com/gimelbrantlab/Qllelic
19 | 


--------------------------------------------------------------------------------
/rnaseq/bibliography.md:
--------------------------------------------------------------------------------
 1 | # Normalization
 2 | * [Comparing the normalization methods for
 3 | the differential analysis of Illumina high-
 4 | throughput RNA-Seq data](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0778-7). Paper that compares different normalization methods on RNASeq data.
 5 | 
 6 | # Power
 7 | * [Power in pairs: assessing the statistical value of paired samples in tests for differential expression](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6302489/). Paper that looks at effect of paired-design on power in RNA-seq.
 8 | 
 9 | # Functional analysis
10 | https://yulab-smu.github.io/clusterProfiler-book/index.html
11 | 
12 | 
13 | [RNA-seq qc](https://seqqc.wordpress.com/)
14 | 


--------------------------------------------------------------------------------
/rnaseq/failure_types:
--------------------------------------------------------------------------------
1 | Different ways that RNAseq can fail with examples.
2 | 
3 | https://docs.google.com/presentation/d/1d5hyuTJMei0myG_vr7YR3vFewajwF9I9loS3I--kpnw/edit?usp=sharing
4 | 


--------------------------------------------------------------------------------
/rnaseq/img/test:
--------------------------------------------------------------------------------
1 | d
2 | 


--------------------------------------------------------------------------------
/rnaseq/img/volcano.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/rnaseq/img/volcano.png


--------------------------------------------------------------------------------
/rnaseq/running_IRFinder.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: How to run intron retention analysis
 3 | description: This code helps to run IRFinder in the cluster.
 4 | category: research
 5 | subcategory: rnaseq
 6 | tags: [hpc, intro_retention]
 7 | ---
 8 | 
 9 | To run any of these commands, need to activate the bioconda IRFinder environment prior to running script.
10 | 
11 | 1. First script creates reference build required for IRFinder
12 | 
13 |     ```bash
14 |     #SBATCH -t 24:00:00                    # Runtime in minutes
15 |     #SBATCH -n 4
16 |     #SBATCH -p medium                # Partition (queue) to submit to
17 |     #SBATCH --mem=128G        # 128 GB memory needed (memory PER CORE)
18 |     #SBATCH -o %j.out               # Standard out goes to this file
19 |     #SBATCH -e %j.err               # Standard err goes to this file
20 |     #SBATCH --mail-type=END         # Mail when the job ends
21 | 
22 |     IRFinder -m BuildRefProcess -r reference_data/
23 |     ```
24 | 
25 |       >**NOTE:** The files in the `reference_data` folder are sym links to the bcbio ref files and need to be named specifically `genome.fa` and `transcripts.gtf`:
26 |       >
27 |       >`genome.fa -> /n/app/bcbio/biodata/genomes/Hsapiens/hg19/seq/hg19.fa`
28 |       >
29 |       >`transcripts.gtf -> /n/app/bcbio/biodata/genomes/Hsapiens/hg19/rnaseq/ref-transcripts.gtf`
30 | 
31 | 2. Second script (.sh) runs IRFinder and STAR on input file
32 | 
33 |       ```bash
34 |       #!/bin/bash
35 | 
36 |       module load star/2.5.4a
37 | 
38 |       IRFinder -r /path/to/irfinder/reference_data \
39 |       -t 4 -d results \
40 |       $1
41 |       ```
42 | 
43 | 3. Third script (.sh) runs a batch job for each input file in directory
44 | 
45 |       ```bash
46 |       #!/bin/bash
47 | 
48 |       for fq in /path/to/*fastq
49 |       do
50 | 
51 |       sbatch -p medium -t 0-48:00 -n 4 --job-name irfinder --mem=128G -o %j.out -e %j.err --wrap="sh /path/to/irfinder/irfinder_input_file.sh $fq"
52 |       sleep 1 # wait 1 second between each job submission
53 | 
54 |       done
55 |       ```
56 | 
57 | 4. Fourth script takes output (IRFinder-IR-dir.txt) and uses the replicates to determine differential expression using the Audic and Claverie test (# replicates < 4). analysisWithLowReplicates.pl script comes with the IRFinder github repo clone, so I cloned the repo at https://github.com/williamritchie/IRFinder/. Notes on the Audic and Claverie test can be found at: https://github.com/williamritchie/IRFinder/wiki/Small-Amounts-of-Replicates-via-Audic-and-Claverie-Test.
58 | 
59 |       ```bash
60 |       #!/bin/bash
61 | 
62 |       #SBATCH -t 24:00:00                    # Runtime in minutes
63 |       #SBATCH -n 4
64 |       #SBATCH -p medium                # Partition (queue) to submit to
65 |       #SBATCH --mem=128G        # 8 GB memory needed (memory PER CORE)
66 |       #SBATCH -o %j.out               # Standard out goes to this file
67 |       #SBATCH -e %j.err               # Standard err goes to this file
68 |       #SBATCH --mail-type=END         # Mail when the job ends
69 | 
70 |       analysisWithLowReplicates.pl \
71 |         -A A_ctrl/Pooled/IRFinder-IR-dir.txt A_ctrl/AJ_1/IRFinder-IR-dir.txt A_ctrl/AJ_2/IRFinder-IR-dir.txt A_ctrl/AJ_3/IRFinder-IR-dir.txt \
72 |         -B B_nrde2/Pooled/IRFinder-IR-dir.txt B_nrde2/AJ_4/IRFinder-IR-dir.txt B_nrde2/AJ_5/IRFinder-IR-dir.txt B_nrde2/AJ_6/IRFinder-IR-dir.txt \
73 |         > KD_ctrl-v-nrde2.tab
74 |       ```
75 | 
76 | 5. Output `KD_ctrl-v-nrde2.tab` file can be read directly into R for filtering and results exploration.
77 | 
78 | 6. Rmarkdown workflow (included in report): IRFinder_report.md
79 | 


--------------------------------------------------------------------------------
/rnaseq/running_leafviz.md:
--------------------------------------------------------------------------------
 1 | # Leafviz - Visualize Leafcutter Results
 2 | 
 3 | 
 4 | **All scripts found in `/HBC Team Folder (1)/Resources/LeafCutter_2023`**
 5 | 
 6 | ### Step 1 - Get annotation files
 7 | 
 8 | Annotation files already prepared for hg38 are in the folder listed above in a folder named new_hg38. To make annotation files for another organism run 
 9 | 
10 | ```bash
11 | ./gtf2leafcutter.pl -o /full/path/to/directory/annotation_directory_name/annotation_file_prefix \
12 | /full/path/to/gencode/annotation.gtf
13 | ```
14 | 
15 | ### Step 2 - Make RData files from your results
16 | 
17 | Use the `prepare_results.R` script in the linked folder. The one on the leafcutter github has not been updated.
18 | Your leafcutter should have output `leafcutter_ds_cluster_significance_XXXXX.txt`, `leafcutter_ds_effect_sizes_XXXXX.txt`, and `XXXX_perind_numers.counts.gz`.
19 | You will need all three of these files and the groups file you made. An example groups file for my PD comparison is below:
20 | 
21 | ```bash
22 | NCIH1568_RB1_KO_DMSO_Replicate_3-ready.bam	DMSO
23 | NCIH1568_RB1_KO_DMSO_Replicate_2-ready.bam	DMSO
24 | NCIH1568_RB1_KO_DMSO_Replicate_1-ready.bam	DMSO
25 | NCIH1568_RB1_KO_PD_Replicate_1-ready.bam	PD
26 | NCIH1568_RB1_KO_PD_Replicate_3-ready.bam	PD
27 | NCIH1568_RB1_KO_PD_Replicate_2-ready.bam	PD
28 | ```
29 | 
30 | Below is an example to make the RData from the PD comparison. This code outputs `PD_new.RData`. 
31 | Note that the annotation has the file path and the annotation file prefix.
32 | 
33 | ```bash
34 | ./prepare_results.R -m PD_groups.txt NCIH_PD_perind_numers.counts.gz \
35 | leafcutter_ds_cluster_significance_PD.txt leafcutter_ds_effect_sizes_PD.txt \
36 | new_hg38/new_hg38 -o PD_new.RData
37 | ```
38 | 
39 | ### Step 3 - Visualize
40 | 
41 | Once you have RDatas made for all of your contrasts it is time to actually run leafviz. 
42 | The critical scripts to run leafviz are `run_leafviz.R `, `ui.r`, and `server.R`.  
43 | Before you can run it you **must** change the path on line 41 of `run_leafviz.R`. 
44 | This path needs to reflect the location of the `ui.R` and `server.R` files. 
45 | To run leafviz with my PD data set I give
46 | 
47 | ```bash
48 | ./run_leafviz.R PD_new.RData
49 | ```
50 | 
51 | This will open a new tab in your browswer with your results!
52 | 
53 | For more on what is being shown check the leafviz [documentation](http://davidaknowles.github.io/leafcutter/articles/Visualization.html)
54 | 


--------------------------------------------------------------------------------
/rnaseq/running_rMATS.md:
--------------------------------------------------------------------------------
 1 | ## rMATS for differential splicing analysis
 2 | 
 3 | * Event-based analysis of splicing (e.g. skipped exon, retained intron, alternative 5' and 3' splice site)
 4 | * rMATS handles replicate RNA-Seq data from both paired and unpaired study design
 5 | * statistical model of rMATS calculates the P-value and false discovery rate that the difference in the isoform ratio of a gene between two conditions exceeds a given user-defined threshold
 6 | 
 7 | Software: https://rnaseq-mats.sourceforge.net/
 8 | 
 9 | Paper: https://www.pnas.org/doi/full/10.1073/pnas.1419161111
10 | 
11 | GitHub: https://github.com/Xinglab/rmats-turbo
12 | 
13 | ### Installation
14 | 
15 | Issues with the conda build installation provided on the GitHub page  `./build_rmats --conda`. Had problem with shared libraries (" "loading shared libraries" error ). 
16 | 
17 | Instead install from bioconda. Reference: https://groups.google.com/g/rmats-user-group/c/S1GFEqB9TE8/m/YV9R27CoCwAJ?pli=1
18 | 
19 | ```bash
20 | 
21 | # Need speicific python version
22 | conda create -n "rMATS_python3.7" python=3.7
23 | 
24 | conda activate rMATS_python3.7
25 | 
26 | conda install -c conda-forge -c bioconda rmats=4.1.0
27 | 
28 | ```
29 | 
30 | If you are running as a script on O2:
31 | 
32 | ```bash
33 | #! /bin/bash
34 | 
35 | #SBATCH -t 0-24:00       # Runtime
36 | #SBATCH -p medium            # Partition (queue)
37 | #SBATCH -J rmats             # Job name
38 | #SBATCH -o rmats_frag.out             # Standard out
39 | #SBATCH -e rmats_frag.err             # Standard error
40 | #SBATCH --mem=50G     # Memory needed per core
41 | #SBATCH -c 6
42 | 
43 | 
44 | # USAGE: For paired-end BAM files;run rMATS
45 | 
46 | # Define the project path
47 | path=/n/data1/cores/bcbio/PIs/peter_sicinski/sicinski_inhibition_RNAseq_human_hbc04676
48 | 
49 | # Change directories
50 | cd ${path}/rMATS
51 | 
52 | # Activate conda env for rmats
53 | source ~/miniconda3/bin/activate
54 | conda init bash
55 | source ~/.bashrc
56 | conda activate rMATS_python3.7 
57 | 
58 | ```
59 | 


--------------------------------------------------------------------------------
/rnaseq/strandedness.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: Stranded RNA-seq libraries.
 3 | description: Explains strandedness and where to find info in bcbio.
 4 | category: research
 5 | subcategory: rnaseq 
 6 | ---
 7 | 
 8 | Bulk RNA-seq libraries retaining strand information (stranded) are useful to quantify expression with higher accuracy for opposite 
 9 | strand transcripts which overlap or have overlapping UTRs.
10 | https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-015-1876-7. 
11 | 
12 | Bcbio RNA-seq pipeline has a 'strandedness' parameter: [unstranded|firststrand|secondstrand]  
13 | https://bcbio-nextgen.readthedocs.io/en/latest/contents/pipelines.html?highlight=strand#configuration.  <- link not working*
14 | 
15 | The terminology was inherited from Tophat, see the detailed description in the Salmon doc. 
16 | https://salmon.readthedocs.io/en/latest/library_type.html
17 | Note, that firstrand = ISR for PE and SR for SE.
18 | 
19 | If the strandedness is unknown, run a small subset of reads with 'unstranded' in bcbio and check out what Salmon reports in 
20 | `bcbio_project/final/sample/salmon/lib_format_counts.json`:
21 | ```
22 | {
23 |     "read_files": [
24 |         "/dev/fd/63",
25 |         "/dev/fd/62"
26 |     ],
27 |     "expected_format": "IU",
28 |     "compatible_fragment_ratio": 1.0,
29 |     "num_compatible_fragments": 721856,
30 |     "num_assigned_fragments": 721856,
31 |     "num_frags_with_concordant_consistent_mappings": 692049,
32 |     "num_frags_with_inconsistent_or_orphan_mappings": 47441,
33 |     "strand_mapping_bias": 0.9477291347866986,
34 |     "MSF": 0,
35 |     "OSF": 0,
36 |     "ISF": 36174,
37 |     "MSR": 0,
38 |     "OSR": 0,
39 |     "ISR": 655875,
40 |     "SF": 37676,
41 |     "SR": 9765,
42 |     "MU": 0,
43 |     "OU": 0,
44 |     "IU": 0,
45 |     "U": 0
46 | }
47 | ```
48 | Here the majority of reads are ISR.
49 | 
50 | Another way to check strand bias is  
51 | `bcbio_project/final/sample/qc/qualimap_rnaseq/rnaseq_qc_results.txt`.  
52 | It has `SSP estimation (fwd/rev) = 0.04 / 0.96` meaning strand bias (ISR, firststrand).
53 | 
54 | Yet another way to confirm strand bias is seqc.  
55 | http://rseqc.sourceforge.net/#infer-experiment-py.  
56 | It uses a small subset of the input bam file:  
57 | `infer_experiment.py -r /bcbio/genomes/Mmusculus/mm10/rnaseq/ref-transcripts.bed -i test.bam`
58 | 
59 | ```
60 | This is PairEnd Data
61 | Fraction of reads failed to determine: 0.1461
62 | Fraction of reads explained by "1++,1--,2+-,2-+": 0.0177
63 | Fraction of reads explained by "1+-,1-+,2++,2--": 0.8362
64 | ```
65 | 


--------------------------------------------------------------------------------
/rnaseq/tools.md:
--------------------------------------------------------------------------------
 1 |  - [IsoformSwitchAnalyzer](https://bioconductor.org/packages/release/bioc/vignettes/IsoformSwitchAnalyzeR/inst/doc/IsoformSwitchAnalyzeR.html)
 2 |     - LP/VB?, 2019/02?
 3 |     - version#?
 4 |     - helps to detect alternative splicing
 5 |     - output very nice figures
 6 |     - what requirements are needed (e.g. R-3.5.1, etc.)?
 7 |     - no tutorials available
 8 |     - not incorporated into bcbio
 9 |     -  I tried it and an example of a consults is here:https://code.harvard.edu/HSPH/hbc_RNAseq_christiani_RNAediting_on_lung_in_humna_hbc02307. This packages has very nice figures: https://www.dropbox.com/work/HBC%20Team%20Folder%20(1)/consults/david_christiani/RNAseq_christiani_RNAediting_on_lung_in_humna?preview=dtu.html (see at the end of the report).
10 |     
11 | - [DEXseq](https://bioconductor.riken.jp/packages/3.0/bioc/html/DEXSeq.html)
12 |     - LP/VB/RK?, date?
13 |     - version#?
14 |     - used to call isoform switching
15 |     - not recommended - use DTU tool instead
16 |     - what requirements are needed (e.g. R-3.5.1, etc.)?
17 |     - no tutorials available
18 |     - yes, DEXseq is incorporated in bcbio
19 |     - Following this paper from MLove et al: https://f1000research.com/articles/7-952/v3 I used salmon and DEXseq to call isoform switching. This consult has an example: https://code.harvard.edu/HSPH/hbc_RNAseq_christiani_RNAediting_on_lung_in_humna_hbc02307. I found that normally one isomform changes a lot and another very little, but I found some examples were the switching is more evident.
20 | 
21 | - [clusterProfiler](https://yulab-smu.github.io/clusterProfiler-book/index.html)
22 | 


--------------------------------------------------------------------------------
/scrnaseq/10XVisium.md:
--------------------------------------------------------------------------------
 1 | ## Analysis of 10X Visium data
 2 | 
 3 | > Download and install spaceranger: https://support.10xgenomics.com/spatial-gene-expression/software/pipelines/latest/installation
 4 | 
 5 | Helpful resource: https://lmweber.org/OSTA-book/
 6 | 
 7 | ### Analysis software/packages 
 8 | * [scanpy](https://scanpy-tutorials.readthedocs.io/en/latest/spatial/basic-analysis.html)
 9 | * [Spatial transcriptomics with Seurat](https://yu-tong-wang.github.io/talk/sc_st_data_analysis_R.html)
10 | * [Spatial single-cell quantification with alevin-fry](https://combine-lab.github.io/alevin-fry-tutorials/2021/af-spatial/)
11 | 
12 | 
13 | ### 1. BCL to FASTQ
14 | 
15 | `spaceranger mkfastq` can be used here. Input is the flow cell directory.
16 | 
17 | Note that **if your SampleSheet is formatted for BCL Convert**, which is Illumina's new demultiplexing software that is soon going to replace bcl2fastq, you will get an error.
18 |  
19 | You will need to change the formatting slightly.
20 |  
21 | https://support.10xgenomics.com/spatial-gene-expression/software/pipelines/latest/using/mkfastq#simple_csv
22 | 
23 | If you are creating the simple csv samplesheet with specific oligo sequences for each sample index you may need to make edits. The I2 auto-orientation detector cannot be activated when supplying a simple csv. For example, teh NextSeq instrument needs the index 2 in reverse complement. So if you had:
24 | 
25 | ```
26 | Lane,Sample,Index,Index2
27 | *,CD_3-GEX_03,GCGGGTAAGT,TAGCACTAAG
28 | ```
29 | 
30 | You can either: 
31 | 
32 | 1. Specify the reverse complement Index2 oligo 
33 | 
34 | ```
35 | *,CD_3-GEX_03,GCGGGTAAGT,CTTAGTGCTA
36 | ```
37 | 
38 | 2. Use the 10x index names, e.g. SI-TT-G6
39 | 
40 | ```
41 | *,CD_3-GEX_03,SI-TT-G6
42 | ```
43 | 
44 | You can find more information [linked here on the 10X website](https://support.10xgenomics.com/spatial-gene-expression/software/pipelines/latest/using/bcl2fastq-direct#sample-sheet)
45 | 
46 | Additionally, if you find you have `AdpaterRead1` or `AdapterRead2` under "Setting" in this file (example below), you will want to remove that. For any 10x library regardless of how the demultiplexing is being done, we do not recommend trimming adapters in the Illumina -- **this will cause problems with reads in downstream analyses**.
47 | 
48 | 
49 | **To create FASTQ files:**
50 | 
51 | ```bash
52 | spaceranger mkfastq --run data/11-7-2022-DeVries-10x-GEX-Visium/Files 
53 |          --simple-csv t samplesheets/11-7-2022-DeVries-10x-GEX-Visium_Samplesheet.csv 
54 |          --output-dir fastq/11-7-2022-DeVries-10x-GEX-Visium
55 | 
56 | ```
57 | 
58 | ### 2. Image files
59 | 
60 | Each slide has 4 capture areas and therefore for a single slide you should have 4 image files.
61 | 
62 | More on image types [from 10X docs here](https://support.10xgenomics.com/spatial-gene-expression/software/pipelines/latest/using/image-recommendations)
63 | 
64 | * Check what type of image you have (you will need to specify in `spaceranger` with the correct flag)
65 | * Open up the image to make sure you have the fiducial border. It's probably done for you. If there are issues with the fiducial alignment (i.e. too tall, too wide) given to you, you may need to manually align using the Loupe browser
66 | 
67 | 
68 | ### 3. Counting expression data
69 | 
70 | The next step is to quantify expression for each capture area. To do this we will use [`spaceranger count`](https://support.10xgenomics.com/spatial-gene-expression/software/pipelines/latest/tutorials/count-ff-tutorial). This command will need to be run for each capture area. Below is the command for a single capture area (in this case Slide 1, capture area A0. You may find your files have not been named with A-D, so map them accordingly.
71 | 
72 | A few things to note if you have samples that were run on multiple flow cells:
73 | 
74 | * include the `--sample` argument to specify the samplename which corresponds to the capture area
75 | * for the `--fastqs` you can add multiple paths to the different flow cell folders and separate them by a comma
76 | 
77 | ```bash
78 | 
79 | spaceranger count --id="CD_Visium_01" \
80 |                    --sample=CD_Visium_01 \
81 |                    --description="Slide1_CaptureArea1" \
82 |                    --transcriptome=refdata-gex-mm10-2020-A \
83 |                    --fastqs=mkfastq/11-10-2022-DeVries-10x-GEX-Visium/AAAW33YHV/,mkfastq/11-4-2022-Devries-10x-GEX-Visium/AAAW3N3HV/,mkfastq/11-7-2022-DeVries-10x-GEX-Visium/AAAW352HV/,mkfastq/11-8-2022-DeVries-10x-GEX-Visium/AAAW3FCHV/,mkfastq/11-9-2022-DeVries-10x-GEX-Visium/AAAW3F3HV/ \
84 |                    --image=images/100622_Walker_Slides1_and_2/V11S14-092_20221006_01_Field1.tif\
85 |                    --slide=V11S14-092 \
86 |                    --area=A1 \
87 |                    --localcores=6 \
88 |                    --localmem=20
89 | 
90 | ```
91 | 


--------------------------------------------------------------------------------
/scrnaseq/CellRanger.md:
--------------------------------------------------------------------------------
 1 | # When running Cell Ranger on O2
 2 | ## Shared by Victor
 3 | > 10x documentation is the one he uses:
 4 | 
 5 | https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger
 6 | 
 7 | > For the custom genome:
 8 | 
 9 | https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/tutorial_mr
10 | 
11 | > For GFP:
12 | 
13 | https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/tutorial_mr#marker
14 | 
15 | ## Build custom genome 
16 | > High level steps (based on the 10x tutorial):
17 | 1. Download the gtf and fasta files for the species of interest;
18 | 2. Filter gtf with `cellranger mkgtf` command; 
19 | 3. Create the fasta file for the additional gene (for example, GFP);
20 | 4. Create the corresponding gtf file for the additional gene;
21 | 5. Append fasta file of the additional gene to the end of the fasta file for the genome;
22 | 6. Append gtf file of the additional gene to the end of the gtf file for the genome;
23 | 7. Make custom genome with `cellranger mkref` command;
24 | 
25 | 


--------------------------------------------------------------------------------
/scrnaseq/Demuxafy_HowTo.md:
--------------------------------------------------------------------------------
 1 | # How to run Demuxafy on O2
 2 | 
 3 | For detailed instructions and updates on `demuxafy`, see the comprehensive [Read the Docs](https://demultiplexing-doublet-detecting-docs.readthedocs.io/en/latest/index.html#)
 4 | 
 5 | 
 6 | ## Installation
 7 | 
 8 | I originally downloaded the `Demuxafy.sif` singularity image for use on O2 as instructed [here](https://demultiplexing-doublet-detecting-docs.readthedocs.io/en/latest/Installation.html). However, **this singularity image did not pass O2's security checks**. The folks at HMS-RC were kind enough to amend the image for me so that it would pass the security checks. The working image is found at: `/n/app/singularity/containers/Demuxafy.sif` allowing anyone to use it.
 9 | 
10 | Of note, this singularity image includes a bunch of software, including popscle, demuxlet, freemuxlet, souporcell and other demultiplexing as well as doublet detection tools, so very useful to have installed!
11 | 
12 | 
13 | ## Input data 
14 | 
15 | Each tool included in `demuxafy` requires slightly different input (see [Read the Docs](https://demultiplexing-doublet-detecting-docs.readthedocs.io/en/latest/index.html#)). 
16 | 
17 | For the demultiplexing tools, in most cases, you will need:
18 | 
19 | - A common SNP genotypes VCF file (pre-processed VCF files can be downloaded [here](https://demultiplexing-doublet-detecting-docs.readthedocs.io/en/latest/DataPrep.html), which is what I did after repeatedly failing to re-generate my own VCF file from the 1000 genome dataset following the provided instructions...)
20 | - A Barcode file (`outs/raw_feature_bc_matrix/barcodes.tsv.gz` from a typical `cellranger count` run)
21 | - A BAM file of aligned single-cell reads (`outs/possorted_genome_bam.bam` from a typical `cellranger count` run)
22 | - Knowledge of the number of samples in the pool you're trying to demultiplex
23 | - Potentially, a FASTA file of the genome your sample was aligned to
24 | 
25 | _NOTE_: When working from a multiplexed dataset (e.g. cell hashing experiment), you may have to re-run `cellranger count` instead of `cellranger multi` to generate the proper barcodes and BAM files. In addition, it may be necessary to use the `barcodes.tsv.gz` file from the `filtered_feature_bc_matrix` (instead of raw) in such cases (see for example this [issue](https://github.com/wheaton5/souporcell/issues/128) when running `souporcell`). 
26 | 
27 | 
28 | ## Pre-processing steps
29 | 
30 | Once you've collated those files, you need to make sure your VCF and BAM files are sorted in the same way. This can be achieved by running the following command after sourcing `sort_vcf_same_as_bam.sh` from the Aerts' Lab popscle helper tool GitHub repo (available [here](https://github.com/aertslab/popscle_helper_tools/blob/master/sort_vcf_same_as_bam.sh)):
31 | 
32 | ```
33 | # Sort VCF file in same order as BAM file
34 | sort_vcf_same_as_bam.sh $BAM $VCF > demuxafy/data/GRCh38_1000G_MAF0.01_ExonFiltered_ChrEncoding_sorted.vcf
35 | ```
36 | 
37 | ### dsc pileup
38 | 
39 | If you wish to run `freemuxlet` (and possibly other tools I haven't piloted), you will also need to run `dsc-pileup` (available within the singularity image) ahead of `freemuxlet` itself. For larger samples (>30k cells), it also helps (= significantly speeds up computational time, from several days to a couple of hours) to pre-filter the BAM file using another of the Aerts' Lab popscle helper tool scripts: `filter_bam_file_for_popscle_dsc_pileup.sh` (available [here](https://github.com/aertslab/popscle_helper_tools/blob/master/filter_bam_file_for_popscle_dsc_pileup.sh))
40 | 
41 | ```
42 | # [OPTIONAL but recommended]
43 | module load gcc/9.2.0 samtools/1.14
44 | scripts/filter_bam_file_for_popscle_dsc_pileup.sh $BAM $BARCODES demuxafy/data/GRCh38_1000G_MAF0.01_ExonFiltered_ChrEncoding_sorted.vcf demuxafy/data/possorted_genome_bam_filtered.bam
45 | 
46 | # Run popscle pileup ahead of freemuxlet
47 | singularity exec $DEMUXAFY popscle dsc-pileup --sam demuxafy/data/possorted_genome_bam_filtered.bam --vcf demuxafy/data/GRCh38_1000G_MAF0.01_ExonFiltered_ChrEncoding_sorted.vcf --group-list $BARCODES --out $FREEMUXLET_OUTDIR/pileup
48 | ``` 
49 | 
50 | _NOTE_: When running the dsc-pileup step on O2, at some point the job might get stalled despite no error message being issued. From my experience, this usually means that the requested memory needs to be increased (I used 48G-56G for most samples I processed, and encountered issues when lowering down to 32G). After filtering the BAM file and with the appropriate amount of memory available, the dsc-pileup step usually completes within 2-3 hours.
51 | 
52 | 
53 | ## Workflow
54 | 
55 | After that, you should be set to run whichever demultiplexing tool you want! See sample scripts for a simple case (small 10X study) in the following [GitHub repo](https://github.com/hbc/neuhausser_scRNA-seq_human_embryo_hbc04528/tree/main/pilot_scRNA-seq/demuxafy/scripts); and for a more complex case (large study, multiplexed 10X data using cell hashing) [here](https://github.com/hbc/hbc_10xCITESeq_Pregizer-Visterra-_hbc04485/tree/main/demuxafy/scripts)
56 | 
57 | You also have the option to generate combined results files to contrast results from different software more easily, as described [here](https://demultiplexing-doublet-detecting-docs.readthedocs.io/en/latest/CombineResults.html), and as implemented in the `combine_results.sbatch` script in the first GitHub repo linked above.
58 | 


--------------------------------------------------------------------------------
/scrnaseq/MDS_plot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/scrnaseq/MDS_plot.png


--------------------------------------------------------------------------------
/scrnaseq/README.md:
--------------------------------------------------------------------------------
 1 | # scRNA-seq
 2 | 
 3 | * **[Tools for scRNA-seq analysis](tools.md):** This document lists the various tools that are currently being used for scRNA-seq analysis (and who has used/tested them) in addition to new tools that we are interested in but have yet to be tested.
 4 | 
 5 | * **Tutorials for scRNA-seq analysis:** These documents are tutorials to help you with various types of scRNA-seq analysis.
 6 | 
 7 |   - **[Single-Cell-conda.md](https://github.com/hbc/knowledgebase/blob/master/research/scrnaseq/Single-Cell-conda.md):**  installing tools for scRNA-seq analysis with conda.
 8 |   - **[Single-Cell.md](https://github.com/hbc/knowledgebase/blob/master/research/scrnaseq/Single-Cell.md):** installing tools and setting up docker for single cell rnaseq
 9 |   - **[rstudio_sc_docker.md](https://github.com/hbc/knowledgebase/blob/master/research/scrnaseq/rstudio_sc_docker.md):**  This docker image contains an rstudio installation with some helpful packages for singlecell analysis. It also includes a conda environment to deal with necessary python packages (like umap-learn).
10 |   - **[Single-cell analysis workflow](https://github.com/hbc/tutorials/tree/master/scRNAseq/scRNAseq_analysis_tutorial):** tutorials walking through the steps in a single-cell RNA-seq analysis, including differential expression analysis, power analysis, and creating a SPRING interface
11 | 
12 | * **[Bibliography](bibliography.md):** This document lists relevant papers pertaining to scRNA-seq analysis
13 | 


--------------------------------------------------------------------------------
/scrnaseq/SNP_demultiplex.md:
--------------------------------------------------------------------------------
 1 | # Demultiplexing SC data using SNP information
 2 | 
 3 | ## Overview
 4 | 
 5 | ## Methods:
 6 | 
 7 | * scSplit:
 8 | 
 9 | 
10 |   **References**:
11 |   
12 |   * [Paper](https://doi.org/10.1186/s13059-019-1852-7) 
13 |   
14 |   * [Repo](https://github.com/jon-xu/scSplit)
15 | 
16 | * Demuxlet/Freemuxlet/Popscle:
17 | 
18 |   Demuxlet is the first iteration of the software. Popscle is a suite that includes an improved version of demuxlet and also freemuxlet. It is recommended 
19 |   by the authors to use popscle.
20 |     
21 |   **Running it:**
22 |   
23 |   Installing demuxlet is not straightforward with very particular instructions. A similar situation might happen with popscle which it's not published yet.
24 |   I recommend using Docker. The repo contains the [Dockerfile](https://github.com/statgen/popscle/blob/master/Dockerfile). You can use it to create your own 
25 |   docker image. One available is [here](https://hub.docker.com/repository/docker/vbarrerab/popscle). This image can also be used to create a singularity container 
26 |   on O2.
27 |   
28 |   _Running on O2: singularity_
29 | 
30 | 
31 | 
32 | singularity exec -B <local_folder_bam_files>:/bam_files,<local_folder_vcf_files>:/vcf_files,<local_folder_dsc_pileup_results>:/results
33 | /n/app/singularity/containers/<user>/<popscle_singularity_container> popscle dsc-pileup --sam /bam_files/<bam_file> --vcf /vcf_files/<vcf_file> 
34 |  --out /results/<pileup_file_output>
35 | 
36 | **Recommendations:**
37 | 
38 |   It is highly reccomended to reduce the number of reads and SNPs before running 
39 |   
40 |   
41 | 
42 | **References**:
43 | 
44 |   _Demuxlet_
45 |   
46 |   * [Paper - Demuxlet](https://www.nature.com/articles/nbt.4042) 
47 |   
48 |   * [Repo](https://github.com/statgen/demuxlet)
49 | 
50 |   _Popscle (Demuxlet/Freemuxlet)_
51 |   
52 |   * [Repo](https://github.com/statgen/popscle)
53 | 
54 |   _popscle helper tools_
55 |   
56 |   * [Repo](https://github.com/aertslab/popscle_helper_tools)
57 | 


--------------------------------------------------------------------------------
/scrnaseq/Single-Cell-conda.md:
--------------------------------------------------------------------------------
 1 | *Single cell analyses require a lot of memory and often fail on the laptops. 
 2 | Having R + Seurat installed in a conda environment + interactive session or batch jobs with 50-100G RAM helps.*
 3 | 
 4 | # 1. Use conda from bcbio
 5 | ```
 6 | which conda
 7 | /n/app/bcbio/dev/anaconda/bin/conda
 8 | conda --version
 9 | conda 4.6.14
10 | ```
11 | 
12 | # 2. Create and setup r conda environment
13 | ```
14 | conda create -n r r-essentials r-base zlib pandoc
15 | conda init bash
16 | conda config --set auto_activate_base false
17 | . ~/.bashrc 
18 | ```
19 | 
20 | # 3. Activate conda env
21 | ```
22 | conda activate r
23 | which R
24 | 
25 | ```
26 | 
27 | # 4. Install packages from within R
28 | 
29 | ## 4.1 Install Seurat
30 | ```
31 | R
32 | install.packages("Seurat")
33 | library(Seurat)
34 | q()
35 | ```
36 | 
37 | ## 4.2 Install Monocle
38 | ```
39 | R
40 | install.packages(c("BiocManager", "remotes"))
41 | BiocManager::install("monocle")
42 | q()
43 | ```
44 | 
45 | ## 4.3 Install liger
46 | ```
47 | R
48 | install.packages("devtools")
49 | library(devtools)
50 | install_github("MacoskoLab/liger")
51 | library(liger)
52 | q()
53 | ```
54 | 
55 | # 5. Install umap-learn for UMAP clustering
56 | ```
57 | pip install umap-learn
58 | ```
59 | 
60 | # 6. Deactivate conda
61 | ```
62 | conda deactivate
63 | ```
64 | 
65 | # 7. (Troubleshooting)
66 | - It may ask you to install github token - too many packages loaded from github.
67 | I generated token on my laptop and placed it in ~/.Renviron
68 | - BiocManager::install("slingshot") - I failed to install it due to gsl issues.
69 | - when running a batch job, use source activate r/ source deactivate 
70 | - if conda is trying to write in bcbio cache, check and set cache priority, your home cache should be first:  
71 | `conda info`,  
72 | ~/.condarc
73 | ```
74 | pkgs_dirs:
75 |     - /home/[UID]/.conda/pkgs
76 |     - /n/app/bcbio/dev/anaconda/pkgs
77 | ```
78 | 


--------------------------------------------------------------------------------
/scrnaseq/bcbio_indrops3.md:
--------------------------------------------------------------------------------
  1 | # Counting cells with bcbio for inDrops3 data - proto-SOP
  2 | 
  3 | ## Last use - 2020-03-17
  4 | 
  5 | ## 1. Check reference genome and transcriptome - is it a mouse project?
  6 | - mm10 reference genome: /n/shared_db/bcbio/biodata/genomes/Mmusculus/mm10
  7 | - transcriptome_fasta: /n/shared_db/bcbio/biodata/genomes/Mmusculus/mm10/rnaseq/ref-transcripts.fa
  8 | - transcriptome_gtf: /n/shared_db/bcbio/biodata/genomes/Mmusculus/mm10/rnaseq/ref-transcripts.gtf
  9 | 
 10 | ## 2. Create bcbio project structure in /scratch
 11 | ```
 12 | mkdir sc_mouse
 13 | cd sc_mouse
 14 | mkdir config input final work
 15 | ```
 16 | 
 17 | ## 3. Prepare fastq input in sc_mouse/input
 18 | - some FC come in 1..4 lanes, merge lanes for every read:
 19 | ```
 20 | cat lane1_r1.fq.gz lane2_r1.fq.gz > project_1.fq.gz
 21 | cat lane1_r2.fq.gz lane2_r2.fq.gz > project_2.fq.gz
 22 | ```
 23 | - cat'ing gzip files sounds ridiculous, but works for the most part, for purists:
 24 | ```
 25 | zcat KM_lane1_R1.fastq KM_lane2_R1.fastq.gz | gzip > KM_1.fq.gz
 26 | ```
 27 | 
 28 | - some cores send bz2 files not gz
 29 | ```
 30 | bunzip2 *.bz2
 31 | cat *R1.fastq | gzip > sample_1.fq.gz
 32 | ```
 33 | 
 34 | - some cores produce R1,R2,R3,R4, others R1,R2,I1,I2, rename them
 35 | ```
 36 | bcbio_R1 = R1 = 86 or 64 bp transcript read
 37 | bcbio_R2 = I1 = 8 bp part 1 of cell barcode
 38 | bcbio_R3 = I2 = 8 bp sample (library) barcode
 39 | bcbio_R4 = R2 = 14 bp = 8 bp part 2 of cell barcode + 6 bp of transcript UMI
 40 | ```
 41 | - files in sc_mouse/input should be (KM here is project name):
 42 | ```
 43 | KM_1.fq.gz
 44 | KM_2.fq.gz
 45 | KM_3.fq.gz
 46 | KM_4.fq.gz
 47 | ```
 48 | 
 49 | ## 4. Create `sc_mouse/config/sample_barcodes.csv`
 50 | Check out if the sample barcodes provided match the actual barcodes in the data.
 51 | ```
 52 | gunzip -c FC_X_3.fq.gz | awk '{if(NR%4 == 2) print $0}' | head -n 400000 | sort | uniq -c | sort -k1,1rn | awk '{print $2","$1}' | head
 53 | 
 54 | AGGCTTAG,112303
 55 | ATTAGACG,95212
 56 | TACTCCTT,94906
 57 | CGGAGAGA,62461
 58 | CGGAGATA,1116
 59 | CGGATAGA,944
 60 | GGGGGGGG,852
 61 | ATTAGACC,848
 62 | ATTAGCCG,840
 63 | ATTATACG,699
 64 | ```
 65 | 
 66 | Sometimes you need to reverse complement sample barcodes:
 67 | ```
 68 | cat barcodes_original.csv | awk -F ',' '{print $1}' | tr ACGTacgt TGCAtgca | rev
 69 | ```
 70 | 
 71 | sample_barcodes.csv
 72 | ```
 73 | TCTCTCCG,S01
 74 | GCGTAAGA,S02
 75 | CCTAGAGT,S03
 76 | TCGACTAG,S04
 77 | TTCTAGAG,S05
 78 | ```
 79 | 
 80 | ## 5. Create `sc_mouse/config/sc-mouse.yaml`
 81 | ```
 82 | details:
 83 | - algorithm:
 84 |     cellular_barcode_correction: 1
 85 |     minimum_barcode_depth: 1000
 86 |     sample_barcodes: /full/path/sc_mouse/config/sample_barcodes.csv
 87 |     transcriptome_fasta: /n/shared_db/bcbio/biodata/genomes/Mmusculus/mm10/rnaseq/ref-transcripts.fa
 88 |     transcriptome_gtf: /n/shared_db/bcbio/biodata/genomes/Mmusculus/mm10/rnaseq/ref-transcripts.gtf
 89 |     umi_type: harvard-indrop-v3
 90 |   analysis: scRNA-seq
 91 |   description: PI_name
 92 |   files:
 93 |   - /full/path/sc_mouse/input/KM_1.fq.gz
 94 |   - /full/path/sc_mouse/input/KM_2.fq.gz
 95 |   - /full/path/sc_mouse/input/KM_3.fq.gz
 96 |   - /full/path/sc_mouse/input/KM_4.fq.gz
 97 |   genome_build: mm10
 98 |   metadata: {}
 99 | fc_name: sc-mouse
100 | upload:
101 |   dir: /full/path/sc_mouse/final
102 | ```
103 | Use `cd sc_mouse/input; readlink -f *` to grab full path to each file and paste into yaml.
104 | 
105 | ## 6. Create `sc_mouse/config/bcbio.sh`
106 | ```
107 | #!/bin/bash
108 | 
109 | # https://slurm.schedmd.com/sbatch.html
110 | 
111 | #SBATCH --partition=priority        # Partition (queue)
112 | #SBATCH --time=10-00:00             # Runtime in D-HH:MM format
113 | #SBATCH --job-name=km            # Job name
114 | #SBATCH -c 20
115 | #SBATCH --mem-per-cpu=5G            # Memory needed per CPU
116 | #SBATCH --output=project_%j.out     # File to which STDOUT will be written, including job ID
117 | #SBATCH --error=project_%j.err      # File to which STDERR will be written, including job ID
118 | #SBATCH --mail-type=ALL             # Type of email notification (BEGIN, END, FAIL, ALL)
119 | 
120 | bcbio_nextgen.py ../config/sc-mouse.yaml -n 20
121 | ```
122 | - most projects take < 5days, but some large 4 lane could take more, like 7-8
123 | 
124 | ## 7. Run bcbio
125 | ```
126 | cd sc_mouse_work
127 | sbatch ../config/bcbio.sh
128 | ```
129 | 
130 | ## 1a. (Optional). 
131 | If you care, download fresh transcriptome annotation from Gencode (https://www.gencodegenes.org/mouse/)
132 | (it has chrom names with chr matching mm10 assembly).
133 | ```
134 | cd sc_mouse/input
135 | wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M23/gencode.vM23.annotation.gtf.gz
136 | gunzip gencode.vM23.annotation.gtf.gz
137 | gffread -g /n/shared_db/bcbio/biodata/genomes/Mmusculus/mm10/seq/mm10.fa gencode.vM23.annotation.gtf -x gencode.vM23.annotation.cds.fa
138 | ```
139 | update sc_mouse/config/sc_mouse.yaml:
140 | ```
141 | transcriptome_fasta: gencode.vM23.annotation.cds.fa
142 | transcriptome_gtf: gencode.vM23.annotation.gtf
143 | ```
144 | ## References
145 | - indrops3 library structure: https://singlecellcore.hms.harvard.edu/resources
146 | - [Even shorter guide](https://github.com/bcbio/bcbio-nextgen/blob/master/config/templates/indrop-singlecell.yaml)
147 | - [Much more comprehensive guide](https://github.com/hbc/tutorials/blob/master/scRNAseq/scRNAseq_analysis_tutorial/lessons/01_bcbio_run.md)
148 | 


--------------------------------------------------------------------------------
/scrnaseq/bibliography.md:
--------------------------------------------------------------------------------
 1 | # Integration
 2 | - [Seurat](https://www.cell.com/cell/fulltext/S0092-8674(19)30559-8)
 3 | - [Harmony](https://www.biorxiv.org/content/10.1101/461954v2)
 4 | 
 5 | # More references
 6 | 1. [A collection of resources from seandavi](https://github.com/seandavi/awesome-single-cell)
 7 | 2. https://scrnaseq-course.cog.sanger.ac.uk/website/index.html
 8 | 3. https://broadinstitute.github.io/2019_scWorkshop/index.html
 9 | 4. https://github.com/SingleCellTranscriptomics/ISMB2018_SingleCellTranscriptomeTutorial
10 | 5. [Bibliography in bib](bcbio_sc.bib)
11 | 


--------------------------------------------------------------------------------
/scrnaseq/cite_seq.md:
--------------------------------------------------------------------------------
1 | 
2 | - https://cite-seq.com
3 | - https://en.wikipedia.org/wiki/CITE-Seq
4 | - https://github.com/Hoohm/CITE-seq-Count
5 | - https://sites.google.com/site/fredsoftwares/products/cite-seq-counter
6 | 


--------------------------------------------------------------------------------
/scrnaseq/doublets.md:
--------------------------------------------------------------------------------
 1 | # Doublet identification
 2 | 
 3 | - gene number filter is not effective in identifying doublets (Scrublet2019 article).
 4 | - there is no good unsupervised doublets detection method for now
 5 | - DoubletIdentification works for a group of cells we suspect they might be doublets (a cluster or a group of clusters)  - if we see mixed marker signature and we know that these cells are not in transitional state, i.e. expert review of clusters is needed before doublet deconvolution
 6 | - dump counts from suspected counts from Seurat
 7 | - identify doublets with Scrublet
 8 | - get back to Seurat
 9 | 
10 | R based DoubletFinder and DoubletDecon have issues
11 | - https://github.com/chris-mcginnis-ucsf/DoubletFinder/issues/64
12 | - https://github.com/EDePasquale/DoubletDecon/issues/21
13 | 
14 | 


--------------------------------------------------------------------------------
/scrnaseq/pub_quality_umaps.md:
--------------------------------------------------------------------------------
 1 | # Here is a collaction of code for nice looking umaps from people in the core. Please add!
 2 | 
 3 | 
 4 | ## Zhu's pretty white boxes
 5 | 
 6 | <img src="../img/zhu_umap.png" width="400">
 7 | 
 8 | ### Code
 9 | 
10 | **Note: Zhu says "The gist is to add cluster numbers to the ggplot data, then using LableClusters to plot."**
11 | 
12 | ```R
13 | Idents(seurat_stroma_SCT) <- "celltype"
14 | 
15 | p1 <- DimPlot(object = seurat_stroma_SCT,
16 |              reduction = "umap", 
17 |              label = FALSE,
18 |              label.size = 4,
19 |              repel = TRUE) + xlab("UMAP 1") + ylab("UMAP 2") + labs(title="UMAP")
20 | 
21 | # add a new column of clusterNo to ggplot data
22 | p1$data$clusterNo <- as.factor(sapply(strsplit(as.character(p1$data$ident), " "), "[", 1))
23 | 
24 | LabelClusters(plot = p1, id = "clusterNo", box = T, repel = F, fill = "white")
25 | ```
26 | 
27 | ## Noor's embedded labels
28 | 
29 | <img src="../img/noor_umap.png" width="400">
30 | 
31 | ### Code
32 | 
33 | 
34 | ```R
35 | LabelClusters(p, id = "ident",  fontface = "bold", size = 3, bg.colour = "white", bg.r = .2, force = 0)
36 | ```
37 | 
38 | 
39 | 


--------------------------------------------------------------------------------
/scrnaseq/rstudio_sc_docker.md:
--------------------------------------------------------------------------------
 1 | # Docker image with rstudio for single cell analysis
 2 | 
 3 | ## Description
 4 | 
 5 | Docker images for single cell analysis.
 6 | 
 7 | All docker images contain an rstudio installation with some helpful packages for singlecell analysis. It also includes a conda environment to deal with necessary python packages (like umap-learn).
 8 | 
 9 | Docker Rstudio images are obtained from [rocker/rstudio](https://hub.docker.com/r/rocker/rstudio).
10 | 
11 | ## R version and Bioconductor
12 | 
13 | The R and Bioconductor versions are specified in the image name (along with the OS version):
14 | 
15 | Example:
16 | `singlecell-base:R.4.0.3-BioC.3.11-ubuntu_20.04`
17 | 
18 | ## Use 
19 | 
20 | `docker run -d -p 8787:8787 --name <container_name> -e USER='rstudio' -e PASSWORD='rstudioSC' -e ROOT=TRUE -v <host_folder>:/home/rstudio/projects vbarrerab/<docker_image>)`
21 | 
22 | `-e DISABLE_AUTH=true` option can be added to avoid Rstudio login prompt. Only use on local machine.
23 | 
24 | This instruction will download and launch a container using the singlecell image. Once launch, it can be access through a web browser with the URL 8787:8787 or localhost:8787.
25 | 
26 | ### Important parameters
27 | 
28 | * -v option is mounting a folder from the host in the container. This allows for data transfer between the host and the container. **This can only be done when creating the container!**
29 | 
30 | * --name assigns a name to the container. Helpful to keep thins tidy.
31 | * -e ROOT=TRUE options provides root access, in case more tweaking inside the container is necessary.
32 | * -p 8787:<port> Change the local port to access the container. **This can only be done when creating the container!**
33 | * FYI: The working directory will be set as /home/rstudio, not /home/rstudio/projects as default behavior.
34 | 
35 | ## Resources
36 | 
37 | The dockerfile and other configuration files can be found on:
38 | 
39 | https://github.com/vbarrera/docker_configuration
40 | 
41 | The docker images: 
42 | 
43 | vbarrerab/singlecell-base
44 | 
45 | ## Available images:
46 | 
47 | - R.4.0.2-BioC.3.11-ubuntu_20.04
48 | - R.4.0.3-BioC.3.11-ubuntu_20.04
49 | 
50 | **Important:**
51 | 
52 | Docker changed its policies to only keep images that have been modified in the last 6 months. This means that previous images will eventually disappear. For previous versions. Check with availability with @vbarrera.
53 | 
54 | # Bibliography
55 | 
56 | Inspired by:
57 | 
58 | https://www.r-bloggers.com/running-your-r-script-in-docker/
59 |   
60 | # Other resources
61 | Using Singularity Containers on the Cluster: https://docs.rc.fas.harvard.edu/kb/singularity-on-the-cluster/
62 | 


--------------------------------------------------------------------------------
/scrnaseq/running_MAST.md:
--------------------------------------------------------------------------------
  1 | # Running MAST
  2 | 
  3 | [MAST](https://github.com/RGLab/MAST) analyzes differential expression using the cell as the unit of replication rather than the sample (as is done for pseduobulk)
  4 |                                                                                                                                          
  5 | **NOTES**
  6 |                                                                                                                                          
  7 | -MAST uses a hurdle model designed for zero heavy data.
  8 | -MAST "expects" log transformed count data. 
  9 | -A [recent paper](https://doi.org/10.1038/s41467-021-21038-1) advises the use of sample id as a random factor to prevent pseudoreplication               
 10 | -Most MAST models include the total number of genes expressed in the cell.
 11 | 
 12 | ## Where to run MAST?
 13 | 
 14 | MAST can be run directly in Seurat (via [FindMarkers](https://satijalab.org/seurat/reference/findmarkers) or by itself.
 15 | 
 16 | While MAST is easier to run with Seurat there are two big downsides:
 17 | 
 18 | 1. Seurat does not log transform the data
 19 | 2. You cannot edit the model with seurat, meaning that you cannot add sample ID or number of genes expressed.
 20 | 
 21 | ## Running MAST (1) - from seurat to a SCA object
 22 | 
 23 | 
 24 | ```r
 25 | # Seurat to SCE
 26 | sce <- as.SingleCellExperiment(seurat_obj)
 27 | 
 28 | # Add log counts
 29 | assay(sce, "log") = log2(counts(sce) + 1)
 30 | 
 31 | # Create new sce object (only 'log' count data)
 32 | sce.1 = SingleCellExperiment(assays = list(log = assay(sce, "log")))
 33 | colData(sce.1) = colData(sce)
 34 | 
 35 | # Change to SCA
 36 | sca = SceToSingleCellAssay(sce.1)
 37 | 
 38 | ```
 39 | 
 40 | ## Running MAST (2) - Filter SCA object
 41 | 
 42 | Here we are only filtering for genes expressed in 10% of cells but this can be altered and other filters can be added.
 43 | 
 44 | ```r
 45 | expressed_genes <- freq(sca) > 0.1
 46 | sca_filtered <- sca[expressed_genes, ]
 47 | 
 48 | ```
 49 | 
 50 | ## Format SCA metadata
 51 | 
 52 | We add the total number of genes expressed per cell as well as setting factors as factors and scaling all continuous variables as suggested by MAST.
 53 | 
 54 | ```r
 55 | cdr2 <- colSums(SummarizedExperiment::assay(sca_filtered)>0)
 56 |  
 57 | SummarizedExperiment::colData(sca_filtered)$ngeneson <- scale(cdr2)
 58 | SummarizedExperiment::colData(sca_filtered)$orig.ident <- factor(SummarizedExperiment::colData(sca_filtered)$orig.ident)
 59 | SummarizedExperiment::colData(sca_filtered)$Gestational_age_scaled <- scale(SummarizedExperiment::colData(sca_filtered)$Gestational_age)
 60 | ```
 61 | 
 62 | ## Run MAST
 63 | 
 64 | This is the most computationally instensive step and takes the longest. 
 65 | Here our model includes the number of genes epxressed (ngeneson), sample id as a random variable ((1 | orig.ident)), Gender, and Gestational age scaled.
 66 | 
 67 | We extract the results from our model for our factor of interest (Gestational_age_scaled)
 68 | 
 69 | 
 70 | ```r
 71 | zlmCond <- suppressMessages(MAST::zlm(~ ngeneson + Gestational_age_scaled + Gender + (1 | orig.ident),  sca_filtered, method='glmer',ebayes = F,strictConvergence = FALSE))
 72 | summaryCond <- suppressMessages(MAST::summary(zlmCond,doLRT='Gestational_age_scaled'))
 73 | ```
 74 | 
 75 | ## Format Results
 76 | 
 77 | MAST results look quite different than DESeq2 results so we need to apply a bit of formatting to make them readable.
 78 | 
 79 | After formatting outputs can be written directly to csv files.
 80 | ```r
 81 | summaryDt <- summaryCond$datatable
 82 | 
 83 | # Create reable results table for all genes tested
 84 | fcHurdle <- merge(summaryDt[contrast == "Gestational_age_scaled"
 85 |  & component == 'H', .(primerid, `Pr(>Chisq)`)], # This extracts hurdle p-values 
 86 |                   summaryDt[contrast == "Gestational_age_scaled" & component == 'logFC', 
 87 |                             .(primerid, coef, ci.hi, ci.lo)], 
 88 |                   by = 'primerid') # This extract LogFC data
 89 | 
 90 | fcHurdle <- stats::na.omit(as.data.frame(fcHurdle))
 91 | 
 92 | fcHurdle$fdr <- p.adjust(fcHurdle$`Pr(>Chisq)`, 'fdr')
 93 | 
 94 | 
 95 | # Create reable results table for significant genes
 96 | fcHurdleSig <- merge(fcHurdle[fcHurdle$fdr < .05,],
 97 |                      as.data.table(mcols(sca_filtered)), by = 'primerid')
 98 | setorder(fcHurdleSig, fdr)
 99 | 
100 | ```
101 | 


--------------------------------------------------------------------------------
/scrnaseq/running_doubletfinder.md:
--------------------------------------------------------------------------------
 1 | # Running DoubletFinder
 2 | 
 3 | 
 4 | [DoubletFinder](https://github.com/chris-mcginnis-ucsf/DoubletFinder) is one of the most popular doublet finding methods with over 1200 citations since 2019 (as of Sept 2023).
 5 | 
 6 | ## Preparing to run DoubletFinder
 7 | 
 8 | The key notes for running doubletFinder are:
 9 | -Each sample MUST be run separately
10 | -Various parameters can be tweaked in the run (see doubletfinder website for details) but the most critical is the prior value of the percentage of doublets.
11 | -DoubletFinder is not fast so best to run on O2 and save output as an RDS file.
12 | 
13 | 
14 | ## Step 1 - Generate subsets
15 | 
16 | Starting with your post-qc seurat object separate out each sample. Then make a list of these new objects and a vector of object names.
17 | 
18 | ```r
19 | sR01 <- subset(x = seurat_qc, subset = orig.ident %in%  c("R01"))
20 | sW01 <- subset(x = seurat_qc, subset = orig.ident %in%  c("W01"))
21 | s3N00 <- subset(x = seurat_qc, subset = orig.ident %in%  c("3N00"))
22 | subsets = list(sR01,sW01,s3N00)
23 | names = c('sR01','sW01',"s3N00")
24 | ```
25 | 
26 | ## Step 2 - Run loop
27 | 
28 | This is the most computationally intensive step. Here we will loop through the list we created and run doublet finder.
29 | 
30 | ```r
31 | for (i in seq(1,length(subsets)) {
32 | 
33 | # SCT Transform and Run UMAP
34 | obj <-  subsets[[i]]
35 | obj <- SCTransform(obj)
36 | obj <- RunPCA(obj)
37 | obj <- RunUMAP(obj, dims = 1:10)
38 | 
39 | #Run doublet Finder
40 | sweep.res.list_obj <- paramSweep_v3(obj, PCs = 1:10, sct = TRUE)
41 | sweep.stats_obj <- summarizeSweep(sweep.res.list_obj, GT = FALSE)
42 | bcmvn_obj <- find.pK(sweep.stats_obj)
43 | nExp_poi <- round(0.09*nrow(obj@meta.data))  ## Assuming 9% doublet formation rate can be changed.
44 | obj <- doubletFinder_v3(obj, PCs = 1:10, pN = 0.25, pK = 0.1, nExp = nExp_poi, reuse.pANN = FALSE, sct = TRUE)
45 | 
46 | # Rename output columns from doublet finder to be consistent across samples for easy merging
47 | colnames(obj@meta.data)[c(22,23)] <- c("pANN","doublet_class") ## change coordinates based on your own metadata size
48 | assign(paste0(names[i]), obj) 
49 | }
50 | ```
51 | 
52 | ## Step 3 - Merge doubletfinder output and save
53 | 
54 | After doubletfinder is run merge to output to a single seuarat object and save that as an RDS.
55 | 
56 | ```r
57 | seurat_doublet <- merge(x = subsets[[1]], 
58 |               y = subsets[2:length(subsets)])
59 | saveRDS(seurat_doublet, file = "seurat_postQC_doubletFinder.rds")
60 | ```
61 | 
62 | ## Step 4 (optional) add doublet info to a pre-existing seurat object for plotting
63 | 
64 | If you have gone ahead and run most of the seurat pipeline before running DoubletFinder you can add the doublet information to any object for plotting on a UMAP
65 | 
66 | ```r
67 | doublet_info <- seurat_doublet@meta.data$doublet_class
68 | names(doublet_info) <- colnames(x = seurat_doublet)
69 | seurat_norm <- AddMetaData(seurat_norm, metadata=doublet_info, col.name="doublet")
70 | ```
71 | 
72 | ## Step 5 remove doublets
73 | 
74 | You can remove doublets from any seurat object that has the doublet info.
75 | 
76 | ```r
77 | seurat_qc_nodub <- subset(x = seurat_doublet, subset = doublet == "Singlet")
78 | saveRDS(seurat_qc_nodub, file = "seurat_qc_nodoublets.rds")
79 | ```
80 | 
81 | 


--------------------------------------------------------------------------------
/scrnaseq/saturation_qc.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: Single Cell Quality control with saturation
 3 | category: Single Cell
 4 | ---
 5 | 
 6 | People often ask how many cells they need to sequence in their next experiment.
 7 | Saturation analysis helps to answer that question looking at the current experiment.
 8 | Would adding more coverage to the current experiment result in getting more transcripts,
 9 | genes, or in just more duplicated reads?
10 | 
11 | First, use [from bcbio to single cell script](https://github.com/hbc/hbcABC/blob/master/inst/rmarkdown/Rscripts/singlecell/from_bcbio_to_singlecell.R)
12 | to load data from bcbio into SingleCellExperiment object.
13 | 
14 | Second, use [this Rmd template](https://github.com/hbc/hbcABC/blob/master/inst/rmarkdown/templates/simple_qc_single_cell/skeleton/skeleton.rmd)
15 | to create report.
16 | 


--------------------------------------------------------------------------------
/scrnaseq/seurat_markers.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: Seurat Markers
 3 | description: This code is for finding Seurat markers
 4 | category: research
 5 | subcategory: scrnaseq
 6 | tags: [differential_analysis]
 7 | ---
 8 | 
 9 | ```bash
10 | ssh -XY username@o2.hms.harvard.edu
11 | 
12 | srun --pty -p interactive -t 0-12:00 --x11 --mem 128G /bin/bash
13 | 
14 | module load gcc/6.2.0 R/3.4.1 hdf5/1.10.1
15 | 
16 | R
17 | ```
18 | 
19 | ```r
20 | library(Seurat)
21 | library(tidyverse)
22 | 
23 | set.seed(1454944673L)
24 | data_dir <- "data" 
25 | seurat <- readRDS(file.path(data_dir, "seurat_tsne_all_res0.6.rds"))
26 | ```
27 | 
28 | Make sure the TSNEPlot looks as expected
29 | 
30 | ```r
31 | TSNEPlot(seurat)
32 | ```
33 | 
34 | Check markers for any particular cluster against all others
35 | 
36 | ```r
37 | cluster14_markers <- FindMarkers(object = seurat, ident.1 = 14, min.pct = 0.25)
38 | ```
39 | 
40 | Or look for markers of every cluster against all others
41 | 
42 | ```r
43 | seurat_markers <- FindAllMarkers(object = seurat, only.pos = TRUE, min.pct = 0.25, thresh.use = 0.25)
44 | ```
45 | 
46 | >**NOTE:** The `seurat_markers` object with be a dataframe with the row names as Ensembl IDs; however, since row names need to be unique, if a gene is a marker for more than one cluster, then Seurat will add a number to the end of the Ensembl ID. Therefore, do not use the row names as the gene identifiers. Use the `gene` column.
47 | 
48 | Save the markers for report generation
49 | 
50 | ```r
51 | saveRDS(seurat_markers, "data/seurat_markers_all_res0.6.rds")
52 | ```
53 | 


--------------------------------------------------------------------------------
/scrnaseq/tinyatlas.md:
--------------------------------------------------------------------------------
1 | - [Mouse brain markers](https://www.brainrnaseq.org/)
2 | - [Mouse and human markers](https://panglaodb.se)
3 | - https://bioconductor.org/packages/release/bioc/html/SingleR.html
4 | - [tinyatlas](https://github.com/hbc/tinyatlas)
5 | - https://www.flyrnai.org/tools/biolitmine/web/
6 | - [human blood](http://scrna.sklehabc.com/), [article](https://academic.oup.com/nsr/article/8/3/nwaa180/5896476)
7 | - [human skin](https://www.nature.com/articles/s42003-020-0922-4/figures/1)
8 | - [A big atlas from Hemberg's lab](https://scfind.sanger.ac.uk/)
9 | 


--------------------------------------------------------------------------------
/scrnaseq/tutorials.md:
--------------------------------------------------------------------------------
1 | - [HSPH](https://github.com/hbc/tutorials/blob/master/scRNAseq/scRNAseq_analysis_tutorial/README.md) 
2 | - [HBC's tutorials for scRNA-seq analysis workflows](https://github.com/hbc/tutorials/tree/master/scRNAseq/scRNAseq_analysis_tutorial)
3 | - [Seurat, Satija lab](https://satijalab.org/seurat/vignettes.html)
4 | - [Hemberg lab, Cambridge](https://scrnaseq-course.cog.sanger.ac.uk/website/index.html)
5 | - [Broad](https://broadinstitute.github.io/2019_scWorkshop/)
6 | - [DS pseudobulk edgeR](http://biocworkshops2019.bioconductor.org.s3-website-us-east-1.amazonaws.com/page/muscWorkshop__vignette/)
7 | - [MIG2019](https://biocellgen-public.svi.edu.au/mig_2019_scrnaseq-workshop/public/index.html)
8 | - [OSCA, Bioconductor](https://bioconductor.org/books/release/OSCA/)
9 | 


--------------------------------------------------------------------------------
/scrnaseq/velocity.md:
--------------------------------------------------------------------------------
 1 | RNA Velocity analysis is a trajectory analysis based on spliced/unspliced RNA ratio.
 2 | 
 3 | It is quite popular https://www.nature.com/articles/s41586-018-0414-6,  
 4 | however, the original pipeline is not well supported:
 5 | https://github.com/velocyto-team/velocyto.R/issues
 6 | 
 7 | There is a new one from kallisto team:
 8 | https://bustools.github.io/BUS_notebooks_R/velocity.html
 9 | 
10 | # 1. Install R4.0 (development version) on O2
11 | - module load gcc/6.2.0
12 | - installed R-devel: https://www.r-bloggers.com/r-devel-in-parallel-to-regular-r-installation/  
13 | because one of the packages wanted R4.0
14 | - configure R with `./configure --enable-R-shlib` for rstudio
15 | - remove conda from PATH to avoid using its libcurl
16 | - module load boost/1.62.0
17 | - module load hdf5/1.10.1
18 | - installing velocyto.R: https://github.com/velocyto-team/velocyto.R/issues/86
19 | 
20 | # 2. Install velocyto.R with R3.6.3 (Fedora 30 example)
21 | bash:
22 | ```
23 | sudo dnf update R
24 | sudo dnf install boost boost-devel hdf5 hdf5-devel
25 | git clone https://github.com/velocyto-team/velocyto.R
26 | ```
27 | rstudio/R:
28 | ```
29 | BiocManager::install("pcaMethods")
30 | setwd("/where/you/cloned/velocyto.R")
31 | devtools::install_local("velocyto.R")
32 | ```
33 | 
34 | # 3. Generate reference files
35 | - `Rscriptdev `[01_get_velocity_files.R](https://github.com/naumenko-sa/crt/blob/master/velocity/01_get_velocity_files.R)
36 | - output:
37 | ```
38 | cDNA_introns.fa
39 | cDNA_tx_to_capture.txt
40 | introns_tx_to_capture.txt
41 | tr2g.tsv
42 | ```
43 | 
44 | # 4. Index reference
45 | This step takes ~1-2h and 100G or RAM:  
46 | `sbatch `[02_kallisto_index.sh](https://github.com/naumenko-sa/crt/blob/master/velocity/02_kallisto_index.sh)
47 | 
48 | - inDrops3 support: https://github.com/BUStools/bustools/issues/4
49 | 
50 | # 5. Split reads by sample with barcode_splitter
51 | 
52 | - merge reads from multiple flowcells first
53 | - https://pypi.org/project/barcode-splitter/
54 | ```
55 | barcode_splitter --bcfile samples.tsv Undetermined_S0_L001_R1.fastq Undetermined_S0_L001_R2.fastq Undetermined_S0_L001_R3.fastq Undetermined_S0_L001_R4.fastq --idxread 3 --suffix .fq
56 | ```
57 | 
58 | kallisto bus counting procedure works on per sample basis, so we need to split samples to separate fastq files, and merge samples across lanes.
59 | 
60 | - [split_barcodes.sh](https://github.com/naumenko-sa/crt/blob/master/velocity/03_split_barcodes.sh)
61 | 
62 | # 6. Count spliced and unspliced transcripts
63 | - [kallisto_count](https://github.com/naumenko-sa/crt/blob/master/velocity/04_kallisto_count.sh)
64 | - output:
65 | ```
66 | spliced.barcodes.txt
67 | spliced.genes.txt
68 | spliced.mtx
69 | unspliced.barcodes.txt
70 | unspliced.genes.txt
71 | unspliced.mtx
72 | ```
73 | 
74 | # 7. Create Seurat objects for every sample
75 | - [create_seurat_sample.Rmd](https://github.com/naumenko-sa/crt/blob/master/velocity/05.create_seurat_sample.Rmd)
76 | - also removes empty droplets
77 | 
78 | # 8. Merge seurat objects
79 | - [merge_seurats](https://github.com/naumenko-sa/crt/blob/master/velocity/06.merge_seurats.Rmd)
80 | 
81 | # 9. Velocity analysis
82 | - [velocity_analysis](https://github.com/naumenko-sa/crt/blob/master/velocity/07.velocity_analysis.Rmd)
83 | 
84 | # 10. Plot velocity picture
85 | - [plot_velocity](https://github.com/naumenko-sa/crt/blob/master/velocity/08.plot_velocity.Rmd)
86 | 
87 | # 11. Repeat marker analysis
88 | - [velocity_markers](https://github.com/naumenko-sa/crt/blob/master/velocity/09.velocity_markers.Rmd)
89 | 
90 | # 11. References
91 | - https://www.kallistobus.tools/tutorials
92 | - https://github.com/satijalab/seurat-wrappers/blob/master/docs/velocity.md
93 | - [preprocessing influences velociy analysis](https://www.biorxiv.org/content/10.1101/2020.03.13.990069v1)
94 | 
95 | # Velocity analysis in Python:
96 | - http://velocyto.org/
97 | - https://github.com/pachterlab/MBGBLHGP_2019/blob/master/Figure_3_Supplementary_Figure_8/supp_figure_9.ipynb
98 | 


--------------------------------------------------------------------------------
/scrnaseq/write10Xcounts.md:
--------------------------------------------------------------------------------
 1 | ## Need to go from counts in a Seurat object to the 10X format?
 2 | 
 3 | Recently found that some tools like Scrublet (doublet detection), require scRNA-seq counts to be in 10X format.
 4 | * barcodes.tsv
 5 | * genes.tsv
 6 | * matrix.mtx
 7 | 
 8 | How to do this easily?
 9 | 
10 | ```r
11 | library(Seurat)
12 | library(tidyverse)
13 | library(rio)
14 | library(DropletUtils)
15 | 
16 | # Read in Seurat object
17 | seurat_stroma <- readRDS("./seurat_stroma_replicatePaper.rds")
18 | 
19 | # Output data
20 | write10xCounts(x = seurat_stroma@assays$RNA@counts, path = "./cell_ranger_data_format_test")
21 | 
22 | ```
23 | 


--------------------------------------------------------------------------------
/scrnaseq/zinbwaver.md:
--------------------------------------------------------------------------------
1 | # When asked why don't you use zinbwave
2 | 
3 | - [zinbwave-deseq2-comparison.Rmd](https://github.com/roryk/zinbwave-deseq2-indrop/blob/master/zinbwave-deseq2-comparison.Rmd)
4 | - [Soneson2018](https://experiments.springernature.com/articles/10.1038/nmeth.4612)
5 | - [Original ZINB](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1406-4)
6 | - [Crowell2020](https://www.biorxiv.org/content/10.1101/713412v2)
7 | - [Droplet scRNA-seq is not zero-inflated](https://www.nature.com/articles/s41587-019-0379-5)
8 | 
9 | 


--------------------------------------------------------------------------------
/training/mkdocs.md:
--------------------------------------------------------------------------------
 1 | # MK-Docs
 2 | 
 3 | Basic prerequisites
 4 | 	-Python
 5 | 	-Github CLI
 6 | 	-VS Code
 7 | 
 8 | 
 9 | Download Visual Studio Code
10 | 
11 | https://code.visualstudio.com/download
12 | 
13 | 1. Open new terminal within VS-Code
14 | 2. Navigate to where you want to host the GitHub repo.
15 | 3. Create a new GitHub repo you want to work on or clone an existing one: git clone link_to_repo.git
16 | 4. Cd to the GitHub repo folder
17 | 5. Lets make a virtual python environment: python -m venv venv 
18 | 6. Source the python environment you just created: source venv/bin/activate
19 | 7. Need pip installed check: pip —version
20 | 8. Install mkdocs: pip install mkdocs-material
21 | 9. Open visual code from here: code .
22 | 10. Open terminal within vscode
23 | 11. To open up the website mkdocs serve
24 | 12.  To change to the “material theme” open mkdocs.yml file and below site_name: My Docs, type 
25 | 		site_name: My Docs
26 | 		theme:
27 | 			name: material
28 | 13. Save it
29 | 14.  To deploy: type mkdocs serve on terminal. It will restart in the same host.
30 | 15. We can change the appearance  of the site by editing and adding plugins to mkdocs.yml (google for common settings)
31 | 16. To add another page. Go inside docs add another nage: eg page2.md
32 | 17. Add 
33 | 	# Page 2
34 | 	
35 | 	## sub heading
36 | 
37 | 	Text inside
38 | 18. Save it
39 | 19. In the terminal type: git add .
40 | 20. Git commit -m $’updating instructor’
41 | 	git push origin main
42 | 	git config http.postBuffer 524288000 #(if error comes up related to html)
43 | 	git pull && git push
44 | 


--------------------------------------------------------------------------------
/variants/clonal_evolution.md:
--------------------------------------------------------------------------------
1 | # Variant analysis
2 | 
3 | - [SciClone](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003665)
4 | - [FishPlot](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-3195-z)
5 | - [PhylogicNDT](https://github.com/broadinstitute/PhylogicNDT)
6 | 


--------------------------------------------------------------------------------
/wgs/crispr-offtarget.md:
--------------------------------------------------------------------------------
  1 | # Overview
  2 | This guide is how to call offtarget edits in a CRISPR edited genome. This is
  3 | pretty easy to and only takes a few steps. First, we need to figure out what is
  4 | different between the CRISPR edited samples and the (hopefully they gave you
  5 | these) control samples. Then we need to find a set of predicted off-target
  6 | CRISPR sites. Finally, once we know what is different, we need to overlap the
  7 | differences with predicted off-target sites, allowing some mismatches.  Then we
  8 | can report the overall differences and differences that could be due to
  9 | offtarget edits.
 10 | 
 11 | # Call CRISPR-edited specific variants
 12 | You want to call edits that are in the CRISPRed sample but not the unedited
 13 | sample. You can do that by plugging into the tumor-normal calling part of bcbio
 14 | and pretending the CRISPR-edited sample is a tumor sample and the non-edited
 15 | sample is a normal sample.
 16 | 
 17 | To get tumor-normal calling to work you need to use a variant caller that
 18 | can handle that, I recommend mutect2.
 19 | 
 20 | To tell bcbio that a pair of samples is a tumor-normal pair you need to
 21 | 
 22 | 1. Put the tumor and normal sample in the same **batch** by setting **batch** in the metadata to the same batch.
 23 | 2. Set **phenotype** of the CRISPR-edited sample to **tumor**.
 24 | 3. Set the **phenotype** of the non-edited sample to **normal**.
 25 | 
 26 | And kick off the **variant2** pipeline, the normal whole genome sequencing pipeline. An example YAML template is below:
 27 | 
 28 | ```yaml
 29 | ---
 30 | details:
 31 |   - analysis: variant2
 32 |     genome_build: hg38
 33 |     algorithm:
 34 |       aligner: bwa
 35 |       variantcaller: mutect2
 36 |       tools_on: [gemini]
 37 | ```
 38 | 
 39 | And an example metadata file:
 40 | 
 41 | ```csv
 42 | samplename,description,batch,phenotype,sex,cas9,gRNA
 43 | Hs27_HSV1.cram,Hs27_HSV1,noCas9_nogRNA,normal,male,no,yes
 44 | Hs27_HSV1_Cas9.cram,Hs27_HSV1_Cas9,noCas9,normal,male,yes,no
 45 | Hs27_HSV1_UL30_5.cram,Hs27_HSV1_UL30_5,noCas9_nogRNA,tumor,male,yes,yes
 46 | Hs27_HSV1_UL30_5_repeat.cram,Hs27_HSV1_UL30_5_repeat,noCas9,tumor,male,yes,yes
 47 | ```
 48 | 
 49 | # Find predicted off-target sites
 50 | There are several tools to do this, a common one folks use is cas-offinder, so
 51 | that is what we will use. There is a [web app](http://www.rgenome.net/cas-offinder/) but it will only return 1,000 events per
 52 | class. Usually this is fine, but if you allow bulges you can get a lot more offtarget sites so you might bump into this limit.
 53 | 
 54 | First install cas-offinder, there is a conda package so this is easy:
 55 | 
 56 | ```bash
 57 | conda create -n crispr -c bioconda cas-offinder
 58 | ```
 59 | 
 60 | There is a companion python wrapper cas-offinder-bulge that can also predict
 61 | offtarget sites taking bulges into effect. You can download it
 62 | [here](https://raw.githubusercontent.com/hyugel/cas-offinder-bulge/master/cas-offinder-bulge) if
 63 | you need to do that.
 64 | 
 65 | You'll need to know the sequence of one or more guides you want to check. You will also need to know
 66 | the PAM sequence for the endonuclease that is being used. 
 67 | 
 68 | You can run cas-offinder like this:
 69 | 
 70 | ```bash
 71 | cas-offinder input.txt C output.txt
 72 | ```
 73 | 
 74 | where input.txt has this format:
 75 | 
 76 | ```
 77 | hg38.fa
 78 | NNNNNNNNNNNNNNNNNNNNNNNGRRT
 79 | ACACGTGAAAGACGGTGACGGNNGRRT 6
 80 | ```
 81 | 
 82 | `hg38.fa` is the path to the FASTA file of the hg38 genome. NNNNNNNNNNNNNNNNNNNNNNNNGRRT is the length of the guide sequence you are interested in with the PAM sequence tacked on the end.
 83 | ACACACGTGAAAGACGGTGACGGNNGRRT is the guide sequence with the PAM sequence tacked on the end. 6 is the number of mismatches you are allowing here, it will look for sites with that many
 84 | or less mismatches.
 85 | 
 86 | If you want to look for bulges, use cas-offinder-bulge with this format:
 87 | 
 88 | ```
 89 | hg38.fa
 90 | NNNNNNNNNNNNNNNNNNNNNNNGRRT 2 1
 91 | ACACGTGAAAGACGGTGACGGNNGRRT 6
 92 | ```
 93 | 
 94 | Where the 2 says to look for a DNA bulge and the 1 a RNA bulge. You can do one or the other, neither or both.
 95 | 
 96 | After you run cas-offinder you can conver the output to a sorted BED file for use with intersecting the your variants:
 97 | 
 98 | ```bash
 99 | cat output.txt | sed 1d | awk '{printf("%s\t%s\t%s\n",$4, $5-10,$5+10)}' | sort -V -k 1,1 -k2,2n  > output.bed
100 | ```
101 | 
102 | # Overlap variants
103 | Finally, use the BED file of predicted off-target sites to pull out possible off-target variant calls:
104 | 
105 | ```bash
106 | bedtools intersect -header -u -a noCas9_nogRNA-mutect2-annotated.vcf.gz -b output.bed
107 | ```
108 | 
109 | And you are done!
110 | 
111 | # More tools
112 | [CRISPResso2](https://github.com/pinellolab/CRISPResso2)
113 | 


--------------------------------------------------------------------------------
/wgs/pacbio_genome_assembly.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | tags:
  3 | title: Genome Assembly Using PacBio Reads Only
  4 | author: Zhu Zhuo
  5 | created: '2019-09-13'
  6 | ---
  7 | 
  8 | # Genome Assembly Using PacBio Reads Only
  9 | 
 10 | This tutorial is based on a bacterial genome assembly project using PacBio sequencing reads only, but it can be followed for genome assembly of other species or using other type long reads with no or a little modification.
 11 | 
 12 | ## Demultiplex
 13 | 
 14 | If the sequencing core hasn't demultiplexed the data, [`lima`](https://github.com/PacificBiosciences/barcoding) can be used for demultiplexing.
 15 | 
 16 | ## Convert `bam` file to `fastq` file
 17 | 
 18 | `subreads.bam` files contain the subreads data and we will convert it from `bam` format to `fastq` format as most assemblers take `fastq` as input.
 19 | 
 20 | [A note on the output from PacBio:](https://pacbiofileformats.readthedocs.io/en/5.1/Primer.html)
 21 | > Unaligned BAM files representing the subreads will be produced natively by the PacBio instrument. The subreads BAM will be the starting point for secondary analysis. In addition, the scraps arising from cutting out adapter and barcode sequences will be retained in a `scraps.bam` file, to enable reconstruction of HQ regions of the ZMW reads, in case the customer needs to rerun barcode finding with a different option.
 22 | 
 23 | Below is an example of slurm script using `bedtools bamtofastq` to convert `bam` to `fastq`.
 24 | 
 25 | ```
 26 | #!/bin/sh
 27 | #SBATCH -p medium
 28 | #SBATCH -J bam2fq
 29 | #SBATCH -o x%_%j.o
 30 | #SBATCH -e x%_%j.e
 31 | #SBATCH -t 00-23:59:00
 32 | #SBATCH -c 2
 33 | #SBATCH --mem=4G
 34 | #SBATCH --array=1-n%5 #change n to the number of subreads.bam files
 35 | 
 36 | module load bedtools/2.27.1
 37 | 
 38 | files=(/path/to/subreads.bam)
 39 | file=${files[$SLURM_ARRAY_TASK_ID-1]}
 40 | sample=`basename file .bam`
 41 | 
 42 | echo $file
 43 | echo $sample
 44 | 
 45 | bedtools bamtofastq -i $file -fq $sample".fq"
 46 | ```
 47 | 
 48 | PacBio Sequel Sequencer reports all base qualities as PHRED 0 (ASCII !). So the quality score for sequel data are all `!` in the `fastq` file generated.
 49 | 
 50 | ## Genome assembly
 51 | 
 52 | ### Using Canu for genome assembly
 53 | 
 54 | [Canu](https://github.com/marbl/canu) does correction, trimming and assembly in a single command. Follow its github page to install the software to a location of preference.
 55 | 
 56 | An example of slurm script for a single sample:
 57 | ```
 58 | #!/bin/sh
 59 | #SBATCH -p priority
 60 | #SBATCH -J canu
 61 | #SBATCH -o %x_%j.o
 62 | #SBATCH -e %x_%j.e
 63 | #SBATCH -t 0-23:59:00
 64 | #SBATCH -c 1
 65 | #SBATCH --mem=1G
 66 | 
 67 | module load java/jdk-1.8u112
 68 | 
 69 | export PATH=/path/to/canu:$PATH
 70 | 
 71 | canu -p sampleName -d sampleName genomeSize=7m \
 72 |   gridOptions="--time=1-23:59:00 --partition=medium" \
 73 |   -pacbio-raw /path/to/converted.fq
 74 | ```
 75 | This is for the 'master' job, so only 1 CPU and 1 Gb memory should be sufficient. Canu will evaluate the resources available and automatically submit jobs to the queue in `gridOptions`.
 76 | 
 77 | _Note_: An alternative is to install a bioconda recipe. But the conda verison is not up-to-date and some additional parameters may need to be specified in the command.
 78 | 
 79 | ### Using Unicycler for bacterial genome assembly
 80 | 
 81 | Follow [Unicycler](https://github.com/rrwick/Unicycler#method-long-read-only-assembly) instructions. [Racon](https://github.com/isovic/racon) is also required and should be installed before running Unicycler.
 82 | 
 83 | ```
 84 | module load gcc/6.2.0 python/3.6.0
 85 | git clone https://github.com/rrwick/Unicycler.git
 86 | cd Unicycler
 87 | python3 setup.py install --user
 88 | ```
 89 | Example of a slurm script
 90 | ```
 91 | #!/bin/sh
 92 | #SBATCH -p priority
 93 | #SBATCH -J unicycler
 94 | #SBATCH -o %x_%j.o
 95 | #SBATCH -e %x_%j.e
 96 | #SBATCH -t 29-23:59:00
 97 | #SBATCH -c 20
 98 | #SBATCH --mem=200G
 99 | 
100 | module load gcc/6.2.0 python/3.6.0 bowtie2/2.3.4.3 samtools/1.9 blast/2.6.0+
101 | export PATH=/path/to/racon:$PATH
102 | 
103 | /path/to/unicycler-runner.py -l /path/to/fastq -t 20 -o sample
104 | ```
105 | ## Assembly quality
106 | 
107 | ### Basic assembly metrics
108 | 
109 | Download [Quast](http://bioinf.spbau.ru/quast) for basic assembly metrics, such as total length, number of contigs and N50.
110 | `/path/to/quast.py -o output_folder -t 6 assembly.fa`
111 | 
112 | ### Assembly completeness
113 | 
114 | Use [BUSCO](https://busco.ezlab.org/) for evaluating the completeness of the genome assembly. BUSCO has a lot of dependencies, so it is better to install a conda recipe.
115 | 
116 | ```
117 | source activate conda-env # activate conda environment. If you don't have one, you may need to create a conda environment.
118 | conda install -c bioconda busco
119 | conda deactivate # deactivate conda environment
120 | ```
121 | Download the BUSCO database for the species and run BUSCO
122 | 
123 | `run_busco -i assembly.fa -o output_folder -l species_odb -m geno`
124 | 
125 | BUSCO is also the abbreviation for Benchmarking Universal Single-Copy Orthologs, which is single-copy ortholog found in >90% of species. The more BUSCOs are present, the more complete the genome assembly is.
126 | 


--------------------------------------------------------------------------------