├── .gitignore ├── README.md ├── RNAVelocity ├── scVelo_tutorial.md └── velocyto_tutorial.md ├── SpatialTranscriptomics ├── Baysor │ ├── LoadingBaysorSegmentationIntoSeurat.R │ ├── RunningBaysor.md │ └── readme.txt └── readme.txt ├── _config.yml ├── admin ├── acknowledging_funding ├── archive_folders_to_standby.md ├── chargeback_models.md ├── consulting_resources.md ├── data_management.md ├── download_data.md ├── getting_started.md ├── initial_consults.md ├── method_snippets.md ├── reproducible_research.md ├── scripts_for_data_management_on_o2 ├── setting_up_an_analysis_guidelines.md └── using_globus ├── bcbio ├── Creating_Hybrid_Mammal_Viral_Reference_Genome.md ├── bcbio_genomes.md ├── bcbio_tips.md ├── bcbio_workflow_mary.md ├── building_a_hybrid_murine_transgene_reference_genome.md ├── git.md └── gtf_gff_validator.md ├── bcbio_chip_userstory_draft.md ├── chipseq ├── bcbio_output_summary.sh ├── cutandrun.md ├── metadata.md └── tools.md ├── img ├── can_not_connect.png ├── images.md ├── noor_umap.png ├── r_taking_longer.png ├── simpsons.gif └── zhu_umap.png ├── long_read_data ├── Jihe's presentation at ABRF.pptx ├── Jihe's summary of ABRF discussion at core meeting.pptx ├── Jihe's_long_read_presentation_core_meeting.pptx ├── genome_assembly_tools.md └── nanopore_DRS_workflow.md ├── misc ├── Core_members_September_2019.key ├── FAQs.md ├── GEO_submissions.md ├── OSX.md ├── Reform_python.md ├── aws.md ├── core_resources.md ├── general_ngs.md ├── git.md ├── miRNA.md ├── mounting_o2_mac.md ├── mtDNA_variants.md ├── multiomics_factor_analysis.md ├── new_to_remote_github_CLI_start_here.md ├── organized_papers.md ├── orphan_improvements.md ├── power_calc_simulations.md └── snakemake-example-pipeline ├── python └── conda.md ├── r ├── .Rprofile ├── R-tips-and-tricks.md ├── Shiny_images │ ├── Added_tabs.png │ ├── Adding_panels.png │ ├── Adding_theme.png │ ├── Altered_action_button.png │ ├── Check_boxes_with_action_button.png │ ├── R_Shiny_hello_world.gif │ ├── R_shiny_req_after.gif │ ├── R_shiny_req_initial.gif │ ├── Return_table.png │ ├── Return_text_app_blank.png │ ├── Return_text_app_hello.png │ ├── Sample_size_hist_100.png │ ├── Sample_size_hist_5.png │ ├── Shiny_UI_server.png │ ├── Shiny_process.png │ ├── Squaring_number_app.png │ └── mtcars_table.png ├── htmlwidgets └── rshiny_server.md ├── rc ├── O2-tips.md ├── O2_portal_errors.md ├── arrays_in_slurm.md ├── connection-to-hpc.md ├── ipython-notebook-on-O2.md ├── jupyter_notebooks.md ├── keepalive.md ├── manage-files.md ├── openondemand.md ├── scheduler.md └── tmux.md ├── rnaseq ├── IRFinder_report.md ├── RepEnrich2_guide.md ├── Volcano_Plots.md ├── ase.md ├── bcbio_rnaseq.bib ├── bibliography.md ├── dexseq.Rmd ├── failure_types ├── img │ ├── test │ └── volcano.png ├── running_IRFinder.md ├── running_leafcutter.md ├── running_leafviz.md ├── running_rMATS.md ├── strandedness.md └── tools.md ├── scrnaseq ├── 10XVisium.md ├── CellRanger.md ├── Demuxafy_HowTo.md ├── MDS_plot.png ├── README.md ├── SNP_demultiplex.md ├── Single-Cell-conda.md ├── Thoughts_on_lymphocyte_Antigen_Receptor_transcripts_in_sc_RNA_Seq_analyses.md ├── bcbio_indrops3.md ├── bcbio_sc.bib ├── bibliography.md ├── cite_seq.md ├── doublets.md ├── pseudobulkDE_edgeR.md ├── pub_quality_umaps.md ├── pySCENIC.md ├── rstudio_sc_docker.md ├── running_MAST.md ├── running_doubletfinder.md ├── saturation_qc.md ├── seurat_clustering_analysis.md ├── seurat_markers.md ├── tinyatlas.md ├── tutorials.md ├── velocity.md ├── write10Xcounts.md └── zinbwaver.md ├── training └── mkdocs.md ├── variants └── clonal_evolution.md └── wgs ├── crispr-offtarget.md └── pacbio_genome_assembly.md /.gitignore: -------------------------------------------------------------------------------- 1 | *.DS_Store 2 | -------------------------------------------------------------------------------- /SpatialTranscriptomics/Baysor/RunningBaysor.md: -------------------------------------------------------------------------------- 1 | 3_6_25 Billingsley 2 | 3 | ### Running Baysor for spatial cell segmentation 4 | 5 | 6 | For review 7 | 8 | https://www.10xgenomics.com/analysis-guides/using-baysor-to-perform-xenium-cell-segmentation 9 | 10 | https://kharchenkolab.github.io/Baysor/dev/ 11 | 12 | https://nanostring-biostats.github.io/CosMx-Analysis-Scratch-Space/posts/flat-file-exports/flat-files-compare.html/ 13 | 14 | _above has some description of AtoMx transcript file output_

15 | 16 | 17 | 18 | https://datadryad.org/stash/dataset/doi:10.5061/dryad.37pvmcvsg#readme/ 19 | 20 | *above has some description of Baysor output data in the segmentation.csv file*

21 | 22 | https://github.com/Bassi-git/Baysor_edit

23 | 24 | 25 | https://vimeo.com/558564804/ 26 | 27 | *above is a very helpful and short vid*



28 | 29 | 30 | ### Running Baysor on CosMx output. 31 | 32 | 33 | Basic usage. 34 | 35 | 36 | 37 | Run Baysor here on o2: 38 | 39 | /n/app/bcbio/baysor

40 | 41 | 42 | 43 | First run **Baysor preview**. It requires a transcript coordinate file. With CosMx output the file will look something like this, BWH_20240509_WC0933_tx_file.csv.gz (unzipped and renamed here to tx_file.csv) 44 | 45 | 46 | Baysor will estimate and construct spatially nearest neighbor clusters of transcripts into Neighborhood Composition Vectors (NCVs) which can be viewed as "pseudocells" and also estimate which transcripts are noise. It will perform unsupervised clustering of the NCVs, assigning them to clusters based on expression profile similarity. You can actually use NCVs similarly to single cell data and do things like create umaps and perform marker identification. The NCVs don't have centroids so you can't plot them spatially the same way you can cell centroids. The number of expected NCV clusters can be selected or it will use a default of 4. Without even having cells called, this can give you an idea of the different cell types present. 47 | 48 | The output html file will show the individual transcript locations on a spatial plot. The NCVs are colored such that NCVs with similar expression profiles are colored similarly, This can give you an idea of different cell types present and their locations as well as providing a helpful reference when later selecting segmentation parameters to help select correctly sized segments. 49 | 50 | 51 | In the Baysor preview command, -x, -y, and -z point to their respective column names in the transcript file. -m gives the minimum number of transcripts to be included in an NCV "pseudocell". 52 | 53 | 54 | ```../bin/baysor/bin/baysor preview -x x_global_px -y y_global_px -z z -g target tx_file.csv -m 10 ```



55 | 56 | 57 | 58 | Next, **Segmentation** can be done using a couple of different approaches. It can use prior information from previous segmentation analyses, or can run without priors. It can use an image or other information as a prior. Here I use CosMx cell identifiers as a prior (each transcript is assigned to a called cell.) You can select how much confidence to give the priors, from 0 none to 1 full. (In an attempt to give full confidence, I tried running with with --prior-segmentation-confidence 1 and it gave very odd results, so you might need to try something like 8 for high confidence.) 59 | 60 | 61 | -p will plot the segments on an HTML file, you'll need this. 62 | 63 | This example gives confidence of 8 to the CosMx cell calls , the cell ids are found in the "cell" column in the tx_file.csv and indicated by :cell in the Baysor run command. I also ask for the number of unsupervised clusters found to be 13 here. 64 | 65 | 66 | ```../bin/baysor/bin/baysor run -x x_global_px -y y_global_px -z z -g target tx_file.csv -p --prior-segmentation-confidence 8 :cell -m 10 --n-clusters=13```



67 | 68 | A second way to run Baysor is without a prior. This requires supplying an -s "scale" argument which corresponds to expected cell diameter. With my CosMx coordinate system (pixels) s = 5 was much much too small, s = 150 too large. You can view the segmentation output on the output html file, and compare this to the preview html file to assess quality of segmentation. 69 | 70 | 71 | ```../bin/baysor/bin/baysor run -x x_global_px -y y_global_px -z z -g target tx_file.csv -p -s 50 -m 10 --n-clusters=13``` 72 | 73 | 74 | Output will look like this 75 | 76 | -rw-rw-r-- 1 jmb17 bcbio 16467552 Feb 26 13:20 segmentation_borders.html
77 | -rw-rw-r-- 1 jmb17 bcbio 2986301 Feb 26 13:18 segmentation_cell_stats.csv
78 | -rw-rw-r-- 1 jmb17 bcbio 8105239 Feb 26 13:18 segmentation_counts.loom
79 | -rw-rw-r-- 1 jmb17 bcbio 374830291 Feb 26 13:18 segmentation.csv
80 | -rw-rw-r-- 1 jmb17 bcbio 1084869 Feb 26 13:18 segmentation_diagnostics.html
81 | -rw-rw-r-- 1 jmb17 bcbio 985 Feb 26 13:20 segmentation_log.log
82 | -rw-rw-r-- 1 jmb17 bcbio 651 Feb 26 10:59 segmentation_params.dump.toml
83 | -rw-rw-r-- 1 jmb17 bcbio 11374607 Feb 26 13:18 segmentation_polygons_2d.json
84 | -rw-rw-r-- 1 jmb17 bcbio 51628559 Feb 26 13:18 segmentation_polygons_3d.json
85 | 86 | segmentation_borders.html is the segmentation image. 87 | 88 | segmentation_cell_stats.csv is the called cell metadata which can be loaded into a Seurat Object. It includes x, y cell centroid coordinates and cell area, which is very useful for filtering out unreasonably small or large segments. 89 | 90 | segmentation.csv are the transcript quality metrics which can be used for transcript filtering in Seurat. It also contains the x and y transcript coordinates and cell and NVC transcript assignments, which can be used to create transcript by cell (or NCV) count matrices, which can be used to create seurat objects. 91 | 92 | segmentation_polygons_2d.json can be used for plotting the segments.



93 | 94 | 95 | After running baysor run, check the segmentation html image for segmentation quality, compare to the preview output. 96 | 97 | Then also check the range of the segmentation_cell_stats area column (column 9) 98 | 99 | ```awk -F',' 'NR>1 {if(min==""){min=max=$9}; if($9max){max=$9}} END {print "Min:", min, "Max:", max}' segmentation_cell_stats.csv``` 100 | 101 | 102 | My CosMx data use pixels, and a conversion of .12028 px per uM, so .12028^2 px per uM^2. Adjust the scale -s parameter if you're not returning cell areas you expect in your experiment. 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | -------------------------------------------------------------------------------- /SpatialTranscriptomics/Baysor/readme.txt: -------------------------------------------------------------------------------- 1 | This directory contains files demonstrating how to run Baysor segmentation on imaging-based spatial trascriptomics data, how to load these data into a Seurat Object, and using Baysor output for qc filtering transcripts and cells 2 | -------------------------------------------------------------------------------- /SpatialTranscriptomics/readme.txt: -------------------------------------------------------------------------------- 1 | This directory contains knowledge regarding Spatial Transcriptomics analyses. 2 | -------------------------------------------------------------------------------- /_config.yml: -------------------------------------------------------------------------------- 1 | theme: jekyll-theme-minimal 2 | google_analytics: UA-150923842-1 3 | -------------------------------------------------------------------------------- /admin/acknowledging_funding: -------------------------------------------------------------------------------- 1 | ## Adding funding section to reports: 2 | 3 | Information on how to acknowledge the core can be found in the individual [MOUs](https://www.dropbox.com/sh/vfnjq2buj3i329v/AADNClhaY6wwBnJu5l4b5CoKa?dl=0) for each institution. 4 | 5 | See this instruction to add the different funding information to reports for clients to be added to papers: http://bioinformatics.sph.harvard.edu/hbcABC/articles/general_start.html#adding-funding-to-template 6 | -------------------------------------------------------------------------------- /admin/archive_folders_to_standby.md: -------------------------------------------------------------------------------- 1 | # archive folders 2 | 3 | ## 1) Login to specified login node on O2, via ssh 4 | In the example below, my login ID is `jnh7` and the login node I am using is `login05.o2.rc.hms.harvard.edu` 5 | You will need to change `jnh7` to your ID to login. 6 | 7 | Example command: 8 | `ssh -XY -l jnh7 login05.o2.rc.hms.harvard.edu` 9 | 10 | Login with your usual password and DUO challenge 11 | 12 | ## 2) Run tmux 13 | Running tmux will let you disconnect from your “session” while still letting your command run in the background.. This is super useful if the command is going to take a long time to run and you need to shut down your computer or disconnect form the wifi for some reason 14 | 15 | The command below will setup a named tmux session called `foo`. Change `foo` to something that will help you remember what you are working on in this tmux session. 16 | 17 | Example command: 18 | `tmux new -s foo` 19 | 20 | 21 | ## 3) Start an interactive session 22 | 23 | You should run the folder compression in an interactive session instead of the login node as the login nodes are shared and running commands in them can slow down the cluster for everyone. As such, RC may kill commands that take too much memory on login nodes without notifying you. 24 | 25 | The command below will give you an interactive node with 8 gigs of RAM (`-mem 8000M`) for 8 hours (`-t 0-08:00`) 26 | 27 | Example command: 28 | `srun --pty -p interactive --mem 8000M --x11 -t 0-08:00 /bin/bash` 29 | 30 | 31 | ## 4) Compress the folder 32 | Go to the folder on O2 33 | In this example I am pretending to work on the project we did with Arlene Sharpe under the hbc code hbc03895 34 | This project is in the following folder: 35 | `/n/data1/cores/bcbio/PIs/arlene_sharpe/Sharpe_RNAseq_analysis_of_siRNA_treated_PANCafs_and_myeloid_cells_after_coculture_hbc03895` 36 | 37 | To compress it I can issue the following commands 38 | Go to the PI folder: 39 | `cd /n/data1/cores/bcbio/PIs/arlene_sharpe/` 40 | Compress the project folder with tar and gzip in a single command and 41 | a) use the same project name as part of the compressed file name **AND** 42 | b) add the date to the compressed file (I’ll use the date I wrote this document in the format YYYYMMDD, or 20201020) 43 | c) to ensure the compression finished, add the `—remove-files` option to the tar command, this will remove the files once the compression has scceeded. 44 | In general, the command to compress files would then look like this (`--remove-files` has to be before the other options as `-f` indicates a file name coming after): 45 | `tar --remove-files -cvzf YYYYMMDD_folder.tar.gz folder` 46 | 47 | 48 | And in the case of this Arlelen Sharpe hbc038956 example, it would look like this: 49 | `tar --remove-files -cvzf 20201020_Sharpe_RNAseq_analysis_of_siRNA_treated_PANCafs_and_myeloid_cells_after_coculture_hbc03895.tar.gz Sharpe_RNAseq_analysis_of_siRNA_treated_PANCafs_and_myeloid_cells_after_coculture_hbc03895` 50 | This will compress the folder and remove the original files (which will still be around in the backed up .snapshot folder for some weeks or months!) 51 | 52 | ## 5) Login to the transfer node 53 | The standby folder can only be accessed from the transfer node 54 | 55 | SSH to the transfer node 56 | `ssh transfer` 57 | Login with your password and DUO challenge, it shouldn’t require your login ID 58 | 59 | ## 6) Move compressed folder to the standby folder 60 | 61 | Our standby folder is located at `/n/standby/cores/bcbio/compute/archived_reports/tier2` 62 | You can see that there are already a bunch of tar gzipped folders in there. 63 | 64 | Move your newly compressed folder to the standby folder using rsync. 65 | Srync follows the general pattern of 66 | `rsync -options source destination` 67 | 68 | I typically use the following rsync options (as `-ravzuP`) 69 | * r = recursive, ie. transfer all the subfolders too 70 | * a = archive, preserve everything 71 | * v= verbose, print updates during tnansfer to screen 72 | * u = update, skip files on receiver that are newer (this is useful if we screwup and have already archived the folder) 73 | * P = show estimate progress as a bar or percentage transferred 74 | 75 | Here I will continue with the example from 5 using Arlene Sharpe’s analysis using our compressed folder as source and the standby folder as destination 76 | 77 | Example command: 78 | `rsync -ravzuP /n/data1/cores/bcbio/PIs/arlene_sharpe/20201020_Sharpe_RNAseq_analysis_of_siRNA_treated_PANCafs_and_myeloid_cells_after_coculture_hbc03895.tar.gz /n/standby/cores/bcbio/compute/archived_reports/tier2/` 79 | 80 | This will copy/sync the compressed folder over to the standby folder. 81 | 82 | Once the file is done copying/syncing, you can erase the source file 83 | 84 | Example command: 85 | `rm /n/data1/cores/bcbio/PIs/arlene_sharpe/20201020_Sharpe_RNAseq_analysis_of_siRNA_treated_PANCafs_and_myeloid_cells_after_coculture_hbc03895.tar.gz` 86 | 87 | ## 7) Link the standby copy of the compressed folder back to the PI folder 88 | This should help locate it again in the future 89 | We do this using a “symlink”, which is similar to a “Shortcut” in Windows or and “Alias” in OSX. 90 | The command to make a symlink in Unix is `ln -s` and typically has the format of `ln -s source destination` 91 | 92 | Using our example of the Sharpe analysis, the command would be: 93 | 94 | `ln -s /n/standby/cores/bcbio/compute/archived_reports/tier2/20201020_Sharpe_RNAseq_analysis_of_siRNA_treated_PANCafs_and_myeloid_cells_after_coculture_hbc03895.tar.gz /n/data1/cores/bcbio/PIS/arlene_sharpe/ 20201020_Sharpe_RNAseq_analysis_of_siRNA_treated_PANCafs_and_myeloid_cells_after_coculture_hbc03895.tar.gz` 95 | 96 | This will drop a symlink in the arelene_sharpe PI directory. W 97 | You can see that you have a symlink by running `ls -lh ` in the PI directory, you should see at least one line with the `20201020_Sharpe_RNAseq_analysis_of_siRNA_treated_PANCafs_and_myeloid_cells_after_coculture_hbc03895.tar.gz` file listed with an arrow beside it pointing to the standby folder 98 | i.e. `20201020_Sharpe_RNAseq_analysis_of_siRNA_treated_PANCafs_and_myeloid_cells_after_coculture_hbc03895.tar.gz -> /n/standby/cores/bcbio/compute/archived_reports/tier2/20201020_Sharpe_RNAseq_analysis_of_siRNA_treated_PANCafs_and_myeloid_cells_after_coculture_hbc03895.tar.gz` 99 | 100 | 101 | 102 | 103 | 104 | 105 | ## NOTES 106 | 107 | If you get disconnected from your session , you will be able to go back to this session by logging in as above in step 1 again and running the command to reconnect to tmux session `foo` 108 | 109 | Example command: 110 | `tmux attach -t foo` 111 | -------------------------------------------------------------------------------- /admin/chargeback_models.md: -------------------------------------------------------------------------------- 1 | # Parameters and limits for how we determine discounted rates for projects. 2 | 3 | ## NIEHS 4 | Discretionary. 5 | PI must have appointment to Chan Environmental Health department or be a member of the NIEHS Center. 6 | 7 | ## CFAR 8 | Some discretion. 9 | Generally limited to a maximum of 80 hours for junior PI analyses but can be less, depending on funding. 10 | Can support core infrastructure and expertise development for senior PIs. 11 | In all cases work must be HIV related. 12 | 13 | ## CATALYST 14 | 5 hours total including initial emails, consults and analysis. 15 | Exceptions may be made in consultation with Shannan for: 16 | - Grant submissions 17 | - manuscript resubmissions 18 | - GEO (or other publication database) uploads 19 | 20 | ## HMS 21 | On quad researchers only 22 | Confirm via APEX ( apex.hms.harvard.edu) 23 | Generally 10 hours free and 30 hours at 50% rate 24 | Can vary depending on funding availability 25 | 26 | -------------------------------------------------------------------------------- /admin/consulting_resources.md: -------------------------------------------------------------------------------- 1 | ## Collaborating 2 | Presentation for Boston Women in Bioinformatics meetup by Dr. Eleanor Howe, founder and CEEO of [Diamond Age Data Science](https://diamondage.com/) on collaborating well. 3 | [Link to slide deck](https://www.dropbox.com/s/rrq7a3ozbmkvmul/Eleanor%20Howe%20-%20Collaboration%20as%20a%20bioinformatician.pdf?dl=1) 4 | 5 | ## Writing 6 | Slides From "Writing Clearly and Concisely (3 Parts)" by Donald Halstead, Lecturer on Epidemiology and Director of Writing Programs. 7 | 1) [Part1 - Writing Clearly](https://www.dropbox.com/s/fzr1nzda5ivs49t/201203_Writing_Clearly1.pdf?dl=1) 8 | 2) [Part2 - Writing Clearly, Voice Stress](https://www.dropbox.com/s/yor51v66ofxr9m9/201210_Writing_Clearly2_Voice_Stress.pdf?dl=1) 9 | 3) [Part3 - Writing Clearly, Combining Sentences](https://www.dropbox.com/s/4kbycikvg0wknkb/201217_Combining_Sentences.Handout.pdf?dl=1) 10 | - [Part3 - Sentence Connectors](https://www.dropbox.com/s/oa22q11a1pvr6as/Sentence_Connector_matrix_complete.pdf?dl=1) 11 | -------------------------------------------------------------------------------- /admin/getting_started.md: -------------------------------------------------------------------------------- 1 | https://github.com/hbc/hbc_admin/blob/master/Getting_Started.md 2 | -------------------------------------------------------------------------------- /admin/initial_consults.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: Information related to initial consults 3 | description: This document contains information it is useful to get from initial consults 4 | category: admin 5 | subcategory: guide 6 | tags: [consults] 7 | --- 8 | 9 | # We need to talk about 10 | 11 | ## Who they are 12 | - PI for the lab 13 | - PI seniority (implications for HSCI funding eligibility) 14 | - institute 15 | - how they found out about us 16 | 17 | ## The actual science and biology 18 | - what the lab works on 19 | - what the researcher works on 20 | - applicability to disease (translational research) 21 | - is it stem cell related (potential HSCI funding) 22 | 23 | ## Technical issues 24 | - organism 25 | - type and number of comparisons 26 | - samples groups 27 | - technique 28 | - potential batches 29 | 30 | ## Funding 31 | - who is paying for the work 32 | - what commitments are entailed by that 33 | 34 | ## Time frame 35 | - is this urgent 36 | - for paper, manuscript or grant 37 | - upcoming meetings? 38 | - need to do quik QC before continuing with experiments? 39 | 40 | ## Who will be doing the work 41 | - advice or analysis? 42 | - advice - important to assess skill of researcher, have clarity that we cannot train, debug code or mentor, can only point to resources 43 | - analysis - we will do it all and share results and code## 44 | 45 | ## Training needs 46 | - point to courses 47 | - point to materials 48 | 49 | ## Time estimate 50 | - when we can start, 51 | - how long it will take 52 | - how much it will cost (if paying) 53 | 54 | ## Authorship expectations 55 | - basic analysis - acknowlegement only, including any funding source 56 | - advanced analysis - middle author for analyst, acknowledgement for funding source 57 | - we follow standard practices on intellectual contributions and authorship 58 | - offering authorship shows you value our efforts, helps us attract and retain qualified personnel and helps ensure continued funding 59 | - we are skilled professionals and approach no project as routine, taking complete responsibility for the integrity of the data and the accuracy of the data analysis, viewing your projects as opportunities to engage intellectually and venues to develop professionally 60 | 61 | ## Process outline/ Next steps 62 | - Basecamp 63 | - MOU delivery 64 | - how billing works 65 | - will followup with email when analyst ready 66 | -------------------------------------------------------------------------------- /admin/reproducible_research.md: -------------------------------------------------------------------------------- 1 | # Guidelines for managing HBC Research Data 2 | 3 | [Docker cheatsheet](https://dockerlabs.collabnix.com/docker/cheatsheet/) 4 | 5 | ## Motivation 6 | To handle data in a manner that allows it to be FAIR, i.e. 7 | 8 | * Findable: associated with a unique identifier 9 | * Accessible: easily retrievable 10 | * Interoperable: "use and speak the same language" via use of standardized vocabularies 11 | * Reusable: adequately described to a new user, have clear information about data usage, and have a traceable "owner's manual" or provenance 12 | 13 | As a rule of thumb it may help to think of anything that is *only* on your local machine as being not reproducible or reusable. 14 | 15 | Many of the Core's standard operating procedures are geared towards reproducibility/reusability. Please try to adhere to the following general guidelines. 16 | 17 | 18 | ### O2 19 | In general, O2 is for the big stuff. Also, anything needed to reproduce the results on run on the server should be here. 20 | * The Core's shared space is located at `/n/data1/cores/bcbio/`. 21 | * We keep projects for researcher in the PIs folder 22 | * this folder has the following structure 23 | * PIfirstname_PIlastname/project_folder 24 | * ideally the project folder would match the github repo name and look similar to the Trello/Harvest name 25 | * it should at least contain the hbc code for tracking 26 | * Refer to the [Setting up an analysis guidelines](https://github.com/hbc/knowledgebase/blob/master/admin/setting_up_an_analysis_guidelines.md) for how to name directories and which folders to include in the project folder. 27 | * Store a copy of the yaml config, metadata csv file, and slurm script used to run the analysis along with the raw data so that someone else can access the project and rerun it if necessary. 28 | * It is also very helpful if you keep a copy of the bcbio object for downstream analysis with your data. 29 | 30 | 31 | ### HBC org on github 32 | In general, code is for a continually updated, searchable record of your code, and will mainly be made up of the type of code you run locally. 33 | Commit the following: 34 | * Rmarkdowns for analysis 35 | * any auxillary scripts required to analyze the data and/or present it in finished form (i.e. data object conversion scripts, yaml files for bcbioRNAseq, etc.) 36 | 37 | ### Dropbox 38 | 39 | Use dropbox to share results and code with collaborators (`HBC Team Folder (1)/Consults/piFirstName_piLastName/project_folder`). 40 | * html files 41 | * Rmarkdown files used to generate the html files 42 | * text files and "small" processed data files (i.e. files included in the reports, such as normalized counts) 43 | * documents: manuscripts, extra metadata, presentations 44 | * anything else the collaborator may need to reproduce the R-based analysis *IF* they wish to reproduce it 45 | * *RNA-seq* example: create folders for QC, DE and FA. These may be in the base project folder, or if the project is complex, in folders labeled by dates. 46 | * Within the QC folder, store the QC Rmd and the html report 47 | * In the DE folder, store the DE Rmd file, the DE report, and the raw and normalized counts matrices. Also store the DESeq2 results tables with gene symbols added 48 | * In the FA folder, store the FA Rmd, the html report and the FA results tables 49 | 50 | ### Basecamp 51 | 52 | Use Basecamp for discussions of project progress. 53 | * Link to Dropbox for results 54 | * You may post small files on basecamp to share with collaborators 55 | 56 | ### Cases to discuss: 57 | 58 | * Many times, the client sends us a presentation (ppt/pptx) or a paper to better understand the project or to provide us with some information that may end up in the metadata. Where should we store these files? 59 | *John - a "docs" directory on O2 works for this* 60 | * Similarly, should we store original metadata files or only the csv file that we end up using. 61 | *John - I generally do, as it can contain important information or be easier to run past the original researcher. I mark it as "original" so I can clearly differentiate it from the metadata we ended up using. If available, I save any code I used to modify the original metadata with the metadata.* 62 | * Reviews. Consults that consist on reviewing the client paper (i.e code). Where to keep all the provided documents (if we have to keep them). 63 | *John - I could see keeping them on Dropbox, as we would typically want to share any reviewer responses with the researcher.* 64 | * In the case we have downloaded data from multiple flowcells, should we keep the originals or the concatenated files. Note that errors can happen when "preparing samples" (concatenating the wrong files for example). 65 | *John - would be great if we could only keep the concatenated files, but only after confirming no lane effects and proper concatenation.* 66 | -------------------------------------------------------------------------------- /admin/scripts_for_data_management_on_o2: -------------------------------------------------------------------------------- 1 | # data management on O2 sop 2 | 3 | ## General 4 | Make a folder with the current date in format to put results in 5 | Year-month-day (eg. 2020-09-08) 6 | Code is in the /n/data1/cores/bcbio/PIs/size_monitoring/hmsrc_space_scripts folder 7 | 8 | ## Finding large and old folders 9 | "dir_sizes_owners.sh -$date“ 10 | Then run the R script dir_sizes_owners.R and point at the resulting tsv file 11 | 12 | `sbatch -t 12:00:00 -p short —wrap=“ bash /n/data1/cores/bcbio/PIs/size_monitoring/hmsrc_data_management/dir_sizes_owners.sh 2021-01-06”` 13 | 14 | ## Finding redundant versions of fastq and fq files 15 | 16 | ## Finding uncompressed fastq and fq files 17 | Run find_fastqs.sh 18 | `sbatch -t 24:00:00 -p medium --wrap="bash /n/data1/core/PIs/size_monitoring/hmsrc_data_management/find_fastqs.sh"` 19 | Then run fastq_dirs.sh to reduce down to the directories 20 | `bash fastq_dirs.sh < ../2021-01-29/fastq_files.txt | sort | uniq` 21 | 22 | ## Finding leftover work directories 23 | Run find_work_dirs.sh 24 | `sbatch -t 24:00:00 -p medium --wrap="bash /n/data1/core/PIs/size_monitoring/hmsrc_data_management/find_work_dirs.sh"` 25 | Run workdir_details.sh 26 | -------------------------------------------------------------------------------- /admin/using_globus: -------------------------------------------------------------------------------- 1 | 2 | ## Scripts for setting up transfers 3 | 4 | 5 | >Hi, 6 | 7 | I’ll be handling the data transfer. 8 | 9 | Currently, we have your data on HMS RC's O2 server. If you have an account there and appropriate space, we can simply transfer ownership to you there. You can then move it to your directory on O2 and decide whether to keep it on O2 (note that they will start charging for data storage in the summer) or move it to another server/drive. 10 | 11 | Alternatively, we can use the new Globus secure transfer service that HMS offers. It has worked quite well for us to transfer from O2 to external servers and individual laptops/computers. I'm attaching some of their guidance docs on setting up the client and initiating a transfer. We would need you to sign into the system with your Harvard ID first (see below) and give us your Globus user ID (see the “Account” section in the Globus interface) or the email you used to sign in. We'd then share the folder with you to access via Globus and you will receive an email with instructions. 12 | 13 | Globus LogIN help 14 | Go to https://www.globus.org/ and click on "LogIn/Use your existing organizational login/Harvard University", this will take you to Harvard Key. You can also login with your ecommons ID. If you don’t have Harvard Key you can create a globus id - (https://www.globusid.org/create). 15 | 16 | Please let us know if you have O2 access and if not, please share your Globus ID and we'll set up your data for transfer. 17 | 18 | 19 | Note, share these two docs: 20 | Globus-Connect-Personal-install-Windows - setting up the client 21 | GlobusOneTimeTransfer - initiate the data transfer 22 | 23 | (Note SelfServiceCollections is for us - how to setup a share. No need to give this doc to anyone.) 24 | 25 | 26 | Alternate script 27 | HI – 28 | 29 | With the help of HMS, we’ve been using globus to facilitate transferring the data. (globus.org) It requires you to sign up for a free account and install their client on your system (mac/windows, etc.). We’ll give you access to the data on the HMS cluster from your globus account and globus can transfer it to whatever storage you have access to at MGH, on an external drive, etc. (ie it’s a glorified FTP, file transfer capability and takes care of error recovery & handling etc.) 30 | 31 | If this sounds good, let us know your globus account info and we can go from there. 32 | 33 | Thanks, 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | ## Potential issues 42 | ### Permissions - initiator needs proper permissions to actually share data 43 | ### Non-standard characters in filenames 44 | - files whose names contain non-alphanumeric characters (eg. ; ? =) may be rejected by windows machines during transfer 45 | ### Changes in files on host or client 46 | - due to Globus monitoring for file integrity, changes that are made to files or directories on either side during transfer can interrupt the transfer 47 | 48 | 49 | ## Notes 50 | * Email from Globus only contains share information, unclear where the researcher learns about the app 51 | * CAn add additional endpoint folders through Preferences/Options on OSX and Windows. 52 | * Console interface or client may not tell you that transfer is complete 53 | 54 | -------------------------------------------------------------------------------- /bcbio/bcbio_genomes.md: -------------------------------------------------------------------------------- 1 | **Homo Sapiens + Covid19** 2 | 3 | *GRCh38_SARSCov2 * - built from ensembl 4 | - Assembly GRCh38, release 99. 5 | - Files: 6 | - Genomic sequence: ftp://ftp.ensembl.org/pub/release-99/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz 7 | - Annotation: ftp://ftp.ensembl.org/pub/release-99/gtf/homo_sapiens/Homo_sapiens.GRCh38.99.gtf.gz 8 | - Covid sequence and annotation added: https://www.ncbi.nlm.nih.gov/nuccore/MN988713.1?report=GenBank 9 | 10 | **Mus musculus** 11 | 12 | *GRCm38_98* - built from ensembl 13 | - Assembly GRCm38, release 98. 14 | - Files: 15 | - Genomic sequence: ftp://ftp.ensembl.org/pub/release-98/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.toplevel.fa.gz 16 | 17 | - Annotation: ftp://ftp.ensembl.org/pub/release-98/gtf/mus_musculus/Mus_musculus.GRCm38.98.gtf.gz 18 | 19 | **Caenorhabditis elegans** 20 | 21 | *WBcel235_WS272* - built from wormbase 22 | - Assembly WBcel235, relase WS272. Project PRJNA13578 (N2 strain) 23 | - Files: 24 | - Genomic sequence: ftp://ftp.wormbase.org/pub/wormbase/releases/WS272/species/c_elegans/PRJNA13758/c_elegans.PRJNA13758.WS272.genomic.fa.gz 25 | 26 | - Annotation: ftp://ftp.wormbase.org/pub/wormbase/releases/WS272/species/c_elegans/PRJNA13758/c_elegans.PRJNA13758.WS272.canonical_geneset.gtf.gz 27 | 28 | **Drosophila melanogaster** 29 | 30 | *DGP6* - built from Flybase 31 | - has a different format for annotations for non-coding genes in the gtf 32 | - only protein coding genes will make it into Salmon and downstream 33 | 34 | *DGP6.92* - built from Ensembl info 35 | - will have all non-coding RNAs in Salmon and downstream results 36 | - shows lower gene detection rates than Flybase 37 | 38 | **Updating supported transcriptomes** 39 | 1. clone cloudbiolinux 40 | 2. update transcriptome 41 | ```bash 42 | bcbio_python cloudbiolinux/utils/prepare_tx_gff.py --cores 8 --gtf Macaca_mulatta.Mmul_8.0.1.95.chr.gtf.gz --fasta /n/app/bcbio/biodata/genomes/Mmulatta/mmul8noscaffold/seq/mmul8noscaffold.fa Mmulatta mmul8noscaffold 43 | ``` 44 | 3. upload the xz file to the bucket 45 | ```bash 46 | aws s3 cp hg19-rnaseq-2019-02-28_75.tar.xz s3://biodata/annotation/ --grants read=uri=http://acs.amazonaws.com/groups/global/AllUsers full=emailaddress=chapmanb@50mail.com 47 | ``` 48 | 4. edit cloudbiolinux ggd transcripts.yaml recipe to point to the new file uploaded on the bucket 49 | 5. edit the cloudbiolinux ggd gtf.yaml to show where you got the GTF from and what you did to it 50 | 6. test before pushing 51 | ```bash 52 | mkdir tmpbcbio-install 53 | ln -s `pwd`/cloudbiolinux tmpbcbio-install/cloudbiolinux 54 | log into bcbio user: sudo -su bcbio /bin/bash 55 | bcbio_nextgen.py upgrade --data 56 | ``` 57 | 7. push changes back to cloudbiolinux 58 | 59 | **Factual list of genomes in O2:/n/shared_db/bcbio/biodata/genomes as of 2020-03-13** 60 | ``` 61 | . 62 | ├── Ad37 63 | │   ├── GW7619026 64 | │   └── GW76-19026 65 | ├── Adenovirus 66 | │   └── Ad37 67 | ├── Amexicanus 68 | │   └── Amexicanus2 69 | ├── Amis 70 | │   ├── ASM28112v4 71 | │   └── ASM28112v4.a 72 | ├── Anidulans 73 | │   └── FGSC_A4 74 | ├── Atta_cephalotes 75 | │   └── Attacep1.0 76 | ├── bcbiotx 77 | ├── Btaurus 78 | │   └── UMD3.1 79 | ├── Celegans 80 | │   ├── WBcel235 81 | │   ├── WBcel235_90 82 | │   ├── WBcel235_raw 83 | │   └── WBcel235_WS272 84 | ├── Dmelanogaster 85 | │   ├── BDGP6 86 | │   ├── BDGP6.15 87 | │   ├── BDGP6.19 88 | │   ├── BDGP6.92 89 | │   ├── flybase 90 | │   └── flybase_dmel_r6.28 91 | ├── Drerio 92 | │   ├── Zv10 93 | │   ├── Zv11 94 | │   └── Zv9 95 | ├── Ecoli 96 | │   ├── EDL933 97 | │   ├── k12 98 | │   ├── MB0009 99 | │   ├── MB2409 100 | │   ├── MB2455 101 | │   ├── MG1655 102 | │   ├── MG1655_v2 103 | │   ├── MG1655_virus 104 | │   ├── MG1655_wrong_name 105 | │   └── NC_000913.3 106 | ├── Gallus_gallus 107 | │   └── galgal5 108 | ├── gdc-virus 109 | │   └── gdc-virus-hsv 110 | ├── haD37 111 | │   └── DQ900900.1 112 | ├── Hsapiens 113 | │   ├── GRCh37 114 | │   ├── hg19 115 | │   ├── hg19-ercc 116 | │   ├── hg19-mt 117 | │   ├── hg19-subset 118 | │   ├── hg19-test 119 | │   └── hg38 120 | ├── humanAd37 121 | │   └── Ad37.hg19 122 | ├── kraken 123 | │   ├── bcbio 124 | │   ├── micro 125 | │   ├── minikraken_20141208 126 | │   ├── minimal 127 | │   └── old_20141302 128 | ├── Lafricana 129 | │   └── loxAfr3 130 | ├── Macaca 131 | │   ├── Mfascicularis 132 | │   ├── Mmul8 133 | │   └── mmul8noscaffold 134 | ├── Mmulatta 135 | │   ├── mmul8 136 | │   └── mmul8noscaffold 137 | ├── Mmusculus 138 | │   ├── cloudbiolinux 139 | │   ├── GRCm38_90 140 | │   ├── GRCm38_98 141 | │   ├── greenberg-mm9 142 | │   ├── mm10 143 | │   └── mm9 144 | ├── Oaires 145 | │   └── Oar_v31 146 | ├── phiX174 147 | │   └── phix 148 | ├── Pintermedia 149 | │   └── ASM195395v1 150 | ├── Rnorvegicus 151 | │   └── rn6 152 | ├── Scerevisiae 153 | │   └── sacCer3 154 | ├── Spombe 155 | │   ├── ASM284v2.25 156 | │   └── ASM284v2.30 157 | ├── spombe 158 | │   └── ASM294v2 159 | ├── Sscrofa 160 |    ├── ss11.1 161 |    └── Sscrofa10.2 162 | ``` 163 | 164 | **How to install a custom genome in O2** 165 | - `sudo -su bcbio /bin/bash` 166 | - `cd /n/app/bcbio` 167 | - https://bcbio-nextgen.readthedocs.io/en/latest/contents/configuration.html#reference-genome-files 168 | - https://bcbio-nextgen.readthedocs.io/en/latest/contents/configuration.html#adding-custom-genomes 169 | 170 | **Workflow4: Whole genome trio (50x) - hg38** 171 | 172 | Inputs (FASTQ files) and results (BAM files, etc) of the [whole genome BWA alignment and GATK variant calling workflow](https://bcbio-nextgen.readthedocs.io/en/latest/contents/germline_variants.html#workflow4-whole-genome-trio-50x-hg38) are stored in `/n/data1/cores/bcbio/shared/NA12878-trio-eval` 173 | 174 | **Use an updated hg38 transcriptome** 175 | ```bash 176 | wget ftp://ftp.ensembl.org/pub/current_gtf/homo_sapiens/Homo_sapiens.GRCh38.101.gtf.gz 177 | gtf=Homo_sapiens.GRCh38.101.chr.gtf.gz 178 | remap_url=http://raw.githubusercontent.com/dpryan79/ChromosomeMappings/master/GRCh38_ensembl2UCSC.txt 179 | wget --no-check-certificate -qO- $remap_url | awk '{if($1!=$2) print "s/^"$1"/"$2"/g"}' > remap.sed 180 | gzip -cd ${gtf} | sed -f remap.sed | grep -v "*_*_alt" > hg38-remapped.gtf 181 | ``` 182 | Then pass `hg38-remapped.gtf` as the `transcriptome_gtf` option. 183 | -------------------------------------------------------------------------------- /bcbio/bcbio_tips.md: -------------------------------------------------------------------------------- 1 | ## Installing a private bcbio development repository on O2 2 | ```bash 3 | wget https://raw.githubusercontent.com/bcbio/bcbio-nextgen/master/scripts/bcbio_nextgen_install.py 4 | python bcbio_nextgen_install.py ${HOME}/local/share/bcbio --tooldir=${HOME}/local --nodata 5 | ln -s /n/app/bcbio/biodata/genomes/ ${HOME}/local/share/genomes 6 | mkdir -p ${HOME}/local/share/galaxy 7 | ln -s /n/app/bcbio/biodata/galaxy/tool-data ${HOME}/local/share/galaxy/tool-data 8 | export PATH="${HOME}/local/bin:$PATH" 9 | ``` 10 | 11 | ## How to fix potential conda errors during installation 12 | Add the following to your `${HOME}/.condarc`: 13 | ```yaml 14 | channels: 15 | - bioconda 16 | - defaults 17 | - conda-forge 18 | safety_checks: disabled 19 | add_pip_as_python_dependency : false 20 | rollback_enabled: false 21 | notify_outdated_conda: false 22 | ``` 23 | 24 | ## Using shared bcbio installation on O2 25 | To use bcbio installation in `/n/app/bcbio` add the corresponding tool and Conda directories to your `$PATH`: 26 | ```shell 27 | export PATH="/n/app/bcbio/tools/bin:/n/app/bcbio/dev/anaconda/bin:${PATH}" 28 | ``` 29 | 30 | ## How to fix jobs bcbio jobs timing out 31 | The O2 cluster can take a really long time to schedule jobs. If you are having problems with bcbio timing out, set your --timeout parameter to something high, like this: 32 | ```bash 33 | /n/app/bcbio/dev/anaconda/bin/bcbio_nextgen.py ../config/bcbio_ensembl.yaml -n 72 -t ipython -s slurm -q short -r --tag feany --timeout 6000 -t 0-11:00 34 | ``` 35 | 36 | ## How to run a one-node bcbio job (multicore, not multinode) 37 | it just runs a bcbio job on one node of the cluster (no IPython) 38 | [More slurm options](https://wiki.rc.hms.harvard.edu/display/O2/Using+Slurm+Basic#UsingSlurmBasic-sbatchoptionsquickreference) 39 | 40 | ``` 41 | #!/bin/bash 42 | 43 | # https://slurm.schedmd.com/sbatch.html 44 | 45 | #SBATCH --partition=priority # Partition (queue) 46 | #SBATCH --time=3-00:00 # Runtime in D-HH:MM format 47 | #SBATCH --job-name=bcbio # Job name - any name 48 | #SBATCH -c 10 # cores per task 49 | #SBATCH --mem-per-cpu=10G # Memory needed per CPU or use --mem to limit total memory 50 | #SBATCH --output=project_%j.out # File to which STDOUT will be written, including job ID 51 | #SBATCH --error=project_%j.err # File to which STDERR will be written, including job ID 52 | #SBATCH --mail-type=ALL # Type of email notification (BEGIN, END, FAIL, ALL) by default goes to the email associated with O2 accounts 53 | #SBATCH --mail-user=abc123@hms.harvard.edu # Email to which notifications will be sent 54 | 55 | bcbio_nextgen.py ../config/illumina_rnaseq.yaml -n 10 56 | ``` 57 | 58 | ## Upgrading shared installation of bcbio on O2 59 | How to upgrade `/n/app/bcbio/dev/anaconda/bin/bcbio_nextgen.py` installation: 60 | * switch to `bcbio` user account: 61 | ``` 62 | sudo -su bcbio 63 | ``` 64 | * make sure `umask` is set correctly: 65 | ``` 66 | umask 0002 67 | ``` 68 | * edit `/n/app/bcbio/bcbio.upgrade.sh`: set `--mail-user` and other options as necessary 69 | * run the upgrade: 70 | ``` 71 | sbatch /n/app/bcbio/bcbio.upgrade.sh 72 | ``` 73 | * copy install log (job output) to `/n/app/bcbio/bcbio.upgrade.sh_YYYY-MM-DD.{err,out}` where YYYY-MM-DD is today's date 74 | * test the installation 75 | 76 | ## conda tricks 77 | Packages dependent on a given one: 78 | ``` 79 | grep r-base /n/app/bcbio/dev/anaconda/pkgs/*/info/index.json 80 | ``` 81 | -------------------------------------------------------------------------------- /bcbio/bcbio_workflow_mary.md: -------------------------------------------------------------------------------- 1 | # Bcbio workflow by Mary Piper 2 | https://github.com/marypiper/bcbio_rnaseq_workflow/blob/master/bcbio_rna-seq_workflow.md 3 | 4 | **Documentation for bcbio:** [bcbio-nextgen readthedocs](http://bcbio-nextgen.readthedocs.org/en/latest/contents/pipelines.html#rna-seq) 5 | 6 | ## Set-up 7 | 1. Follow instructions for starting an analysis using https://github.com/hbc/knowledgebase/blob/master/admin/setting_up_an_analysis_guidelines.md. 8 | 9 | 3. Download fastq files from facility to data folder 10 | 11 | - Download fastq files from a non-password protected url 12 | - `wget --mirror url` (for each file of sample in each lane) 13 | - Rory's code to concatenate files for the same samples on multiple lanes: 14 | 15 | barcodes="BC1 BC2 BC3 BC4" 16 | for barcode in $barcodes 17 | do 18 | find folder -name $barcode_*R1.fastq.gz -exec cat {} \; > data/${barcode}_R1.fastq.gz 19 | find folder -name $barcode_*R2.fastq.gz -exec cat {} \; > data/${barcode}_R2.fastq.gz 20 | done 21 | 22 | - Download from password protected FTP such as Dana Farber 23 | - `wget -r --user --password ` 24 | 25 | - Download fastq files from BioPolymers: 26 | - `rsync -avr username@bpfngs.med.harvard.edu:./folder_name .` 27 | 28 | --OR-- 29 | 30 | - `sftp username@bpfngs.med.harvard.edu` 31 | - `cd` to correct folder 32 | - `mget *.tab` 33 | - `mget *.bz2` 34 | 35 | - Download from the Broad using Aspera: 36 | - To download data I use this [script](https://github.com/marypiper/bcbio_rnaseq_workflow/blob/master/aspera_connect_lsf). 37 | 38 | 4. Create metadata in Excel create sym links by concatenate("ln -s ", column $A2 with path_to_where_files_are_stored, " ", column with name of sym link $D2). Can extract parts of column using delimiters in Data tab column to text. 39 | 40 | 5. Save Excel as text and replace ^M with new lines in vim: 41 | 42 | `:%s//\r/g` 43 | 44 | 6. Settings for bcbio- make sure you have following settings in `~/.bashrc` file: 45 | 46 | ```bash 47 | unset PYTHONHOME 48 | unset PYTHONPATH 49 | export PATH=/n/app/bcbio/tools/bin:$PATH 50 | ``` 51 | 52 | 7. Within the `meta` folder, add your comma-separated metadata file (`projectname_rnaseq.csv`) 53 | - first column is `samplename` and is the names of the fastq files as they appear in the directory (should be the file name without the extension (no .fastq or R#.fastq for paired-end reads)) 54 | - second column is `description` and is unique names to call samples - provide the names you want to have the samples called by 55 | - **FOR CHIP-SEQ** need additional columns: 56 | - `phenotype`: `chip` or `input` for each sample 57 | - `batch`: batch1, batch2, batch3, ... for grouping each input with it's appropriate chip(s) 58 | - additional specifics regarding the metadata file: [http://bcbio-nextgen.readthedocs.org/en/latest/contents/configuration.html#automated-sample-configuration](http://bcbio-nextgen.readthedocs.org/en/latest/contents/configuration.html#automated-sample-configuration) 59 | 60 | 8. Within the `config` folder, add your custom Illumina template 61 | - Example template for human RNA-seq using Illumina prepared samples (genome_build for mouse = mm10, human = hg19 or hg38 (need to change star to hisat2 if using hg38): 62 | 63 | ```yaml 64 | # Template for mouse RNA-seq using Illumina prepared samples 65 | --- 66 | details: 67 | - analysis: RNA-seq 68 | genome_build: mm10 69 | algorithm: 70 | aligner: star 71 | quality_format: standard 72 | strandedness: firststrand 73 | tools_on: bcbiornaseq 74 | bcbiornaseq: 75 | organism: mus musculus 76 | interesting_groups: [genotype] 77 | upload: 78 | dir: /n/data1/cores/bcbio/PIs/vamsi_mootha/hbc_mootha_rnaseq_of_metabolite_transporter_KO_mouse_livers_hbc03618_1/bcbio_final 79 | ``` 80 | 81 | - List of genomes available can be found by running `bcbio_setup_genome.py` 82 | - strandedness options: `unstranded`, `firststrand`, `secondstrand` 83 | - Additional parameters can be found: [http://bcbio-nextgen.readthedocs.org/en/latest/contents/configuration.html#automated-sample-configuration](http://bcbio-nextgen.readthedocs.org/en/latest/contents/configuration.html#automated-sample-configuration) 84 | - Best practice templates can be found: [https://github.com/chapmanb/bcbio-nextgen/tree/master/config/templates](https://github.com/chapmanb/bcbio-nextgen/tree/master/config/templates) 85 | 86 | 87 | 9. Within the `data` folder, add all your fastq files to analyze. 88 | 89 | ## Analysis 90 | 91 | 1. Go to `/n/scratch2/your_ECommonsID/PI` and create an `analysis` folder. Change directories to `analysis` folder and create the full Illumina instructions using the Illumina template created in Set-up: step #6. 92 | - `srun --pty -p interactive -t 0-12:00 --mem 8G bash` start interactive job 93 | - `cd path-to-folder/analysis` change directories to analysis folder 94 | - `bcbio_nextgen.py -w template /n/data1/cores/bcbio/PIs/path_to_templates/star-illumina-rnaseq.yaml /n/data1/cores/bcbio/PIs/path_to_meta/*-rnaseq.csv /n/data1/cores/bcbio/PIs/path_to_data/*fastq.gz` run command to create the full yaml file 95 | 96 | 2. Create script for running the job (in analysis folder) 97 | 98 | For a larger job: 99 | 100 | ```bash 101 | #!/bin/sh 102 | #SBATCH -p medium 103 | #SBATCH -J mootha 104 | #SBATCH -o run.o 105 | #SBATCH -e run.e 106 | #SBATCH -t 0-100:00 107 | #SBATCH --cpus-per-task=1 108 | #SBATCH --mem-per-cpu=8G 109 | #SBATCH --mail-type=ALL 110 | #SBATCH --mail-user=piper@hsph.harvard.edu 111 | 112 | export PATH=/n/app/bcbio/tools/bin:$PATH 113 | 114 | /n/app/bcbio/dev/anaconda/bin/bcbio_nextgen.py ../config/\*\_rnaseq.yaml -n 48 -t ipython -s slurm -q medium -r t=0-100:00 --timeout 300 --retries 3 115 | ``` 116 | 117 | For a smaller job, it might be faster in overall time to just run the job on the priority queue. If you only have a few samples, and your fairshare score is low, running on the priority queue could end up being faster since you will quickly get a job there and not have to wait. 118 | 119 | ```bash 120 | #!/bin/sh 121 | #SBATCH -p priority 122 | #SBATCH -J mootha 123 | #SBATCH -o run.o 124 | #SBATCH -e run.e 125 | #SBATCH -t 0-100:00 126 | #SBATCH --cpus-per-task=8 127 | #SBATCH --mem-per-cpu=64G 128 | #SBATCH --mail-type=ALL 129 | #SBATCH --mail-user=piper@hsph.harvard.edu 130 | export PATH=/n/app/bcbio/tools/bin:$PATH 131 | /n/app/bcbio/dev/anaconda/bin/bcbio_nextgen.py ../config/\*\_rnaseq.yaml -n 8 132 | ``` 133 | 134 | 3. Go to work folder and start the job - make sure in an interactive session 135 | 136 | ```bash 137 | cd /n/scratch2/path_to_folder/analysis/\*\_rnaseq/work 138 | sbatch ../../runJob-\*\_rnaseq.slurm 139 | ``` 140 | 141 | ### Exploration of region of interest 142 | 143 | 1. The bam files will be located here: `path-to-folder/*-rnaseq/analysis/*-rnaseq/work/align/SAMPLENAME/NAME_*-rnaseq_star/` # needs to be updated 144 | 145 | 2. Extracting interesting region (example) 146 | - `samtools view -h -b sample1.bam "chr2:176927474-177089906" > sample1_hox.bam` 147 | 148 | - `samtools index sample1_hox.bam` 149 | 150 | 151 | ## Mounting bcbio 152 | 153 | `sshfs mp298@transfer.orchestra.med.harvard.edu:/n/data1/cores/bcbio ~/bcbio -o volname=bcbio -o follow_symlinks` 154 | -------------------------------------------------------------------------------- /bcbio/building_a_hybrid_murine_transgene_reference_genome.md: -------------------------------------------------------------------------------- 1 | ## Building hybrid murine/transgene reference genome 2 | 3 | Heather Wick 4 | 5 | This document is based on [prior contributions by James Billingsley](https://github.com/hbc/knowledgebase/blob/master/bcbio/Creating_Hybrid_Mammal_Viral_Reference_Genome.md) 6 | 7 | 1) Download reference genome from ensembl 8 | * Example: latest mouse reference (GRCm9) fasta (primary assembly) and gtf downloaded [here](http://useast.ensembl.org/Mus_musculus/Info/Index![image](https://github.com/hbc/knowledgebase/assets/33556230/98d91abd-5cd9-4651-b541-48e3bf413483) 9 | ) 10 | 11 | 2) Acquire transgenes and format to match standard fasta format 12 | * In this case, genes were provided by the client 13 | * For our purposes, each transgene was considered to be on its own chromosome, the length of which was the length of the individual gene 14 | * Editing was done via plain text editor 15 | 16 | Format: 17 | ``` 18 | > [GENE_NAME] dna:chromosom chromosome:[GENOME]:[GENE_NAME]:[CHR_START]:[CHR_END]:1 REF 19 | BASEPAIRS_ALLCAPS_60_CHARACTERS_WIDE 20 | ``` 21 | Example: 22 | ``` 23 | >H2B-GFP dna:chromosome chromosome:GRCm39:H2B-GFP:1:1116:1 REF 24 | ATGCCAGAGCCAGCGAAGTCTGCTCCCGCCCCGAAAAAGGGCTCCAAGAAGGCGGTGACT 25 | AAGGCGCAGAAGAAAGGCGGCAAGAAGCGCAAGCGCAGCCGCAAGGAGAGCTATTCCATC 26 | TATGTGTACAAGGTTCTGAAGCAGGTCCACCCTGACACCGGCATTTCGTCCAAGGCCATG 27 | GGCATCATGAATTCGTTTGTGAACGACATTTTCGAGCGCATCGCAGGTGAGGCTTCCCGC 28 | CTGGCGCATTACAACAAGCGCTCGACCATCACCTCCAGGGAGATCCAGACGGCCGTGCGC 29 | CTGCTGCTGCCTGGGGAGTTGGCCAAGCACGCCGTGTCCGAGGGTACTAAGGCCATCACC 30 | AAGTACACCAGCGCTAAGGATCCACCGGTCGCCACCATGGTGAGCAAGGGCGAGGAGCTG 31 | TTCACCGGGGTGGTGCCCATCCTGGTCGAGCTGGACGGCGACGTAAACGGCCACAAGTTC 32 | AGCGTGTCCGGCGAGGGCGAGGGCGATGCCACCTACGGCAAGCTGACCCTGAAGTTCATC 33 | TGCACCACCGGCAAGCTGCCCGTGCCCTGGCCCACCCTCGTGACCACCCTGACCTACGGC 34 | GTGCAGTGCTTCAGCCGCTACCCCGACCACATGAAGCAGCACGACTTCTTCAAGTCCGCC 35 | ATGCCCGAAGGCTACGTCCAGGAGCGCACCATCTTCTTCAAGGACGACGGCAACTACAAG 36 | ACCCGCGCCGAGGTGAAGTTCGAGGGCGACACCCTGGTGAACCGCATCGAGCTGAAGGGC 37 | ATCGACTTCAAGGAGGACGGCAACATCCTGGGGCACAAGCTGGAGTACAACTACAACAGC 38 | CACAACGTCTATATCATGGCCGACAAGCAGAAGAACGGCATCAAGGTGAACTTCAAGATC 39 | CGCCACAACATCGAGGACGGCAGCGTGCAGCTCGCCGACCACTACCAGCAGAACACCCCC 40 | ATCGGCGACGGCCCCGTGCTGCTGCCCGACAACCACTACCTGAGCACCCAGTCCGCCCTG 41 | AGCAAAGACCCCAACGAGAAGCGCGATCACATGGTCCTGCTGGAGTTCGTGACCGCCGCC 42 | GGGATCACTCTCGGCATGGACGAGCTGTACAAGTAA 43 | ``` 44 | 45 | 3) Create GTF for transgenes 46 | * [Information about GTF file format can be found here](https://useast.ensembl.org/info/website/upload/gff.html) 47 | * For our purposes, each transgene was considered to be on its own chromosome, the length of which was the length of the individual gene 48 | * There are only 9 columns. The Attributes in the last column are separated by semicolons/spaces, not tabs. 49 | 50 | Format: 51 | ``` 52 | GENE_NAME SOURCE FEATURE CHR_START CHR_END SCORE STRAND FRAME ATTRIBUTES;SEPARATED;BY;SEMI-COLONS;NOT;TABS!; 53 | ``` 54 | Example: 55 | ``` 56 | H2B-GFP unknown exon 1 1116 . + . gene_id "H2B-GFP"; transcript_id "H2B-GFP"; gene_name "H2B-GFP"; gene_biotype "protein_coding"; 57 | ``` 58 | 59 | 4) Concatenate reference fasta and gtf with transgene fasta and gtfs 60 | 61 | Format: 62 | ``` 63 | cat GENOME.dna.primary_assembly.fa TRANSGENE.fa > GENOME.dna.primary_assembly_TRANSGENE.fa 64 | cat GENOME.gtf TRANSGENE.gtf > GENOME_TRANSGENE.gtf 65 | ``` 66 | Example (two transgenes were added): 67 | ``` 68 | cat Mus_musculus.GRCm39.dna.primary_assembly.fa H2B-GFP.fa tTA.fa > Mus_musculus.GRCm39.dna.primary_assembly_GFP_tTA.fa 69 | cat Mus_musculus.GRCm39.110.gtf H2B-GFP.gtf tTA.gtf > Mus_musculus.GRCm39.110_GFP_tTA.gtf 70 | ``` 71 | * Check your formatting!! Sometimes extra new lines or tabs are easy to accidentally add 72 | 73 | 5) Create folder for new reference genome 74 | ``` 75 | sudo -su bcbio /bin/bash 76 | cd /n/app/bcbio/1.2.9/genomes/Mmusculus/ 77 | mkdir GRCm39 78 | ``` 79 | * If you wish to move your new fasta and gtf to the new directory, you may need to move the files to your home directory, then use sudo to sign in as bcbio before copying them to their final location. You may also need to log into an interactive session because the files are quite large 80 | ``` 81 | cd GRCm39 82 | srun --pty -p interactive --mem 500M -t 0-06:00 83 | mv /path/to/home/dir/GENOME.dna.primary_assembly_TRANSGENE.fa . 84 | mv /path/to/home/dir/GENOME_TRANSGENE.gtf . 85 | ``` 86 | 87 | 6) Run bcbio_setup_genome.py 88 | * Process is too long to run interactively, so use a script: 89 | 90 | Format: 91 | ``` 92 | #!/bin/bash 93 | #SBATCH -t 5-00:00 # Runtime in D-HH:MM format 94 | #SBATCH --job-name=genome_setupbcbio # Job name 95 | #SBATCH -c 10 # cores 96 | #SBATCH -p medium 97 | #SBATCH --mem-per-cpu=5G # Memory needed per CPU or --mem 98 | #SBATCH --output=project_%j.out # File to which STDOUT will be written, including job ID 99 | #SBATCH --error=project_%j.err # File to which STDERR will be written, including job ID 100 | #SBATCH --mail-type=ALL # Type of email notification (BEGIN, END, FAIL, ALL) 101 | #SBATCH --mail-user=[USER]k@hsph.harvard.edu 102 | 103 | #this script submits bcbio_setup_genome.py 104 | #must sudo into bcbio first: 105 | sudo -su bcbio /bin/bash 106 | 107 | date 108 | 109 | bcbio_setup_genome.py -f GENOME.dna.primary_assembly_TRANSGENE.fa -g GENOME_TRANSGENE.gtf -i bwa star seq -n SPECIES -b GENOME --buildversion BUILD 110 | 111 | date 112 | ``` 113 | Example of actual script can be found here: `/n/app/bcbio/1.2.9/genomes/Mmusculus/GRCm39_GFP_tTA/submit_genome_setup.sh` 114 | 115 | -------------------------------------------------------------------------------- /bcbio/git.md: -------------------------------------------------------------------------------- 1 | # Git tips 2 | 3 | - [Pro git book](https://git-scm.com/book/en/v2) 4 | - https://github.com/Kunena/Kunena-Forum/wiki/Create-a-new-branch-with-git-and-manage-branches 5 | - https://nvie.com/posts/a-successful-git-branching-model/ 6 | - http://sandofsky.com/blog/git-workflow.html 7 | - https://blog.izs.me/2012/12/git-rebase 8 | 9 | # Sync with upstream/master, delete all commits in origin/master branch 10 | ``` 11 | git checkout master 12 | git reset --hard upstream/master 13 | git push --force 14 | ``` 15 | 16 | # Sync with upstream/master 17 | ``` 18 | git fetch upstream 19 | git checkout master 20 | git merge upstream/master 21 | ``` 22 | 23 | # Feature workflow 24 | ``` 25 | git checkout -b feature_branch 26 | # 1 .. N 27 | git add -A . 28 | git commit -m "sync" 29 | git push? 30 | 31 | git checkout master 32 | git merge --squash private_feature_branch 33 | git commit -v 34 | git push 35 | # pull request to upstream 36 | # code review 37 | # request merged 38 | git branch -d feature_branch 39 | git push origin :feature_branch 40 | ``` 41 | 42 | # Migrating github.com repos to [code.harvard.edu](https://code.harvard.edu/) 43 | 44 | 1. Set up your ssh keys. You can use your old keys (if you remember your passphrase) by going to `Settings --> SSH and GPG keys --> New SSH key` 45 | 2. Create your repo in code.harvard.edu. Copy the 'Clone with SSH link`: `git@code.harvard.edu:HSPH/repo_name.git` (*NOTE: some of us have had trouble with the HTTPS link*) 46 | 3. Go to your local repo that you would like to migrate. Enter the directory. 47 | 48 | ``` 49 | # this will add a second remote location 50 | git remote add harvard git@code.harvard.edu:HSPH/repo_name.git 51 | 52 | # this will get rid of the old origin remote 53 | git push -u harvard --all 54 | ``` 55 | 56 | 4. You should see the contents of your local repo in Enterprise. Now go to 'Settings' for the repo and 'Collaborators and Teams'. Here you will need to add Bioinformatics Core and give 'Admin' priveleges. 57 | 58 | 59 | > **NOTE:** If you decide to compile all your old repos into one giant repo (i.e. [hbc_mistrm_reports_legacy](https://code.harvard.edu/HSPH/hbc_mistrm_reports_legacy)), make sure that you remove all `.git` folders from each of them before committing. Otherwise you will not be able to see the contents on each folder on Enterprise. 60 | 61 | -------------------------------------------------------------------------------- /bcbio/gtf_gff_validator.md: -------------------------------------------------------------------------------- 1 | # GTF and GFF validators 2 | 3 | > ### Use case example: bosTau9 genome 4 | Building a new genome in bcbio. Reference files were retrieved from NCBI (RefSeq genome and gtf files). There are some additional whitespaces in the file causing errors in the build. Solution: download the gff file instead and validate using `gff3validator` and use as input to bcbio with the added parameter `-gff3`. More info on genometools installs and commands found below. Another option woud be to use the GTF validator (perl-based), also listed below. 5 | 6 | > **NOTE:** Initially found during Moazed consult 7 | 8 | 9 | ## GFF Validator 10 | - includes gtf to gff converter 11 | - Download latest from: http://genometools.org/pub/ 12 | - Documentation: http://genometools.org/tools.html 13 | 14 | ### Usage 15 | ```{bash, eval=FALSE} 16 | {installed_path}/gt -help 17 | {installed_path}/gt gff3validator {gff_file} 18 | {installed_path}/gt gtf_to_gff3 {gtf_file} 19 | {installed_path}/gt gff3_to_gtf {gff_file} 20 | ``` 21 | ### Errors 22 | Report bugs to https://github.com/genometools/genometools/issues. 23 | 24 | ## GTF validator (perl-based) 25 | - Download: https://mblab.wustl.edu/software.html#evalLink (press 'RELEASES") 26 | - Documentation: https://mblab.wustl.edu/media/software/eval-documentation.pdf 27 | 28 | ### Usage 29 | let's say tar is downloaded and extracted at **/home/eval-2.2.8** 30 | 31 | That folder is noted as **{eval}** in the code: 32 | 33 | ```{bash, eval=FALSE} 34 | perl -I {eval} {eval}/validate_gtf.pl -f {gtf_file} {fasta_file_associated_with_the_gtf} 35 | ``` 36 | '-f' is an option, it creates a fixed file with same title as the origial gtf with '.fixed.gtf' extension. 37 | A custom hg38 gtf ran for an hour. 38 | Memory ran out with 8GB for some reason, so I ran with 64GB just in case. Since it might be using information from the genome.fa extensively. 39 | -------------------------------------------------------------------------------- /chipseq/bcbio_output_summary.sh: -------------------------------------------------------------------------------- 1 | #/bin/bash 2 | # Written by Will Gammerdinger at HSPH on September 15th, 2022. 3 | 4 | # Assign input and output files to variables 5 | metadata_file=/n/data1/cores/bcbio/PIs/andrew_lassar/hbc_lassar_mouse_TF_profiling_FoxC_atac_cutnrun_rnaseq_hbc04930/cutnrun/meta/FOXC.csv 6 | multiqc_file=/n/data1/cores/bcbio/PIs/andrew_lassar/hbc_lassar_mouse_TF_profiling_FoxC_atac_cutnrun_rnaseq_hbc04930/cutnrun/final/2024-02-01_FOXC/multiqc/multiqc_data/multiqc_general_stats.txt 7 | sample_directory=/n/data1/cores/bcbio/PIs/andrew_lassar/hbc_lassar_mouse_TF_profiling_FoxC_atac_cutnrun_rnaseq_hbc04930/cutnrun/final/ 8 | sample_prefix=X 9 | input_antibody_label=input 10 | output_file=/n/data1/cores/bcbio/PIs/andrew_lassar/hbc_lassar_mouse_TF_profiling_FoxC_atac_cutnrun_rnaseq_hbc04930/cutnrun/final/summarized_report.txt 11 | 12 | # Print Header line 13 | echo -e "sample\tantibody\ttreatment\treads\tmapped_reads\tmapping_percent\tpeaks\tRiP_percent\tPBC1\tPBC2\tBottlenecking\tNRF\tComplexity\tGC_percent" > $output_file; 14 | 15 | # Determine the column number for the columns on interest 16 | antibody_column=`awk -F ',' 'NR==1{for (i=1; i<=NF; i++) { if ($i == "antibody") { print i } }}' $metadata_file` 17 | phenotype_column=`awk -F ',' 'NR==1{for (i=1; i<=NF; i++) { if ($i == "phenotype") { print i } }}' $metadata_file` 18 | treatment_column=`awk -F ',' 'NR==1{for (i=1; i<=NF; i++) { if ($i == "treatment") { print i } }}' $metadata_file` 19 | reads_column=`awk 'NR==1{for (i=1; i<=NF; i++) { if ($i == "bcbio_mqc-generalstats-bcbio-Total_reads") { print i } }}' $multiqc_file` 20 | mapped_reads_column=`awk 'NR==1{for (i=1; i<=NF; i++) { if ($i == "bcbio_mqc-generalstats-bcbio-Mapped_reads") { print i } }}' $multiqc_file` 21 | mapping_percent_column=`awk 'NR==1{for (i=1; i<=NF; i++) { if ($i == "Samtools_mqc-generalstats-samtools-reads_mapped_percent") { print i } }}' $multiqc_file` 22 | RiP_percent_column=`awk 'NR==1{for (i=1; i<=NF; i++) { if ($i == "bcbio_mqc-generalstats-bcbio-RiP_pct") { print i } }}' $multiqc_file` 23 | PBC1_column=`awk 'NR==1{for (i=1; i<=NF; i++) { if ($i == "bcbio_mqc-generalstats-bcbio-PBC1") { print i } }}' $multiqc_file` 24 | PBC2_column=`awk 'NR==1{for (i=1; i<=NF; i++) { if ($i == "bcbio_mqc-generalstats-bcbio-PBC2") { print i } }}' $multiqc_file` 25 | Bottlenecking_column=`awk 'NR==1{for (i=1; i<=NF; i++) { if ($i == "bcbio_mqc-generalstats-bcbio-bottlenecking") { print i } }}' $multiqc_file` 26 | NRF_column=`awk 'NR==1{for (i=1; i<=NF; i++) { if ($i == "bcbio_mqc-generalstats-bcbio-NRF") { print i } }}' $multiqc_file` 27 | Complexity_column=`awk 'NR==1{for (i=1; i<=NF; i++) { if ($i == "bcbio_mqc-generalstats-bcbio-complexity") { print i } }}' $multiqc_file` 28 | GC_percent_column=`awk 'NR==1{for (i=1; i<=NF; i++) { if ($i == "FastQC_mqc-generalstats-fastqc-percent_gc") { print i } }}' $multiqc_file` 29 | 30 | # For each sample gather the various statistics and print to a the summarized report 31 | for i in ${sample_directory}${sample_prefix}*; do 32 | sample=`basename $i` 33 | antibody=`grep $sample $metadata_file | awk -F ',' -v antibody_column=$antibody_column '{print $antibody_column}'` 34 | sample_type=`grep $sample $metadata_file | awk -F ',' -v phenotype_column=$phenotype_column '{print $phenotype_column}'` 35 | treatment=`grep $sample $metadata_file | awk -F ',' -v treatment_column=$treatment_column '{print $treatment_column}'` 36 | reads=`grep $sample $multiqc_file | awk -F'\t' -v reads_column=$reads_column '{print $reads_column}' | sed 's/.0$//g'` 37 | mapped_reads=`grep $sample $multiqc_file | awk -F'\t' -v mapped_reads_column=$mapped_reads_column '{print $mapped_reads_column}' | sed 's/.0$//g'` 38 | mapping_percent=`grep $sample $multiqc_file | awk -F'\t' -v mapping_percent_column=$mapping_percent_column '{print $mapping_percent_column}' | head -c 5` 39 | if [[ $antibody == $input_antibody_label ]]; then 40 | peaks="NA" 41 | else 42 | peaks=`wc -l ${sample_directory}${sample}/macs2/${sample}_peaks.* | awk 'NR==1{print $1}'` 43 | fi 44 | RiP_percent=`grep $sample $multiqc_file | awk -F'\t' -v RiP_percent_column=$RiP_percent_column '{print $RiP_percent_column}'` 45 | PBC1=`grep $sample $multiqc_file | awk -F'\t' -v PBC1_column=$PBC1_column '{print $PBC1_column}' | head -c 5` 46 | PBC2=`grep $sample $multiqc_file | awk -F'\t' -v PBC2_column=$PBC2_column '{print $PBC2_column}' | head -c 5` 47 | Bottlenecking=`grep $sample $multiqc_file | awk -F'\t' -v Bottlenecking_column=$Bottlenecking_column '{print $Bottlenecking_column}'` 48 | NRF=`grep $sample $multiqc_file | awk -F'\t' -v NRF_column=$NRF_column '{print $NRF_column}' | head -c 5` 49 | Complexity=`grep $sample $multiqc_file | awk -F'\t' -v Complexity_column=$Complexity_column '{print $Complexity_column}'` 50 | GC_percent=`grep $sample $multiqc_file | awk -F'\t' -v GC_percent_column=$GC_percent_column '{print $GC_percent_column}'` 51 | echo -e "$sample\t$antibody\t$treatment\t$reads\t$mapped_reads\t$mapping_percent\t$peaks\t$RiP_percent\t$PBC1\t$PBC2\t$Bottlenecking\t$NRF\t$Complexity\t$GC_percent" >> $output_file; 52 | done 53 | 54 | echo -e "The summarized report has been created and can be found here:\n\t$output_file" 55 | -------------------------------------------------------------------------------- /chipseq/metadata.md: -------------------------------------------------------------------------------- 1 | # Note on Metadata for chipseq 2 | ## Linking the inputs together with one line. (Thanks to Meeta) 3 | ## antibody column matters! (needs to be included in vignette maybe?) 4 | ``` 5 | samplename,description,batch,phenotype,replicate,treatment,antibody 6 | Lib4.R1.bc.2.WTMTF2.fq,WTMTF2_1,pair1,chip,1,WT,MTF2 7 | Lib9.R1R2.bc.19.WTMTF2.fq,WTMTF2_2,pair2,chip,2,WT,MTF2 8 | Lib3.bc.1.WTH3K27ME3.fq,WTH3k27ME3_1,pair3,chip,1,WT,H3k27ME3 9 | Lib9.R1R2.bc.1.WTH3K27ME3.fq,WTH3k27ME3_2,pair4,chip,2,WT,H3k27ME3 10 | Lib2.bc.1.MKOFLAG.fq,MTF2KO_1,pair5,chip,1,MTF2KO,FLAG 11 | Lib9.R1R2.bc.30.MKOFLAG.fq,MTF2KO_2,pair6,chip,2,MTF2KO,FLAG 12 | Lib2.bc.2.MKOWTFLAG.fq,MTF2KO_WTRES_1,pair7,chip,1,MTF2KO_WTRES,WT-FLAG 13 | Lib9.R1R2.bc.31.MKOWTFLAG.fq,MTF2KO_WTRES_2,pair8,chip,2,MTF2KO_WTRES,WT-FLAG 14 | Lib2.bc.15.MKOMUTFLAG.fq,MTF2KO_MUTRES_1,pair9,chip,1,MTF2KO_MUTRES,MUT-FLAG 15 | Lib9.R1R2.bc.32.MKOMUTFLAG.fq,MTF2KO_MUTRES_2,pair10,chip,2,MTF2KO_MUTRES,MUT-FLAG 16 | Lib3.bc.7.EKOH3K27ME3.fq,EEDKO_1,pair11,chip,1,EEDKO,H3k27ME3 17 | Lib3.bc.8.EKOH3K27ME3.fq,EEDKO_2,pair12,chip,2,EEDKO,H3k27ME3 18 | Lib10.R1R2.bc.3.EKOWTRES.fq,EKOWT_1,pair13,chip,1,EKO_WT,H3k27ME3 19 | Lib10.R1R2.bc.5.EKOWTRES.fq,EKOWT_2,pair14,chip,2,EKO_WT,H3k27ME3 20 | Lib8.R1R2.bc.16.EKOMUTRES.fq,EKOMUT_1,pair15,chip,1,EKO_MUT,H3k27ME3 21 | Lib10.R1R2.bc.4.EKOMUTRES.fq,EKOMUT_2,pair16,chip,2,EKO_MUT,H3k27ME3 22 | Lib2.bc.14.INPUT.fq,input_global,pair1;pair2;pair3;pair4;pair5;pair6;pair7;pair8;pair9;pair10;pair11;pair12;pair13;pair14;pair15;pair16,input,1,WT,Input 23 | ``` 24 | 25 | ## I am getting these warnings and some samples ran with broadpeak. (samples that have H3K27ME3) 26 | > Going through the log, I found this.... as I didn't get any peaks for samples that didn't have H3k27ME3 in the antibody column. 27 | ``` 28 | [2021-04-10T05:28Z] h3k27me3 specified, using broad peak settings. 29 | [2021-04-10T05:28Z] h3k27me3 specified, using broad peak settings. 30 | [2021-04-10T05:28Z] h3k27me3 specified, using broad peak settings. 31 | 32 | [2021-04-10T05:28Z] mut-flag specified, but not listed as a supported antibody. Valid antibodies are {'h3k36me3', 'narrow', 'h3k4me1', 33 | 'h2afz', 'h3ac', 'h4k20me1', 'h3k4me3', 'h3k4me2', 'h3k9ac', 'h3k79me2', 'h3k9me2', 'h3f3a', 'h3k79me3', 'h3k27me3', 'broad', 'h3k9me3', 'h3k9me1', 'h3k27ac'}. 34 | If you know your antibody should be called with narrow or broad peaks, supply 'narrow' or 'broad' as the antibody. 35 | 36 | [2021-04-10T05:28Z] flag specified, but not listed as a supported antibody. Valid antibodies are {'h3k36me3', 'narrow', 'h3k4me1', 37 | 'h2afz', 'h3ac', 'h4k20me1', 'h3k4me3', 'h3k4me2', 'h3k9ac', 'h3k79me2', 'h3k9me2', 'h3f3a', 'h3k79me3', 'h3k27me3', 'broad', 'h3k9me3', 'h3k9me1', 'h3k27ac'}. 38 | If you know your antibody should be called with narrow or broad peaks, supply 'narrow' or 'broad' as the antibody. 39 | 40 | [2021-04-10T05:28Z] wt-flag specified, but not listed as a supported antibody. Valid antibodies are {'h3k36me3', 'narrow', 'h3k4me1', 41 | 'h2afz', 'h3ac', 'h4k20me1', 'h3k4me3', 'h3k4me2', 'h3k9ac', 'h3k79me2', 'h3k9me2', 'h3f3a', 'h3k79me3', 'h3k27me3', 'broad', 'h3k9me3', 'h3k9me1', 'h3k27ac'}. 42 | If you know your antibody should be called with narrow or broad peaks, supply 'narrow' or 'broad' as the antibody. 43 | 44 | [2021-04-10T05:28Z] h3k27me3 specified, using broad peak settings. 45 | [2021-04-10T05:28Z] h3k27me3 specified, using broad peak settings. 46 | [2021-04-10T05:28Z] h3k27me3 specified, using broad peak settings. 47 | ``` 48 | -------------------------------------------------------------------------------- /img/can_not_connect.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/img/can_not_connect.png -------------------------------------------------------------------------------- /img/images.md: -------------------------------------------------------------------------------- 1 | images go here! 2 | -------------------------------------------------------------------------------- /img/noor_umap.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/img/noor_umap.png -------------------------------------------------------------------------------- /img/r_taking_longer.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/img/r_taking_longer.png -------------------------------------------------------------------------------- /img/simpsons.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/img/simpsons.gif -------------------------------------------------------------------------------- /img/zhu_umap.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/img/zhu_umap.png -------------------------------------------------------------------------------- /long_read_data/Jihe's presentation at ABRF.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/long_read_data/Jihe's presentation at ABRF.pptx -------------------------------------------------------------------------------- /long_read_data/Jihe's summary of ABRF discussion at core meeting.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/long_read_data/Jihe's summary of ABRF discussion at core meeting.pptx -------------------------------------------------------------------------------- /long_read_data/Jihe's_long_read_presentation_core_meeting.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/long_read_data/Jihe's_long_read_presentation_core_meeting.pptx -------------------------------------------------------------------------------- /long_read_data/genome_assembly_tools.md: -------------------------------------------------------------------------------- 1 | ## Hybrid Assembly Strategies (smaller/prokaryotic genomes) 2 | 3 | 4 | ## Hybrid Assembly Strategies (larger/eukaryotic genomes) 5 | 6 | ### Oxford nanopore and Illumina - 7 | 8 | > Note that these were suggestions for an **algal assembly** from Dr. Chris Fields and Kim Walden at [UIUC's bioinformatics core, HPCBio](https://hpcbio.illinois.edu/). 9 | 10 | * Get a good estimate of the genome size using your illumina reads and [Genome Scope](http://qb.cshl.edu/genomescope/). You will need to get a kmer histogram from your illumina data to use as input to Genome Scope, and you can use [KMC](https://github.com/refresh-bio/KMC) or [Jellyfish](http://www.genome.umd.edu/jellyfish.html) for that. 11 | * Workflow 12 | * Assemble Nanopore reads using something like [wtdbg2](https://github.com/ruanjue/wtdbg2), [Flye](https://github.com/fenderglass/Flye), or [miniasm](https://github.com/lh3/miniasm) (might need to use multiple assemblers and test which works best) 13 | * Do a first round of assembly polishing with [nanopolish](https://github.com/jts/nanopolish), using only nanopore data 14 | * Do a second round of assembly polishing with [Racon](https://github.com/isovic/racon) or [Pilon](https://github.com/broadinstitute/pilon/wiki), using illumina data this time 15 | * Use [BUSCO](https://busco.ezlab.org/) for assessment 16 | * If your Nanopore data was generated using tech from before 2020 using older flow cells, you may have to re run the basecalling using something like [Guppy](https://esr-nz.github.io/gpu_basecalling_testing/gpu_benchmarking.html) (speedy if you can use GPUs) 17 | 18 | ## Genome Annotation tools 19 | 20 | * MAKER 21 | * Braker 22 | * GeneMark 23 | * antiSMASH 24 | * PGAP (prokaryotes) 25 | 26 | ## Assembly assessment 27 | 28 | * [BUSCO](https://busco.ezlab.org/) 29 | -------------------------------------------------------------------------------- /misc/Core_members_September_2019.key: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/misc/Core_members_September_2019.key -------------------------------------------------------------------------------- /misc/FAQs.md: -------------------------------------------------------------------------------- 1 | # Snippets for Frequently Asked Questions from clients, when they go over the reports. 2 | ``` 3 | Feel free to add the FAQs you have received. 4 | ``` 5 | 6 | ## General 7 | ### Functional Analysis 8 | 1. What is geneRatio and bgRatio in overerpresentation analysis? 9 | 10 | - The geneRatio is the {# of annotated genes assigned to term from input}/{# of input genes annotated} 11 | 12 | - The bgRatio is the {# of annotated genes assigned to term from background}/{# of background genes annotated} 13 | 14 | Please note that the denominator may be different between MF, BP, and CC as there are different number of genes annotated for those categories. 15 | 16 | - The input is a list of candidate genes (i.e. list of significant DEGs) while the background is a list of all the genes in the study. 17 | 18 | 2. How is the p-value calculated? 19 | 20 | Simplistic link: https://www.pathwaycommons.org/guide/primers/statistics/fishers_exact_test/ 21 | 22 | A bit more mathematical link: http://www.nonlinear.com/progenesis/qi/v2.0/faq/should-i-use-enrichment-or-over-representation-analysis-for-pathways-data.aspx 23 | 24 | Good video link: https://www.coursera.org/lecture/bd2k-lincs/enrichment-analysis-part-1-xLgN5 25 | 26 | ### Multiple testing correction 27 | What is q-value and why do we need this? 28 | 29 | Here is a pretty good slide about the need for multiple testing and how FDR is calculated : 30 | https://www.gs.washington.edu/academics/courses/akey/56008/lecture/lecture10.pdf 31 | 32 | ### t-test and Wilcoxon {credits to [Preeti](https://github.com/orgs/hbc/people/preetida)} 33 | Blog by Jonathan Bartlett, super helpful for many stat related questions. 34 | 35 | https://thestatsgeek.com/2014/04/12/is-the-wilcoxon-mann-whitney-test-a-good-non-parametric-alternative-to-the-t-test/ 36 | 37 | ### UpSetR plots 38 | To visualize the overlaps, we use the UpSetR package in R to draw bar plots that demonstrate the overlap, instead of Venn diagrams. The bar plots drawn by this package, and their associated annotations, are a cleaner way to demonstrate/observe overlaps. Here is brief guide to reading the UpSetR overlap plots: 39 | 40 | *These plots are relatively intuitive for 2 or 3 categories, but can tend to get more complex for >3 categories. In all cases, you will find the categories being compared and their size listed below the bar plots on the left. As you look to the right (directly below each bar) there are dots with connecting lines that denote which categories the overlap is between, or if there is no overlap (just a dot). The numbers at the top of the bars denote the size of the overlap.* 41 | 42 | ### PCA 43 | For understanding PCA: 44 | Our lesson - https://hbctraining.github.io/scRNA-seq_online/lessons/05_normalization_and_PCA.html#principal-component-analysis-pca 45 | A youtube video - https://www.youtube.com/watch?v=_UVHneBUBW0 46 | 47 | 48 | ## ChIP-seq and ATAC-seq 49 | 50 | ## BULK RNA-seq 51 | 52 | ## scRNA-seq 53 | 54 | -------------------------------------------------------------------------------- /misc/GEO_submissions.md: -------------------------------------------------------------------------------- 1 | Main guide is here: 2 | https://www.ncbi.nlm.nih.gov/geo/info/submission.html 3 | 4 | Highh throughput sequencing is here: 5 | https://www.ncbi.nlm.nih.gov/geo/info/seq.html 6 | 7 | Adding the RC guide (Joon): 8 | https://wiki.rc.hms.harvard.edu/display/O2/Submitting+data+to+GEO 9 | 10 | # The preparation 11 | 12 | ## Analyst responsibilities 13 | You will need 14 | 1) [GEO metadata sheet](https://www.ncbi.nlm.nih.gov/geo/info/seq.html) 15 | 2) Raw fastq files 16 | 3) Derived files for data 17 | a) RNAseq 18 | - raw counts table (as tsv/csv), can put as supplementary file 19 | - TPM (as tsv/csv), can put as supplementary file 20 | - bams are OK too but I have never been asked for them 21 | 22 | 4) details on the analysis for the GEO metadata sheet, including 23 | - which sequencer was used 24 | - paired or single end reads? 25 | - insert size if paired 26 | - programs and versions used in the analysis, including the bcbio and R portions 27 | 28 | Example metadata sheets can be found in this Dropbox folder: 29 | https://www.dropbox.com/sh/88035zd8h9qhvzh/AACmHB7xsXhdgrSyZY42uwLYa?dl=0 30 | 31 | For all of the raw and derived data files, you will need to run md5 checksums. 32 | 33 | ## Researcher responsibilities 34 | The client wil need to give you the details about things that were involved in the experiment and library preparation. 35 | These include 36 | - growth protocol 37 | - treatment protocol 38 | - extract protocol 39 | - library construction protocol 40 | They will also need to supply the general info about the experiment including: 41 | - title 42 | - summary 43 | - overall design 44 | - who they want to be a contributor 45 | 46 | I usually fill out what I can and then send them the metadata sheet with their areas to fill out highlighted. 47 | 48 | # The upload 49 | Once you have the data, derived data and metadata sheet, its time to upload to the GEO FTP server. 50 | Sign into your NCBI and GEO account and go to the [Transfer Files](https://www.ncbi.nlm.nih.gov/geo/info/submissionftp.html) link on the GEO submission page. 51 | 52 | There they will tell you what your directory is on the GEO FTP server (for example, uploads/jnhutchinson_AtsZaoGM) as well as the server address (eg. ftp-private.ncbi.nlm.nih.gov) login (geoftp) and password (rebUzyi1). 53 | 54 | Go to your upload directory on O2 with the GEO submission files and login to the ftp server using `lftp geoftp:rebUzyi1@ftp-private.ncbi.nlm.nih.gov`. Note that lftp is not available on login or interactive nodes, so you will need to ssh to the O2 transfer node (`ssh user@transfer.rc.hms.harvard.edu`) to use it. *You can also use Filezilla if your files are on your local machine.* Then move to your remote upload directory (cd /uploads/jnhutchinson_AtsZaoGM, *for Filezilla, you should enter this into the Remote site: directory box*) and start your upload. For lftp, you can use 55 | ```mirror -R``` or ```mput *``` to upload the files. For Filezilla, just drag the files over to the remote directory. The sit back and maybe work on something else, or like, take a break from the bionformatics mine while everything uploads. If you have a ton of files, you may want to use something like tmux to prevent your session from being terminated. 56 | 57 | 58 | When the upload is complete, notify GEO of the submission using the cleverly named [Notify GEO](https://submit.ncbi.nlm.nih.gov/geo/submission/) link. 59 | 60 | You will receive an email confirming your upload and GEO staff will contact you if there are any issues. Common issues to watch out for are: 61 | 1) column headings in derived data not matching fastq sample names 62 | 2) missing gene ids in derived data 63 | 64 | Less commonly they may ask you to fix insufficiently descriptive summary and overall design 65 | 66 | # The aftermath 67 | 68 | Note that unless you specifically set things up otherwise, the submission will be tied to your name and you will have to be responsible for updates and releases (i.e. you will be the "Investigator"). You can deal with this one of two ways 69 | 1) set things up from your initial login to have you as the submitter and the researcher as the Investigator, I personallly find this inconvenient as I may be doing multiple GEO submission for different researchers, but YMMV 70 | 2) do the submisison yourself as the Investigator and submitter and once the submisison is accepte, email GEO to have the submission transferred to the researcher. 71 | Note that both of these methods will require the researcher to obtain both an NCBI (if they don't already have one) account and share the login id and email address with you. 72 | -------------------------------------------------------------------------------- /misc/OSX.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: DS_Store tips 3 | description: Don't leave .DS_Store files on network volumes 4 | category: computing 5 | subcategory: tips_tricks 6 | tags: [osx] 7 | --- 8 | 9 | # Don't leave .DS_Store files on network volumes 10 | defaults write com.apple.desktopservices DSDontWriteNetworkStores true 11 | -------------------------------------------------------------------------------- /misc/Reform_python.md: -------------------------------------------------------------------------------- 1 | # Using Reform to create custom genome 2 | > https://gencore.bio.nyu.edu/reform/ 3 | 4 | - You can find good example with cellRanger in this link. 5 | 6 | 7 | ## Usage guide 8 | ### Conda env. and install 9 | ``` 10 | module load gcc/6.2.0 conda2/4.2.13 11 | conda activate python_3.6.5 # you need an environment with python3 activated. (this one is custom for me) 12 | pip3 install biopython 13 | git clone https://github.com/gencorefacility/reform.git 14 | cd reform/ 15 | ``` 16 | 17 | ### The command code and the files we need. 18 | ``` 19 | --chrom= \ 20 | --position= \ 21 | --in_fasta= \ 22 | --in_gff= \ 23 | --ref_fasta= \ 24 | --ref_gff= 25 | ``` 26 | - **chrom** ID of the chromsome to modify 27 | 28 | - **position** Position in chromosome at which to insert . Can use -1 to add to end of chromosome. Note: Either position, or upstream AND downstream sequence must be provided. 29 | 30 | - **upstream_fasta** Path to Fasta file with upstream sequence. Note: Either *position*, or *upstream AND downstream* sequence must be provided. 31 | 32 | - **downstream_fasta** Path to Fasta file with downstream sequence. Note: Either *position*, or *upstream AND downstream* sequence must be provided. 33 | 34 | - **in_fasta** Path to new sequence to be inserted into reference genome in fasta format. 35 | 36 | - **in_gff** Path to GFF file describing new fasta sequence to be inserted. 37 | 38 | - **ref_fasta** Path to reference fasta file. 39 | 40 | - **ref_gff** Path to reference gff file. 41 | 42 | Example: 43 | ```ruby 44 | python3 reform.py 45 | --chrom=X \ 46 | --position=3 \ 47 | --in_fasta=in.fa \ 48 | --in_gff=in.gff3 \ 49 | --ref_fasta=ref.fa \ 50 | --ref_gff=ref.gff3 51 | ``` 52 | We will put the in.fa sequence in the X chromosome position 3. 53 | 54 | Sequence is 10bp, so we expect a new transcript at X:4-13. 55 | 56 | > in.fa 57 | ``` 58 | >input_sequence 59 | TGGAGGATCG 60 | ``` 61 | 62 | > ref.gff 63 | 64 | 65 | > in.gff 66 | 67 | 68 | > reformed.gff 69 | 70 | -------------------------------------------------------------------------------- /misc/aws.md: -------------------------------------------------------------------------------- 1 | Can be run from ` /n/app/bcbio/dev/anaconda/bin/aws` or from `/usr/bin/aws` on the O2 transfer nodes. 2 | 3 | ## Setup S3 bucket 4 | Setup your S3 bucket with: 5 | `foo/aws configure` 6 | You will need your 7 | - AWS Access Key ID 8 | - AWS Secret Access Key 9 | - Default region name 10 | - Default output format 11 | 12 | 13 | ## Interact with AWS bucket 14 | 15 | - no dirs in AWS, those strings are just prefixes 16 | - use `--dryrun` to test a command 17 | 18 | |Problem | Unix rationale | AWS spell | 19 | |---------------|------------------------|-------------------------------------------------------------------------------| 20 | |copy files |`cp * destination_dir` |`aws s3 sync . s3://bucket/dir/` | 21 | |get file sizes |`ls -lh` |`aws s3 ls --human-readable s3://bucket/dir/` | 22 | |copy bam files |`cp */*.bam /target_dir`|options order matters!
`aws s3 sync s3://bucket/dir/ . --exclude "*" --include "*.bam"`| 23 | -------------------------------------------------------------------------------- /misc/core_resources.md: -------------------------------------------------------------------------------- 1 | # Sequencing cores 2 | |Name|Affiliation|Website|Contact(s)|Services/Capabilities| 3 | |---|---|---|---|---| 4 | |Biopolymers Facility|HMS|Bob Steen|https://genome.med.harvard.edu/|NGS| 5 | |Molecular Biology Core Facility|DFCI,CFAR|Zach Herbert|http://mbcf.dfci.harvard.edu/|NGS| 6 | |Bauer Core Facility|FAS| |https://bauercore.fas.harvard.edu/|NGS| 7 | 8 | Partners core 9 | 10 | # Single Cell Encapsulation cores 11 | |Name|Affiliation|Website|Contact(s)|Services/Capabilities| 12 | |---|---|---|---|---| 13 | |Single Cell Core|HMS|Sarah Boswell|https://singlecellcore.hms.harvard.edu/|10X,InDrops| 14 | |Bauer Core Facility|FAS| |https://bauercore.fas.harvard.edu/|10X,| 15 | 16 | BWH single cell core 17 | 18 | # Analytical Cores 19 | |Name|Affiliation|Website|Contact(s)|Services/Capabilities| 20 | |---|---|---|---|---| 21 | |Joslin Diabeters Center Biostatistics and Bioinformatics Cores|Joslin|https://joslinresearch.org/drc-cores/bioinformatics-and-biostatistics-core|Jon Dreyfuss|NGS,proteomics,metabolomics,microarray| 22 | 23 | BWH single cell core 24 | 25 | # Sorting cores (FACS, CyTOF) 26 | CyTOF core at Dana 27 | 28 | Keith Reeves' FACS core at Dana 29 | 30 | 31 | 32 | 33 | 34 | -------------------------------------------------------------------------------- /misc/general_ngs.md: -------------------------------------------------------------------------------- 1 | # 3' DGE from LSP demultiplexing example 2 | ``` 3 | bcl2fastq --adapter-stringency 0.9 --barcode-mismatches 0 --fastq-compression-level 4 --min-log-level INFO --minimum-trimmed-read-length 0 --sample-sheet /n/boslfs/INSTRUMENTS/illumina/180604_NB501677_0276_AHVMT2BGX5/SampleSheet.csv --runfolder-dir /n/boslfs/INSTRUMENTS/illumina/180604_NB501677_0276_AHVMT2BGX5 --output-dir /n/boslfs/ANALYSIS/180604_NB501677_0276_AHVMT2BGX5 --processing-threads 8 --no-lane-splitting --mask-short-adapter-reads 0 --use-bases-mask y*,y*,y*,y* 4 | ``` 5 | 6 | # Illumina instrument by FASTQ read name 7 | - @HWI-Mxxxx or @Mxxxx - MiSeq 8 | - @Kxxxx - HiSeq 3000(?)/4000 9 | - @Nxxxx - NextSeq 500/550 10 | - @Axxxxx - NovaSeq 11 | - @HWI-Dxxxx - HiSeq 2000/2500 12 | - AAXX, @HWUSI - GAIIx 13 | - BCXX = HiSeq v1.5 14 | - ACXX = HiSeq High-Output v3 15 | - ANXX = HiSeq High-Output v4 16 | - ADXX = HiSeq RR v1 17 | - AMXX, BCXX =HiSeq RR v2 18 | - ALXX = HiSeqX 19 | - BGXX, AGXX = High-Output NextSeq 20 | - AFXX = Mid-Output NextSeq 21 | 22 | # Illumina BaseSpace CLI 23 | https://developer.basespace.illumina.com/docs/content/documentation/cli/cli-overview 24 | 25 | # Miscellaneous 26 | 27 | **Add text "chr" to #CHROM column of VCF** 28 | ``` 29 | $ bcftools annotate --rename-chrs sample.vcf.gz 30 | ``` 31 | map file should contain "`old_name new_name`" pairs separated by whitespaces, each on a separate line 32 | -------------------------------------------------------------------------------- /misc/git.md: -------------------------------------------------------------------------------- 1 | # Git tips 2 | 3 | - [Pro git book](https://git-scm.com/book/en/v2) 4 | - https://github.com/Kunena/Kunena-Forum/wiki/Create-a-new-branch-with-git-and-manage-branches 5 | - https://nvie.com/posts/a-successful-git-branching-model/ 6 | - http://sandofsky.com/blog/git-workflow.html 7 | - https://blog.izs.me/2012/12/git-rebase 8 | - https://benmarshall.me/git-rebase/ 9 | - [find big files in history](https://stackoverflow.com/questions/10622179/how-to-find-identify-large-commits-in-git-history) 10 | - [remove a big file from history](https://www.czettner.com/2015/07/16/deleting-big-files-from-git-history.html) 11 | - [git-tips](https://github.com/git-tips/tips) 12 | 13 | 14 | # merge master branch into (empty) main and delete master 15 | ``` 16 | module load git 17 | git fetch origin main 18 | git branch -a 19 | git checkout main 20 | git merge master --allow-unrelated-histories 21 | git add -A . 22 | git commit 23 | git push 24 | 25 | git branch -d master 26 | git push origin :master 27 | ``` 28 | 29 | # Add remote upstream 30 | ```bash 31 | git remote -v 32 | git remote add upstream https://github.com/ORIGINAL_OWNER/ORIGINAL_REPOSITORY.git 33 | ``` 34 | 35 | # Create a tag in the upstream 36 | ```bash 37 | git fetch upstream 38 | git checkout master 39 | git reset --hard upstream/master 40 | git tag -a -m "project tag (date)" vx.y.z 41 | git push upstream vx.y.z 42 | git push origin vx.y.z 43 | ``` 44 | 45 | # Sync with upstream/master, delete all commits in origin/master 46 | ``` 47 | git fetch upstream 48 | git checkout main 49 | git reset --hard upstream/main 50 | git push --force 51 | ``` 52 | 53 | # Sync with upstream/master 54 | ``` 55 | git fetch upstream 56 | git checkout main 57 | git merge upstream/main 58 | ``` 59 | # big feature workflow - rebase - squash 60 | ``` 61 | # sync master with upstream first 62 | # create new branch and switch to it 63 | git checkout -b feature1 64 | # create many commits with meaningfull messages 65 | git add -A. 66 | git commit 67 | # upstream accumulated some commits 68 | git fetch upsteam 69 | # rebasing the branch not the master 70 | # to PR from the branch later not from the master 71 | # automatic rebase - replay all commit on top of master 72 | git rebase upstream/master 73 | 74 | # alternative - interactive rebase 75 | # 1. see latest commits from HEAD down to the start of feature1 76 | # on top of upstream 77 | # git log --oneline --decorate --all --graph 78 | # 2. interactive rebase for the last 13 commits (including head) 79 | # git rebase -i HEAD~13 80 | # set s (squash) in the interactive editor for all commits except for the top one 81 | # alter commit message 82 | 83 | # force since origin as 13 separate commits 84 | git push --force --set-upstream origin feature1 85 | # PR from feature1 branch to upstream/master 86 | ``` 87 | 88 | # 2 Feature workflow 89 | ``` 90 | git checkout -b feature1 91 | git add -A. 92 | git commit 93 | git push --set-upstream origin feature1 94 | # pull request1 95 | git checkout master 96 | git checkout -b feature2 97 | git add -A. 98 | git commit 99 | git push --set-upstream origin feature2 100 | # pull request 2 101 | ``` 102 | 103 | # Feature workflow w squash 104 | ``` 105 | git checkout -b feature_branch 106 | # 1 .. N 107 | git add -A . 108 | git commit -m "sync" 109 | 110 | git checkout master 111 | git merge --squash private_feature_branch 112 | git commit -v 113 | git push 114 | # pull request to upstream 115 | # code review 116 | # request merged 117 | git branch -d feature_branch 118 | git push origin :feature_branch 119 | ``` 120 | 121 | # get commits from maintainers in a pull request and push back 122 | ``` 123 | git fetch upstream pull/[PR_Number]/head:new_branch 124 | git checkout new_branch 125 | git add 126 | git commit 127 | git push --set-upstream origin new_branch 128 | ``` 129 | 130 | # ~/.ssh/config 131 | ``` 132 | Host github.com 133 | HostName github.com 134 | PreferredAuthentications publickey 135 | IdentityFIle ~/.ssh/id_rsa_git 136 | User git 137 | ``` 138 | 139 | # Migrating github.com repos to [code.harvard.edu](https://code.harvard.edu/) 140 | 141 | See [this page](https://gist.github.com/niksumeiko/8972566) for good general guidance 142 | 143 | 1. Set up your ssh keys. You can use your old keys (if you remember your passphrase) by going to `Settings --> SSH and GPG keys --> New SSH key` 144 | 2. Create your repo in code.harvard.edu. Copy the 'Clone with SSH link`: `git@code.harvard.edu:HSPH/repo_name.git` (*NOTE: some of us have had trouble with the HTTPS link*) 145 | 3. Go to your local repo that you would like to migrate. Enter the directory. 146 | 147 | ``` 148 | # this will add a second remote location 149 | git remote add harvard git@code.harvard.edu:HSPH/repo_name.git 150 | 151 | # this will get rid of the old origin remote 152 | git push -u harvard --all 153 | ``` 154 | 155 | 4. You should see the contents of your local repo in Enterprise. Now go to 'Settings' for the repo and 'Collaborators and Teams'. Here you will need to add Bioinformatics Core and give 'Admin' priveleges. 156 | 157 | 158 | > **NOTE:** If you decide to compile all your old repos into one giant repo (i.e. [hbc_mistrm_reports_legacy](https://code.harvard.edu/HSPH/hbc_mistrm_reports_legacy)), make sure that you remove all `.git` folders from each of them before committing. Otherwise you will not be able to see the contents on each folder on Enterprise. 159 | 160 | # Remove sensitive information from the file and from the history 161 | ``` 162 | Make a backup 163 | # cd ~/backup 164 | # git clone git@github.com:hbc/knowledgebase.git 165 | cd ~/work 166 | git clone git@github.com:hbc/knowledgebase.git 167 | git filter-branch --tree-filter 'rm -f admin/download_data.md' HEAD 168 | git push --force-with-lease origin master 169 | # commit saved copy of download_data.md without secrets 170 | ``` 171 | -------------------------------------------------------------------------------- /misc/miRNA.md: -------------------------------------------------------------------------------- 1 | * https://github.com/lpantano/bcbioSmallRna 2 | -------------------------------------------------------------------------------- /misc/mounting_o2_mac.md: -------------------------------------------------------------------------------- 1 | ## For OSX 2 | 3 | To have O2 accessible on your laptop/desktop as a folder, you need to use something called [`sshfs`](https://en.wikipedia.org/wiki/SSHFS) (ssh filesystem). This is a command that is not native to OSXand you need to go through several steps in order to get it. Once you have `sshfs`, then you need to set up ssh keys to connect O2 to your laptop without having to type in a password. 4 | 5 | ### 1. Installing sshfs on OSX 6 | 7 | Download macFUSE from [https://github.com/osxfuse/osxfuse/releases](https://github.com/osxfuse/osxfuse/releases/download/macfuse-4.6.0/macfuse-4.6.0.dmg), and install it. 8 | 9 | NOTE: In order to install macFUSE, you may need to first enable system extensions, following [this guideline from Apple](https://support.apple.com/guide/mac-help/change-security-settings-startup-disk-a-mac-mchl768f7291/mac), which will require restarting your computer. 10 | 11 | Download sshfs from [https://github.com/osxfuse/sshfs/releases](https://github.com/osxfuse/sshfs/releases/download/osxfuse-sshfs-2.5.0/sshfs-2.5.0.pkg), and install it. 12 | 13 | > #### Use this only if the above option fails! 14 | > 15 | > Step 1. Install [Xcode](https://developer.apple.com/xcode/) 16 | > ```bash 17 | > $ xcode-select --install 18 | > ``` 19 | > 20 | > Step 2. Install Homebrew using ruby (from Xcode) 21 | > ```bash 22 | > $ /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)" 23 | > 24 | > # Uninstall Homebrew 25 | > # /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/uninstall)" 26 | > ``` 27 | > 28 | > Step 2.1. Check to make sure that Homebrew is working properly 29 | > ```bash 30 | > $ brew doctor 31 | > ``` 32 | > 33 | > Step 3. Install Cask from Homebrew's caskroom 34 | > ```bash 35 | > $ brew tap caskroom/cask 36 | > ``` 37 | > 38 | > Step 4. Install OSXfuse using Cask 39 | > ```bash 40 | > $ brew cask install osxfuse 41 | > ``` 42 | > 43 | > Step 5. Install sshfs from fuse 44 | > ```bash 45 | > $ brew install sshfs 46 | > ``` 47 | 48 | ### 2. Set up "ssh keys" 49 | 50 | Once `sshfs` is installed, the next step is to connect O2 (or a remote server) to our laptops. To make this process seamless, first set up ssh keys which can be used to connect to the server without having to type in a password every time. 51 | 52 | Log into O2 and use `vim` to open `~/.ssh/authorized_keys` and paste the code below copied from your computer to this file and save it. NOTE: make sure to replace `ecommonsID` with your actual username! 53 | 54 | ```bash 55 | # set up ssh keys 56 | $ ssh-keygen -t rsa -b 4096 -f ~/.ssh/id_rsa -C "ecommonsID" 57 | $ ssh-add -K ~/.ssh/id_rsa 58 | ``` 59 | 60 | Arguments for `ssh-keygen`: 61 | * `-t` = Specifies the type of key to create. The possible values are "rsa1" for protocol version 1 and "rsa" or "dsa" for protocol version 2. *We want rsa.* 62 | * `-b` = Specifies the number of bits in the key to create. For RSA keys, the minimum size is 768 bits and the default is 2048 bits. *We want 4096* 63 | * `-f` = name of output "keyfile" 64 | * `-C` = Provides a new comment 65 | 66 | Arguments for `ssh-add`: 67 | * `-K` = Store passphrases in your keychain 68 | 69 | ```bash 70 | # copy the contents of `id_rsa.pub` to ~/.ssh/authorized_keys on O2 71 | $ cat ~/.ssh/id_rsa.pub | pbcopy 72 | ``` 73 | 74 | > `pbcopy` puts the output of `cat` into the clipboard (in other words, it is equivalent to copying with ctrl + c) so you can just paste it as usual with ctrl + v. 75 | 76 | ### 3. Mount O2 using sshfs 77 | 78 | Now, let's set up for running `sshfs` on our laptops (local machines), by creating a folder with an intuitive name for your home directory on the cluster to be mounted in. 79 | 80 | ```bash 81 | $ mkdir ~/O2_mount 82 | ``` 83 | 84 | Finally, let's run the `sshfs` command to have O2 mount as a folder in the above space. Again, replace `ecommonsID` with your username. 85 | ```bash 86 | $ sshfs ecommonsID@transfer.rc.hms.harvard.edu:. ~/O2_mount -o volname="O2" -o compression=no -o Cipher=arcfour -o follow_symlinks 87 | ``` 88 | 89 | Now we can browse through our home directory on O2 as though it was a folder on our laptop. 90 | 91 | > If you want to access your lab's directory in `/groups/` or your directory in `/n/scratch2`, you will need to create sym links to those in your home directory and you will be able to access those as well. 92 | 93 | Once you are finished using O2 in its mounted form, you can cancel the connection using `umount` and the name of the folder. 94 | 95 | ```bash 96 | $ umount ~/O2_mount 97 | ``` 98 | 99 | ### 4. Set up alias (optional) 100 | 101 | It is optional to set shorter commands using `alias` for establishing and canceling `sshfs` connection. Use `vim` to create or open `~/.bashrc` and paste the following `alias` commands and save it. 102 | 103 | ```bash 104 | $ alias mounto2='sshfs ecommonsID@transfer.rc.hms.harvard.edu:. ~/O2_mount -o volname="O2" -o follow_symlinks' 105 | $ alias umounto2='umount ~/O2_mount' 106 | ``` 107 | 108 | > If your default shell is `zsh` instead of `bash`, use `vim` to create or open `~/.zshrc` and paste the `alias` commands. 109 | 110 | Update changes in `.bashrc` 111 | 112 | ```bash 113 | $ source .bashrc 114 | ``` 115 | Now we can type `mounto2` and `umounto2` to mount and unmount O2. 116 | 117 | *** 118 | *This lesson has been developed by members of the teaching team at the [Harvard Chan Bioinformatics Core (HBC)](http://bioinformatics.sph.harvard.edu/). These are open access materials distributed under the terms of the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.* 119 | -------------------------------------------------------------------------------- /misc/mtDNA_variants.md: -------------------------------------------------------------------------------- 1 | # SNV and indels 2 | - when starting from WGS or WES, subset MT chromosome 3 | - estimate coverage, callable >=100X: https://github.com/naumenko-sa/bioscripts/blob/master/scripts/bam.coverage.bamstats05.sh 4 | - use template for bcbio: https://github.com/bcbio/bcbio-nextgen/pull/3059 5 | 6 | # Large deletions 7 | - MitoDel: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5657046/ 8 | - eKLIPse: https://www.ncbi.nlm.nih.gov/pubmed/30393377 9 | 10 | # Databases 11 | - https://www.mitomap.org/foswiki/bin/view/MITOMAP/WebHome 12 | - https://www.mitomap.org/foswiki/bin/view/MITOMAP/TopVariants 13 | - mvTool V2: https://mseqdr.org/mv.php 14 | - MSeqDR, ClinVar, ICGC, COSMIC 15 | 16 | -------------------------------------------------------------------------------- /misc/multiomics_factor_analysis.md: -------------------------------------------------------------------------------- 1 | # Uploading a program called MOFA2 2 | ## It is used for finding factors from multiomics datasets. 3 | 4 | The author presented an example usage in scRNAseq & scATACseq as well as other datasets (e.g. bulkRNAseq with proteomics). 5 | 6 | I think it will be nice to look over. 7 | 8 | https://biofam.github.io/MOFA2/ 9 | 10 | Nice Review article of integrating multiomics, by one of the presenters. 11 | 12 | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7034308/ 13 | 14 | iOMICSPass - co-expression based data integration using network 15 | 16 | https://www.nature.com/articles/s41540-019-0099-y 17 | -------------------------------------------------------------------------------- /misc/new_to_remote_github_CLI_start_here.md: -------------------------------------------------------------------------------- 1 | # Putting local git repos into the HBC Github organization remotely via command line interface 2 | 3 | Heather Wick 4 | 5 | Has your experience with github primarily been through the browser? This document has the basics to begin turning your working directories into github repositories which can be pushed to the HBC location remotely via command line. 6 | 7 | ### Wait, back up, what do you mean by push? 8 | 9 | Confused by push, pull, branch, main, commit? If you're not sure, it's worthwhile to familiarize yourself with the basics of git/github. There are some great resources and tutorials to learn from out there. Here's an interactive one (I could only get it to work in safari, not chrome): 10 | https://learngitbranching.js.org 11 | 12 | This won't teach you how to put your things into the HBC Github organization though. 13 | 14 | ## Set up/configuration 15 | 16 | You only need to do these once! 17 | 18 | ### 1. Configure git locally to link to your github account 19 | Open up a terminal and type 20 | 21 | ```bash 22 | git config --global user.email EMAIL_YOU_USE_TO_SIGN_IN_TO_GITHUB 23 | git config --global user.name YOUR_GITHUB_USERNAME 24 | ``` 25 | 26 | ### 2. Make personal access token 27 | Configuring your local git isn't enough, as github is moving away from passwords. You will need to make a personal access token through your github account. Follow the instructions here: 28 | https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens 29 | 30 | **Copy this personal access token and save it somewhere or keep the window open for now. You will be prompted to enter your personal access token the first time you type `push -u origin main`. You will not be able to access this token again!** 31 | 32 | ## Creating your git repo 33 | 34 | ### 1. Initialize git repo on the HBC Github via web browser 35 | 36 | I have yet to find a way to do this remotely via CLI, but as far as I can tell this step is a necessary pre-requisite to pushing a local repo to the HBC github. Will update to add CLI if possible. 37 | 38 | Go to https://github.com/hbctraining and click the green "New Repository" button. Initialize a new, empty repository. 39 | 40 | Once you do this, there will be some basic code you can copy under `Quick Setup`, including the `https` location of your repo which can be used below 41 | 42 | ### 2. Create a local git repo and push it to the HBC Github via CLI 43 | 44 | In your terminal, navigate to the folder you would like to turn into a github repo and type the following: 45 | 46 | ```bash 47 | echo "# text to add to readme" >> README.md 48 | git init 49 | git add README.md 50 | git commit -m "first commit" 51 | git branch -M main 52 | git remote add origin https://github.com/hbctraining/NAME_OF_REPOT.git 53 | git push -u origin main 54 | ``` 55 | **You will be prompted to enter your personal access token the first time you type `push -u origin main`.** 56 | 57 | ## Useful tips/tricks 58 | 59 | If you are doing this in a directory with folders/data/files you don't necessarily want to put on github, you will want to pick/choose what you upload. Here are some tips and notes: 60 | 61 | ### Add all, but exclude some 62 | 63 | **note: will not exclude if already pushed to HBC repo! Just untracks them!** 64 | 65 | The best time to implement this is when you are making your first upload 66 | 67 | ```bash 68 | git add . 69 | git reset -- path/to/thing/to/exclude 70 | git reset -- path/to/more/things/to/exclude 71 | git commit -m "NAME_OF_COMMIT" 72 | git push -u origin main 73 | ``` 74 | 75 | ### Add specific files/folders: 76 | 77 | ```bash 78 | git add path/to/files* 79 | git commit -m "NAME_OF_COMMIT" 80 | git push -u origin main 81 | ``` 82 | 83 | ### Add all files/folder except this file/folder: 84 | 85 | ```bash 86 | git add -- . ':!THING_TO_EXCLUDE' 87 | git commit -m "NAME_OF_COMMIT" 88 | git push -u origin main 89 | ``` 90 | 91 | ### Remove a file you already pushed to Github 92 | 93 | You might be tempted to just do this in the browser, but be warned! It will break your local repo until you pull from the HBC location. This could be a problem if that was important data you want to continue to store locally. Fortunately, you can "delete" files on the HBC Github without deleting them locally. Here's how: 94 | 95 | ```bash 96 | git rm --cached NAME_OF_FILE 97 | ``` 98 | 99 | Or for a folder: 100 | ```bash 101 | git rm -r --cached NAME_OF_FILE 102 | ``` 103 | **Side effects of resorting to `git rm -r --cached REALLY_IMPORTANT_DATA_DIRECTORY` include anxiety, sweating, heart palpitations, and appeals to spiritual beings** 104 | 105 | ### Check what changes have been made to the current commit 106 | 107 | Very useful to see what will actually be added, removed, etc or if everything is up to date. 108 | ```bash 109 | git status 110 | ``` 111 | 112 | ### .gitignore: coming soon 113 | 114 | -------------------------------------------------------------------------------- /misc/organized_papers.md: -------------------------------------------------------------------------------- 1 | # Human genome reference T2T - CHR13 - 2022 2 | - https://www.genome.gov/about-genomics/telomere-to-telomere 3 | - https://www.science.org/doi/pdf/10.1126/science.abj6987 4 | - https://www.science.org/doi/10.1126/science.abl3533 5 | - https://www.science.org/doi/epdf/10.1126/science.abl3533 6 | 7 | # GWAS 8 | - https://nature.com/articles/nrg1521 9 | - https://nature.com/articles/nrg1916 10 | - https://nature.com/articles/nrg2344 11 | - https://nature.com/articles/nrg2544 12 | - https://nature.com/articles/nrg2796 13 | - https://nature.com/articles/nrg2813 14 | - https://nature.com/articles/nrg.2016.142 15 | - https://nature.com/articles/s4157 16 | 17 | # bulk-RNA-seq 18 | - Systematic evaluation of splicing calling tools - 2019: https://academic.oup.com/bib/article/21/6/2052/5648232 19 | - [RPKM/TPM misuse](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7373998/) 20 | -------------------------------------------------------------------------------- /misc/orphan_improvements.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: Improvements for the analysis 3 | description: List of things to try. 4 | category: research 5 | subcategory: orphans 6 | tags: [hbc] 7 | --- 8 | 9 | 10 | 1. Try out Alevin from Salmon for a more principled single-cell quantification (https://www.biorxiv.org/content/early/2018/06/01/335000) 11 | 2. Add retained intron analysis with IRFinder to bcbio-nextgen 12 | 3. See if adding support for grolar to convert pizzly output to something more parseable makes sense. It's an R script and hasn't really been worked on so might not be useable: https://github.com/MattBashton/grolar 13 | 4. Add automatic loading/QC of bcbioSingleCell data from bcbio 14 | 5. Convert bcbio-nextgen singlecell matrices to HDF5 format in bcbio 15 | 6. Swap bcbioSingleCell to read the already-combined matrices for speed purposes 16 | 7. Add bcbioRNASeq template to do DTU usage using DRIMseq (https://f1000research.com/articles/7-952/v1) 17 | 8. Update installed genomes to use newest Ensembl build for RNA-seq for bcbio-supported genomes. 18 | -------------------------------------------------------------------------------- /misc/power_calc_simulations.md: -------------------------------------------------------------------------------- 1 | Code and sample metadata and data for running simulation based power calculations can be found [here](https://github.com/hbc/power_calc_simulations). 2 | 3 | 4 | This code will use the mean and variance of the data set to derive simulated datasets with a defined number of values with defined fold changes. The simulated data will the be tested to determine precision and recall estimates for the comparison. 5 | 6 | 7 | 8 | *original code written by Lorena Pantano and adapted by John Hutchinson* 9 | -------------------------------------------------------------------------------- /misc/snakemake-example-pipeline: -------------------------------------------------------------------------------- 1 | --- 2 | title: Example of snakemake pipeline 3 | description: An example of snakemake file to run a pipeline applied to a bunch of files 4 | category: research 5 | subcategory: general_ngs 6 | tags: [snakemake] 7 | --- 8 | 9 | 10 | This file shows how to run a pipeline with snakemake for a bunch of files defined in `SAMPLES` variables. 11 | 12 | It shows how to put together different steps and how they are related to each other. 13 | 14 | The tricky part is to have always `rule all` step and get all the output filenames you want to generate. If you miss 15 | some files that you want to generate and they are not the input in any other step then that step is not happening. 16 | 17 | ``` 18 | from os.path import join 19 | 20 | # Globals --------------------------------------------------------------------- 21 | 22 | # Full path to a FASTA file. 23 | GENOME_DIR = '../reference' 24 | 25 | # Full path to a folder that holds all of your FASTQ files. 26 | FASTQ_DIR = '../rawdata' 27 | 28 | # A Snakemake regular expression matching the forward mate FASTQ files. 29 | SAMPLES, = glob_wildcards(join(FASTQ_DIR, '{sample,[^/]+}_R1_001.fastq.gz')) 30 | 31 | # Patterns for the 1st mate and the 2nd mate using the 'sample' wildcard. 32 | PATTERN_R1 = '{sample}_R1_001.fastq.gz' 33 | PATTERN_R2 = '{sample}_R2_001.fastq.gz' 34 | PATTERN_GENOME = '{sample}.fa' 35 | 36 | 37 | rule all: 38 | input: 39 | index = expand(join(GENOME_DIR, '{sample}.fa.bwt'), sample = SAMPLES), 40 | vcf = expand(join('vcf', '{sample}.vcf'), sample = SAMPLES), 41 | vcfpileup = expand(join('pileup', '{sample}.vcf'), sample = SAMPLES), 42 | sam = expand(join('stats', '{sample}.txt'), sample = SAMPLES) 43 | 44 | rule index: 45 | input: 46 | join(GENOME_DIR, '{sample}.fa') 47 | output: 48 | join(GENOME_DIR, '{sample}.fa.bwt') 49 | shell: 50 | 'bwa index {input}' 51 | 52 | rule map: 53 | input: 54 | genome = join(GENOME_DIR, PATTERN_GENOME), 55 | index = join(GENOME_DIR, '{sample}.fa.bwt'), 56 | r1 = join(FASTQ_DIR, PATTERN_R1), 57 | r2 = join(FASTQ_DIR, PATTERN_R2) 58 | output: 59 | 'bam/{sample}.bam' 60 | shell: 61 | 'bwa mem -c 250 -M -t 6 -v 1 {input.genome} {input.r1} {input.r2} | samtools sort - > {output}' 62 | 63 | rule stats: 64 | input: 65 | bam = 'bam/{sample}.bam' 66 | output: 67 | 'stats/{sample}.txt' 68 | shell: 69 | 'samtools stats {input} > {output}' 70 | 71 | rule pileup: 72 | input: 73 | bam = 'bam/{sample}.bam', 74 | genome = join(GENOME_DIR, PATTERN_GENOME) 75 | output: 76 | 'pileup/{sample}.mp' 77 | shell: 78 | 'samtools mpileup -f {input.genome} -t DP -t AD -d 10000 -u -g {input.bam} > {output}' 79 | 80 | 81 | rule mpconvert: 82 | input: 83 | 'pileup/{sample}.mp', 84 | output: 85 | 'pileup/{sample}.vcf' 86 | shell: 87 | 'bcftools convert -O v {input} > {output}' 88 | 89 | 90 | rule bcf: 91 | input: 92 | 'pileup/{sample}.mp', 93 | output: 94 | 'vcf/{sample}.vcf' 95 | shell: 96 | 'bcftools call -v -m {input} > {output}' 97 | 98 | ``` 99 | -------------------------------------------------------------------------------- /python/conda.md: -------------------------------------------------------------------------------- 1 | ## Conda 2 | 3 | Every system has a Python installation, but you don't necessarily want to use that. Why not? That version is typically outdated and configured to support system functions. Most tools require specific versions of Python and depedencies so you need more flexibility. 4 | 5 | **Solution?** 6 | 7 | Set up a full-stack scientific Python deployment **using a Python distribution** (Anaconda or Miniconda). It is an installation of Python with a set of curated packages which are guaranteed to work together. 8 | 9 | 10 | ## Setting up Python distribution on O2 11 | 12 | You can install it to your home directory, though not needed as O2 has a miniconda module available. 13 | 14 | By default, miniconda and conda envs are installed under user home space. 15 | 16 | ### Conda Environments 17 | Environments allow you to creata an isolated, reproducible environments where you have fine-tuned control over python version, all packages and configuration. _This is always recommended over using the default environment._ 18 | 19 | To create an environment using Pytho 3.9 and the numpy package: 20 | 21 | ```bash 22 | $ conda create --name my_environment python=3.9 numpy 23 | ``` 24 | 25 | Now that you have created it, you need to activate it. Once activated, all installations of tools are specific to that environment; do this using `conda install`. It is a configured space where you can run analyses reproducibly. 26 | 27 | ```bash 28 | $ conda activate my_environment 29 | ``` 30 | 31 | When you are done you can deactivate the environment or close it: 32 | 33 | ```bash 34 | conda deactivate 35 | ``` 36 | 37 | The environments and associated libs are located in `~/home/minconda3/envs`. Both miniconda and the created environments can occupy a lot of space and max out your home directory! 38 | 39 | **Solution: Create the conda env in another space** 40 | 41 | For conda envs, you can use the full path outside of home when creating env: 42 | 43 | ```bash 44 | module purge 45 | module load miniconda3/23.1.0 46 | conda create -p /path/to/somewhere/not/home/myEnv python=3.9 numpy 47 | ``` 48 | 49 | > **NOTE:** It's common that installing packages using Conda is slow or fails because Conda is unable to resolve dependencies. To get around this, we suggest the use of Mamba. 50 | 51 | **Installing lots of dependency packages?** 52 | 53 | You can do this easily by creating a yaml file, for example `environment.yaml` below was used to install Pytables: 54 | 55 | ```bash 56 | name: pytables 57 | channels: 58 | - defaults 59 | dependencies: 60 | - python=3.9* 61 | - numpy >= 1.19.0 62 | - zlib 63 | - cython >= 0.29.32 64 | - hdf5=1.14.0 65 | - numexpr >= 2.6.2 66 | - packaging 67 | - py-cpuinfo 68 | - python-blosc2 >= 2.3.0 69 | ``` 70 | 71 | Now to create the environment we reference the file in the command: 72 | 73 | ```bash 74 | conda env create -f environment.yaml 75 | ``` 76 | 77 | ### Channels 78 | 79 | Where do conda packages come from? The packages are hosted on conda “channels”. From the conda pages: 80 | 81 | _"Conda channels are the locations where packages are stored. They serve as the base for hosting and managing packages. Conda packages are downloaded from remote channels, which are URLs to directories containing conda packages. The conda command searches a set of channels."_ 82 | 83 | Using `-c` you can specify whihc channels you want conda to search in for packages. 84 | 85 | > Adapted from [An Introduction to Earth and Environmental Data Science](https://earth-env-data-science.github.io/lectures/environment/python_environments.html) 86 | -------------------------------------------------------------------------------- /r/.Rprofile: -------------------------------------------------------------------------------- 1 | version <- paste0(R.Version()$major,".",R.Version()$minor) 2 | if (version == "3.6.1") { 3 | .libPaths("~/R-3.6.1/library") 4 | }else if (version == "3.5.1") { 5 | .libPaths("/R-3.5.1/library") 6 | } 7 | 8 | #Add this to your home folder, and make modifications to the version numbers if you need to. 9 | #This will let you load the correct library folders for the different versions you have on O2. 10 | #R-3.6.1 and R-3.5.1 will load their corresponding library path for mine, but feel free to modify them to fit your needs. 11 | # This has been created as ChIPQC was problematic in R-3.6.1 and I had to load R-3.5.1 to generate a html report. (Joon) 12 | -------------------------------------------------------------------------------- /r/R-tips-and-tricks.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: R tips 3 | description: This code helps with regular data improving efficiency 4 | category: computing 5 | subcategory: tips_tricks 6 | tags: [R, visualization] 7 | --- 8 | 9 | # Import/Export of files 10 | Stop using write.csv, write.table and use the [rio](https://cran.r-project.org/web/packages/rio/index.html) library instead. All rio needs is the file extension to figure out what file type you're dealing with. Easy import and export to Excel files for clients. 11 | 12 | # Parsing in R using Tidyverse 13 | This is a link to a nice tutorial from Ista Zahn from IQSS using stringr and tidyverse for parsing files in R. It is from the Computefest 2017 workshop: 14 | http://tutorials-live.iq.harvard.edu:8000/user/zwD2ioESyGbS/notebooks/workshops/R/RProgramming/Rprogramming.ipynb 15 | 16 | # Better clean default ggplot 17 | install cowplot (https://cran.r-project.org/web/packages/cowplot/index.html) 18 | ```r 19 | library(cowplot) 20 | ``` 21 | 22 | # Nice looking log scales 23 | Example for x-axis 24 | ```r 25 | library(scales) 26 | p + scale_x_log10( 27 | breaks = scales::trans_breaks("log10", function(x) 10^x), 28 | labels = scales::trans_format("log10", scales::math_format(10^.x))) + 29 | annotation_logticks(sides='b') 30 | ``` 31 | 32 | # Read a bunch of files into one dataframe 33 | ```r 34 | library(tidyverse) 35 | read_files = function(files) { 36 | data_frame(filename = files) %>% 37 | mutate(contents = map(filename, ~ read_tsv(.))) %>% 38 | unnest() 39 | } 40 | ``` 41 | 42 | # remove a layer from a ggplot2 object with ggedit 43 | ``` 44 | plotGeneSaturation(bcb, interestingGroups=NULL) + 45 | ggrepel::geom_text_repel(aes(label=description, color=NULL)) 46 | p %>% 47 | ggedit::remove_geom('point', 1) + 48 | geom_point(aes(color=NULL)) 49 | ``` 50 | 51 | # [Link to information about count normalization methods](https://github.com/hbc/knowledgebase/wiki/Count-normalization-methods) 52 | The images currently break, but I will update when the course materials are in a more permanent state. 53 | 54 | # .Rprofile usefulness 55 | ```R 56 | ## don't ask for CRAN repository 57 | options("repos" = c(CRAN = "http://cran.rstudio.com/")) 58 | ## for the love of god don't open up tcl/tk ever 59 | options(menu.graphics=FALSE) 60 | ## set seed for reproducibility 61 | set.seed(123456) 62 | ## don't print out more than 100 lines at once 63 | options(max.print=100) 64 | ## helps with debugging Bioconductor/S4 code 65 | options(showErrorCalls = TRUE, showWarnCalls = TRUE) 66 | 67 | ## Create a new invisible environment for all the functions to go in 68 | ## so it doesn't clutter your workspace. 69 | .env <- new.env() 70 | 71 | ## ht==headtail, i.e., show the first and last 10 items of an object 72 | .env$ht <- function(d, n=10) rbind(head(d, n), tail(d, n)) 73 | 74 | ## copy from clipboard 75 | .env$pbcopy = function(x) { 76 | capture.output(x, file=pipe("pbcopy")) 77 | } 78 | 79 | ## update your local bcbioRNASeq and bcbioSingleCell installations 80 | .env$update_bcbio = function(x) { 81 | devtools::install_github("steinbaugh/basejump") 82 | devtools::install_github("hbc/bcbioBase") 83 | devtools::install_github("hbc/bcbioRNASeq") 84 | devtools::install_github("hbc/bcbioSingleCell") 85 | } 86 | 87 | attach(.env) 88 | ``` 89 | 90 | # Make density plot without underline 91 | ```R 92 | ggplot(colData(sce) %>% 93 | as.data.frame(), aes(log10GenesPerUMI)) + 94 | stat_density(geom="line") + 95 | facet_wrap(~period + intervention) 96 | ``` 97 | 98 | # Archive a file to Dropbox with a link to it 99 | ```R 100 | ```{r results='asis'} 101 | dropbox_dir = "HSPH/eggan/hbc02067" 102 | archive_data_with_link = function(data, filename, description, dropbox_dir) { 103 | readr::write_csv(data, filename) 104 | links = bcbioBase::copyToDropbox(filename, dropbox_dir) 105 | link = gsub("dl=0", "dl=1", links[[1]]$url) 106 | basejump::markdownLink(filename, link, paste0(" ", description)) 107 | } 108 | archive_data_with_link(als, "dexseq-all.csv", "All DEXSeq results", dropbox_dir) 109 | archive_data_with_link(als %>% 110 | filter(padj < 0.1), "dexseq-sig.csv", 111 | "All significant DEXSeq results", dropbox_dir) 112 | ``` 113 | 114 | # Novel operators from magrittr 115 | The “%<>%” operator lets you pipe an object to a function and then back into the same object. 116 | So: 117 | `foo -> foo %>% bar()` 118 | is the same as 119 | `foo %<>% bar()` 120 | 121 | # gghelp: Converts a natural language query into a 'ggplot2' command 122 | This [package](https://rdrr.io/github/brandmaier/ggx/) allows users to issue natural language commands 123 | related to theme-related styling of plots (colors, font size and such), which then are translated into 124 | valid 'ggplot2' commands. 125 | 126 | ### Examples: 127 | ```R 128 | gghelp("rotate x-axis labels by 90 degrees") 129 | gghelp("increase font size on x-axis label") 130 | gghelp("set x-axis label to 'Length of Sepal'") 131 | ``` 132 | -------------------------------------------------------------------------------- /r/Shiny_images/Added_tabs.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/Added_tabs.png -------------------------------------------------------------------------------- /r/Shiny_images/Adding_panels.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/Adding_panels.png -------------------------------------------------------------------------------- /r/Shiny_images/Adding_theme.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/Adding_theme.png -------------------------------------------------------------------------------- /r/Shiny_images/Altered_action_button.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/Altered_action_button.png -------------------------------------------------------------------------------- /r/Shiny_images/Check_boxes_with_action_button.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/Check_boxes_with_action_button.png -------------------------------------------------------------------------------- /r/Shiny_images/R_Shiny_hello_world.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/R_Shiny_hello_world.gif -------------------------------------------------------------------------------- /r/Shiny_images/R_shiny_req_after.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/R_shiny_req_after.gif -------------------------------------------------------------------------------- /r/Shiny_images/R_shiny_req_initial.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/R_shiny_req_initial.gif -------------------------------------------------------------------------------- /r/Shiny_images/Return_table.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/Return_table.png -------------------------------------------------------------------------------- /r/Shiny_images/Return_text_app_blank.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/Return_text_app_blank.png -------------------------------------------------------------------------------- /r/Shiny_images/Return_text_app_hello.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/Return_text_app_hello.png -------------------------------------------------------------------------------- /r/Shiny_images/Sample_size_hist_100.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/Sample_size_hist_100.png -------------------------------------------------------------------------------- /r/Shiny_images/Sample_size_hist_5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/Sample_size_hist_5.png -------------------------------------------------------------------------------- /r/Shiny_images/Shiny_UI_server.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/Shiny_UI_server.png -------------------------------------------------------------------------------- /r/Shiny_images/Shiny_process.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/Shiny_process.png -------------------------------------------------------------------------------- /r/Shiny_images/Squaring_number_app.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/Squaring_number_app.png -------------------------------------------------------------------------------- /r/Shiny_images/mtcars_table.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/r/Shiny_images/mtcars_table.png -------------------------------------------------------------------------------- /r/htmlwidgets: -------------------------------------------------------------------------------- 1 | (See http://gallery.htmlwidgets.org/ for more awesome widgets.) 2 | 3 | Using some basic R libraries, you can setup some interactive visualizations wihtout using Rshiny 4 | 5 | Here is some example code illustrating what I am thinking about, using the iris dataset from R 6 | 7 | `library(crosstalk)` 8 | `library(lineupjs)` 9 | `library(d3scatter)` 10 | 11 | `shared_iris = SharedData$new(iris)` 12 | `d3scatter(shared_iris, ~Petal.Length, ~Petal.Width, ~Species, width="100%")` 13 | `lineup(shared_iris, width="100%")` 14 | 15 | Similarly, the morpheus.js html widget makes for fantastic, interactive heatmaps. 16 | `library(morpheus)` 17 | 18 | rowAnnotations <- data.frame(annotation1=1:32, annotation2=sample(LETTERS[1:3], nrow(mtcars), replace = TRUE))` 19 | `morpheus(mtcars, colorScheme=list(scalingMode="fixed", colors=heat.colors(3)), rowAnnotations=rowAnnotations, overrideRowDefaults=FALSE, rows=list(list(field='annotation2', highlightMatchingValues=TRUE, display=list('color'))))` 20 | -------------------------------------------------------------------------------- /rc/O2-tips.md: -------------------------------------------------------------------------------- 1 | # O2 tips 2 | 3 | ## Making conda not slow down your login 4 | If you have a complex base environment that gets loaded on login, you can end up having freezes of 30 seconds or more when 5 | logging into O2. It is ultra annoying. You can fix this by not running the `_conda_setup` script in your .bashrc, like this: 6 | 7 | ```bash 8 | # >>> conda initialize >>> 9 | # !! Contents within this block are managed by 'conda init' !! 10 | #__conda_setup="$('/home/rdk4/local/share/bcbio/anaconda/bin/conda' 'shell.bash' 'hook' 2> /dev/null)" 11 | #if [ $? -eq 0 ]; then 12 | # eval "$__conda_setup" 13 | #else 14 | if [ -f "/home/rdk4/local/share/bcbio/anaconda/etc/profile.d/conda.sh" ]; then 15 | . "/home/rdk4/local/share/bcbio/anaconda/etc/profile.d/conda.sh" 16 | else 17 | export PATH="/home/rdk4/local/share/bcbio/anaconda/bin:$PATH" 18 | fi 19 | #fi 20 | #unset __conda_setup 21 | # <<< conda initialize <<< 22 | ``` 23 | 24 | ## Interactive function to request memory and hours 25 | 26 | Can be added to .bashrc (or if you don't want to clutter it, put it in .o2_aliases and then source it from .bashrc) 27 | 28 | Defaults: 4G mem, 8 hours. 29 | ``` 30 | function interactive() { 31 | mem=${1:-4} 32 | hours=${2:-8} 33 | 34 | srun --pty -p interactive --mem ${mem}G -t 0-${hours}:00 /bin/bash 35 | } 36 | ``` 37 | 38 | -------------------------------------------------------------------------------- /rc/O2_portal_errors.md: -------------------------------------------------------------------------------- 1 | # O2 Portal - R 2 | 3 | These are common errors found when running R on the O2 portal and ways to fix them. 4 | 5 | ## How to launch Rstudio 6 | 7 | - Besides your private R library, we have now platform shared R library, add this to *Shared R Personal Library* section 8 | : "/n/data1/cores/bcbio/R/library/4.3.1" (RNAseq and scRNAseq) 9 | - Minimum modules to load for R4.3.*: `cmake/3.22.2 gcc/9.2.0 R/4.3.1 ` 10 | - Minimum modules to load for 4.2.1 single cell analyses (some might be specific to trajectory analysis): 11 | `gcc/9.2.0 imageMagick/7.1.0 geos/3.10.2 cmake/3.22.2 R/4.2.1 fftw/3.3.10 gdal/3.1.4 udunits/2.2.28` 12 | - Sometimes specific nodes work better: under "Slurm Custom Arguments": `-x compute-f-17-[09-25]` 13 | 14 | # Issues 15 | 16 | ## Issue 1 - You can make a session and open Rstudio on O2 but cannot actually type. 17 | 18 | Potential solution: Make a new session and put the following under "Slurm Custom Arguments": 19 | ``` 20 | -x compute-f-17-[09-25] 21 | ``` 22 | 23 | ## Issue 2 - Everything was fine but then you lost connection. 24 | 25 | When you attempt to reload you see: 26 | 27 |

28 | 29 |

30 | 31 | Potential solutions: Refresh your interactive sessions page first then refresh your R page. 32 | If that doesn't work close your R session and re-open from the interactive sessions page. 33 | If that doesn't work wait 5-10 min then repeat. 34 | 35 | ## Issue 3 - You made a session but cannot connect 36 | 37 | When you attempt to connect you see: 38 | 39 |

40 | 41 |

42 | 43 | Potential solutions: This error indicates that either you did not load a gcc module or you loaded the incorrect one for the version of R you are running. 44 | Kill the current session and start a new one with the correct gcc loaded in the modules to be loaded tab. 45 | 46 | ## Issue 4 - When you finally refresh your environment is gone (THE WORST) 47 | 48 | What happened is you ran out of memory and R restarted itself behind the scenes. You will NOT get an error message for this of any kind. The best thing to do is quit your session and restart a new one with more memory. 49 | 50 | ## Issue 5 - Crashing 51 | 52 | Also, previous issues with O2portal RStudio crashing - “the compute-f architecture is not good enough and this part of the process fails because (maybe) it was built/installed on a newer node” . 53 | Solution: add the flag when you start the session to just exclude those nodes -x compute-f-17-[09-25] 54 | 55 | ## Issue 6 - commands using cores fail 56 | 57 | ``` 58 | Error in `.rowNamesDF<-`(x, value = value) : invalid 'row.names' length 59 | In addition: Warning message: 60 | In mclapply(X, function(...) { : 61 | scheduled cores 1, 2 did not deliver results, all values of the jobs will be affected 62 | ``` 63 | -------------------------------------------------------------------------------- /rc/arrays_in_slurm.md: -------------------------------------------------------------------------------- 1 | 2 | # Arrays in Slurm 3 | 4 | When I am working on large data sets my mind often drifts back to an old Simpsons episode. Bart is in France and being taught to pick grapes. They show him a detailed technique and he does it successfully. Then they say: 5 | 6 | 7 |

8 | 9 |

10 | 11 |

12 | We've all been here 13 |

14 | 15 | A pipeline or process may seem easy or fast when you have 1-3 samples but totally daunting when you have 50. When scaling up you need to consider file overwriting, computational resources, and time. 16 | 17 | One easy way to scale up is to use the array feature in slurm. 18 | 19 | ## What is a job array? 20 | 21 | Atlassian says this about job arrays on O2: "Job arrays can be leveraged to quickly submit a number of similar jobs. For example, you can use job arrays to start multiple instances of the same program on different input files, or with different input parameters. A job array is technically one job, but with multiple tasks." [link](https://harvardmed.atlassian.net/wiki/spaces/O2/pages/1586793632/Using+Slurm+Basic#Job-Arrays). 22 | 23 | Array jobs run simultaneously rather than one at a time which means they are very fast! Additionally, running a job array is very simple! 24 | 25 | ```bash 26 | sbatch --array=1-10 my_script.sh 27 | ``` 28 | 29 | This will run my_script.sh 10 times with the job IDs 1,2,3,4,5,6,7,8,9,10 30 | 31 | We can also put this directly into the bash script itself (although we will continue with the command line version here). 32 | ```bash 33 | $SBATCH --array=1-10 34 | ``` 35 | 36 | We can specify any job IDs we want. 37 | 38 | ```bash 39 | sbatch --array=1,7,12 my_script.sh 40 | ``` 41 | This will run my_script.sh 3 times with the job IDs 1,7,12 42 | 43 | Of course we don't want to run the same job on the same input files over and over, that would be pointless. We can use the job IDs within our script to specify different input or output files. In bash the job id is given a special variable `${SLURM_ARRAY_TASK_ID}` 44 | 45 | 46 | ## How can I use ${SLURM_ARRAY_TASK_ID}? 47 | 48 | The value of `${SLURM_ARRAY_TASK_ID}` is simply job ID. If I run 49 | 50 | ```bash 51 | sbatch --array=1,7 my_script.sh 52 | ``` 53 | This will start two jobs, one where `${SLURM_ARRAY_TASK_ID}` is 1 and one where it is 7 54 | 55 | There are several ways we can use this. If we plan ahead and name our files with these numbers (e.g., sample_1.fastq, sample_2.fastq) we can directly refer to these files in our script: `sample_${SLURM_ARRAY_TASK_ID}.fastq` However, using the ID for input files is often not a great idea as it means you need to strip away most of the information that you might put in these names. 56 | 57 | Instead we can keep our sample names in a separate file and use [awk](awk.md) to pull the file names. 58 | 59 | here is our complete list of long sample names which is found in our file `samples.txt`: 60 | 61 | ``` 62 | DMSO_control_day1_rep1 63 | DMSO_control_day1_rep2 64 | DMSO_control_day2_rep1 65 | DMSO_control_day2_rep2 66 | DMSO_KO_day1_rep1 67 | DMSO_KO_day1_rep2 68 | DMSO_KO_day2_rep1 69 | DMSO_KO_day2_rep2 70 | Drug_control_day1_rep1 71 | Drug_control_day1_rep2 72 | Drug_control_day2_rep1 73 | Drug_control_day2_rep2 74 | Drug_KO_day1_rep1 75 | Drug_KO_day1_rep2 76 | Drug_KO_day2_rep1 77 | Drug_KO_day2_rep2 78 | ``` 79 | 80 | If we renamed all of these to 1-16 we would lose a lot of information that may be helpful to have on hand. If these are all sam files and we want to convert them to bam files our script could look like this 81 | 82 | ```bash 83 | 84 | file=$(awk -v awkvar="${SLURM_ARRAY_TASK_ID}" 'NR==awkvar' samples.txt) 85 | 86 | samtools view -S -b ${file}.sam > ${file}.bam 87 | 88 | ``` 89 | 90 | Since we have sixteen samples we would run this as 91 | 92 | ```bash 93 | sbatch --array=1-16 my_script.sh 94 | ``` 95 | 96 | So what is this script doing? `file=$(awk -v awkvar="${SLURM_ARRAY_TASK_ID}" 'NR==awkvar' samples.txt)` pulls the line of `samples.txt` that matched the job ID. Then we assign that to a variable called `${file}` and use that to run our command. 97 | 98 | Job IDs can also be helpful for output files or folders. We saw above how we used the job ID to help name our output bam file. But creating and naming folders is helpful in some instances as well. 99 | 100 | ```bash 101 | 102 | file=$(awk -v awkvar="${SLURM_ARRAY_TASK_ID}" 'NR==awkvar' samples.txt) 103 | 104 | PREFIX="Folder_${SLURM_ARRAY_TASK_ID}" 105 | mkdir $PREFIX 106 | cd $PREFIX 107 | 108 | samtools view -S -b ../${file}.sam > ${file}.bam 109 | 110 | ``` 111 | 112 | This script differs from our previous one in that it makes a folder with the job ID (Folder_1 for job ID 1) then moves inside of it to execute the command. Instead of getting all 16 of our bam files output in a single folder each of them will be in its own folder labled Folder_1 to Folder_16. 113 | 114 | **NOTE** That we define `${file}` BEFORE we move into our new folder as samples.txt is only present in the main directory. 115 | 116 | 117 | 118 | -------------------------------------------------------------------------------- /rc/connection-to-hpc.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: Connecting to hpc from local 3 | description: This code helps with connecting to hpc computers 4 | category: computing 5 | subcategory: tips_tricks 6 | tags: [ssh, hpc] 7 | --- 8 | 9 | 10 | # osx 11 | 12 | Use [Homebrew](http://brew.sh/) to get linux-like functionality on OSX 13 | 14 | Use [XQuartz](https://www.xquartz.org/) for X11 window functionality in OSX. 15 | 16 | # Odyssey with 2FA 17 | Enter one time password into the current window (https://github.com/jwm/os-x-otp-token-paster) 18 | 19 | # Fix 'Warning: No xauth data; using fake authentication data for X11 forwarding' 20 | Add this to your ~/.ssh/config on your OSX machine: 21 | 22 | ``` 23 | Host * 24 | XAuthLocation /opt/X11/bin/xauth 25 | ``` 26 | 27 | # Use ssh keys on remote server 28 | This will add your key to the OSX keychain, here your private key is assumed to be named "id_rsa": 29 | 30 | ``` 31 | ssh-add -K ~/.ssh/id_rsa 32 | ``` 33 | 34 | Now tell ssh to use the keychain. Add this to the ~/.ssh/config on your OSX machine: 35 | 36 | ``` 37 | Host * 38 | AddKeysToAgent yes 39 | UseKeychain yes 40 | IdentityFile ~/.ssh/id_rsa 41 | XAuthLocation /opt/X11/bin/xauth 42 | ``` 43 | -------------------------------------------------------------------------------- /rc/ipython-notebook-on-O2.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: IPython notebook on O2 3 | description: How to open up an ipython notebook running on O2 4 | category: computing 5 | subcategory: tips_tricsk 6 | tags: [python, ipython, singlecell] 7 | --- 8 | 9 | 1. First connect to O2 and open up an interactive session with all of the cores and memory you want to use. Here I'm connecting to the short queue so I can get more cores to use. 10 | 11 | ```bash 12 | srun -n 8 --pty -p short --mem 64G -t 0-12:00 --x11 /bin/bash 13 | ``` 14 | 15 | 2. Note the name of the compute node you are on: 16 | 17 | ```bash 18 | uname --nodename 19 | ``` 20 | 21 | 3. Start a jupyter notebook server on a specific port: 22 | 23 | ```bash 24 | jupyter notebook --no-browser --port=1234 25 | ``` 26 | 27 | This command will open up a notebook server on port 1234. You might have to pick 28 | a different port if 1234 is being used. Note the token it provides for you, you 29 | will need this token to use your notebook server. 30 | 31 | 4. Create an auto-closing SSH tunnel from your local machine to the jupyter notebook: 32 | 33 | On your local machine do: 34 | 35 | ```bash 36 | ssh -f -L 9999:localhost:9999 o2 -t 'ssh -f -L 9999:localhost:1234 compute-a-16-49 "sleep 60"' 37 | ``` 38 | 39 | This sets up a two SSH tunnels. The first one is connecting port 9999 on your laptop to port 9998 on `login02` on o2 (134.174.159.22). The second is connecting port 9998 on `login02` to port 1234 on `compute-e-16-49`. This script will auto-close the tunnel if you don't connect to it in 60 seconds, and will auto-close the tunnel when your session is closed. 40 | 41 | 5. Open a web browser and put `localhost:9999` as the address. 42 | 43 | This should now connect you to the jupyter notebook server. It will ask you for 44 | the token. If you put the token in, you can now log in and will be in your 45 | home directory on O2. 46 | 47 | You are now running a notebook server. This is just running using a single core now-- we want to hook up our computing that we reserved. We asked for 8 cores, so we'll set up a cluster with 8 cores. Click on "IPython Clusters", set the number of engines on 48 | "default" to 8, and you will have your notebook connected to the 8 cores. 49 | 50 | 6. Start working! 51 | 52 | You can open up a terminal by going to the Files tab and clicking on new and opening 53 | the terminal. You can start a new notebook by going to the Files tab, clicking on 54 | new and opening a python notebook. 55 | -------------------------------------------------------------------------------- /rc/jupyter_notebooks.md: -------------------------------------------------------------------------------- 1 | # Jupyter notebooks 2 | 3 | This post is for those who are interested in running notebooks seemingly on O2. There is a well written documentation about running jupyter notebook on O2. you can find it here. https://wiki.rc.hms.harvard.edu/display/O2/Jupyter+on+O2. However, this involves multiple steps, opening bunch of terminals at times, and importantly finding an unused port every-time. I found it quite cumbersome and annoying, so i spent some time solving it. It took me a while to nail it down, with the help of FAC RC, but they suggested a simpler solution. If you wish to run jupyter/R notebook on O2 {where your data sits}, 4 | Here what you need to do : 5 | 6 | Install https://github.com/aaronkollasch/jupyter-o2 by `pip install jupyter-o2` on your local terminal. 7 | 8 | Run `jupyter-o2 --generate-config` on command line. 9 | This will generate the configuration file and will tell you where it is located. Un comment the fields which are not needed. Since configuration file is the key, find attach the a template for use, you need to change your credentials though. 10 | 11 | You are all set to run notebook on O2 from your local machine now, without logging into server. 12 | Now at your local terminal Run `jupyter-o2 notebook` for python notebooks. Alternatively you can also do `jupyter-o2 lab` for R/python 13 | This will ask you a paraphrase, you should enter your ecommons password as paraphrase. 14 | Boom!!! you are good to go! Happy Pythoning :):) 15 | If you wish you run R notebooks on O2, refer this. https://docs.anaconda.com/anaconda/navigator/tutorials/r-lang/ 16 | 17 | 18 | # Example code 19 | 20 | Just to add, in the HMS-RC documentation they suggested any ports over 50000. To give examples of logging into a jupyter notebook session I have provided the code below. 21 | 22 | ## Creating a Jupyter notebook 23 | 24 | Log onto a login node 25 | 26 | ``` 27 | # Log onto O2 using a specific port - I used '50000' in this instance - you can choose a different port and just replace the 50000 with the number of your specific port 28 | ssh -Y -L 50000:127.0.0.1:50000 ecommons_id@o2.hms.harvard.edu 29 | ``` 30 | 31 | Once on the login node, you can start an interactive session specifying the port with `--tunnel` 32 | 33 | ``` 34 | # Create interactive session 35 | srun --pty -p interactive -t 0-12:00 --x11 --mem 128G --tunnel 50000:50000 /bin/bash 36 | ``` 37 | 38 | Load the modules that you will need 39 | 40 | ``` 41 | # Load modules 42 | module load gcc/9.2.0 python/3.8.12 43 | ``` 44 | 45 | Create environment for running analysis (example here is for velocity) 46 | 47 | ``` 48 | # Create virtual environment (only do this once) 49 | virtualenv velocyto --system-site-packages 50 | ``` 51 | 52 | Activate virtual environment 53 | 54 | ``` 55 | # Activate virtual environment 56 | source velocyto/bin/activate 57 | ``` 58 | 59 | Install Jupyter notebook and any other libraries (only need to do this once) 60 | 61 | ``` 62 | # Install juypter notebook 63 | pip3 install jupyter 64 | 65 | # Install any other libraries needed for analysis (this is for velocity) 66 | pip3 install numpy scipy cython numba matplotlib scikit-learn h5py click 67 | pip3 install velocyto 68 | pip3 install scvelo 69 | ``` 70 | 71 | To create a Jupyter notebook run the following (again instead of 50000, use your port #): 72 | 73 | ``` 74 | # Start jupyter notebook 75 | jupyter notebook --port=50000 --browser='none' 76 | ``` 77 | 78 | ## Logging onto an existing notebook 79 | 80 | ``` 81 | # Log onto O2 using a specific port - I used '50000' in this instance - you can choose a different port and just replace the 50000 with the number of your specific port 82 | ssh -Y -L 50000:127.0.0.1:50000 ecommons_id@o2.hms.harvard.edu 83 | 84 | # Create interactive session 85 | srun --pty -p interactive -t 0-12:00 --x11 --mem 128G --tunnel 50000:50000 /bin/bash 86 | 87 | # Load modules 88 | module load gcc/9.2.0 python/3.8.12 89 | 90 | # Activate virtual environment 91 | source velocyto/bin/activate 92 | 93 | # Open existing notebook 94 | jupyter notebook name_of_notebook.ipynb --port=50000 --browser='none' 95 | ``` 96 | 97 | ## Sharing your notebook 98 | To share the contents of your notebook, you can either upload the notebook directly to Github and add your client as a collaborator on the repo, or export the report as a markdown or PDF. 99 | 100 | To export as a PDF, you need to have additional modules loaded and python packages installed: 101 | 102 | ``` 103 | module load texlive/2007 104 | 105 | pip3 install Pyppeteer 106 | pip3 install nbconvert 107 | ``` 108 | -------------------------------------------------------------------------------- /rc/keepalive.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: Transfer files inside cluster 3 | description: This code helps with transfer files inside cluster 4 | category: computing 5 | subcategory: tips_tricks 6 | tags: [ssh, hpc] 7 | --- 8 | 9 | Useful for file transfers on O2's new transfer cluster (transfer.rc.hms.harvard.edu). 10 | 11 | The nohup command can be prepended to the bash command and the command will keep running after you logout (or have your connection interrupted). 12 | 13 | From HMS RC: 14 | 15 | `From one of the file transfer systems under transfer.rc.hms.harvard.edu , you can prefix your command with "nohup" to put it in the background and be able to log out without interrupting the process.` 16 | 17 | `For example, after logging in to e.g. the transfer01 host, run your command:` 18 | 19 | `nohup rsync -av /dir1 /dir2` 20 | 21 | `and then log out. rsync will keep running.` 22 | 23 | `To check in on the process later, just remember which machine you ran rsync and you can directly re-login to that system if you like.` 24 | 25 | `For example:` 26 | 27 | `1. ssh transfer.rc.hms.harvard.edu (let's say you land on transfer03), and then:` 28 | `2. ssh transfer01` 29 | `-- from there you can run the "ps" command or however you like to monitor the process.` 30 | 31 | 32 | ## Another option from John 33 | If you run tmux from the login node before you ssh to the transfer node to xfer files, you can drop your connection and then re-attach to your tmux session later. It should still be running your transfer. 34 | 35 | **General steps** 36 | 1) Login to O2 37 | 2) write down what login node your are on (usually something like login0#) 38 | *at login node* 39 | 3) Start a new tmux session 40 | `tmux new -s myname` 41 | 4) SSH to the transfer node 42 | `ssh user@transfer.rc.hms.harvard.edu` 43 | *on transfer node* 44 | 5) start transfer with rsync, scp etc. 45 | 6) close terminal window without logging out 46 | *time passes* 47 | 7) Login to O2 again 48 | 8) ssh to the login node you wrote down above 49 | `ssh user@login0#` 50 | 9) Reattach to your tmux session 51 | `tmux a -t myname` 52 | 10) Profit 53 | 54 | You can get around having to remember which node you logged into by alwasys logging into the same node. For example you can add this to your .bash_profile on OSX: 55 | `alias ssho2='ssh -XY -l user login05.o2.rc.hms.harvard.edu'` 56 | 57 | -------------------------------------------------------------------------------- /rc/manage-files.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: Managing files 3 | description: This code helps with managing file names 4 | category: computing 5 | subcategory: tips_tricks 6 | tags: [bash, osx, linux] 7 | --- 8 | 9 | ## How to remove all files except the ones you want: 10 | 11 | First, expand rm capabilities by: 12 | `shopt -s extglob` 13 | 14 | Then use find and remove: 15 | `find . ! -name 'file.txt' -type f -exec rm -f {} +` 16 | 17 | 18 | ## Rename files 19 | 20 | It has a lot of good options but one pretty useful for removing whitespaces and set all to lowercase: 21 | 22 | `rename -c --nows ` 23 | 24 | ## Use umask to restrict default permissions for users outside of the group 25 | 26 | Set umask 007 in our .bashrc. Then newly created directories will have 770 (rwxrwx---) permissions, 27 | and files will have 660 (rw-rw----). 28 | -------------------------------------------------------------------------------- /rc/openondemand.md: -------------------------------------------------------------------------------- 1 | ## My intitial notes on using the FAS-RC Open on Demand virtual desktop system 2 | 3 | ### Steps to get going 4 | - Download and install Cisco connect VPN 5 | 6 | - Login to to VPN using your FAS-RC credentials and two-pass authentication code 7 | 8 | - Navigate to https://vdi.rc.fas.harvard.edu/pun/sys/dashboard 9 | 10 | - Login to page using FAS-RC user id and password 11 | 12 | - Click on Interactive Apps pulldown and select the “Rstudio Server” under the SEr5ver heading. DO NOT select, “RStudio Server (bioconductor + tidyverse)” 13 | 14 | Here most of the settings are self-explanatory. 15 | - I have tried out multiple cores (12) and while I am unsure it is using the full 48 cores parallel::detectCores finds, my simulation did run substantially faster (4 cores = 5% done when “12 cores” was at 75%. I would be interested to hear other people’s experiences and how transparently/well it works with R. 16 | 17 | - Maximum memory allocation is supposed to bey 120GHB. I haven’t tried asking for more. 18 | 19 | - I’ve been loading the R/4.02-fasrrc01 Core R versions. I tried the R/4.02-fasrc Core & gcc 9.3.0 first and ran into package compilation issues. 20 | 21 | - You can set the R_LIBS_USER folder to use which will contain your library packages. Using this approach, I was able to install packages in session , delete the server and come back to the installed packages in a new session. You could theoretically also switch between R versions using this and the version selector. 22 | 23 | - I haven’t tired executing a script before staring Rstudio, but theoretically, I could see using this to launch a condo environment. 24 | 25 | - I don’t know about reservations but they sound interesting for getting a high mem machine. 26 | -------------------------------------------------------------------------------- /rc/scheduler.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: Alias for cluster jobs stats 3 | description: This code helps with commands related to job submission 4 | category: computing 5 | subcategory: tips_tricks 6 | tags: [bash, hpc] 7 | --- 8 | 9 | # SLURM 10 | 11 | * Useful aliases 12 | 13 | ```bash 14 | alias bjobs='sacct -u ${USER} --format="JobID,JobName%25,NodeList,State,ncpus,start,elapsed" -s PD,R' 15 | alias bjobs_all='sacct -u ${USER} --format="JobID,JobName%25,NodeList,State,ncpus,AveCPU,AveRSS,MaxRSS,MaxRSSTask,start,elapsed"' 16 | ``` 17 | -------------------------------------------------------------------------------- /rc/tmux.md: -------------------------------------------------------------------------------- 1 | Tmux is a great way to work on the server as it allow syou to: 2 | 1) keep your session alive. 3 | 3) have multiple named sessions open to firewall tasks/projects. 4 | 4) run multiple windows/command lines from a single login (O2 allows a maximum of 2-3 logins to their system). 5 | 5) quickly spin off windows to do small commands (see 3). 6 | 7 | ### Useful resources 8 | 9 | #### [Tmux cheat sheet](https://tmuxcheatsheet.com/) 10 | 11 | 12 | #### Tmux configuration 13 | Tmux works great but can have some issues upon first use that make it challenging to use for those of us used to a GUI: 14 | a) the default command key is not great. 15 | b) it doesn't work well with a mouse. 16 | c) it doesn't let you copy text easily. 17 | d) it doesn't scroll your window easily. 18 | e) resizing the windows can be challenging. 19 | 20 | The confirguation file code below should make some of these issues easier: 21 | 22 | set -g default-terminal "screen-256color" 23 | set -g status-bg red 24 | set -g status-fg black 25 | 26 | # Use instead of the default as Tmux prefix 27 | set-option -g prefix C-a 28 | unbind-key C-b 29 | bind-key C-a send-prefix 30 | 31 | 32 | # Options enable mouse support in Tmux 33 | #set -g terminal-overrides 'xterm*:smcup@:rmcup@' 34 | # For Tmux >= 2.1 35 | #set -g mouse on 36 | # For Tmux <2.1 37 | # Make mouse useful in copy mode 38 | setw -g mode-mouse on 39 | # 40 | # # Allow mouse to select which pane to use 41 | set -g mouse-select-pane on 42 | # 43 | # # Allow mouse dragging to resize panes 44 | set -g mouse-resize-pane on 45 | # 46 | # # Allow mouse to select windows 47 | set -g mouse-select-window on 48 | 49 | 50 | # set colors for the active window 51 | # START:activewindowstatuscolor 52 | setw -g window-status-current-fg white 53 | setw -g window-status-current-bg red 54 | setw -g window-status-current-attr bright 55 | # END:activewindowstatuscolor 56 | 57 | 58 | ## Optional- act more like vim: 59 | #set-window-option -g mode-keys vi 60 | #bind h select-pane -L 61 | #bind j select-pane -D 62 | #bind k select-pane -U 63 | #bind l select-pane -R 64 | #unbind p 65 | #bind p paste-buffer 66 | #bind -t vi-copy v begin-selection 67 | #bind -t vi-copy y copy-selection 68 | 69 | 70 | # moving between panes 71 | # START:paneselect 72 | bind h select-pane -L 73 | bind j select-pane -D 74 | bind k select-pane -U 75 | bind l select-pane -R 76 | # END:paneselect 77 | 78 | 79 | # START:panecolors 80 | set -g pane-border-fg green 81 | set -g pane-border-bg black 82 | set -g pane-active-border-fg white 83 | set -g pane-active-border-bg yellow 84 | # END:panecolors 85 | 86 | # Command / message line 87 | # START:cmdlinecolors 88 | set -g message-fg white 89 | set -g message-bg black 90 | set -g message-attr bright 91 | # END:cmdlinecolorsd -g pane-active-border-bg yellow 92 | 93 | 94 | bind-key C-a last-window 95 | setw -g aggressive-resize on 96 | 97 | #### [Restoring your tmux session after reboot](https://andrewjamesjohnson.com/restoring-tmux-sessions/) 98 | -------------------------------------------------------------------------------- /rnaseq/RepEnrich2_guide.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: How to run repeat enrichment analysis 3 | description: This guide shows how to run RepEnrich2 4 | category: research 5 | subcategory: rnaseq 6 | tags: [annotation] 7 | --- 8 | 9 | RepEnrich2 tries to look at something that standard RNA-seq pipelines miss, the 10 | enrichment of repeats in NGS data. It is extremely slow and is a pain to get 11 | going. Below is a guide getting it working and has some links to a fork of 12 | RepEnrich2 I made that makes it more friendly to use. 13 | 14 | I have not actually validated the RepEnrich2 output, so caveat emptor. 15 | 16 | # Preparing RepEnrich2 17 | 18 | ## Create isolated conda environment 19 | 20 | ```bash 21 | conda create -c bioconda -n repenrich2 python=2.7 biopython bedtools samtools bowtie2 bcbio-nextgen 22 | ``` 23 | 24 | ## Download my fork of RepEnrich2 25 | This has quality of life fixes such as memoization of outputs so if it fails you don't 26 | have to redo steps. 27 | 28 | ```bash 29 | git clone git@github.com:nerettilab/RepEnrich2.git 30 | ``` 31 | 32 | ## Download a pre-created index 33 | You can make your own, for example I made 34 | [hg38](https://www.dropbox.com/s/lefkk38q6bbj76b/Repenrich2_setup_hg38.tar.gz?dl=1) 35 | and the RepEnrich2 folks have mm9 and hg19 36 | [here](https://drive.google.com/drive/folders/0B8_2gE04f4QWNmdpWlhaWEYwaHM). But the RepeatMasker 37 | file it uses needs to be cleaned first and I'm not sure how they cleaned it. They had a hg38 one cleaned 38 | already from RepEnrich so I just used that. 39 | 40 | ## Download bcbio_RepEnrich2 41 | Download [bcbio_RepEnrich2](https://github.com/roryk/bcbio_RepEnrich2). This will need modification if you 42 | want to use it, but it is simple, I just didn't bother as I don't anticipate us running this again. 43 | 44 | # Running RepEnrich2 45 | `bcbio_RepEnrich2` is all you need to run it, the help should give you enough information to go on. 46 | annotation here is the file from RepeatMasker that was used to generate the RepEnrich setup. The 47 | bowtie index is a bowtie2 index of the genome you aligned to. Running RepEnrich2 takes FOREVER, so 48 | be sure to run it on the long queue. 49 | 50 | Example command: 51 | 52 | ```bash 53 | python bcbio_RepEnrich2.py --threads 16 ../human-dsrna/config/human-dsrna.yaml /n/app/bcbio/biodata/genomes/Hsapiens/hg38/bowtie2/hg38 metadata/hg38_repeatmasker_clean.txt metadata/RepEnrich2_setup_hg38/ 54 | ``` 55 | 56 | # RepEnrich2 outputs 57 | 58 | You will get three files for each sample, for example: 59 | 60 | ``` 61 | P1722_class_fraction_counts.txt 62 | P1722_family_fraction_counts.txt 63 | P1722_fraction_counts.txt 64 | ``` 65 | 66 | The `class` and `family` files are the counts in the `samplename_fraction_counts.txt` file aggregated by family or 67 | class. Those could be used as aggregate analyses, but the `fraciton_counts` looks at the different repeat 68 | types individually, so is more what folks are probably looking for. 69 | -------------------------------------------------------------------------------- /rnaseq/Volcano_Plots.md: -------------------------------------------------------------------------------- 1 | ## [Enhanced Volcano](https://bioconductor.org/packages/devel/bioc/vignettes/EnhancedVolcano/inst/doc/EnhancedVolcano.html) is a great and flexible way to creat volcano plots. 2 | 3 | Input is a dataframe of test statistics. It works well with the output of `lfcShrink()` 4 | Below is an example call and output: 5 | 6 | ``` 7 | library(EnhancedVolcano) 8 | 9 | EnhancedVolcano(shrunken_res_treatment, 10 | lab= NA, 11 | x = 'log2FoldChange', 12 | y = 'pvalue', title="Volcano Plot for Treatment", subtitle = "") 13 | ``` 14 | 15 |

16 | 17 |

18 | 19 | 20 | Almost every aspect is flexible and changable. 21 | -------------------------------------------------------------------------------- /rnaseq/ase.md: -------------------------------------------------------------------------------- 1 | # ASE = allele specific expression, allelic imbalance 2 | 3 | - https://stephanecastel.wordpress.com/2017/02/15/how-to-generate-ase-data-with-phaser/ 4 | 5 | 6 | ## Installation of Phaser: 7 | ``` 8 | git clone https://github.com/secastel/phaser.git 9 | module load gcc/6.2.0 10 | module load python/2.7.12 11 | cython/0.25.1 12 | pip install intervaltree --user 13 | cd phaser/phaser 14 | python setup.py build_ext --inplace 15 | ``` 16 | 17 | ASE QC: 18 | - https://github.com/gimelbrantlab/Qllelic 19 | -------------------------------------------------------------------------------- /rnaseq/bibliography.md: -------------------------------------------------------------------------------- 1 | # Normalization 2 | * [Comparing the normalization methods for 3 | the differential analysis of Illumina high- 4 | throughput RNA-Seq data](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0778-7). Paper that compares different normalization methods on RNASeq data. 5 | 6 | # Power 7 | * [Power in pairs: assessing the statistical value of paired samples in tests for differential expression](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6302489/). Paper that looks at effect of paired-design on power in RNA-seq. 8 | 9 | # Functional analysis 10 | https://yulab-smu.github.io/clusterProfiler-book/index.html 11 | 12 | 13 | [RNA-seq qc](https://seqqc.wordpress.com/) 14 | -------------------------------------------------------------------------------- /rnaseq/failure_types: -------------------------------------------------------------------------------- 1 | Different ways that RNAseq can fail with examples. 2 | 3 | https://docs.google.com/presentation/d/1d5hyuTJMei0myG_vr7YR3vFewajwF9I9loS3I--kpnw/edit?usp=sharing 4 | -------------------------------------------------------------------------------- /rnaseq/img/test: -------------------------------------------------------------------------------- 1 | d 2 | -------------------------------------------------------------------------------- /rnaseq/img/volcano.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/rnaseq/img/volcano.png -------------------------------------------------------------------------------- /rnaseq/running_IRFinder.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: How to run intron retention analysis 3 | description: This code helps to run IRFinder in the cluster. 4 | category: research 5 | subcategory: rnaseq 6 | tags: [hpc, intro_retention] 7 | --- 8 | 9 | To run any of these commands, need to activate the bioconda IRFinder environment prior to running script. 10 | 11 | 1. First script creates reference build required for IRFinder 12 | 13 | ```bash 14 | #SBATCH -t 24:00:00 # Runtime in minutes 15 | #SBATCH -n 4 16 | #SBATCH -p medium # Partition (queue) to submit to 17 | #SBATCH --mem=128G # 128 GB memory needed (memory PER CORE) 18 | #SBATCH -o %j.out # Standard out goes to this file 19 | #SBATCH -e %j.err # Standard err goes to this file 20 | #SBATCH --mail-type=END # Mail when the job ends 21 | 22 | IRFinder -m BuildRefProcess -r reference_data/ 23 | ``` 24 | 25 | >**NOTE:** The files in the `reference_data` folder are sym links to the bcbio ref files and need to be named specifically `genome.fa` and `transcripts.gtf`: 26 | > 27 | >`genome.fa -> /n/app/bcbio/biodata/genomes/Hsapiens/hg19/seq/hg19.fa` 28 | > 29 | >`transcripts.gtf -> /n/app/bcbio/biodata/genomes/Hsapiens/hg19/rnaseq/ref-transcripts.gtf` 30 | 31 | 2. Second script (.sh) runs IRFinder and STAR on input file 32 | 33 | ```bash 34 | #!/bin/bash 35 | 36 | module load star/2.5.4a 37 | 38 | IRFinder -r /path/to/irfinder/reference_data \ 39 | -t 4 -d results \ 40 | $1 41 | ``` 42 | 43 | 3. Third script (.sh) runs a batch job for each input file in directory 44 | 45 | ```bash 46 | #!/bin/bash 47 | 48 | for fq in /path/to/*fastq 49 | do 50 | 51 | sbatch -p medium -t 0-48:00 -n 4 --job-name irfinder --mem=128G -o %j.out -e %j.err --wrap="sh /path/to/irfinder/irfinder_input_file.sh $fq" 52 | sleep 1 # wait 1 second between each job submission 53 | 54 | done 55 | ``` 56 | 57 | 4. Fourth script takes output (IRFinder-IR-dir.txt) and uses the replicates to determine differential expression using the Audic and Claverie test (# replicates < 4). analysisWithLowReplicates.pl script comes with the IRFinder github repo clone, so I cloned the repo at https://github.com/williamritchie/IRFinder/. Notes on the Audic and Claverie test can be found at: https://github.com/williamritchie/IRFinder/wiki/Small-Amounts-of-Replicates-via-Audic-and-Claverie-Test. 58 | 59 | ```bash 60 | #!/bin/bash 61 | 62 | #SBATCH -t 24:00:00 # Runtime in minutes 63 | #SBATCH -n 4 64 | #SBATCH -p medium # Partition (queue) to submit to 65 | #SBATCH --mem=128G # 8 GB memory needed (memory PER CORE) 66 | #SBATCH -o %j.out # Standard out goes to this file 67 | #SBATCH -e %j.err # Standard err goes to this file 68 | #SBATCH --mail-type=END # Mail when the job ends 69 | 70 | analysisWithLowReplicates.pl \ 71 | -A A_ctrl/Pooled/IRFinder-IR-dir.txt A_ctrl/AJ_1/IRFinder-IR-dir.txt A_ctrl/AJ_2/IRFinder-IR-dir.txt A_ctrl/AJ_3/IRFinder-IR-dir.txt \ 72 | -B B_nrde2/Pooled/IRFinder-IR-dir.txt B_nrde2/AJ_4/IRFinder-IR-dir.txt B_nrde2/AJ_5/IRFinder-IR-dir.txt B_nrde2/AJ_6/IRFinder-IR-dir.txt \ 73 | > KD_ctrl-v-nrde2.tab 74 | ``` 75 | 76 | 5. Output `KD_ctrl-v-nrde2.tab` file can be read directly into R for filtering and results exploration. 77 | 78 | 6. Rmarkdown workflow (included in report): IRFinder_report.md 79 | -------------------------------------------------------------------------------- /rnaseq/running_leafviz.md: -------------------------------------------------------------------------------- 1 | # Leafviz - Visualize Leafcutter Results 2 | 3 | 4 | **All scripts found in `/HBC Team Folder (1)/Resources/LeafCutter_2023`** 5 | 6 | ### Step 1 - Get annotation files 7 | 8 | Annotation files already prepared for hg38 are in the folder listed above in a folder named new_hg38. To make annotation files for another organism run 9 | 10 | ```bash 11 | ./gtf2leafcutter.pl -o /full/path/to/directory/annotation_directory_name/annotation_file_prefix \ 12 | /full/path/to/gencode/annotation.gtf 13 | ``` 14 | 15 | ### Step 2 - Make RData files from your results 16 | 17 | Use the `prepare_results.R` script in the linked folder. The one on the leafcutter github has not been updated. 18 | Your leafcutter should have output `leafcutter_ds_cluster_significance_XXXXX.txt`, `leafcutter_ds_effect_sizes_XXXXX.txt`, and `XXXX_perind_numers.counts.gz`. 19 | You will need all three of these files and the groups file you made. An example groups file for my PD comparison is below: 20 | 21 | ```bash 22 | NCIH1568_RB1_KO_DMSO_Replicate_3-ready.bam DMSO 23 | NCIH1568_RB1_KO_DMSO_Replicate_2-ready.bam DMSO 24 | NCIH1568_RB1_KO_DMSO_Replicate_1-ready.bam DMSO 25 | NCIH1568_RB1_KO_PD_Replicate_1-ready.bam PD 26 | NCIH1568_RB1_KO_PD_Replicate_3-ready.bam PD 27 | NCIH1568_RB1_KO_PD_Replicate_2-ready.bam PD 28 | ``` 29 | 30 | Below is an example to make the RData from the PD comparison. This code outputs `PD_new.RData`. 31 | Note that the annotation has the file path and the annotation file prefix. 32 | 33 | ```bash 34 | ./prepare_results.R -m PD_groups.txt NCIH_PD_perind_numers.counts.gz \ 35 | leafcutter_ds_cluster_significance_PD.txt leafcutter_ds_effect_sizes_PD.txt \ 36 | new_hg38/new_hg38 -o PD_new.RData 37 | ``` 38 | 39 | ### Step 3 - Visualize 40 | 41 | Once you have RDatas made for all of your contrasts it is time to actually run leafviz. 42 | The critical scripts to run leafviz are `run_leafviz.R `, `ui.r`, and `server.R`. 43 | Before you can run it you **must** change the path on line 41 of `run_leafviz.R`. 44 | This path needs to reflect the location of the `ui.R` and `server.R` files. 45 | To run leafviz with my PD data set I give 46 | 47 | ```bash 48 | ./run_leafviz.R PD_new.RData 49 | ``` 50 | 51 | This will open a new tab in your browswer with your results! 52 | 53 | For more on what is being shown check the leafviz [documentation](http://davidaknowles.github.io/leafcutter/articles/Visualization.html) 54 | -------------------------------------------------------------------------------- /rnaseq/running_rMATS.md: -------------------------------------------------------------------------------- 1 | ## rMATS for differential splicing analysis 2 | 3 | * Event-based analysis of splicing (e.g. skipped exon, retained intron, alternative 5' and 3' splice site) 4 | * rMATS handles replicate RNA-Seq data from both paired and unpaired study design 5 | * statistical model of rMATS calculates the P-value and false discovery rate that the difference in the isoform ratio of a gene between two conditions exceeds a given user-defined threshold 6 | 7 | Software: https://rnaseq-mats.sourceforge.net/ 8 | 9 | Paper: https://www.pnas.org/doi/full/10.1073/pnas.1419161111 10 | 11 | GitHub: https://github.com/Xinglab/rmats-turbo 12 | 13 | ### Installation 14 | 15 | Issues with the conda build installation provided on the GitHub page `./build_rmats --conda`. Had problem with shared libraries (" "loading shared libraries" error ). 16 | 17 | Instead install from bioconda. Reference: https://groups.google.com/g/rmats-user-group/c/S1GFEqB9TE8/m/YV9R27CoCwAJ?pli=1 18 | 19 | ```bash 20 | 21 | # Need speicific python version 22 | conda create -n "rMATS_python3.7" python=3.7 23 | 24 | conda activate rMATS_python3.7 25 | 26 | conda install -c conda-forge -c bioconda rmats=4.1.0 27 | 28 | ``` 29 | 30 | If you are running as a script on O2: 31 | 32 | ```bash 33 | #! /bin/bash 34 | 35 | #SBATCH -t 0-24:00 # Runtime 36 | #SBATCH -p medium # Partition (queue) 37 | #SBATCH -J rmats # Job name 38 | #SBATCH -o rmats_frag.out # Standard out 39 | #SBATCH -e rmats_frag.err # Standard error 40 | #SBATCH --mem=50G # Memory needed per core 41 | #SBATCH -c 6 42 | 43 | 44 | # USAGE: For paired-end BAM files;run rMATS 45 | 46 | # Define the project path 47 | path=/n/data1/cores/bcbio/PIs/peter_sicinski/sicinski_inhibition_RNAseq_human_hbc04676 48 | 49 | # Change directories 50 | cd ${path}/rMATS 51 | 52 | # Activate conda env for rmats 53 | source ~/miniconda3/bin/activate 54 | conda init bash 55 | source ~/.bashrc 56 | conda activate rMATS_python3.7 57 | 58 | ``` 59 | -------------------------------------------------------------------------------- /rnaseq/strandedness.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: Stranded RNA-seq libraries. 3 | description: Explains strandedness and where to find info in bcbio. 4 | category: research 5 | subcategory: rnaseq 6 | --- 7 | 8 | Bulk RNA-seq libraries retaining strand information (stranded) are useful to quantify expression with higher accuracy for opposite 9 | strand transcripts which overlap or have overlapping UTRs. 10 | https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-015-1876-7. 11 | 12 | Bcbio RNA-seq pipeline has a 'strandedness' parameter: [unstranded|firststrand|secondstrand] 13 | https://bcbio-nextgen.readthedocs.io/en/latest/contents/pipelines.html?highlight=strand#configuration. <- link not working* 14 | 15 | The terminology was inherited from Tophat, see the detailed description in the Salmon doc. 16 | https://salmon.readthedocs.io/en/latest/library_type.html 17 | Note, that firstrand = ISR for PE and SR for SE. 18 | 19 | If the strandedness is unknown, run a small subset of reads with 'unstranded' in bcbio and check out what Salmon reports in 20 | `bcbio_project/final/sample/salmon/lib_format_counts.json`: 21 | ``` 22 | { 23 | "read_files": [ 24 | "/dev/fd/63", 25 | "/dev/fd/62" 26 | ], 27 | "expected_format": "IU", 28 | "compatible_fragment_ratio": 1.0, 29 | "num_compatible_fragments": 721856, 30 | "num_assigned_fragments": 721856, 31 | "num_frags_with_concordant_consistent_mappings": 692049, 32 | "num_frags_with_inconsistent_or_orphan_mappings": 47441, 33 | "strand_mapping_bias": 0.9477291347866986, 34 | "MSF": 0, 35 | "OSF": 0, 36 | "ISF": 36174, 37 | "MSR": 0, 38 | "OSR": 0, 39 | "ISR": 655875, 40 | "SF": 37676, 41 | "SR": 9765, 42 | "MU": 0, 43 | "OU": 0, 44 | "IU": 0, 45 | "U": 0 46 | } 47 | ``` 48 | Here the majority of reads are ISR. 49 | 50 | Another way to check strand bias is 51 | `bcbio_project/final/sample/qc/qualimap_rnaseq/rnaseq_qc_results.txt`. 52 | It has `SSP estimation (fwd/rev) = 0.04 / 0.96` meaning strand bias (ISR, firststrand). 53 | 54 | Yet another way to confirm strand bias is seqc. 55 | http://rseqc.sourceforge.net/#infer-experiment-py. 56 | It uses a small subset of the input bam file: 57 | `infer_experiment.py -r /bcbio/genomes/Mmusculus/mm10/rnaseq/ref-transcripts.bed -i test.bam` 58 | 59 | ``` 60 | This is PairEnd Data 61 | Fraction of reads failed to determine: 0.1461 62 | Fraction of reads explained by "1++,1--,2+-,2-+": 0.0177 63 | Fraction of reads explained by "1+-,1-+,2++,2--": 0.8362 64 | ``` 65 | -------------------------------------------------------------------------------- /rnaseq/tools.md: -------------------------------------------------------------------------------- 1 | - [IsoformSwitchAnalyzer](https://bioconductor.org/packages/release/bioc/vignettes/IsoformSwitchAnalyzeR/inst/doc/IsoformSwitchAnalyzeR.html) 2 | - LP/VB?, 2019/02? 3 | - version#? 4 | - helps to detect alternative splicing 5 | - output very nice figures 6 | - what requirements are needed (e.g. R-3.5.1, etc.)? 7 | - no tutorials available 8 | - not incorporated into bcbio 9 | - I tried it and an example of a consults is here:https://code.harvard.edu/HSPH/hbc_RNAseq_christiani_RNAediting_on_lung_in_humna_hbc02307. This packages has very nice figures: https://www.dropbox.com/work/HBC%20Team%20Folder%20(1)/consults/david_christiani/RNAseq_christiani_RNAediting_on_lung_in_humna?preview=dtu.html (see at the end of the report). 10 | 11 | - [DEXseq](https://bioconductor.riken.jp/packages/3.0/bioc/html/DEXSeq.html) 12 | - LP/VB/RK?, date? 13 | - version#? 14 | - used to call isoform switching 15 | - not recommended - use DTU tool instead 16 | - what requirements are needed (e.g. R-3.5.1, etc.)? 17 | - no tutorials available 18 | - yes, DEXseq is incorporated in bcbio 19 | - Following this paper from MLove et al: https://f1000research.com/articles/7-952/v3 I used salmon and DEXseq to call isoform switching. This consult has an example: https://code.harvard.edu/HSPH/hbc_RNAseq_christiani_RNAediting_on_lung_in_humna_hbc02307. I found that normally one isomform changes a lot and another very little, but I found some examples were the switching is more evident. 20 | 21 | - [clusterProfiler](https://yulab-smu.github.io/clusterProfiler-book/index.html) 22 | -------------------------------------------------------------------------------- /scrnaseq/10XVisium.md: -------------------------------------------------------------------------------- 1 | ## Analysis of 10X Visium data 2 | 3 | > Download and install spaceranger: https://support.10xgenomics.com/spatial-gene-expression/software/pipelines/latest/installation 4 | 5 | Helpful resource: https://lmweber.org/OSTA-book/ 6 | 7 | ### Analysis software/packages 8 | * [scanpy](https://scanpy-tutorials.readthedocs.io/en/latest/spatial/basic-analysis.html) 9 | * [Spatial transcriptomics with Seurat](https://yu-tong-wang.github.io/talk/sc_st_data_analysis_R.html) 10 | * [Spatial single-cell quantification with alevin-fry](https://combine-lab.github.io/alevin-fry-tutorials/2021/af-spatial/) 11 | 12 | 13 | ### 1. BCL to FASTQ 14 | 15 | `spaceranger mkfastq` can be used here. Input is the flow cell directory. 16 | 17 | Note that **if your SampleSheet is formatted for BCL Convert**, which is Illumina's new demultiplexing software that is soon going to replace bcl2fastq, you will get an error. 18 | 19 | You will need to change the formatting slightly. 20 | 21 | https://support.10xgenomics.com/spatial-gene-expression/software/pipelines/latest/using/mkfastq#simple_csv 22 | 23 | If you are creating the simple csv samplesheet with specific oligo sequences for each sample index you may need to make edits. The I2 auto-orientation detector cannot be activated when supplying a simple csv. For example, teh NextSeq instrument needs the index 2 in reverse complement. So if you had: 24 | 25 | ``` 26 | Lane,Sample,Index,Index2 27 | *,CD_3-GEX_03,GCGGGTAAGT,TAGCACTAAG 28 | ``` 29 | 30 | You can either: 31 | 32 | 1. Specify the reverse complement Index2 oligo 33 | 34 | ``` 35 | *,CD_3-GEX_03,GCGGGTAAGT,CTTAGTGCTA 36 | ``` 37 | 38 | 2. Use the 10x index names, e.g. SI-TT-G6 39 | 40 | ``` 41 | *,CD_3-GEX_03,SI-TT-G6 42 | ``` 43 | 44 | You can find more information [linked here on the 10X website](https://support.10xgenomics.com/spatial-gene-expression/software/pipelines/latest/using/bcl2fastq-direct#sample-sheet) 45 | 46 | Additionally, if you find you have `AdpaterRead1` or `AdapterRead2` under "Setting" in this file (example below), you will want to remove that. For any 10x library regardless of how the demultiplexing is being done, we do not recommend trimming adapters in the Illumina -- **this will cause problems with reads in downstream analyses**. 47 | 48 | 49 | **To create FASTQ files:** 50 | 51 | ```bash 52 | spaceranger mkfastq --run data/11-7-2022-DeVries-10x-GEX-Visium/Files 53 | --simple-csv t samplesheets/11-7-2022-DeVries-10x-GEX-Visium_Samplesheet.csv 54 | --output-dir fastq/11-7-2022-DeVries-10x-GEX-Visium 55 | 56 | ``` 57 | 58 | ### 2. Image files 59 | 60 | Each slide has 4 capture areas and therefore for a single slide you should have 4 image files. 61 | 62 | More on image types [from 10X docs here](https://support.10xgenomics.com/spatial-gene-expression/software/pipelines/latest/using/image-recommendations) 63 | 64 | * Check what type of image you have (you will need to specify in `spaceranger` with the correct flag) 65 | * Open up the image to make sure you have the fiducial border. It's probably done for you. If there are issues with the fiducial alignment (i.e. too tall, too wide) given to you, you may need to manually align using the Loupe browser 66 | 67 | 68 | ### 3. Counting expression data 69 | 70 | The next step is to quantify expression for each capture area. To do this we will use [`spaceranger count`](https://support.10xgenomics.com/spatial-gene-expression/software/pipelines/latest/tutorials/count-ff-tutorial). This command will need to be run for each capture area. Below is the command for a single capture area (in this case Slide 1, capture area A0. You may find your files have not been named with A-D, so map them accordingly. 71 | 72 | A few things to note if you have samples that were run on multiple flow cells: 73 | 74 | * include the `--sample` argument to specify the samplename which corresponds to the capture area 75 | * for the `--fastqs` you can add multiple paths to the different flow cell folders and separate them by a comma 76 | 77 | ```bash 78 | 79 | spaceranger count --id="CD_Visium_01" \ 80 | --sample=CD_Visium_01 \ 81 | --description="Slide1_CaptureArea1" \ 82 | --transcriptome=refdata-gex-mm10-2020-A \ 83 | --fastqs=mkfastq/11-10-2022-DeVries-10x-GEX-Visium/AAAW33YHV/,mkfastq/11-4-2022-Devries-10x-GEX-Visium/AAAW3N3HV/,mkfastq/11-7-2022-DeVries-10x-GEX-Visium/AAAW352HV/,mkfastq/11-8-2022-DeVries-10x-GEX-Visium/AAAW3FCHV/,mkfastq/11-9-2022-DeVries-10x-GEX-Visium/AAAW3F3HV/ \ 84 | --image=images/100622_Walker_Slides1_and_2/V11S14-092_20221006_01_Field1.tif\ 85 | --slide=V11S14-092 \ 86 | --area=A1 \ 87 | --localcores=6 \ 88 | --localmem=20 89 | 90 | ``` 91 | -------------------------------------------------------------------------------- /scrnaseq/CellRanger.md: -------------------------------------------------------------------------------- 1 | # When running Cell Ranger on O2 2 | ## Shared by Victor 3 | > 10x documentation is the one he uses: 4 | 5 | https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger 6 | 7 | > For the custom genome: 8 | 9 | https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/tutorial_mr 10 | 11 | > For GFP: 12 | 13 | https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/tutorial_mr#marker 14 | 15 | ## Build custom genome 16 | > High level steps (based on the 10x tutorial): 17 | 1. Download the gtf and fasta files for the species of interest; 18 | 2. Filter gtf with `cellranger mkgtf` command; 19 | 3. Create the fasta file for the additional gene (for example, GFP); 20 | 4. Create the corresponding gtf file for the additional gene; 21 | 5. Append fasta file of the additional gene to the end of the fasta file for the genome; 22 | 6. Append gtf file of the additional gene to the end of the gtf file for the genome; 23 | 7. Make custom genome with `cellranger mkref` command; 24 | 25 | -------------------------------------------------------------------------------- /scrnaseq/Demuxafy_HowTo.md: -------------------------------------------------------------------------------- 1 | # How to run Demuxafy on O2 2 | 3 | For detailed instructions and updates on `demuxafy`, see the comprehensive [Read the Docs](https://demultiplexing-doublet-detecting-docs.readthedocs.io/en/latest/index.html#) 4 | 5 | 6 | ## Installation 7 | 8 | I originally downloaded the `Demuxafy.sif` singularity image for use on O2 as instructed [here](https://demultiplexing-doublet-detecting-docs.readthedocs.io/en/latest/Installation.html). However, **this singularity image did not pass O2's security checks**. The folks at HMS-RC were kind enough to amend the image for me so that it would pass the security checks. The working image is found at: `/n/app/singularity/containers/Demuxafy.sif` allowing anyone to use it. 9 | 10 | Of note, this singularity image includes a bunch of software, including popscle, demuxlet, freemuxlet, souporcell and other demultiplexing as well as doublet detection tools, so very useful to have installed! 11 | 12 | 13 | ## Input data 14 | 15 | Each tool included in `demuxafy` requires slightly different input (see [Read the Docs](https://demultiplexing-doublet-detecting-docs.readthedocs.io/en/latest/index.html#)). 16 | 17 | For the demultiplexing tools, in most cases, you will need: 18 | 19 | - A common SNP genotypes VCF file (pre-processed VCF files can be downloaded [here](https://demultiplexing-doublet-detecting-docs.readthedocs.io/en/latest/DataPrep.html), which is what I did after repeatedly failing to re-generate my own VCF file from the 1000 genome dataset following the provided instructions...) 20 | - A Barcode file (`outs/raw_feature_bc_matrix/barcodes.tsv.gz` from a typical `cellranger count` run) 21 | - A BAM file of aligned single-cell reads (`outs/possorted_genome_bam.bam` from a typical `cellranger count` run) 22 | - Knowledge of the number of samples in the pool you're trying to demultiplex 23 | - Potentially, a FASTA file of the genome your sample was aligned to 24 | 25 | _NOTE_: When working from a multiplexed dataset (e.g. cell hashing experiment), you may have to re-run `cellranger count` instead of `cellranger multi` to generate the proper barcodes and BAM files. In addition, it may be necessary to use the `barcodes.tsv.gz` file from the `filtered_feature_bc_matrix` (instead of raw) in such cases (see for example this [issue](https://github.com/wheaton5/souporcell/issues/128) when running `souporcell`). 26 | 27 | 28 | ## Pre-processing steps 29 | 30 | Once you've collated those files, you need to make sure your VCF and BAM files are sorted in the same way. This can be achieved by running the following command after sourcing `sort_vcf_same_as_bam.sh` from the Aerts' Lab popscle helper tool GitHub repo (available [here](https://github.com/aertslab/popscle_helper_tools/blob/master/sort_vcf_same_as_bam.sh)): 31 | 32 | ``` 33 | # Sort VCF file in same order as BAM file 34 | sort_vcf_same_as_bam.sh $BAM $VCF > demuxafy/data/GRCh38_1000G_MAF0.01_ExonFiltered_ChrEncoding_sorted.vcf 35 | ``` 36 | 37 | ### dsc pileup 38 | 39 | If you wish to run `freemuxlet` (and possibly other tools I haven't piloted), you will also need to run `dsc-pileup` (available within the singularity image) ahead of `freemuxlet` itself. For larger samples (>30k cells), it also helps (= significantly speeds up computational time, from several days to a couple of hours) to pre-filter the BAM file using another of the Aerts' Lab popscle helper tool scripts: `filter_bam_file_for_popscle_dsc_pileup.sh` (available [here](https://github.com/aertslab/popscle_helper_tools/blob/master/filter_bam_file_for_popscle_dsc_pileup.sh)) 40 | 41 | ``` 42 | # [OPTIONAL but recommended] 43 | module load gcc/9.2.0 samtools/1.14 44 | scripts/filter_bam_file_for_popscle_dsc_pileup.sh $BAM $BARCODES demuxafy/data/GRCh38_1000G_MAF0.01_ExonFiltered_ChrEncoding_sorted.vcf demuxafy/data/possorted_genome_bam_filtered.bam 45 | 46 | # Run popscle pileup ahead of freemuxlet 47 | singularity exec $DEMUXAFY popscle dsc-pileup --sam demuxafy/data/possorted_genome_bam_filtered.bam --vcf demuxafy/data/GRCh38_1000G_MAF0.01_ExonFiltered_ChrEncoding_sorted.vcf --group-list $BARCODES --out $FREEMUXLET_OUTDIR/pileup 48 | ``` 49 | 50 | _NOTE_: When running the dsc-pileup step on O2, at some point the job might get stalled despite no error message being issued. From my experience, this usually means that the requested memory needs to be increased (I used 48G-56G for most samples I processed, and encountered issues when lowering down to 32G). After filtering the BAM file and with the appropriate amount of memory available, the dsc-pileup step usually completes within 2-3 hours. 51 | 52 | 53 | ## Workflow 54 | 55 | After that, you should be set to run whichever demultiplexing tool you want! See sample scripts for a simple case (small 10X study) in the following [GitHub repo](https://github.com/hbc/neuhausser_scRNA-seq_human_embryo_hbc04528/tree/main/pilot_scRNA-seq/demuxafy/scripts); and for a more complex case (large study, multiplexed 10X data using cell hashing) [here](https://github.com/hbc/hbc_10xCITESeq_Pregizer-Visterra-_hbc04485/tree/main/demuxafy/scripts) 56 | 57 | You also have the option to generate combined results files to contrast results from different software more easily, as described [here](https://demultiplexing-doublet-detecting-docs.readthedocs.io/en/latest/CombineResults.html), and as implemented in the `combine_results.sbatch` script in the first GitHub repo linked above. 58 | -------------------------------------------------------------------------------- /scrnaseq/MDS_plot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hbc/knowledgebase/4c3ad97e61fef4f59d79ba60c05122ac3a4a282e/scrnaseq/MDS_plot.png -------------------------------------------------------------------------------- /scrnaseq/README.md: -------------------------------------------------------------------------------- 1 | # scRNA-seq 2 | 3 | * **[Tools for scRNA-seq analysis](tools.md):** This document lists the various tools that are currently being used for scRNA-seq analysis (and who has used/tested them) in addition to new tools that we are interested in but have yet to be tested. 4 | 5 | * **Tutorials for scRNA-seq analysis:** These documents are tutorials to help you with various types of scRNA-seq analysis. 6 | 7 | - **[Single-Cell-conda.md](https://github.com/hbc/knowledgebase/blob/master/research/scrnaseq/Single-Cell-conda.md):** installing tools for scRNA-seq analysis with conda. 8 | - **[Single-Cell.md](https://github.com/hbc/knowledgebase/blob/master/research/scrnaseq/Single-Cell.md):** installing tools and setting up docker for single cell rnaseq 9 | - **[rstudio_sc_docker.md](https://github.com/hbc/knowledgebase/blob/master/research/scrnaseq/rstudio_sc_docker.md):** This docker image contains an rstudio installation with some helpful packages for singlecell analysis. It also includes a conda environment to deal with necessary python packages (like umap-learn). 10 | - **[Single-cell analysis workflow](https://github.com/hbc/tutorials/tree/master/scRNAseq/scRNAseq_analysis_tutorial):** tutorials walking through the steps in a single-cell RNA-seq analysis, including differential expression analysis, power analysis, and creating a SPRING interface 11 | 12 | * **[Bibliography](bibliography.md):** This document lists relevant papers pertaining to scRNA-seq analysis 13 | -------------------------------------------------------------------------------- /scrnaseq/SNP_demultiplex.md: -------------------------------------------------------------------------------- 1 | # Demultiplexing SC data using SNP information 2 | 3 | ## Overview 4 | 5 | ## Methods: 6 | 7 | * scSplit: 8 | 9 | 10 | **References**: 11 | 12 | * [Paper](https://doi.org/10.1186/s13059-019-1852-7) 13 | 14 | * [Repo](https://github.com/jon-xu/scSplit) 15 | 16 | * Demuxlet/Freemuxlet/Popscle: 17 | 18 | Demuxlet is the first iteration of the software. Popscle is a suite that includes an improved version of demuxlet and also freemuxlet. It is recommended 19 | by the authors to use popscle. 20 | 21 | **Running it:** 22 | 23 | Installing demuxlet is not straightforward with very particular instructions. A similar situation might happen with popscle which it's not published yet. 24 | I recommend using Docker. The repo contains the [Dockerfile](https://github.com/statgen/popscle/blob/master/Dockerfile). You can use it to create your own 25 | docker image. One available is [here](https://hub.docker.com/repository/docker/vbarrerab/popscle). This image can also be used to create a singularity container 26 | on O2. 27 | 28 | _Running on O2: singularity_ 29 | 30 | 31 | 32 | singularity exec -B :/bam_files,:/vcf_files,:/results 33 | /n/app/singularity/containers// popscle dsc-pileup --sam /bam_files/ --vcf /vcf_files/ 34 | --out /results/ 35 | 36 | **Recommendations:** 37 | 38 | It is highly reccomended to reduce the number of reads and SNPs before running 39 | 40 | 41 | 42 | **References**: 43 | 44 | _Demuxlet_ 45 | 46 | * [Paper - Demuxlet](https://www.nature.com/articles/nbt.4042) 47 | 48 | * [Repo](https://github.com/statgen/demuxlet) 49 | 50 | _Popscle (Demuxlet/Freemuxlet)_ 51 | 52 | * [Repo](https://github.com/statgen/popscle) 53 | 54 | _popscle helper tools_ 55 | 56 | * [Repo](https://github.com/aertslab/popscle_helper_tools) 57 | -------------------------------------------------------------------------------- /scrnaseq/Single-Cell-conda.md: -------------------------------------------------------------------------------- 1 | *Single cell analyses require a lot of memory and often fail on the laptops. 2 | Having R + Seurat installed in a conda environment + interactive session or batch jobs with 50-100G RAM helps.* 3 | 4 | # 1. Use conda from bcbio 5 | ``` 6 | which conda 7 | /n/app/bcbio/dev/anaconda/bin/conda 8 | conda --version 9 | conda 4.6.14 10 | ``` 11 | 12 | # 2. Create and setup r conda environment 13 | ``` 14 | conda create -n r r-essentials r-base zlib pandoc 15 | conda init bash 16 | conda config --set auto_activate_base false 17 | . ~/.bashrc 18 | ``` 19 | 20 | # 3. Activate conda env 21 | ``` 22 | conda activate r 23 | which R 24 | 25 | ``` 26 | 27 | # 4. Install packages from within R 28 | 29 | ## 4.1 Install Seurat 30 | ``` 31 | R 32 | install.packages("Seurat") 33 | library(Seurat) 34 | q() 35 | ``` 36 | 37 | ## 4.2 Install Monocle 38 | ``` 39 | R 40 | install.packages(c("BiocManager", "remotes")) 41 | BiocManager::install("monocle") 42 | q() 43 | ``` 44 | 45 | ## 4.3 Install liger 46 | ``` 47 | R 48 | install.packages("devtools") 49 | library(devtools) 50 | install_github("MacoskoLab/liger") 51 | library(liger) 52 | q() 53 | ``` 54 | 55 | # 5. Install umap-learn for UMAP clustering 56 | ``` 57 | pip install umap-learn 58 | ``` 59 | 60 | # 6. Deactivate conda 61 | ``` 62 | conda deactivate 63 | ``` 64 | 65 | # 7. (Troubleshooting) 66 | - It may ask you to install github token - too many packages loaded from github. 67 | I generated token on my laptop and placed it in ~/.Renviron 68 | - BiocManager::install("slingshot") - I failed to install it due to gsl issues. 69 | - when running a batch job, use source activate r/ source deactivate 70 | - if conda is trying to write in bcbio cache, check and set cache priority, your home cache should be first: 71 | `conda info`, 72 | ~/.condarc 73 | ``` 74 | pkgs_dirs: 75 | - /home/[UID]/.conda/pkgs 76 | - /n/app/bcbio/dev/anaconda/pkgs 77 | ``` 78 | -------------------------------------------------------------------------------- /scrnaseq/bcbio_indrops3.md: -------------------------------------------------------------------------------- 1 | # Counting cells with bcbio for inDrops3 data - proto-SOP 2 | 3 | ## Last use - 2020-03-17 4 | 5 | ## 1. Check reference genome and transcriptome - is it a mouse project? 6 | - mm10 reference genome: /n/shared_db/bcbio/biodata/genomes/Mmusculus/mm10 7 | - transcriptome_fasta: /n/shared_db/bcbio/biodata/genomes/Mmusculus/mm10/rnaseq/ref-transcripts.fa 8 | - transcriptome_gtf: /n/shared_db/bcbio/biodata/genomes/Mmusculus/mm10/rnaseq/ref-transcripts.gtf 9 | 10 | ## 2. Create bcbio project structure in /scratch 11 | ``` 12 | mkdir sc_mouse 13 | cd sc_mouse 14 | mkdir config input final work 15 | ``` 16 | 17 | ## 3. Prepare fastq input in sc_mouse/input 18 | - some FC come in 1..4 lanes, merge lanes for every read: 19 | ``` 20 | cat lane1_r1.fq.gz lane2_r1.fq.gz > project_1.fq.gz 21 | cat lane1_r2.fq.gz lane2_r2.fq.gz > project_2.fq.gz 22 | ``` 23 | - cat'ing gzip files sounds ridiculous, but works for the most part, for purists: 24 | ``` 25 | zcat KM_lane1_R1.fastq KM_lane2_R1.fastq.gz | gzip > KM_1.fq.gz 26 | ``` 27 | 28 | - some cores send bz2 files not gz 29 | ``` 30 | bunzip2 *.bz2 31 | cat *R1.fastq | gzip > sample_1.fq.gz 32 | ``` 33 | 34 | - some cores produce R1,R2,R3,R4, others R1,R2,I1,I2, rename them 35 | ``` 36 | bcbio_R1 = R1 = 86 or 64 bp transcript read 37 | bcbio_R2 = I1 = 8 bp part 1 of cell barcode 38 | bcbio_R3 = I2 = 8 bp sample (library) barcode 39 | bcbio_R4 = R2 = 14 bp = 8 bp part 2 of cell barcode + 6 bp of transcript UMI 40 | ``` 41 | - files in sc_mouse/input should be (KM here is project name): 42 | ``` 43 | KM_1.fq.gz 44 | KM_2.fq.gz 45 | KM_3.fq.gz 46 | KM_4.fq.gz 47 | ``` 48 | 49 | ## 4. Create `sc_mouse/config/sample_barcodes.csv` 50 | Check out if the sample barcodes provided match the actual barcodes in the data. 51 | ``` 52 | gunzip -c FC_X_3.fq.gz | awk '{if(NR%4 == 2) print $0}' | head -n 400000 | sort | uniq -c | sort -k1,1rn | awk '{print $2","$1}' | head 53 | 54 | AGGCTTAG,112303 55 | ATTAGACG,95212 56 | TACTCCTT,94906 57 | CGGAGAGA,62461 58 | CGGAGATA,1116 59 | CGGATAGA,944 60 | GGGGGGGG,852 61 | ATTAGACC,848 62 | ATTAGCCG,840 63 | ATTATACG,699 64 | ``` 65 | 66 | Sometimes you need to reverse complement sample barcodes: 67 | ``` 68 | cat barcodes_original.csv | awk -F ',' '{print $1}' | tr ACGTacgt TGCAtgca | rev 69 | ``` 70 | 71 | sample_barcodes.csv 72 | ``` 73 | TCTCTCCG,S01 74 | GCGTAAGA,S02 75 | CCTAGAGT,S03 76 | TCGACTAG,S04 77 | TTCTAGAG,S05 78 | ``` 79 | 80 | ## 5. Create `sc_mouse/config/sc-mouse.yaml` 81 | ``` 82 | details: 83 | - algorithm: 84 | cellular_barcode_correction: 1 85 | minimum_barcode_depth: 1000 86 | sample_barcodes: /full/path/sc_mouse/config/sample_barcodes.csv 87 | transcriptome_fasta: /n/shared_db/bcbio/biodata/genomes/Mmusculus/mm10/rnaseq/ref-transcripts.fa 88 | transcriptome_gtf: /n/shared_db/bcbio/biodata/genomes/Mmusculus/mm10/rnaseq/ref-transcripts.gtf 89 | umi_type: harvard-indrop-v3 90 | analysis: scRNA-seq 91 | description: PI_name 92 | files: 93 | - /full/path/sc_mouse/input/KM_1.fq.gz 94 | - /full/path/sc_mouse/input/KM_2.fq.gz 95 | - /full/path/sc_mouse/input/KM_3.fq.gz 96 | - /full/path/sc_mouse/input/KM_4.fq.gz 97 | genome_build: mm10 98 | metadata: {} 99 | fc_name: sc-mouse 100 | upload: 101 | dir: /full/path/sc_mouse/final 102 | ``` 103 | Use `cd sc_mouse/input; readlink -f *` to grab full path to each file and paste into yaml. 104 | 105 | ## 6. Create `sc_mouse/config/bcbio.sh` 106 | ``` 107 | #!/bin/bash 108 | 109 | # https://slurm.schedmd.com/sbatch.html 110 | 111 | #SBATCH --partition=priority # Partition (queue) 112 | #SBATCH --time=10-00:00 # Runtime in D-HH:MM format 113 | #SBATCH --job-name=km # Job name 114 | #SBATCH -c 20 115 | #SBATCH --mem-per-cpu=5G # Memory needed per CPU 116 | #SBATCH --output=project_%j.out # File to which STDOUT will be written, including job ID 117 | #SBATCH --error=project_%j.err # File to which STDERR will be written, including job ID 118 | #SBATCH --mail-type=ALL # Type of email notification (BEGIN, END, FAIL, ALL) 119 | 120 | bcbio_nextgen.py ../config/sc-mouse.yaml -n 20 121 | ``` 122 | - most projects take < 5days, but some large 4 lane could take more, like 7-8 123 | 124 | ## 7. Run bcbio 125 | ``` 126 | cd sc_mouse_work 127 | sbatch ../config/bcbio.sh 128 | ``` 129 | 130 | ## 1a. (Optional). 131 | If you care, download fresh transcriptome annotation from Gencode (https://www.gencodegenes.org/mouse/) 132 | (it has chrom names with chr matching mm10 assembly). 133 | ``` 134 | cd sc_mouse/input 135 | wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M23/gencode.vM23.annotation.gtf.gz 136 | gunzip gencode.vM23.annotation.gtf.gz 137 | gffread -g /n/shared_db/bcbio/biodata/genomes/Mmusculus/mm10/seq/mm10.fa gencode.vM23.annotation.gtf -x gencode.vM23.annotation.cds.fa 138 | ``` 139 | update sc_mouse/config/sc_mouse.yaml: 140 | ``` 141 | transcriptome_fasta: gencode.vM23.annotation.cds.fa 142 | transcriptome_gtf: gencode.vM23.annotation.gtf 143 | ``` 144 | ## References 145 | - indrops3 library structure: https://singlecellcore.hms.harvard.edu/resources 146 | - [Even shorter guide](https://github.com/bcbio/bcbio-nextgen/blob/master/config/templates/indrop-singlecell.yaml) 147 | - [Much more comprehensive guide](https://github.com/hbc/tutorials/blob/master/scRNAseq/scRNAseq_analysis_tutorial/lessons/01_bcbio_run.md) 148 | -------------------------------------------------------------------------------- /scrnaseq/bibliography.md: -------------------------------------------------------------------------------- 1 | # Integration 2 | - [Seurat](https://www.cell.com/cell/fulltext/S0092-8674(19)30559-8) 3 | - [Harmony](https://www.biorxiv.org/content/10.1101/461954v2) 4 | 5 | # More references 6 | 1. [A collection of resources from seandavi](https://github.com/seandavi/awesome-single-cell) 7 | 2. https://scrnaseq-course.cog.sanger.ac.uk/website/index.html 8 | 3. https://broadinstitute.github.io/2019_scWorkshop/index.html 9 | 4. https://github.com/SingleCellTranscriptomics/ISMB2018_SingleCellTranscriptomeTutorial 10 | 5. [Bibliography in bib](bcbio_sc.bib) 11 | -------------------------------------------------------------------------------- /scrnaseq/cite_seq.md: -------------------------------------------------------------------------------- 1 | 2 | - https://cite-seq.com 3 | - https://en.wikipedia.org/wiki/CITE-Seq 4 | - https://github.com/Hoohm/CITE-seq-Count 5 | - https://sites.google.com/site/fredsoftwares/products/cite-seq-counter 6 | -------------------------------------------------------------------------------- /scrnaseq/doublets.md: -------------------------------------------------------------------------------- 1 | # Doublet identification 2 | 3 | - gene number filter is not effective in identifying doublets (Scrublet2019 article). 4 | - there is no good unsupervised doublets detection method for now 5 | - DoubletIdentification works for a group of cells we suspect they might be doublets (a cluster or a group of clusters) - if we see mixed marker signature and we know that these cells are not in transitional state, i.e. expert review of clusters is needed before doublet deconvolution 6 | - dump counts from suspected counts from Seurat 7 | - identify doublets with Scrublet 8 | - get back to Seurat 9 | 10 | R based DoubletFinder and DoubletDecon have issues 11 | - https://github.com/chris-mcginnis-ucsf/DoubletFinder/issues/64 12 | - https://github.com/EDePasquale/DoubletDecon/issues/21 13 | 14 | -------------------------------------------------------------------------------- /scrnaseq/pub_quality_umaps.md: -------------------------------------------------------------------------------- 1 | # Here is a collaction of code for nice looking umaps from people in the core. Please add! 2 | 3 | 4 | ## Zhu's pretty white boxes 5 | 6 | 7 | 8 | ### Code 9 | 10 | **Note: Zhu says "The gist is to add cluster numbers to the ggplot data, then using LableClusters to plot."** 11 | 12 | ```R 13 | Idents(seurat_stroma_SCT) <- "celltype" 14 | 15 | p1 <- DimPlot(object = seurat_stroma_SCT, 16 | reduction = "umap", 17 | label = FALSE, 18 | label.size = 4, 19 | repel = TRUE) + xlab("UMAP 1") + ylab("UMAP 2") + labs(title="UMAP") 20 | 21 | # add a new column of clusterNo to ggplot data 22 | p1$data$clusterNo <- as.factor(sapply(strsplit(as.character(p1$data$ident), " "), "[", 1)) 23 | 24 | LabelClusters(plot = p1, id = "clusterNo", box = T, repel = F, fill = "white") 25 | ``` 26 | 27 | ## Noor's embedded labels 28 | 29 | 30 | 31 | ### Code 32 | 33 | 34 | ```R 35 | LabelClusters(p, id = "ident", fontface = "bold", size = 3, bg.colour = "white", bg.r = .2, force = 0) 36 | ``` 37 | 38 | 39 | -------------------------------------------------------------------------------- /scrnaseq/rstudio_sc_docker.md: -------------------------------------------------------------------------------- 1 | # Docker image with rstudio for single cell analysis 2 | 3 | ## Description 4 | 5 | Docker images for single cell analysis. 6 | 7 | All docker images contain an rstudio installation with some helpful packages for singlecell analysis. It also includes a conda environment to deal with necessary python packages (like umap-learn). 8 | 9 | Docker Rstudio images are obtained from [rocker/rstudio](https://hub.docker.com/r/rocker/rstudio). 10 | 11 | ## R version and Bioconductor 12 | 13 | The R and Bioconductor versions are specified in the image name (along with the OS version): 14 | 15 | Example: 16 | `singlecell-base:R.4.0.3-BioC.3.11-ubuntu_20.04` 17 | 18 | ## Use 19 | 20 | `docker run -d -p 8787:8787 --name -e USER='rstudio' -e PASSWORD='rstudioSC' -e ROOT=TRUE -v :/home/rstudio/projects vbarrerab/)` 21 | 22 | `-e DISABLE_AUTH=true` option can be added to avoid Rstudio login prompt. Only use on local machine. 23 | 24 | This instruction will download and launch a container using the singlecell image. Once launch, it can be access through a web browser with the URL 8787:8787 or localhost:8787. 25 | 26 | ### Important parameters 27 | 28 | * -v option is mounting a folder from the host in the container. This allows for data transfer between the host and the container. **This can only be done when creating the container!** 29 | 30 | * --name assigns a name to the container. Helpful to keep thins tidy. 31 | * -e ROOT=TRUE options provides root access, in case more tweaking inside the container is necessary. 32 | * -p 8787: Change the local port to access the container. **This can only be done when creating the container!** 33 | * FYI: The working directory will be set as /home/rstudio, not /home/rstudio/projects as default behavior. 34 | 35 | ## Resources 36 | 37 | The dockerfile and other configuration files can be found on: 38 | 39 | https://github.com/vbarrera/docker_configuration 40 | 41 | The docker images: 42 | 43 | vbarrerab/singlecell-base 44 | 45 | ## Available images: 46 | 47 | - R.4.0.2-BioC.3.11-ubuntu_20.04 48 | - R.4.0.3-BioC.3.11-ubuntu_20.04 49 | 50 | **Important:** 51 | 52 | Docker changed its policies to only keep images that have been modified in the last 6 months. This means that previous images will eventually disappear. For previous versions. Check with availability with @vbarrera. 53 | 54 | # Bibliography 55 | 56 | Inspired by: 57 | 58 | https://www.r-bloggers.com/running-your-r-script-in-docker/ 59 | 60 | # Other resources 61 | Using Singularity Containers on the Cluster: https://docs.rc.fas.harvard.edu/kb/singularity-on-the-cluster/ 62 | -------------------------------------------------------------------------------- /scrnaseq/running_MAST.md: -------------------------------------------------------------------------------- 1 | # Running MAST 2 | 3 | [MAST](https://github.com/RGLab/MAST) analyzes differential expression using the cell as the unit of replication rather than the sample (as is done for pseduobulk) 4 | 5 | **NOTES** 6 | 7 | -MAST uses a hurdle model designed for zero heavy data. 8 | -MAST "expects" log transformed count data. 9 | -A [recent paper](https://doi.org/10.1038/s41467-021-21038-1) advises the use of sample id as a random factor to prevent pseudoreplication 10 | -Most MAST models include the total number of genes expressed in the cell. 11 | 12 | ## Where to run MAST? 13 | 14 | MAST can be run directly in Seurat (via [FindMarkers](https://satijalab.org/seurat/reference/findmarkers) or by itself. 15 | 16 | While MAST is easier to run with Seurat there are two big downsides: 17 | 18 | 1. Seurat does not log transform the data 19 | 2. You cannot edit the model with seurat, meaning that you cannot add sample ID or number of genes expressed. 20 | 21 | ## Running MAST (1) - from seurat to a SCA object 22 | 23 | 24 | ```r 25 | # Seurat to SCE 26 | sce <- as.SingleCellExperiment(seurat_obj) 27 | 28 | # Add log counts 29 | assay(sce, "log") = log2(counts(sce) + 1) 30 | 31 | # Create new sce object (only 'log' count data) 32 | sce.1 = SingleCellExperiment(assays = list(log = assay(sce, "log"))) 33 | colData(sce.1) = colData(sce) 34 | 35 | # Change to SCA 36 | sca = SceToSingleCellAssay(sce.1) 37 | 38 | ``` 39 | 40 | ## Running MAST (2) - Filter SCA object 41 | 42 | Here we are only filtering for genes expressed in 10% of cells but this can be altered and other filters can be added. 43 | 44 | ```r 45 | expressed_genes <- freq(sca) > 0.1 46 | sca_filtered <- sca[expressed_genes, ] 47 | 48 | ``` 49 | 50 | ## Format SCA metadata 51 | 52 | We add the total number of genes expressed per cell as well as setting factors as factors and scaling all continuous variables as suggested by MAST. 53 | 54 | ```r 55 | cdr2 <- colSums(SummarizedExperiment::assay(sca_filtered)>0) 56 | 57 | SummarizedExperiment::colData(sca_filtered)$ngeneson <- scale(cdr2) 58 | SummarizedExperiment::colData(sca_filtered)$orig.ident <- factor(SummarizedExperiment::colData(sca_filtered)$orig.ident) 59 | SummarizedExperiment::colData(sca_filtered)$Gestational_age_scaled <- scale(SummarizedExperiment::colData(sca_filtered)$Gestational_age) 60 | ``` 61 | 62 | ## Run MAST 63 | 64 | This is the most computationally instensive step and takes the longest. 65 | Here our model includes the number of genes epxressed (ngeneson), sample id as a random variable ((1 | orig.ident)), Gender, and Gestational age scaled. 66 | 67 | We extract the results from our model for our factor of interest (Gestational_age_scaled) 68 | 69 | 70 | ```r 71 | zlmCond <- suppressMessages(MAST::zlm(~ ngeneson + Gestational_age_scaled + Gender + (1 | orig.ident), sca_filtered, method='glmer',ebayes = F,strictConvergence = FALSE)) 72 | summaryCond <- suppressMessages(MAST::summary(zlmCond,doLRT='Gestational_age_scaled')) 73 | ``` 74 | 75 | ## Format Results 76 | 77 | MAST results look quite different than DESeq2 results so we need to apply a bit of formatting to make them readable. 78 | 79 | After formatting outputs can be written directly to csv files. 80 | ```r 81 | summaryDt <- summaryCond$datatable 82 | 83 | # Create reable results table for all genes tested 84 | fcHurdle <- merge(summaryDt[contrast == "Gestational_age_scaled" 85 | & component == 'H', .(primerid, `Pr(>Chisq)`)], # This extracts hurdle p-values 86 | summaryDt[contrast == "Gestational_age_scaled" & component == 'logFC', 87 | .(primerid, coef, ci.hi, ci.lo)], 88 | by = 'primerid') # This extract LogFC data 89 | 90 | fcHurdle <- stats::na.omit(as.data.frame(fcHurdle)) 91 | 92 | fcHurdle$fdr <- p.adjust(fcHurdle$`Pr(>Chisq)`, 'fdr') 93 | 94 | 95 | # Create reable results table for significant genes 96 | fcHurdleSig <- merge(fcHurdle[fcHurdle$fdr < .05,], 97 | as.data.table(mcols(sca_filtered)), by = 'primerid') 98 | setorder(fcHurdleSig, fdr) 99 | 100 | ``` 101 | -------------------------------------------------------------------------------- /scrnaseq/running_doubletfinder.md: -------------------------------------------------------------------------------- 1 | # Running DoubletFinder 2 | 3 | 4 | [DoubletFinder](https://github.com/chris-mcginnis-ucsf/DoubletFinder) is one of the most popular doublet finding methods with over 1200 citations since 2019 (as of Sept 2023). 5 | 6 | ## Preparing to run DoubletFinder 7 | 8 | The key notes for running doubletFinder are: 9 | -Each sample MUST be run separately 10 | -Various parameters can be tweaked in the run (see doubletfinder website for details) but the most critical is the prior value of the percentage of doublets. 11 | -DoubletFinder is not fast so best to run on O2 and save output as an RDS file. 12 | 13 | 14 | ## Step 1 - Generate subsets 15 | 16 | Starting with your post-qc seurat object separate out each sample. Then make a list of these new objects and a vector of object names. 17 | 18 | ```r 19 | sR01 <- subset(x = seurat_qc, subset = orig.ident %in% c("R01")) 20 | sW01 <- subset(x = seurat_qc, subset = orig.ident %in% c("W01")) 21 | s3N00 <- subset(x = seurat_qc, subset = orig.ident %in% c("3N00")) 22 | subsets = list(sR01,sW01,s3N00) 23 | names = c('sR01','sW01',"s3N00") 24 | ``` 25 | 26 | ## Step 2 - Run loop 27 | 28 | This is the most computationally intensive step. Here we will loop through the list we created and run doublet finder. 29 | 30 | ```r 31 | for (i in seq(1,length(subsets)) { 32 | 33 | # SCT Transform and Run UMAP 34 | obj <- subsets[[i]] 35 | obj <- SCTransform(obj) 36 | obj <- RunPCA(obj) 37 | obj <- RunUMAP(obj, dims = 1:10) 38 | 39 | #Run doublet Finder 40 | sweep.res.list_obj <- paramSweep_v3(obj, PCs = 1:10, sct = TRUE) 41 | sweep.stats_obj <- summarizeSweep(sweep.res.list_obj, GT = FALSE) 42 | bcmvn_obj <- find.pK(sweep.stats_obj) 43 | nExp_poi <- round(0.09*nrow(obj@meta.data)) ## Assuming 9% doublet formation rate can be changed. 44 | obj <- doubletFinder_v3(obj, PCs = 1:10, pN = 0.25, pK = 0.1, nExp = nExp_poi, reuse.pANN = FALSE, sct = TRUE) 45 | 46 | # Rename output columns from doublet finder to be consistent across samples for easy merging 47 | colnames(obj@meta.data)[c(22,23)] <- c("pANN","doublet_class") ## change coordinates based on your own metadata size 48 | assign(paste0(names[i]), obj) 49 | } 50 | ``` 51 | 52 | ## Step 3 - Merge doubletfinder output and save 53 | 54 | After doubletfinder is run merge to output to a single seuarat object and save that as an RDS. 55 | 56 | ```r 57 | seurat_doublet <- merge(x = subsets[[1]], 58 | y = subsets[2:length(subsets)]) 59 | saveRDS(seurat_doublet, file = "seurat_postQC_doubletFinder.rds") 60 | ``` 61 | 62 | ## Step 4 (optional) add doublet info to a pre-existing seurat object for plotting 63 | 64 | If you have gone ahead and run most of the seurat pipeline before running DoubletFinder you can add the doublet information to any object for plotting on a UMAP 65 | 66 | ```r 67 | doublet_info <- seurat_doublet@meta.data$doublet_class 68 | names(doublet_info) <- colnames(x = seurat_doublet) 69 | seurat_norm <- AddMetaData(seurat_norm, metadata=doublet_info, col.name="doublet") 70 | ``` 71 | 72 | ## Step 5 remove doublets 73 | 74 | You can remove doublets from any seurat object that has the doublet info. 75 | 76 | ```r 77 | seurat_qc_nodub <- subset(x = seurat_doublet, subset = doublet == "Singlet") 78 | saveRDS(seurat_qc_nodub, file = "seurat_qc_nodoublets.rds") 79 | ``` 80 | 81 | -------------------------------------------------------------------------------- /scrnaseq/saturation_qc.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: Single Cell Quality control with saturation 3 | category: Single Cell 4 | --- 5 | 6 | People often ask how many cells they need to sequence in their next experiment. 7 | Saturation analysis helps to answer that question looking at the current experiment. 8 | Would adding more coverage to the current experiment result in getting more transcripts, 9 | genes, or in just more duplicated reads? 10 | 11 | First, use [from bcbio to single cell script](https://github.com/hbc/hbcABC/blob/master/inst/rmarkdown/Rscripts/singlecell/from_bcbio_to_singlecell.R) 12 | to load data from bcbio into SingleCellExperiment object. 13 | 14 | Second, use [this Rmd template](https://github.com/hbc/hbcABC/blob/master/inst/rmarkdown/templates/simple_qc_single_cell/skeleton/skeleton.rmd) 15 | to create report. 16 | -------------------------------------------------------------------------------- /scrnaseq/seurat_markers.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: Seurat Markers 3 | description: This code is for finding Seurat markers 4 | category: research 5 | subcategory: scrnaseq 6 | tags: [differential_analysis] 7 | --- 8 | 9 | ```bash 10 | ssh -XY username@o2.hms.harvard.edu 11 | 12 | srun --pty -p interactive -t 0-12:00 --x11 --mem 128G /bin/bash 13 | 14 | module load gcc/6.2.0 R/3.4.1 hdf5/1.10.1 15 | 16 | R 17 | ``` 18 | 19 | ```r 20 | library(Seurat) 21 | library(tidyverse) 22 | 23 | set.seed(1454944673L) 24 | data_dir <- "data" 25 | seurat <- readRDS(file.path(data_dir, "seurat_tsne_all_res0.6.rds")) 26 | ``` 27 | 28 | Make sure the TSNEPlot looks as expected 29 | 30 | ```r 31 | TSNEPlot(seurat) 32 | ``` 33 | 34 | Check markers for any particular cluster against all others 35 | 36 | ```r 37 | cluster14_markers <- FindMarkers(object = seurat, ident.1 = 14, min.pct = 0.25) 38 | ``` 39 | 40 | Or look for markers of every cluster against all others 41 | 42 | ```r 43 | seurat_markers <- FindAllMarkers(object = seurat, only.pos = TRUE, min.pct = 0.25, thresh.use = 0.25) 44 | ``` 45 | 46 | >**NOTE:** The `seurat_markers` object with be a dataframe with the row names as Ensembl IDs; however, since row names need to be unique, if a gene is a marker for more than one cluster, then Seurat will add a number to the end of the Ensembl ID. Therefore, do not use the row names as the gene identifiers. Use the `gene` column. 47 | 48 | Save the markers for report generation 49 | 50 | ```r 51 | saveRDS(seurat_markers, "data/seurat_markers_all_res0.6.rds") 52 | ``` 53 | -------------------------------------------------------------------------------- /scrnaseq/tinyatlas.md: -------------------------------------------------------------------------------- 1 | - [Mouse brain markers](https://www.brainrnaseq.org/) 2 | - [Mouse and human markers](https://panglaodb.se) 3 | - https://bioconductor.org/packages/release/bioc/html/SingleR.html 4 | - [tinyatlas](https://github.com/hbc/tinyatlas) 5 | - https://www.flyrnai.org/tools/biolitmine/web/ 6 | - [human blood](http://scrna.sklehabc.com/), [article](https://academic.oup.com/nsr/article/8/3/nwaa180/5896476) 7 | - [human skin](https://www.nature.com/articles/s42003-020-0922-4/figures/1) 8 | - [A big atlas from Hemberg's lab](https://scfind.sanger.ac.uk/) 9 | -------------------------------------------------------------------------------- /scrnaseq/tutorials.md: -------------------------------------------------------------------------------- 1 | - [HSPH](https://github.com/hbc/tutorials/blob/master/scRNAseq/scRNAseq_analysis_tutorial/README.md) 2 | - [HBC's tutorials for scRNA-seq analysis workflows](https://github.com/hbc/tutorials/tree/master/scRNAseq/scRNAseq_analysis_tutorial) 3 | - [Seurat, Satija lab](https://satijalab.org/seurat/vignettes.html) 4 | - [Hemberg lab, Cambridge](https://scrnaseq-course.cog.sanger.ac.uk/website/index.html) 5 | - [Broad](https://broadinstitute.github.io/2019_scWorkshop/) 6 | - [DS pseudobulk edgeR](http://biocworkshops2019.bioconductor.org.s3-website-us-east-1.amazonaws.com/page/muscWorkshop__vignette/) 7 | - [MIG2019](https://biocellgen-public.svi.edu.au/mig_2019_scrnaseq-workshop/public/index.html) 8 | - [OSCA, Bioconductor](https://bioconductor.org/books/release/OSCA/) 9 | -------------------------------------------------------------------------------- /scrnaseq/velocity.md: -------------------------------------------------------------------------------- 1 | RNA Velocity analysis is a trajectory analysis based on spliced/unspliced RNA ratio. 2 | 3 | It is quite popular https://www.nature.com/articles/s41586-018-0414-6, 4 | however, the original pipeline is not well supported: 5 | https://github.com/velocyto-team/velocyto.R/issues 6 | 7 | There is a new one from kallisto team: 8 | https://bustools.github.io/BUS_notebooks_R/velocity.html 9 | 10 | # 1. Install R4.0 (development version) on O2 11 | - module load gcc/6.2.0 12 | - installed R-devel: https://www.r-bloggers.com/r-devel-in-parallel-to-regular-r-installation/ 13 | because one of the packages wanted R4.0 14 | - configure R with `./configure --enable-R-shlib` for rstudio 15 | - remove conda from PATH to avoid using its libcurl 16 | - module load boost/1.62.0 17 | - module load hdf5/1.10.1 18 | - installing velocyto.R: https://github.com/velocyto-team/velocyto.R/issues/86 19 | 20 | # 2. Install velocyto.R with R3.6.3 (Fedora 30 example) 21 | bash: 22 | ``` 23 | sudo dnf update R 24 | sudo dnf install boost boost-devel hdf5 hdf5-devel 25 | git clone https://github.com/velocyto-team/velocyto.R 26 | ``` 27 | rstudio/R: 28 | ``` 29 | BiocManager::install("pcaMethods") 30 | setwd("/where/you/cloned/velocyto.R") 31 | devtools::install_local("velocyto.R") 32 | ``` 33 | 34 | # 3. Generate reference files 35 | - `Rscriptdev `[01_get_velocity_files.R](https://github.com/naumenko-sa/crt/blob/master/velocity/01_get_velocity_files.R) 36 | - output: 37 | ``` 38 | cDNA_introns.fa 39 | cDNA_tx_to_capture.txt 40 | introns_tx_to_capture.txt 41 | tr2g.tsv 42 | ``` 43 | 44 | # 4. Index reference 45 | This step takes ~1-2h and 100G or RAM: 46 | `sbatch `[02_kallisto_index.sh](https://github.com/naumenko-sa/crt/blob/master/velocity/02_kallisto_index.sh) 47 | 48 | - inDrops3 support: https://github.com/BUStools/bustools/issues/4 49 | 50 | # 5. Split reads by sample with barcode_splitter 51 | 52 | - merge reads from multiple flowcells first 53 | - https://pypi.org/project/barcode-splitter/ 54 | ``` 55 | barcode_splitter --bcfile samples.tsv Undetermined_S0_L001_R1.fastq Undetermined_S0_L001_R2.fastq Undetermined_S0_L001_R3.fastq Undetermined_S0_L001_R4.fastq --idxread 3 --suffix .fq 56 | ``` 57 | 58 | kallisto bus counting procedure works on per sample basis, so we need to split samples to separate fastq files, and merge samples across lanes. 59 | 60 | - [split_barcodes.sh](https://github.com/naumenko-sa/crt/blob/master/velocity/03_split_barcodes.sh) 61 | 62 | # 6. Count spliced and unspliced transcripts 63 | - [kallisto_count](https://github.com/naumenko-sa/crt/blob/master/velocity/04_kallisto_count.sh) 64 | - output: 65 | ``` 66 | spliced.barcodes.txt 67 | spliced.genes.txt 68 | spliced.mtx 69 | unspliced.barcodes.txt 70 | unspliced.genes.txt 71 | unspliced.mtx 72 | ``` 73 | 74 | # 7. Create Seurat objects for every sample 75 | - [create_seurat_sample.Rmd](https://github.com/naumenko-sa/crt/blob/master/velocity/05.create_seurat_sample.Rmd) 76 | - also removes empty droplets 77 | 78 | # 8. Merge seurat objects 79 | - [merge_seurats](https://github.com/naumenko-sa/crt/blob/master/velocity/06.merge_seurats.Rmd) 80 | 81 | # 9. Velocity analysis 82 | - [velocity_analysis](https://github.com/naumenko-sa/crt/blob/master/velocity/07.velocity_analysis.Rmd) 83 | 84 | # 10. Plot velocity picture 85 | - [plot_velocity](https://github.com/naumenko-sa/crt/blob/master/velocity/08.plot_velocity.Rmd) 86 | 87 | # 11. Repeat marker analysis 88 | - [velocity_markers](https://github.com/naumenko-sa/crt/blob/master/velocity/09.velocity_markers.Rmd) 89 | 90 | # 11. References 91 | - https://www.kallistobus.tools/tutorials 92 | - https://github.com/satijalab/seurat-wrappers/blob/master/docs/velocity.md 93 | - [preprocessing influences velociy analysis](https://www.biorxiv.org/content/10.1101/2020.03.13.990069v1) 94 | 95 | # Velocity analysis in Python: 96 | - http://velocyto.org/ 97 | - https://github.com/pachterlab/MBGBLHGP_2019/blob/master/Figure_3_Supplementary_Figure_8/supp_figure_9.ipynb 98 | -------------------------------------------------------------------------------- /scrnaseq/write10Xcounts.md: -------------------------------------------------------------------------------- 1 | ## Need to go from counts in a Seurat object to the 10X format? 2 | 3 | Recently found that some tools like Scrublet (doublet detection), require scRNA-seq counts to be in 10X format. 4 | * barcodes.tsv 5 | * genes.tsv 6 | * matrix.mtx 7 | 8 | How to do this easily? 9 | 10 | ```r 11 | library(Seurat) 12 | library(tidyverse) 13 | library(rio) 14 | library(DropletUtils) 15 | 16 | # Read in Seurat object 17 | seurat_stroma <- readRDS("./seurat_stroma_replicatePaper.rds") 18 | 19 | # Output data 20 | write10xCounts(x = seurat_stroma@assays$RNA@counts, path = "./cell_ranger_data_format_test") 21 | 22 | ``` 23 | -------------------------------------------------------------------------------- /scrnaseq/zinbwaver.md: -------------------------------------------------------------------------------- 1 | # When asked why don't you use zinbwave 2 | 3 | - [zinbwave-deseq2-comparison.Rmd](https://github.com/roryk/zinbwave-deseq2-indrop/blob/master/zinbwave-deseq2-comparison.Rmd) 4 | - [Soneson2018](https://experiments.springernature.com/articles/10.1038/nmeth.4612) 5 | - [Original ZINB](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1406-4) 6 | - [Crowell2020](https://www.biorxiv.org/content/10.1101/713412v2) 7 | - [Droplet scRNA-seq is not zero-inflated](https://www.nature.com/articles/s41587-019-0379-5) 8 | 9 | -------------------------------------------------------------------------------- /training/mkdocs.md: -------------------------------------------------------------------------------- 1 | # MK-Docs 2 | 3 | Basic prerequisites 4 | -Python 5 | -Github CLI 6 | -VS Code 7 | 8 | 9 | Download Visual Studio Code 10 | 11 | https://code.visualstudio.com/download 12 | 13 | 1. Open new terminal within VS-Code 14 | 2. Navigate to where you want to host the GitHub repo. 15 | 3. Create a new GitHub repo you want to work on or clone an existing one: git clone link_to_repo.git 16 | 4. Cd to the GitHub repo folder 17 | 5. Lets make a virtual python environment: python -m venv venv 18 | 6. Source the python environment you just created: source venv/bin/activate 19 | 7. Need pip installed check: pip —version 20 | 8. Install mkdocs: pip install mkdocs-material 21 | 9. Open visual code from here: code . 22 | 10. Open terminal within vscode 23 | 11. To open up the website mkdocs serve 24 | 12. To change to the “material theme” open mkdocs.yml file and below site_name: My Docs, type 25 | site_name: My Docs 26 | theme: 27 | name: material 28 | 13. Save it 29 | 14. To deploy: type mkdocs serve on terminal. It will restart in the same host. 30 | 15. We can change the appearance of the site by editing and adding plugins to mkdocs.yml (google for common settings) 31 | 16. To add another page. Go inside docs add another nage: eg page2.md 32 | 17. Add 33 | # Page 2 34 | 35 | ## sub heading 36 | 37 | Text inside 38 | 18. Save it 39 | 19. In the terminal type: git add . 40 | 20. Git commit -m $’updating instructor’ 41 | git push origin main 42 | git config http.postBuffer 524288000 #(if error comes up related to html) 43 | git pull && git push 44 | -------------------------------------------------------------------------------- /variants/clonal_evolution.md: -------------------------------------------------------------------------------- 1 | # Variant analysis 2 | 3 | - [SciClone](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003665) 4 | - [FishPlot](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-3195-z) 5 | - [PhylogicNDT](https://github.com/broadinstitute/PhylogicNDT) 6 | -------------------------------------------------------------------------------- /wgs/crispr-offtarget.md: -------------------------------------------------------------------------------- 1 | # Overview 2 | This guide is how to call offtarget edits in a CRISPR edited genome. This is 3 | pretty easy to and only takes a few steps. First, we need to figure out what is 4 | different between the CRISPR edited samples and the (hopefully they gave you 5 | these) control samples. Then we need to find a set of predicted off-target 6 | CRISPR sites. Finally, once we know what is different, we need to overlap the 7 | differences with predicted off-target sites, allowing some mismatches. Then we 8 | can report the overall differences and differences that could be due to 9 | offtarget edits. 10 | 11 | # Call CRISPR-edited specific variants 12 | You want to call edits that are in the CRISPRed sample but not the unedited 13 | sample. You can do that by plugging into the tumor-normal calling part of bcbio 14 | and pretending the CRISPR-edited sample is a tumor sample and the non-edited 15 | sample is a normal sample. 16 | 17 | To get tumor-normal calling to work you need to use a variant caller that 18 | can handle that, I recommend mutect2. 19 | 20 | To tell bcbio that a pair of samples is a tumor-normal pair you need to 21 | 22 | 1. Put the tumor and normal sample in the same **batch** by setting **batch** in the metadata to the same batch. 23 | 2. Set **phenotype** of the CRISPR-edited sample to **tumor**. 24 | 3. Set the **phenotype** of the non-edited sample to **normal**. 25 | 26 | And kick off the **variant2** pipeline, the normal whole genome sequencing pipeline. An example YAML template is below: 27 | 28 | ```yaml 29 | --- 30 | details: 31 | - analysis: variant2 32 | genome_build: hg38 33 | algorithm: 34 | aligner: bwa 35 | variantcaller: mutect2 36 | tools_on: [gemini] 37 | ``` 38 | 39 | And an example metadata file: 40 | 41 | ```csv 42 | samplename,description,batch,phenotype,sex,cas9,gRNA 43 | Hs27_HSV1.cram,Hs27_HSV1,noCas9_nogRNA,normal,male,no,yes 44 | Hs27_HSV1_Cas9.cram,Hs27_HSV1_Cas9,noCas9,normal,male,yes,no 45 | Hs27_HSV1_UL30_5.cram,Hs27_HSV1_UL30_5,noCas9_nogRNA,tumor,male,yes,yes 46 | Hs27_HSV1_UL30_5_repeat.cram,Hs27_HSV1_UL30_5_repeat,noCas9,tumor,male,yes,yes 47 | ``` 48 | 49 | # Find predicted off-target sites 50 | There are several tools to do this, a common one folks use is cas-offinder, so 51 | that is what we will use. There is a [web app](http://www.rgenome.net/cas-offinder/) but it will only return 1,000 events per 52 | class. Usually this is fine, but if you allow bulges you can get a lot more offtarget sites so you might bump into this limit. 53 | 54 | First install cas-offinder, there is a conda package so this is easy: 55 | 56 | ```bash 57 | conda create -n crispr -c bioconda cas-offinder 58 | ``` 59 | 60 | There is a companion python wrapper cas-offinder-bulge that can also predict 61 | offtarget sites taking bulges into effect. You can download it 62 | [here](https://raw.githubusercontent.com/hyugel/cas-offinder-bulge/master/cas-offinder-bulge) if 63 | you need to do that. 64 | 65 | You'll need to know the sequence of one or more guides you want to check. You will also need to know 66 | the PAM sequence for the endonuclease that is being used. 67 | 68 | You can run cas-offinder like this: 69 | 70 | ```bash 71 | cas-offinder input.txt C output.txt 72 | ``` 73 | 74 | where input.txt has this format: 75 | 76 | ``` 77 | hg38.fa 78 | NNNNNNNNNNNNNNNNNNNNNNNGRRT 79 | ACACGTGAAAGACGGTGACGGNNGRRT 6 80 | ``` 81 | 82 | `hg38.fa` is the path to the FASTA file of the hg38 genome. NNNNNNNNNNNNNNNNNNNNNNNNGRRT is the length of the guide sequence you are interested in with the PAM sequence tacked on the end. 83 | ACACACGTGAAAGACGGTGACGGNNGRRT is the guide sequence with the PAM sequence tacked on the end. 6 is the number of mismatches you are allowing here, it will look for sites with that many 84 | or less mismatches. 85 | 86 | If you want to look for bulges, use cas-offinder-bulge with this format: 87 | 88 | ``` 89 | hg38.fa 90 | NNNNNNNNNNNNNNNNNNNNNNNGRRT 2 1 91 | ACACGTGAAAGACGGTGACGGNNGRRT 6 92 | ``` 93 | 94 | Where the 2 says to look for a DNA bulge and the 1 a RNA bulge. You can do one or the other, neither or both. 95 | 96 | After you run cas-offinder you can conver the output to a sorted BED file for use with intersecting the your variants: 97 | 98 | ```bash 99 | cat output.txt | sed 1d | awk '{printf("%s\t%s\t%s\n",$4, $5-10,$5+10)}' | sort -V -k 1,1 -k2,2n > output.bed 100 | ``` 101 | 102 | # Overlap variants 103 | Finally, use the BED file of predicted off-target sites to pull out possible off-target variant calls: 104 | 105 | ```bash 106 | bedtools intersect -header -u -a noCas9_nogRNA-mutect2-annotated.vcf.gz -b output.bed 107 | ``` 108 | 109 | And you are done! 110 | 111 | # More tools 112 | [CRISPResso2](https://github.com/pinellolab/CRISPResso2) 113 | -------------------------------------------------------------------------------- /wgs/pacbio_genome_assembly.md: -------------------------------------------------------------------------------- 1 | --- 2 | tags: 3 | title: Genome Assembly Using PacBio Reads Only 4 | author: Zhu Zhuo 5 | created: '2019-09-13' 6 | --- 7 | 8 | # Genome Assembly Using PacBio Reads Only 9 | 10 | This tutorial is based on a bacterial genome assembly project using PacBio sequencing reads only, but it can be followed for genome assembly of other species or using other type long reads with no or a little modification. 11 | 12 | ## Demultiplex 13 | 14 | If the sequencing core hasn't demultiplexed the data, [`lima`](https://github.com/PacificBiosciences/barcoding) can be used for demultiplexing. 15 | 16 | ## Convert `bam` file to `fastq` file 17 | 18 | `subreads.bam` files contain the subreads data and we will convert it from `bam` format to `fastq` format as most assemblers take `fastq` as input. 19 | 20 | [A note on the output from PacBio:](https://pacbiofileformats.readthedocs.io/en/5.1/Primer.html) 21 | > Unaligned BAM files representing the subreads will be produced natively by the PacBio instrument. The subreads BAM will be the starting point for secondary analysis. In addition, the scraps arising from cutting out adapter and barcode sequences will be retained in a `scraps.bam` file, to enable reconstruction of HQ regions of the ZMW reads, in case the customer needs to rerun barcode finding with a different option. 22 | 23 | Below is an example of slurm script using `bedtools bamtofastq` to convert `bam` to `fastq`. 24 | 25 | ``` 26 | #!/bin/sh 27 | #SBATCH -p medium 28 | #SBATCH -J bam2fq 29 | #SBATCH -o x%_%j.o 30 | #SBATCH -e x%_%j.e 31 | #SBATCH -t 00-23:59:00 32 | #SBATCH -c 2 33 | #SBATCH --mem=4G 34 | #SBATCH --array=1-n%5 #change n to the number of subreads.bam files 35 | 36 | module load bedtools/2.27.1 37 | 38 | files=(/path/to/subreads.bam) 39 | file=${files[$SLURM_ARRAY_TASK_ID-1]} 40 | sample=`basename file .bam` 41 | 42 | echo $file 43 | echo $sample 44 | 45 | bedtools bamtofastq -i $file -fq $sample".fq" 46 | ``` 47 | 48 | PacBio Sequel Sequencer reports all base qualities as PHRED 0 (ASCII !). So the quality score for sequel data are all `!` in the `fastq` file generated. 49 | 50 | ## Genome assembly 51 | 52 | ### Using Canu for genome assembly 53 | 54 | [Canu](https://github.com/marbl/canu) does correction, trimming and assembly in a single command. Follow its github page to install the software to a location of preference. 55 | 56 | An example of slurm script for a single sample: 57 | ``` 58 | #!/bin/sh 59 | #SBATCH -p priority 60 | #SBATCH -J canu 61 | #SBATCH -o %x_%j.o 62 | #SBATCH -e %x_%j.e 63 | #SBATCH -t 0-23:59:00 64 | #SBATCH -c 1 65 | #SBATCH --mem=1G 66 | 67 | module load java/jdk-1.8u112 68 | 69 | export PATH=/path/to/canu:$PATH 70 | 71 | canu -p sampleName -d sampleName genomeSize=7m \ 72 | gridOptions="--time=1-23:59:00 --partition=medium" \ 73 | -pacbio-raw /path/to/converted.fq 74 | ``` 75 | This is for the 'master' job, so only 1 CPU and 1 Gb memory should be sufficient. Canu will evaluate the resources available and automatically submit jobs to the queue in `gridOptions`. 76 | 77 | _Note_: An alternative is to install a bioconda recipe. But the conda verison is not up-to-date and some additional parameters may need to be specified in the command. 78 | 79 | ### Using Unicycler for bacterial genome assembly 80 | 81 | Follow [Unicycler](https://github.com/rrwick/Unicycler#method-long-read-only-assembly) instructions. [Racon](https://github.com/isovic/racon) is also required and should be installed before running Unicycler. 82 | 83 | ``` 84 | module load gcc/6.2.0 python/3.6.0 85 | git clone https://github.com/rrwick/Unicycler.git 86 | cd Unicycler 87 | python3 setup.py install --user 88 | ``` 89 | Example of a slurm script 90 | ``` 91 | #!/bin/sh 92 | #SBATCH -p priority 93 | #SBATCH -J unicycler 94 | #SBATCH -o %x_%j.o 95 | #SBATCH -e %x_%j.e 96 | #SBATCH -t 29-23:59:00 97 | #SBATCH -c 20 98 | #SBATCH --mem=200G 99 | 100 | module load gcc/6.2.0 python/3.6.0 bowtie2/2.3.4.3 samtools/1.9 blast/2.6.0+ 101 | export PATH=/path/to/racon:$PATH 102 | 103 | /path/to/unicycler-runner.py -l /path/to/fastq -t 20 -o sample 104 | ``` 105 | ## Assembly quality 106 | 107 | ### Basic assembly metrics 108 | 109 | Download [Quast](http://bioinf.spbau.ru/quast) for basic assembly metrics, such as total length, number of contigs and N50. 110 | `/path/to/quast.py -o output_folder -t 6 assembly.fa` 111 | 112 | ### Assembly completeness 113 | 114 | Use [BUSCO](https://busco.ezlab.org/) for evaluating the completeness of the genome assembly. BUSCO has a lot of dependencies, so it is better to install a conda recipe. 115 | 116 | ``` 117 | source activate conda-env # activate conda environment. If you don't have one, you may need to create a conda environment. 118 | conda install -c bioconda busco 119 | conda deactivate # deactivate conda environment 120 | ``` 121 | Download the BUSCO database for the species and run BUSCO 122 | 123 | `run_busco -i assembly.fa -o output_folder -l species_odb -m geno` 124 | 125 | BUSCO is also the abbreviation for Benchmarking Universal Single-Copy Orthologs, which is single-copy ortholog found in >90% of species. The more BUSCOs are present, the more complete the genome assembly is. 126 | --------------------------------------------------------------------------------