├── .Rbuildignore ├── .github ├── .gitignore └── workflows │ └── publish.yml ├── .gitignore ├── 01-setup-course.qmd ├── 02-scripting-basics.qmd ├── 03-cloud-computing-basics.qmd ├── 04-doing-work-with-dx-run.qmd ├── 05-batch-processing.qmd ├── 06-JSON.qmd ├── 07-containers.qmd ├── 09-from-hpc.qmd ├── README.md ├── _quarto.yml ├── appendix.qmd ├── book.bib ├── building-apps.qmd ├── cover.png ├── data ├── batch-on-worker.sh ├── dx-find-data-class.sh ├── dx-find-data-field.sh ├── dx-find-data-name.sh ├── dx-find-path.sh └── dx-find-xargs.sh ├── docs ├── 404.html ├── cloud-computing-basics.html ├── containers-and-reproducibility.html ├── index.html ├── lets-go-on-the-cloud.html ├── libs │ ├── anchor-sections-1.1.0 │ │ ├── anchor-sections-hash.css │ │ ├── anchor-sections.css │ │ └── anchor-sections.js │ ├── gitbook-2.6.7 │ │ ├── css │ │ │ ├── fontawesome │ │ │ │ └── fontawesome-webfont.ttf │ │ │ ├── plugin-bookdown.css │ │ │ ├── plugin-clipboard.css │ │ │ ├── plugin-fontsettings.css │ │ │ ├── plugin-highlight.css │ │ │ ├── plugin-search.css │ │ │ ├── plugin-table.css │ │ │ └── style.css │ │ └── js │ │ │ ├── app.min.js │ │ │ ├── clipboard.min.js │ │ │ ├── jquery.highlight.js │ │ │ ├── plugin-bookdown.js │ │ │ ├── plugin-clipboard.js │ │ │ ├── plugin-fontsettings.js │ │ │ ├── plugin-search.js │ │ │ └── plugin-sharing.js │ ├── header-attrs-2.12 │ │ └── header-attrs.js │ └── jquery-3.6.0 │ │ └── jquery-3.6.0.min.js ├── reference-keys.txt ├── search_index.json ├── shell-scripting-basics.html ├── shell-scripting-for-dnanexus-applets.html └── style.css ├── images ├── app_structure.png ├── applet-build-process.png ├── binder2.png ├── binder3.png ├── binder_screen.png ├── cog-sci.png ├── dnanexus_players.png ├── docker_terminology.png ├── hpc_players.png ├── indirect.png ├── instance_checklist.png ├── json_visualizer.png ├── local_compute.png ├── mybinder_launch.png ├── order_operations.png ├── storage.png ├── terminal.png ├── two_scripts.png └── ukb_application.png ├── index.qmd ├── index_files └── libs │ ├── bootstrap │ ├── bootstrap-icons.css │ ├── bootstrap-icons.woff │ ├── bootstrap.min.css │ └── bootstrap.min.js │ ├── clipboard │ └── clipboard.min.js │ └── quarto-html │ ├── anchor.min.js │ ├── popper.min.js │ ├── quarto-syntax-highlighting.css │ ├── quarto.js │ ├── tippy.css │ └── tippy.umd.min.js ├── json_data ├── dxapp.json ├── example.json ├── fastqc.json ├── job.json └── rap-jobs.json ├── preamble.tex ├── references.bib ├── references.qmd ├── renv.lock ├── renv ├── .gitignore ├── activate.R └── settings.dcf ├── shell_for_bioinformatics.Rproj ├── style.css ├── summary.qmd └── using-jupyterlabs.qmd /.Rbuildignore: -------------------------------------------------------------------------------- 1 | ^\.github$ 2 | -------------------------------------------------------------------------------- /.github/.gitignore: -------------------------------------------------------------------------------- 1 | *.html 2 | -------------------------------------------------------------------------------- /.github/workflows/publish.yml: -------------------------------------------------------------------------------- 1 | name: Publish Website 2 | 3 | # Allow one concurrent deployment 4 | concurrency: 5 | group: "pages" 6 | cancel-in-progress: true 7 | 8 | on: 9 | push: 10 | branches: ['main'] 11 | 12 | jobs: 13 | quarto-publish: 14 | name: Publish with Quarto 15 | runs-on: ubuntu-latest 16 | steps: 17 | - name: Checkout repository 18 | uses: actions/checkout@v3 19 | - run: sudo apt install libcurl4-openssl-dev libssl-dev 20 | - uses: r-lib/actions/setup-r@v2 21 | - uses: r-lib/actions/setup-renv@v2 22 | with: 23 | cache-version: 1 24 | - name: Install Quarto 25 | uses: quarto-dev/quarto-actions/setup@v2 26 | - name: install jupyter 27 | uses: actions/setup-python@v4 28 | with: 29 | python-version: '3.9' 30 | - run: pip install jupyter 31 | - name: Publish to GitHub Pages 32 | uses: quarto-dev/quarto-actions/publish@v2 33 | with: 34 | target: gh-pages 35 | 36 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .Rproj.user 2 | .Rhistory 3 | .RData 4 | .Ruserdata 5 | 6 | /.quarto/ 7 | _book/ 8 | _freeze/ -------------------------------------------------------------------------------- /01-setup-course.qmd: -------------------------------------------------------------------------------- 1 | # Setup for the Course / dx-toolkit basics 2 | 3 | In this chapter, we'll setup our DNAnexus account, start a project, get files into it, and run a job with those files. 4 | 5 | This is meant to be a whirlwind tour - we'll expand on more information about each of these steps in further chapters. 6 | 7 | ## Setup your DNAnexus Account 8 | 9 | First, create an account at . You'll need your login and password to interact with the platform. 10 | 11 | If you are not a registered customer with DNAnexus, you will have to [set up your billing by adding a credit card](https://documentation.dnanexus.com/admin/billing-and-account-management). 12 | 13 | I know that money is tight for everyone, but everything we'll do in this course should cost no more than $5-10 in compute time. 14 | 15 | ## Terminal setup / dx-toolkit setup 16 | 17 | We'll be running all of these scripts on our own machine. We'll be using the command-line for most of these. 18 | 19 | If you are on Linux/Mac, you'll be working with the terminal. 20 | If you are on Windows, I recommend you install [Windows Subsystem for Linux](https://learn.microsoft.com/en-us/windows/wsl/install), and specifically the Ubuntu distribution. That will give you a command-line shell that you can use to interact with the DNAnexus platform. 21 | 22 | On your machine, I recommend using a text editor to edit the scripts in your terminal. Good ones include [VS Code](https://code.visualstudio.com/), or built in ones such as `nano`. 23 | 24 | Now that we have a terminal and code editor, we can install the [dx-toolkit](https://documentation.dnanexus.com/downloads) onto our machine. In your terminal, you'll first need to make sure that python 3 is installed, and the `pip` installer is installed as well. 25 | 26 | ## On Ubuntu 27 | 28 | You should have a python3 installation - check by typing in: 29 | 30 | `which python3` 31 | 32 | If you get a blank response, then you'll need to install it. 33 | 34 | ``` 35 | sudo apt-get install python3 ## If python is not yet installed 36 | sudo apt-get install pip3 37 | sudo apt-get install git 38 | sudo apt-get install jq 39 | ``` 40 | 41 | :::{.callout-note} 42 | ## What about `miniconda`? 43 | 44 | Installing `miniconda` (the minimal [Anaconda installer](https://docs.conda.io/en/latest/miniconda.html)) has some advantages, but does require some learning, especially about conda environments. 45 | 46 | Here's a link that discusses using conda environments and how you can use them to install software on your computer: 47 | 48 | The short of it is that you define a *conda environment*, which is a space that lets you install R or Python packages and other software dependencies. For example, I could have an environment that uses Python 2.8, and another one that uses Python 3.9, and I could switch between them using a command called `conda activate`. 49 | ::: 50 | 51 | 52 | 53 | ## On Macs 54 | 55 | Install [homebrew](https://brew.sh/) to your mac and install `python3` and `pip3`: 56 | 57 | ``` 58 | brew install python3 59 | brew install pip3 60 | brew install git 61 | brew install jq 62 | ``` 63 | 64 | ## For both machines 65 | 66 | Once you have access to Python 3 and `pip3`, you can install the dx-toolkit using the following command: 67 | 68 | ``` 69 | pip3 install dxpy 70 | ``` 71 | 72 | That last command will install the `dx-toolkit` to your machine, which are the command line tools you'll need to work on the DNAnexus cloud. 73 | 74 | Test it out by typing: 75 | 76 | ``` 77 | dx -h 78 | ``` 79 | 80 | You should get similar output: 81 | 82 | ``` 83 | usage: dx [-h] [--version] command ... 84 | 85 | DNAnexus Command-Line Client, API v1.0.0, client v0.330.0 86 | 87 | dx is a command-line client for interacting with the DNAnexus platform. You can log in, 88 | navigate, upload, organize and share your data, launch analyses, and more. For a quick tour 89 | of what the tool can do, see 90 | 91 | https://documentation.dnanexus.com/getting-started/tutorials/cli-quickstart#quickstart-for-> 92 | 93 | For a breakdown of dx commands by category, run "dx help". 94 | ``` 95 | 96 | ## Alternative Setup: binder.org 97 | 98 | If you aren't able to install the dx-toolkit to your machine, you can use this Binder link to try out the commands. Binder opens a preinstalled image with a shell that has `dxpy` preinstalled on one of the servers. 99 | 100 | 101 | 102 | When you launch it, you will first see this page: 103 | 104 | ![Launch Binder Screen](images/binder_screen.png) 105 | 106 | Then after a few moments (hopefully no more than 5 minutes), the JupyterLab interface should launch. Select the "Terminal" from the bottom of the launcher: 107 | 108 | ![JupyterLab Interface](images/binder2.png) 109 | 110 | Once you select the terminal, this is what you should see: 111 | 112 | ![JupyterLab Terminal](images/binder3.png) 113 | 114 | Now you're ready to get started. Login (@sec-login) and proceed from there. 115 | 116 | Just keep in mind that this shell is ephemeral - it will disappear. So make sure that any files you create that you want to save are either uploaded back to your project with `dx upload` or you've downloaded them using the file explorer. 117 | 118 | This shell includes the following utilities: 119 | 120 | - `git` (needed to download course materials) 121 | - `nano` (needed to edit files) 122 | - `dxpy` (the dx-toolkit) 123 | - `python`/`pip` (needed to install dx-toolkit) 124 | - `jq` (needed to work with JSON files) 125 | 126 | ## Clone the files and scripts using git 127 | 128 | ### On Your Own Computer 129 | 130 | On your own computer, clone the repo (it will already be cloned if you're in the binder version). 131 | 132 | ```{bash} 133 | #| eval: false 134 | 135 | git clone https://github.com/laderast/bash_bioinfo_scripts/ 136 | ``` 137 | 138 | 139 | This will create a folder called `/bash_bioinfo_scripts/` in your current directory. Change to it: 140 | 141 | ``` 142 | cd bash_bioinfo_scripts 143 | ``` 144 | 145 | ### In the binder 146 | 147 | You will already be in the `bash_bioinfo_scripts/` folder. 148 | 149 | ## Try logging in {#sec-login} 150 | 151 | Now that you have an account and the `dx-toolkit` installed, try logging in with `dx login`: 152 | 153 | ```{bash} 154 | #| eval: false 155 | dx login 156 | ``` 157 | 158 | The platform will then ask you for your username and password. Enter them. 159 | 160 | If you are successful, you will see either the select screen or, if you only have one project, that project will be selected for you. 161 | 162 | ## Super Quick Intro to dx-toolkit 163 | 164 | The dx-toolkit is our main tool for interacting with the DNAnexus platform on the command-line. It handles the following: 165 | 166 | - Creating a project and managing membership (`dx new project`/`dx invite`/`dx uninvite`). 167 | - File transfer to and from project storage (`dx upload`/`dx download`) 168 | - Starting up computational jobs on the platform with apps and workflows (`dx run`) 169 | - Monitoring/Terminating jobs on the platform (`dx watch`/`dx describe`/`dx terminate`) 170 | - Building Apps, which are executables on the platform (`dx-app-wizard`, `dx build`) 171 | - Building Workflows, which string together apps on the platform (`dx build`) 172 | 173 | How do you know a command belongs to the dx-toolkit? They all begin with `dx`. For example, to list the contents of your current project, you'll use something like `ls`: 174 | 175 | ```{bash} 176 | #| eval: false 177 | dx ls 178 | ``` 179 | 180 | ## Create Project for Course 181 | 182 | Let's create a project on the platform, and then we will get files into it to prepare for the online work with the platform. 183 | 184 | The first command we'll run is `dx new project` in order to create our project. 185 | 186 | ```{bash} 187 | #| eval: false 188 | dx new project -y my_project 189 | ``` 190 | 191 | :::{.callout-note} 192 | ## What's happening here? 193 | 194 | When you call `dx new project`, that creates a project on the platform with the name `my_project`. This project lives in the cloud, within the DNAnexus platform. 195 | 196 | The `-y` option switches you over into that new project. 197 | ::: 198 | 199 | ## Copying Files to Project Storage 200 | 201 | Our worker scripts live in the the `bash_for_bioinformatics/` folder on our machine. 202 | 203 | Let's copy our files from the public project into our newly created DNAnexus project. 204 | 205 | ```{bash} 206 | #| eval: false 207 | dx cp -r "project-BQbJpBj0bvygyQxgQ1800Jkk:/Developer Quickstart" . 208 | dx mv "Developer Quickstart/" "data/" 209 | ``` 210 | 211 | Confirm that your file system on your machine is similiar to the following output: 212 | 213 | ```{bash} 214 | #| eval: false 215 | tree 216 | ``` 217 | 218 | ``` 219 | ├── JSON 220 | │   └── json_data 221 | │   ├── dxapp.json 222 | │   ├── example.json 223 | │   ├── fastqc.json 224 | │   ├── job.json 225 | │   └── rap-jobs.json 226 | ├── batch-processing 227 | │   ├── batch-on-worker.sh 228 | │   ├── dx-find-data-class.sh 229 | │   ├── dx-find-data-field.sh 230 | │   ├── dx-find-data-name.sh 231 | │   ├── dx-find-path.sh 232 | │   └── dx-find-xargs.sh 233 | ``` 234 | 235 | And the file system in your project should look similar to this when you run `dx tree`: 236 | 237 | ```{bash} 238 | #| eval: false 239 | 240 | dx tree 241 | ``` 242 | 243 | 244 | ``` 245 | ├── data 246 | │ ├── NA12878.bai 247 | │ ├── NA12878.bam 248 | │ ├── NC_000868.fasta 249 | │ ├── NC_001422.fasta 250 | │ ├── small-celegans-sample.fastq 251 | │ ├── SRR100022_chrom20_mapped_to_b37.bam 252 | │ ├── SRR100022_chrom21_mapped_to_b37.bam 253 | │ └── SRR100022_chrom22_mapped_to_b37.bam 254 | ``` 255 | 256 | ::: {.callout-note} 257 | ## Why are there two file systems? 258 | 259 | When we do cloud computing, there are two file systems we'll have to be familiar with: 260 | 261 | 1. The file system on our own machine 262 | 2. The file system in our DNAnexus project 263 | 264 | The `tree` command shows you the contents of your local machine. 265 | 266 | `dx tree` on the other hand, shows you the contents of your project storage. Remember that anything that begins with `dx` is a command that interacts with the platform! 267 | 268 | Ok, I lied. There are technically 3 filesystems we need to be familar with. The last one is the working directory on the worker machine. We'll talk more about this in the cloud computing basics (@sec-cloud) section. 269 | ::: 270 | 271 | ## Full Script 272 | 273 | The full script for setting up is in the `setup-course/project_setup.sh` 274 | 275 | ```{bash} 276 | #| eval: false 277 | #| filename: setup-course/project_setup.sh 278 | ## Login 279 | dx login 280 | 281 | ## Create your new project 282 | dx new project -y my_project 283 | 284 | ## Clone this repository onto your computer 285 | git clone https://github.com/laderast/bash_bioinfo_scripts 286 | 287 | ## Upload the scripts into the scripts folder 288 | cd bash_bioinfo_scripts 289 | dx upload -r worker_scripts/ 290 | 291 | ## Copy the data over 292 | dx cp -r "project-BQbJpBj0bvygyQxgQ1800Jkk:/Developer Quickstart" . 293 | dx mv "Developer Quickstart/" "data/" 294 | 295 | ## Confirm your project matches 296 | dx tree 297 | tree 298 | ``` 299 | -------------------------------------------------------------------------------- /03-cloud-computing-basics.qmd: -------------------------------------------------------------------------------- 1 | # Cloud Computing Basics {#sec-cloud} 2 | 3 | We all need to start somewhere when we work with cloud computing. 4 | 5 | This chapter is a review of how cloud computing works on the DNAnexus platform. If you haven't used cloud computing before, no worries! This chapter will get you up to speed. 6 | 7 | Also, if you are coming from using an on-premise high performance computing (HPC) cluster, you can skip ahead. 8 | 9 | ## Learning Objectives 10 | 11 | 1. **Define** key players in both local computing and cloud computing 12 | 1. **Articulate** key differences between local computing and cloud computing 13 | 1. **Describe** the sequence of events in launching jobs in the DNAnexus cloud 14 | 1. **Differentiate** local storage from cloud-based project storage 15 | 1. **Describe** instance types and how to use them. 16 | 17 | ## Important Terminology 18 | 19 | Let's establish the terminology we need to talk about cloud computing. 20 | 21 | - **DNAnexus Project** - contains files, executables (apps/applets), and logs associated with analysis. 22 | - **Software Environment** - everything needed to run a piece of software on a brand new computer. For example, this would include installing R, but also all of its dependencies as well. 23 | - **App/Applet** - Executable on the DNAnexus platform. 24 | - **Project Storage** - Part of the platform that stores our files and other objects. We'll see that these other objects include applets, databases, and other object types. 25 | 26 | ## Understanding the key players 27 | 28 | In order to understand what's going on with Cloud Computing, we will have to change our mental model of computing. 29 | 30 | Let's contrast the key players in local computing with the key players in cloud computing. 31 | 32 | ### Key Players in Local Computing 33 | 34 | ![Local Computing](images/local_compute.png){#fig-local} 35 | 36 | - Our Machine 37 | 38 | When we run an analysis or process files on our computer, we are in control of all aspects of our computer. We are able to install a software environment, such as R or Python, and then execute scripts/notebooks that reside on our computer on data that's on our computer. 39 | 40 | Our main point of access to either the HPC cluster or to the DNAnexus cloud is going to be our computer. 41 | 42 | 43 | ### Key Players in Cloud Computing 44 | 45 | Let's contrast our view of local computing with the key players in the DNAnexus platform (@fig-dnanexus-players). 46 | 47 | ![Key Players in DNAnexus Storage](images/dnanexus_players.png){#fig-dnanexus-players} 48 | 49 | - **Our Machine** - We interact with the platform via the dx-toolkit installed on our machine. When we utilize cloud resources, we request them from our own computer using commands from the dx toolkit. 50 | - **DNAnexus Platform** - Although there are many parts, we can treat the DNAnexus platform as a single entity that we interact with. Our request gets sent to the platform, and given availability, it will grant access to a temporary DNAnexus Worker. Also contains **project storage**. 51 | - **DNAnexus Worker** - A temporary machine that comes from a pool of available machines. We'll see that it starts out as a blank slate. 52 | 53 | ## Sequence of Events of Running a Job 54 | 55 | Let's run through the order of operations of running a job on the platform. Let's focus on running an aligner (BWA-MEM) on a FASTQ file. Our output will be a .BAM (aligned reads) file. 56 | 57 | Let's go over the order of operations needed to execute our job on the DNAnexus platform (@fig-dnanexus). 58 | 59 | ![Order of Operations](images/order_operations.png){#fig-dnanexus} 60 | 61 | 1. **Start a job using `dx run` to send a request to the platform.** In order to start a job, we will need two things: an app (`app-bwa-mem`), and a file to process on the platform (not shown). We specify this information using `dx run`. When we use `dx run`, a request is sent to the platform. 62 | 2. **Platform requests for a worker from available workers; worker made available on platform.** In this step, the DNAnexus platform looks for a worker instance that can meet our needs. The platform handles installing the app and its software environment to the worker as well. We'll see that apps have a default instance type that are suggested by the authors. 63 | 3. **Input files transferred from project storage.** We're going to process a FASTQ file (`53525342.fq.gz`). This needs to be transferred from the project storage to the worker storage on the machine. 64 | 4. **Computations run on worker; output files are generated.** Once our app is ready and our file is transferred, we can run the computation on the worker. 65 | 5. **Output files transferred back to project storage.** Any files that we generate during our computation (`53525342.bam`) must be transferred back into project storage. 66 | 6. **Response from DNAnexus platform to User.** If our job was successful, we will receive a response from the platform. This can be an email, or the output from `dx find jobs`. If our job was unable to run, we will recieve a "failed" response. 67 | 68 | When you are working on the platform, especially with batch jobs, keep in mind this order of execution. Being familiar with how the key players interact on the platform is key to running efficient jobs. 69 | 70 | ### A Common Pattern: Scripts on your computer, scripts on the worker 71 | 72 | A very common pattern we'll use is having two sets of scripts (@fig-scripts). The batch script that generates the separate jobs run on separate workers (`batch_RUN.sh`), and a script that is run on each worker (`plink_script.sh`). 73 | 74 | The batch script will specify file inputs as paths from project storage. For example, a project storage path might be `data/chr1.vcf.gz`. 75 | 76 | The trick with the worker script is being able to visualize the location of the files on the worker after they're transferred. In most cases, the files will be transferred to the working directory of the worker. 77 | 78 | ![Two kinds of scripts.](images/two_scripts.png){#fig-scripts} 79 | 80 | ## Key Differences with local computing 81 | 82 | As you might have surmised, running a job on the DNAnexus platform is very different from computing on your local computer. 83 | 84 | 1. We don't own the worker machine, we only have temporary access to it. A lot of the complications of running cloud computations comes from this. 85 | 2. We have to be explicit what kind of machine we want. We'll talk much more about this in terms of instance types (@sec-instance) 86 | 3. We need to transfer files to and from our temporary worker. 87 | 88 | 89 | ## Project Storage vs Worker Storage 90 | 91 | ![Project Storage versus Worker storage](images/storage.png){#fig-storage} 92 | 93 | You might have noticed that the worker is a blank slate after we request it. So any files we need to process need to be transferred over to the temporary worker storage. 94 | 95 | Fortunately, when we use apps, the file transfer process is handled by the app. This also means that when you build your own apps on the platform, you will need to specify inputs (what files the app will process) and outputs (the resulting files from the app). 96 | 97 | ### Running Scripts on a Worker is Indirect 98 | 99 | Because this file transfer occurs from project storage to the worker, when we run scripts on a worker, we have to think about the location of the files on the worker (@fig-indirect). 100 | 101 | ![An indirect process](images/indirect.png){#fig-indirect} 102 | 103 | Let's look at an example of that nestedness. Say we have two bed files and we want to combine them on a worker using PLINK. 104 | 105 | 1. Transfer data from project to the working directory on the worker. The first thing we need to do is transfer them from project storage to worker storage. Notice that even though in project storage they are in a data folder, they are in the base directory of the worker. 106 | 2. Run the computation on the worker, and generate results file. The files we need to process are in the base directory of the worker, so we can refer to them without the folder path (`plink chr1.bed`). We generate a results file called `chr_combined.bed` (the combined `.bed` file). 107 | 3. Transfer results file back to the project storage. 108 | 109 | 110 | ## Instance Types {#sec-instance} 111 | 112 | One of the major concerns when you're getting started on the platform is cost. 113 | 114 | What impacts cost? The number of workers and your priority for those workers matters. Most importantly, the instance type matters (@fig-instance). Let's review how to pick an instance type on the platform. 115 | 116 | First off, all apps (including Swiss Army Knife) have a default instance selected. 117 | 118 | :::{#fig-instance} 119 | ```{mermaid} 120 | graph LR 121 | A["mem1_
(Memory)"] --> C["ssd1_
(Disk Size)"] 122 | C --> E["v2_
(Version)"] 123 | E --> G["x4
(Cores)"] 124 | ``` 125 | Anatomy of an instance type. 126 | ::: 127 | 128 | This is an example of an instance type. We can see that it has four sections: a memory class (`mem1`), a disk space class (`ssd1`), a version (`v2`), and the number of CPUs (`x4`). Together, this combination of classes forms an [instance type](https://documentation.dnanexus.com/developer/api/running-analyses/instance-types). 129 | 130 | ![An Instance Type Checklist](images/instance_checklist.png){#fig-instances} 131 | 132 | Let's talk about what aspects to concentrate on when choosing your instance type. 133 | 134 | :::{.callout-note} 135 | ## Read the Swiss Army Knife documentation 136 | 137 | When you run Swiss Army Knife from the command line, the default instance type is `mem1_ssd1_v2_x4` - how many cores are available in this instance? 138 | ::: 139 | 140 | :::{.callout collapse="true"} 141 | ## Answer 142 | 143 | 4 Cores (that's what the `x4` suffix means.) 144 | 145 | ::: 146 | 147 | ### When to Scale Up or Down? 148 | 149 | One question we often get is about when to scale up and down instance types in terms of resource usage. It's important when you're starting out to do profiling of the app on a data file or dataset that you know well. Start with the default instance type for an app on your test files at first. 150 | 151 | Once you have run an app as a job, you can look at the job log to understand how the compute resources were utilized. 152 | 153 | In this case, our job under utilized the compute resources on our instance type, so we might want to scale to a lower instance type. 154 | 155 | If our job crashed due to a lack of resources, or is running slow, we may want to scale up our resource type. 156 | 157 | -------------------------------------------------------------------------------- /05-batch-processing.qmd: -------------------------------------------------------------------------------- 1 | # Batch Processing on the Cloud {#sec-batch} 2 | 3 | Now we're prepared for the big one: batch processing on the DNAnexus platform. All of the shell and DNAnexus skills we've learned will be leveraged in this chapter. 4 | 5 | :::{.callout-note} 6 | ## Prep for Exercises 7 | 8 | Make sure you are logged into the platform using `dx login` and that your course project is selected with `dx select`. 9 | 10 | In your shell (either on your machine or in binder), make sure you're in the `bash_bioinfo_scripts/batch-processing/` folder: 11 | 12 | ``` 13 | cd batch-processing/ 14 | ``` 15 | ::: 16 | 17 | 18 | ## Learning Objectives 19 | 20 | 1. **Utilize** `dx find data` to find data files on the platform to batch process. 21 | 1. **Iterate** over files using Bash scripting and `xargs` on the platform to batch process them within a DNAnexus project. 22 | 1. **Leverage** dxFUSE to simplify your bash scripts 23 | 1. **Utilize** `dx generate-batch-inputs`/`dx run --batch-tsv` to batch process files 24 | 1. **Utilize** Python to batch process multiple files per worker. 25 | 26 | ## Two Ways of Batching 27 | 28 | :::{#fig-batch1} 29 | ```{mermaid} 30 | graph LR; 31 | A[List files
using `dx data`] --> F{"|"} 32 | F --> E[`xargs` sh -c] 33 | E --> B[`dx run`
on file1]; 34 | E --> C[`dx run`
on file2]; 35 | E --> D[`dx run`
on file3]; 36 | ``` 37 | Batch method 1. We list files and then pipe them into `xargs`, which generates individual dx-run statements. 38 | ::: 39 | 40 | :::{#fig-batch2} 41 | ```{mermaid} 42 | graph LR; 43 | A[Submit array
of files
in `dx run`] --> B[Loop over array
of files
in worker]; 44 | ``` 45 | Batch method 2. We first get our files onto the worker through a single dx run command, and then use `xargs` on the worker to cycle through them. 46 | ::: 47 | 48 | We actually have two methods of batching jobs using Swiss Army Knife: 49 | 50 | 1. Use `xargs` on our home system to run `dx run` statements for each file (@fig-batch1). 51 | 1. Submit an array of files as an input to Swiss Army Knife. Then process each file using the `icmd` input (@fig-batch2) 52 | 53 | Both of these methods can potentially be useful. 54 | 55 | 56 | ## Finding files using `dx find data` {#sec-dx-find} 57 | 58 | `dx find data` is a command that is extremely helpful on the DNAnexus platform. Based on metadata and folder paths, `dx find data` will return a list of files that meet the criteria. 59 | 60 | `dx find data` lets you search on the following types of metadata: 61 | 62 | - tags `--tag` 63 | - properties `--property` 64 | - name `--name` 65 | - type `--type` 66 | 67 | It can output in a number of different formats. Including: 68 | 69 | - `--brief` - return only the file-ids 70 | - `--json` - return file information in JSON format 71 | - `--verbose` - this is the default setting 72 | - `--delimited` - return as a delimited text file 73 | 74 | Of all of these, `--brief` and `--json` are the most useful for automation. `--delimited` is also helpful, but there is also a utility called `dx generate-batch-inputs` that will let us specify multiple inputs to process line by line. 75 | 76 | ## Helpful `dx find data` examples 77 | 78 | As we're starting off in our batch processing journey, I wanted to provide some helpful recipes for selecting files. 79 | 80 | ### Find all *.bam files in a project 81 | 82 | You can use wildcard characters with the `--name` flag. Here, we're looking for anything with the suffix "*.bam". 83 | 84 | ```{bash} 85 | #| eval: false 86 | #| filename: batch-processing/dx-find-data-name.sh 87 | dx find data --name "*.bam" --brief 88 | ``` 89 | 90 | ### Searching within a folder 91 | 92 | You can add the `--path` command to search in a specific folder. 93 | 94 | ```{bash} 95 | #| eval: false 96 | #| filename: batch-processing/dx-find-path.sh 97 | dx find data --name "*.bam" --path "data/" 98 | ``` 99 | 100 | ### Find all files with a field id 101 | 102 | Take advantage of metadata associated with files when you can. If you are on UKB RAP, one of the most helpful properties to search is `field_id`. 103 | 104 | Note: be careful with this one, especially if you are working on UK Biobank RAP. You don't want to return 500,000 file ids. I would concentrate on the field ids that are aggregated on the population level, such as the pVCF files. 105 | 106 | ```{bash} 107 | #| eval: false 108 | #| filename: batch-processing/dx-find-data-field.sh 109 | dx find data --property field_id="23148" --brief 110 | ``` 111 | 112 | ### Find all files that are of class `file` 113 | 114 | There are a number of different object classes on the platform, such as `file` or `applet` 115 | 116 | Search for all files in your project that have a `file` class. 117 | 118 | ```{bash} 119 | #| eval: false 120 | #| filename: batch-processing/dx-find-data-class.sh 121 | dx find data --class file --brief 122 | ``` 123 | 124 | ### In General: Think about leveraging metadata 125 | 126 | In general, think about leveraging metadata that is attached to your files. 127 | 128 | For example, for the UKB Research Analysis Platform, data files in the `Bulk/` folder in your project have multiple properties: `field_id` (the data field as specified by UK Biobank) and `eid`. 129 | 130 | ## Using `xargs` to Batch Multiple Files {#sec-xargs2} 131 | 132 | Ok, now we have a list of files from `dx find data` that meet our criteria. How can we process them one by one? 133 | 134 | Remember our discussion of `xargs`? (@sec-xargs) This is where `xargs` shines, when you provide it a list of files. 135 | 136 | Remember, a really useful pattern for `xargs` is using it for variable expansion and starting a subshell to process individual files. 137 | 138 | ```{bash} 139 | #| eval: false 140 | #| filename: batch-processing/dx-find-xargs.sh 141 | dx find data --name "*.bam" --brief | \ 142 | xargs -I % sh -c "dx run app-swiss-army-knife -y -iin="%" \ 143 | -icmd='samtools view -c \${in_name} > \${in_prefix-counts.txt}' \ 144 | --tag samjob --destination results/' 145 | ``` 146 | 147 | The key piece of code we're doing the variable expansion in is here: 148 | 149 | ```{bash} 150 | #| eval: false 151 | sh -c 'dx run app-swiss-army-knife -iin="%" \ 152 | -icmd="samtools view -c \${in_name} > \${in_prefix}-counts.txt" \ 153 | --tag samjob --destination results/' 154 | ``` 155 | 156 | We're using `sh -c` to run a script as a *subshell* to execute the `dx run` statement. 157 | 158 | Note that we're specifying the helper variables here with a `\`: 159 | 160 | `\${in_name}` 161 | 162 | This escaping (`\$`) of the dollar sign is to prevent the variable expansion from happening in the top-level shell - the helper variable names need to be passed in to the subshell which needs to pass it onto the worker. Figuring this out took time and made my brain hurt. 163 | 164 | This escaping is only necessary because we're using `xargs` and passing our `-icmd` input into the worker. For the most part, you won't need to escape the `$`. This is also a reason to write shell scripts that run on the worker. 165 | 166 | When we run this command, we get the following screen output: 167 | 168 | ``` 169 | Using input JSON: 170 | { 171 | "cmd": "samtools view -c $in_name > $in_prefix-counts.txt", 172 | "in": [ 173 | { 174 | "$dnanexus_link": { 175 | "project": "project-GGyyqvj0yp6B82ZZ9y23Zf6q", 176 | "id": "file-BZ9YGpj0x05xKxZ42QPqZkJY" 177 | } 178 | } 179 | ] 180 | } 181 | 182 | Calling app-GFxJgVj9Q0qQFykQ8X27768Y with output destination 183 | project-GGyyqvj0yp6B82ZZ9y23Zf6q:/results 184 | 185 | Job ID: job-GJ2xVZ80yp62X5Z51qp191Y8 186 | 187 | [more job info] 188 | ``` 189 | 190 | if we do a `dx find jobs`, we'll see our jobs listed. Hopefully they are running: 191 | 192 | ``` 193 | dx find jobs --tag samjob 194 | * Swiss Army Knife (swiss-army-knife:main) (running) job-GJ2xVf00yp62kx9Z8VK10vpQ 195 | tladeras 2022-10-11 13:57:59 (runtime 0:01:49) 196 | * Swiss Army Knife (swiss-army-knife:main) (running) job-GJ2xVb80yp6KjQpxFJJBzv5k 197 | tladeras 2022-10-11 13:57:57 (runtime 0:00:52) 198 | * Swiss Army Knife (swiss-army-knife:main) (runnable) job-GJ2xVZj0yp6FFFXG11j6YJ9V 199 | tladeras 2022-10-11 13:57:55 (runtime 0:01:15) 200 | * Swiss Army Knife (swiss-army-knife:main) (runnable) job-GJ2xVZ80yp62X5Z51qp191Y8 201 | tladeras 2022-10-11 13:57:53 (runtime 0:00:56) 202 | ``` 203 | 204 | ### When batching, tag your jobs 205 | 206 | It is critical that you tag your jobs in your `dx run` code with the `--tag` argument. 207 | 208 | Why? You will at some point start up a bunch of batch jobs that might have some settings/parameters that were set wrong. That's when you need the tag. 209 | 210 | ```{bash} 211 | #| eval: false 212 | dx find jobs --tag "samjob" 213 | ``` 214 | 215 | ### Using tags to `dx terminate` jobs {#sec-terminate} 216 | 217 | `dx terminate ` will terminate a running job with that job id. It doesn't take a tag as input. 218 | 219 | But again, `xargs` to the rescue. We can find our job ids with the tag `samjob` using `dx find jobs` and then pipe the `--brief` output into `xargs` to terminate each job id. 220 | 221 | ```{bash} 222 | #| eval: false 223 | dx find jobs --tag samjob --brief | xargs -I% sh -c "dx terminate %" 224 | ``` 225 | 226 | ## Submitting Multiple Files to a Single Worker {#sec-mult-worker} 227 | 228 | We talked about another method to batch process files on a worker (@fig-batch2). We can submit an array of files to a worker, and then process them one at a time on the worker. 229 | 230 | The key is that we're running `xargs` on the worker, not on our own machine to process each file. 231 | 232 | ```{bash} 233 | #| eval: false 234 | #| filename: batch-processing/batch-on-worker.sh 235 | cmd_to_run="ls *.vcf.gz | xargs -I% sh -c 'bcftools stats % > \$(basename %).stats.txt'" 236 | 237 | dx run swiss-army-knife \ 238 | -iin="data/chr1.vcf.gz" \ 239 | -iin="data/chr2.vcf.gz" \ 240 | -iin="data/chr3.vcf.gz" \ 241 | -icmd=${cmd_to_run} 242 | ``` 243 | 244 | In the variable `$cmd_to_run`, we're putting a command that we'll run on the worker. That command is: 245 | 246 | ```{bash} 247 | #| eval: false 248 | ls *.vcf.gz | xargs -I% sh -c "bcftools stats % > \$(basename %).stats.txt 249 | ``` 250 | 251 | We submitted an array of files in our `dx run` statement. So now they are transferred into our working directory on the worker. So we can list the files using `ls *.vcf.gz` and pipe that list into `xargs`. 252 | 253 | Note that we lose the ability to use helper variables in our script when we process a list of files on the worker. So here we have to use `\$(basename %)`, because we use `()` to expand a variable in a subshell, and we escape the `$` here so that bash will execute the variable expansion on the worker. 254 | 255 | Again, this is possible, but it may be easier to have a separate script that contains our commands, transfer that as an input to Swiss Army Knife, and run that script by specifying `bash myscript.sh` in our command. 256 | 257 | ## Batching multiple inputs: `dx generate_batch_inputs` 258 | 259 | What if you have multiple inputs that you need to batch with? This is where the [`dx generate_batch_inputs`](https://documentation.dnanexus.com/user/running-apps-and-workflows/running-batch-jobs) comes in. 260 | 261 | For each input for an app, we can specify it using wildcard characters with regular expressions. 262 | 263 | ```{bash} 264 | # | eval: false 265 | dx generate_batch_inputs \ 266 | --path "data/"\ 267 | -iin="(.*)\.bam$" 268 | ``` 269 | 270 | Here we're specifying a single input `in`, and we've supplied a wildcard search. It's going to look in `data/` for this particular pattern (we're looking for bam files). 271 | 272 | If we do this, we'll get the following response: 273 | 274 | ``` 275 | Found 4 valid batch IDs matching desired pattern. 276 | Created batch file dx_batch.0000.tsv 277 | ``` 278 | 279 | So, there is 1 `.tsv` file that was generated by `dx generate_batch_inputs` on our machine. 280 | 281 | If we have many more input files, say 3000 files, it would generate 3 `.tsv` files. Each of these `.tsv` files contains about 1000 files per line. We can run these individual jobs with: 282 | 283 | ```{bash} 284 | #| eval: false 285 | dx run swiss-army-knife --batch-tsv dx_batch.0000.tsv \ 286 | -icmd='samtools stats ${in_name} > ${in_prefix}.stats.txt ' \ 287 | --destination "/Results/" \ 288 | --detach --allow-ssh \ 289 | --tag bigjob 290 | ``` 291 | 292 | This will generate 4 jobs from the `dx_batch.0000` file to process the individual files. Each `tsv` file will generate up to 1000 jobs. 293 | 294 | ### Drawbacks to `dx generate_batch_inputs`/`dx run --batch-tsv` 295 | 296 | The largest drawback to using `dx generate_batch_inputs` is that each column must correspond to an individual input name - you can't submit an array of files to a job this way. 297 | 298 | ### For More Information 299 | 300 | The Batch Jobs documentation page has some good code examples for `dx generate_batch_inputs` here: 301 | 302 | ## Programatically Submitting Arrays of Files for a job 303 | 304 | You can also use Python to build `dx run` statements, which is especially helpful when you want to submit arrays of 100+ files to a worker. 305 | 306 | See for more info. 307 | 308 | ## What you learned in this chapter 309 | 310 | This was a big chapter, and built on everything you've learned in the previous chapters. 311 | 312 | We put together the output of `dx find data --brief` (@sec-dx-find) with a pipe (`|`), and used `xargs` (@sec-xargs2) to spawn jobs per set of files. 313 | 314 | Another way to process files is to upload them onto a worker and process them (@sec-mult-worker). 315 | 316 | We also learned of alternative approaches using `dx generate_batch_inputs`/`dx run --batch-tsv` and using Python to build the `dx run` statements. 317 | -------------------------------------------------------------------------------- /06-JSON.qmd: -------------------------------------------------------------------------------- 1 | # Working with JSON on the DNAnexus Platform {#sec-json} 2 | 3 | :::{.callout-note} 4 | ## Preparing for this Chapter 5 | 6 | You will not need to login to the platform for this chapter. 7 | 8 | You'll want to `cd` into the `JSON` folder in your project. 9 | 10 | ``` 11 | cd JSON/ 12 | ``` 13 | 14 | You'll also need to [install `jq`](https://stedolan.github.io/jq/download/) if it's not yet on your system. If you're on Ubuntu/WSL, I recommend installing via `apt install`. If you're on Mac, I recommend installing via `brew install`. 15 | 16 | You can check if `jq` is already installed by typing 17 | 18 | `which jq` 19 | 20 | ::: 21 | 22 | ## Learning Objectives 23 | 24 | By the end of this chapter, you should be able to: 25 | 26 | - **Define** and **Explain** what JSON is and its elements and structures 27 | - **Explain** how JSON is used on the DNAnexus platform 28 | - **Explain** the basic structure of a JSON file 29 | - **Generate** JSON output from `dx find data` and `dx find jobs` 30 | - **Execute** simple `jq` commands to extract information from a JSON file 31 | - **Execute** advanced `jq` filters using conditionals to process output from `dx find files` or `dx find jobs`. 32 | 33 | ## What is JSON? 34 | 35 | JSON is short for **J**ava**S**cript **O**bject **N**otation. It is a format used for storing information on the web and for interacting with APIs. 36 | 37 | ## How is JSON used on the DNAnexus Platform? 38 | 39 | JSON is used in multiple ways on the DNAnexus Platform, including: 40 | 41 | - Submitting Jobs with complex parameters/inputs 42 | - Specifying parameters of an app or workflow (`dxapp.json` and `dxworkflow.json`) 43 | - Output of commands such as `dx find data` or `dx find jobs` with the `--json` flag 44 | - Extracting environment variables from `dx env` 45 | 46 | Underneath it all, all interactions with the DNAnexus API server are JSON submissions. 47 | 48 | You can see that JSON is used in many places on the DNAnexus platforms, and for many purposes. So having basic knowledge of JSON can be really helpful. 49 | 50 | ## Elements of a JSON file 51 | 52 | Here are the main elements of a JSON file: 53 | 54 | - **Key:Value Pair**. Example: `"name": "Ted Laderas"`. In this example, our key is "name" and our value is "Ted Laderas" 55 | - **List `[]`** - a collection of values. All values have to be the same data type. Example: `["mom", "dad"]` 56 | - **Object** `{}` - A collection of key/value pairs, enclosed with curly brackets (`{}`) 57 | 58 | Here's the example we're going to use. We'll do most of our processing of JSON on our own machine. 59 | 60 | ```{bash} 61 | #| eval: false 62 | #| filename: "json_data/example.json" 63 | { 64 | "report_html": { 65 | "dnanexus_link": "file-G4x7GX80VBzQy64k4jzgjqgY" 66 | }, 67 | "stats_txt": { 68 | "dnanexus_link": "file-G4x7GXQ0VBzZxFxz4fqV120B" 69 | }, 70 | "users": ["laderast", "ted", "tladeras"] 71 | } 72 | 73 | ``` 74 | 75 | :::{.callout-note} 76 | ## Check Yourself 77 | 78 | What does the `names` value contain in the following JSON? Is it a list, object or key:value pair? 79 | 80 | ``` 81 | { 82 | "names": ["Ted", "Lisa", "George"] 83 | } 84 | ``` 85 | ::: 86 | 87 | :::{.callout-note collapse="true"} 88 | ## Answer 89 | 90 | It is a list. We know this because the value contains a `[]`. 91 | 92 | ``` 93 | { 94 | "names": ["Ted", "Lisa", "George"] 95 | } 96 | ``` 97 | ::: 98 | 99 | ## Nestedness 100 | 101 | JSON wouldn't be helpful if it were only limited to a single level or key:values. Values can be lists or objects as well. For example, in our example JSON, we can see that the value of `report_html` is a JSON object: 102 | 103 | ``` 104 | "report_html": { 105 | "dnanexus_link": "file-G4x7GX80VBzQy64k4jzgjqgY" 106 | } 107 | ``` 108 | 109 | The object is: 110 | 111 | ``` 112 | { 113 | "dnanexus_link": "file-G4x7GX80VBzQy64k4jzgjqgY" 114 | } 115 | ``` 116 | 117 | When we work with extracting information, we'll have to take this nested structure in mind. 118 | 119 | ## Outputting JSON with `dx find` commands 120 | 121 | We already encountered the `dx find data` command, which we used in the batch processing chapter. 122 | 123 | If we use the `--json` option, then the file information will be outputted in json format. This command will return a list of JSON file objects. 124 | 125 | For example: 126 | 127 | ```{bash} 128 | #| eval: false 129 | #| filename: 05-JSON/dx-find-data-json.sh 130 | dx find data --path ted_demo:data/ --json 131 | ``` 132 | 133 | The output will look like this: 134 | 135 | ``` 136 | [ 137 | { 138 | "project": "project-GGyyqvj0yp6B82ZZ9y23Zf6q", 139 | "id": "file-FvQGZb00bvyQXzG3250XGbgz", 140 | "describe": { 141 | "id": "file-FvQGZb00bvyQXzG3250XGbgz", 142 | "project": "project-GGyyqvj0yp6B82ZZ9y23Zf6q", 143 | "class": "file", 144 | "name": "small-celegans-sample.fastq", 145 | "state": "closed", 146 | "folder": "/json_data", 147 | "modified": 1665003035646, 148 | "size": 16801690 149 | } 150 | }, 151 | { 152 | "project": "project-GGyyqvj0yp6B82ZZ9y23Zf6q", 153 | "id": "file-B5Q8z8V5g3bX5qQ9y9YQ006k", 154 | "describe": { 155 | "id": "file-B5Q8z8V5g3bX5qQ9y9YQ006k", 156 | "project": "project-GGyyqvj0yp6B82ZZ9y23Zf6q", 157 | "class": "file", 158 | "name": "NC_001422.fasta", 159 | "state": "closed", 160 | "folder": "/json_data", 161 | "modified": 1665003035645, 162 | "size": 5539 163 | } 164 | } 165 | ] 166 | 167 | ``` 168 | :::{.callout-note} 169 | ### Test your knowledge 170 | 171 | What is returned when we run this code? Is it a JSON object, or a list of JSON objects? 172 | 173 | ```{bash} 174 | #| eval: false 175 | #| filename: 05-JSON/dx-find-jobs-json.sh 176 | dx find jobs --json 177 | ``` 178 | ::: 179 | 180 | :::{.callout-note collapse="true"} 181 | ## Answer 182 | 183 | It's hard to tell at first, but We are returning a list of JSON objects, each of which corresponds to a single job run within our project. 184 | ::: 185 | 186 | 187 | 188 | 189 | ## Learning `jq` gradually 190 | 191 | As you can see, JSON can be very complicated to process and extract information from, depending on how many levels you go deep in a JSON document. That's why `jq` exists 192 | 193 | `jq` is a utility that is made to process JSON. All `jq` commands have this format: 194 | 195 | ```{bash} 196 | #| eval: false 197 | jq '' 198 | ``` 199 | 200 | Filters are the heart of processing data using `jq`. They let you extract JSON values or keys and process them with conditionals to filter data down. For example, you can do something like the following: 201 | 202 | 1. Select all elements where the job status is failed 203 | 2. For each of these elements, output the job-status id 204 | 205 | You can see how `jq` can be extremely powerful. 206 | 207 | You can also pipe JSON from standard output into `jq`. This will be really helpful for us when we start using pipes of data files from `dx find data`. 208 | 209 | ## Our simplest filter: `.` 210 | 211 | One of the biggest uses for `jq` is for more readable formatting. Oftentimes, the JSON returned by an API call is really hard to read. It can be returned as a single line of text, and it is really hard for humans to see the actual structure of the JSON response. 212 | 213 | If we run `jq .` on a JSON file, we'll see that it makes it much more readable. 214 | 215 | ```{bash} 216 | #| eval: false 217 | #| filename: JSON/jq-simple.sh 218 | jq '.' json_data/example.json 219 | ``` 220 | 221 | ## Getting the keys 222 | 223 | We can extract the keys from the top level JSON by using `'keys'` as our filter. 224 | 225 | ```{bash} 226 | #| eval: false 227 | #| filename: JSON/jq-keys.sh 228 | jq 'keys' json_data/example.json 229 | ``` 230 | 231 | ## Extracting a value from a container: `jq .report_html` 232 | 233 | So, say we want to extract the value from the `report_html` key in the above. 234 | 235 | We can specify the key that we're interested in to extract the value from that key. 236 | 237 | ```{bash} 238 | #| eval: false 239 | #| filename: JSON/jq-report.sh 240 | jq '.report_html' json_data/example.json 241 | ``` 242 | 243 | :::{.callout-note} 244 | ## Try it out 245 | 246 | This is the JSON file we're going to be working with, in `json_data/example.json`. 247 | 248 | ```{bash} 249 | #| eval: false 250 | #| filename: "json_data/example.json" 251 | { 252 | "report_html": { 253 | "dnanexus_link": "file-G4x7GX80VBzQy64k4jzgjqgY" 254 | }, 255 | "stats_txt": { 256 | "dnanexus_link": "file-G4x7GXQ0VBzZxFxz4fqV120B" 257 | }, 258 | "users": ["laderast", "ted", "tladeras"] 259 | } 260 | 261 | ``` 262 | 263 | In your terminal, try out: 264 | 265 | ``` 266 | jq '.stats_txt' json_data/example.json 267 | ``` 268 | 269 | What do you return? 270 | ::: 271 | 272 | :::{.callout-note collapse="true"} 273 | ## Answer 274 | 275 | ```{bash} 276 | #| eval: false 277 | #| filename: JSON/jq-stats-txt.sh 278 | jq '.stats_txt' json_data/example.json 279 | ``` 280 | 281 | We'll return the following JSON object, which contains a single key-value pair. 282 | 283 | ``` 284 | { 285 | "dnanexus_link": "file-G4x7GXQ0VBzZxFxz4fqV120B" 286 | } 287 | ``` 288 | ::: 289 | 290 | ### Going one level deeper 291 | 292 | We can extract the actual value associated with the `dnanexus_link` key within `report_html` by chaining onto our filter: 293 | 294 | ```{bash} 295 | #| eval: false 296 | #| filename: JSON/jq-nested.sh 297 | jq '.report_html.dnanexus_link' json_data/example.json 298 | ``` 299 | 300 | :::{.callout-note} 301 | ## Try It Out 302 | 303 | What is returned when you run this code? 304 | 305 | ```{bash} 306 | #| eval: false 307 | #| filename: JSON/jq-nested.sh 308 | jq '.report_html.dnanexus_link' json_data/example.json 309 | ``` 310 | 311 | ::: 312 | 313 | :::{.callout-note collapse="true"} 314 | ## Answer 315 | 316 | Running this command should return the value of `dnanexus-link` within `report_html`: 317 | 318 | ``` 319 | "file-G4x7GX80VBzQy64k4jzgjqgY" 320 | ``` 321 | 322 | ::: 323 | 324 | ## Conditional Filters using `jq` 325 | 326 | One natural use case for using `jq` on the DNAnexus platform is to rerun failed jobs. 327 | 328 | Failed jobs can occur when using normal priority, which focuses on using spot instances. So, if we ran a series of jobs, we would want to restart these failed jobs. 329 | 330 | This is a bit of code that would allow us to select those jobs that have failed. 331 | 332 | ```{bash} 333 | #| eval: false 334 | #| filename: JSON/dx-find-jobs-jq-clone.sh 335 | dx find jobs --json |\ 336 | jq '.[] | select (.state | contains("failed")) | .id' |\ 337 | xargs -I% sh -c "dx run --clone %" 338 | ``` 339 | 340 | The second line contains the `jq` filter that does the magic. Remember, the filter is contained within the single quotes (`''`). 341 | 342 | The last line contains `"dx run --clone %"`. 343 | 344 | Let's take apart the different parts of the `jq` filter (@fig-jq-filter): 345 | 346 | :::{#fig-jq-filter} 347 | ```{mermaid} 348 | graph LR; 349 | A[".[]"] --> B{"|"} 350 | B --> C["select (.state | contains('failed'))"] 351 | C --> D{"|"} 352 | D --> E[".id"] 353 | ``` 354 | Taking apart the `jq` filter. 355 | ::: 356 | 357 | Note that the pipes in this filter apply only to the `jq` filter, so don't mix them up with the other pipes in our overall Bash statement. 358 | 359 | The first part of the filter, `.[]`, says that we want to process the list (remember, `dx find jobs` returns a list of objects). 360 | 361 | The second part of the filter, `select (.state | contains('failed'))` will let us select objects in the list that have a `state` of `failed`. This list of objects is then passed on the next part of the filter. 362 | 363 | The last part of the filter, `.id`, returns the the file ids for our failed jobs. 364 | 365 | This is a basic pattern for selecting objects that meet a criteria, and can be really helpful when you want more control of your batch processing. 366 | 367 | :::{.callout-note} 368 | ## Check Yourself 369 | 370 | How would you modify the code below to terminate all jobs that had `state` `running` using `dx terminate`? 371 | 372 | ```{bash} 373 | #| eval: false 374 | dx find jobs --json |\ 375 | jq '.[] | select (.state | contains("failed")) | .id' |\ 376 | xargs -I% sh -c "dx run --clone %" 377 | ``` 378 | 379 | ::: 380 | 381 | :::{.callout-note collapse="true"} 382 | ## Answer 383 | 384 | ```{bash} 385 | #| eval: false 386 | dx find jobs --json | \ 387 | jq '.[] | select (.state | contains("running")) | .id' | \ 388 | xargs -I% sh -c "dx terminate %" 389 | ``` 390 | 391 | ::: 392 | 393 | ## Using JSON as an Input 394 | 395 | This section is made to help you in writing JSON files. If you build an app or a workflow, you will need to edit the `dxapp.json` or `dxworkflow.json` files to enable your executables to be runnable. 396 | 397 | ### Writing and modifying JSON 398 | 399 | I know that JSON is supposed to be human readable. However, there are a lot of little quibbles that don't make it easily human writable. 400 | 401 | I highly recommend using an editor such as [VS Code](https://code.visualstudio.com/), with the appropriate [JSON plugin](https://code.visualstudio.com/docs/languages/json). A JSON Visualizer such as the [JSON Crack Extension](https://marketplace.visualstudio.com/items?itemName=AykutSarac.jsoncrack-vscode) will be extremely helpful as well. 402 | 403 | ![JSON Visualizer Plugin](images/json_visualizer.png) 404 | 405 | Using the visualizer plugin and this tutorial will help you write well formed JSON, and point out any issues you might have. It's easy to misplace a comma, or a bracket, and this tool helps you write well-formed JSON. 406 | 407 | 408 | 409 | -------------------------------------------------------------------------------- /07-containers.qmd: -------------------------------------------------------------------------------- 1 | # Containers and Reproducibility {#sec-containers} 2 | 3 | :::{.callout-note} 4 | ## Prep for Exercises 5 | 6 | Make sure you are logged into the platform using `dx login` and that your course project is selected with `dx select`. 7 | 8 | In your shell (either on your machine or in binder), make sure you're in the `bash_bioinfo_scripts/containers/` folder: 9 | 10 | ``` 11 | cd containers/ 12 | ``` 13 | ::: 14 | 15 | 16 | 17 | ## Learning Objectives 18 | 19 | 1. **Explain** the benefits of using containers on DNAnexus for reproducibility and for batch processing 20 | 1. **Define** the terms *image*, *container*, and *snapshot* in the context of Docker 21 | 1. **Create** snapshots on RAP using `docker pull` and `docker save` with the ttyd app 22 | 1. **Utilize** containers to batch process files on RAP 23 | 1. **Extend** a docker image by installing within interactive mode 24 | 1. **Build** a docker image using Dockerfiles 25 | 26 | ## Why Containers? 27 | 28 | There is a replication crisis out there. Even given a script and the raw data, it is often difficult to replicate the results generated by a study. 29 | 30 | Why is this difficult? Many others have talked about this, but one simple reason is that the results are tied to software and database versions. 31 | 32 | This is the motivation for using *containers* - they are a way of packaging software that 'freezes' the software versions. If you provide the container that you used to generate the results, other people should be able to replicate your results even if they're on a different operating system. 33 | 34 | ## Terminology 35 | 36 | In order to be unambiguous with our language, we'll use the following definitions: 37 | 38 | ![Docker Terms 1](images/docker_terminology.png){#fig-docker1} 39 | 40 | - **Registry** - collection of repositories that you pull docker images from. Example repositories include DockerHub and Quay.io. 41 | - **Docker Image** - what you download from a registry - the "recipe" for building the software environment. Stored in a registry. use `docker pull` to get image, `docker commit` to push changes to registry, can also generate image from a Dockerfile, 42 | - **Docker Container** - The executable software environment installed on a machine. Runnable. Generate from `docker pull` from a repository. 43 | - **Snapshot File** - An single archive file (`.tar.gz`) that contains the Docker container. Generate using `docker save` on a container. Also known as an *image file* on the platform. 44 | 45 | ## Building Docker Snapshot Files on the the DNAnexus platform 46 | 47 | ### The Golden Rule of Docker and Batch Analysis 48 | 49 | DockerHub has a pull limit of 200 pulls/day/user. You will face this limit a lot if you just use the image url. 50 | 51 | So, if you are processing more than 200 files (or Jobs), you should save the docker image into platform storage as a snapshot file. 52 | 53 | Let's talk about the basic snapshot building process. 54 | 55 | ### Be Secure 56 | 57 | Before we get started, security is always a concern when running Docker images. The `docker` group has elevated status on a system, so we need to be careful that when we're running them, they aren't introducing any system vulnerabilities. 58 | 59 | These are mostly important when running containers that are web-servers or part of a web stack, but it is also important to think about when running jobs on the cloud. 60 | 61 | Here are some guidelines to think about when you are working with a container. 62 | 63 | - **Use vendor-specific Docker Images when possible**. 64 | - **Use container scanners to spot potential vulnerabilities**. DockerHub has a vulnerability scanner that scans your Docker images for potential vulnerabilities. 65 | - **Avoid kitchen-sink images**. One issue is when an image is built on top of many other images. It makes it really difficult to plug vulnerabilities. When in doubt, use images from trusted people and organizations. 66 | 67 | ### The Basic Snapshot Building Process 68 | 69 | ::: {#fig-snapshot-building} 70 | ```{mermaid} 71 | flowchart TD 72 | A[start ttyd] --> B[docker pull
from registry] 73 | B --> C[docker save to
snapshot file] 74 | C --> D[dx upload
snapshot to
project storage] 75 | D --> E[terminate ttyd] 76 | ``` 77 | Building a docker snapshot on the DNAnexus platform. 78 | ::: 79 | 80 | ### Building Snapshot Files in `ttyd` {#sec-ttyd} 81 | 82 | Up until now, we have been using our own machine or the binder shell for doing our work. 83 | 84 | We're going to pull up a web-enabled shell on a DNAnexus worker with the `ttyd` app. `ttyd` is useful because: 85 | 86 | 1. `docker` is already installed, so we can `docker pull` our container and `docker save` our snapshot to the ttyd instance. 87 | 1. It's much faster to transfer our snapshot file back into project storage with `dx upload`. 88 | 89 | To open ttyd, open the **Tool Library** under **Tools** and select your project. 90 | 91 | ![Opening ttyd]() 92 | 93 | ### Pull your image from a registry 94 | 95 | ```{bash} 96 | #| eval: false 97 | docker pull quay.io/biocontainers/samtools:1.15.1--h1170115_0 98 | ``` 99 | 100 | On your `ttyd` instance, do a `docker pull` to pull your image from the registry. Note that we're pulling `samtools` from `quay.io` here, from the `biocontainers` user. 101 | 102 | We're also specifying a *version tag* - the `1.15.1--h1170115_0` to tie our `samtools` to a specific version. This is important - most `docker pull` operations will pull from the `latest` tag, which is not tied to a specific version. So make sure to tie your image to a specific version. 103 | 104 | When you're done pulling the docker image, try out the `docker images` command. 105 | 106 | ``` 107 | docker images 108 | ``` 109 | 110 | ### Try your docker image out 111 | 112 | Now that we have our docker image downloaded, we can test it out by running `samtools --help`. This should give us the help message. 113 | 114 | ```{bash} 115 | #| eval: false 116 | docker run biocontainers/samtools samtools --help 117 | ``` 118 | 119 | 120 | ### Save your docker image as a snapshot 121 | 122 | Now that we've pulled the container, we are now going to save it as a snapshot file using `docker save`. We pipe the output of `docker save` into `gzip` to save it as `samtools_image.tar.gz`. 123 | 124 | ```{bash} 125 | #| eval: false 126 | 127 | docker save quay.io/biocontainers/samtools | gzip > samtools_image.tar.gz 128 | ``` 129 | 130 | ### Upload your snapshot 131 | 132 | Now we can get our image back into project storage. We'll create a folder called `images/` with `dx mkdir` and then use `dx upload` to get our snapshot file into the `images/` folder. 133 | 134 | ```{bash} 135 | #| eval: false 136 | dx mkdir images/ 137 | dx upload samtools_image.tar.gz --destination images/ 138 | ``` 139 | 140 | ### Important: make sure to terminate your ttyd instance! 141 | 142 | One thing to remember is that there is no timeout associated with `ttyd`. You will get a reminder email after it's been open after 24 hours, but you will get no warning after that. 143 | 144 | So make sure to use `dx terminate` or terminate the ttyd job under the `Manage` tab. 145 | 146 | ## Using Docker with Swiss Army Knife {#sec-docker-sak} 147 | 148 | Now that we've built our Docker snapshot, let's use it in Swiss Army Knife. 149 | 150 | Swiss Army Knife has two separate inputs associated with Docker: 151 | 152 | - `-iimage_file` - This is where you put the snapshot file (such as the `samtools.tar.gz`) 153 | - `-iimage` - This is where you'd put the Docker URL (such as `quay.io/ucsc_cgl/samtools`) 154 | 155 | So, let's run a `samtools` job using our Docker snapshot. 156 | 157 | ```{bash} 158 | #| eval: false 159 | dx run app-swiss-army-knife \ 160 | -iimage_file="images/samtools.tar.gz" \ 161 | -iin="data/NA12878.bam" 162 | -icmd="docker run samtools stats * > ${in_prefix}.stats.txt" 163 | ``` 164 | 165 | The main thing that has been changed here is that we've added an the `-iimage_file` input to our `dx run` statement. 166 | 167 | ## Extending a Docker Image 168 | 169 | One thing that you might do is extend a Docker image by adding additional software. You can do this by opening up an interactive mode and installing within the container. 170 | 171 | What is interactive mode? When you pull a docker image in your `ttyd` session (@sec-ttyd), you can issue a `docker run` command with these options: 172 | 173 | ``` 174 | docker run -it ubuntu:18.04 /bin/bash 175 | ``` 176 | 177 | It will open up a bash shell in the container. 178 | 179 | ### Pulling a Base Image 180 | 181 | We'll start out with the official ubuntu 18.04 container in our ttyd session: 182 | 183 | ```{bash} 184 | #| eval: false 185 | docker pull ubuntu:18.04 186 | docker images 187 | ``` 188 | 189 | ### Open up interactive mode 190 | 191 | In ttyd, now enter an interactive session: 192 | 193 | ``` 194 | docker run -it ubuntu:18.04 /bin/bash 195 | ``` 196 | 197 | If it works, you will open up a `bash` prompt in the container. 198 | 199 | You'll know you're in the container if you do an `ls` and your filesystem looks different. 200 | 201 | ### Install Software 202 | 203 | Now, let's install [EMBOSS](https://emboss.sourceforge.net/) (European Molecular Biology Open Software Suite), which is a suite of string utilities for working with genomic data. If you look at the EMBOSS link, you will see that you can install it via `apt install`, which is available by default in the `ubuntu` container. 204 | 205 | ```{bash} 206 | #| eval: false 207 | apt update && apt upgrade 208 | apt install emboss gzip -y 209 | ``` 210 | 211 | ### Exit Container 212 | 213 | Now exit from your container's interactive mode: 214 | 215 | ```{bash} 216 | #| eval: false 217 | exit 218 | ``` 219 | 220 | You'll be back at the normal ttyd prompt. 221 | 222 | ### `docker commit`/`docker save` your new snapshot file 223 | 224 | We created a new container when we installed everything. We'll need to find it its ID in ttyd. 225 | 226 | ```{bash} 227 | #| eval: false 228 | docker ps -a 229 | ``` 230 | 231 | We can see that our new container has the following id. We can use this id to save a new container with `docker commit`. Now we can save the snapshot file by using `docker save`: 232 | 233 | ```{bash} 234 | #| eval: false 235 | docker commit emboss:6.6.0 236 | docker save emboss:6.6.0 | gzip > emboss.tar.gz 237 | dx upload emboss.tar.gz --destination images/ 238 | ``` 239 | 240 | ### Other uses of Interactive Mode 241 | 242 | Docker's interactive mode is really helpful for testing out scripts and making sure they are reproducible. 243 | 244 | If I have a one-off analysis, it may be faster for me to just open up `ttyd` and use `docker run` to open up interactive mode, and do work with a container. 245 | 246 | ## Making Dockerfiles 247 | 248 | The other way to build image files is to use a Dockerfile. A Dockerfile is a recipe for installing software and its dependencies. 249 | 250 | Let's take a look at a Dockerfile. By default, it is contained within a folder and is called `Dockerfile`: 251 | 252 | ``` 253 | FROM ubuntu:18.04 254 | 255 | RUN apt-get update && \ 256 | apt-get install -y build-essential && \ 257 | apt-get install -y wget && \ 258 | apt-get clean && \ 259 | rm -rf /var/lib/apt/lists/* 260 | 261 | ENV CONDA_DIR /opt/conda 262 | RUN wget --quiet https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh && \ 263 | /bin/bash ~/miniconda.sh -b -p /opt/conda 264 | 265 | ENV PATH=$CONDA_DIR/bin:$PATH 266 | 267 | #install plink with conda 268 | RUN conda install -c "bioconda/label/cf201901" plink 269 | RUN conda install -c "bioconda/label/cf201901" samtools 270 | ``` 271 | 272 | We can build the Docker image in our directory using: 273 | 274 | ``` 275 | docker build . -t gatk_sam_plink:0.0.1 276 | ``` 277 | 278 | When it's done, we can then make sure it's been built by using 279 | 280 | ``` 281 | docker images 282 | ``` 283 | 284 | And we can use it like any other image. 285 | 286 | ## Going Further with Docker 287 | 288 | Now that you know how to build a snapshot file, you've also learned another step in building apps: specifying software dependencies. You can use these snapshot files to specify executables in your app. 289 | 290 | You can also use these snapshot files in your WDL workflow. 291 | 292 | ## What you learned in this chapter 293 | 294 | - How containers enable reproducibility 295 | - Defined specific container terminology 296 | - Created snapshot files using `ttyd` 297 | - Use these snapshot files with Swiss Army Knife 298 | - How to extend a docker image by installing new software -------------------------------------------------------------------------------- /09-from-hpc.qmd: -------------------------------------------------------------------------------- 1 | # From HPC to Cloud 2 | 3 | ## Key Players in High Performance Computing 4 | 5 | ![Key Players in HPC](images/hpc_players.png){#fig-hpc-players} 6 | 7 | - Our Machine 8 | - Head Node 9 | - HPC Worker 10 | - Shared Storage 11 | 12 | ## Analogies with HPC 13 | 14 | |Component|HPC|DNAnexus| 15 | |---------|---|--------| 16 | |**Driver/Requestor**|Head Node of Cluster|API Server| 17 | |**Submission Script Language**|PBS/SLURM|dx-toolkit| 18 | |**Worker**|Requested from private pool of machines in cluster|Requested from Pool of Machines from AWS/Azure| 19 | |**Shared Storage**|Shared File System (Lustre, GPFS)|Project Storage| 20 | |**Worker I/O**|Handled by Shared File System|Transferred to/from Project Storage by Worker| 21 | 22 | ## HPC vs. DNAnexus Commands 23 | 24 | |Task |dx-toolkit |PBS |SLURM | 25 | |-------|-----------|------|---------| 26 | |**Run Job** |`dx run