-
Notifications
You must be signed in to change notification settings - Fork 5
/
01-setup-course.qmd
298 lines (195 loc) · 9.86 KB
/
01-setup-course.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
# Setup for the Course / dx-toolkit basics
In this chapter, we'll setup our DNAnexus account, start a project, get files into it, and run a job with those files.
This is meant to be a whirlwind tour - we'll expand on more information about each of these steps in further chapters.
## Setup your DNAnexus Account
First, create an account at <https://platform.dnanexus.com>. You'll need your login and password to interact with the platform.
If you are not a registered customer with DNAnexus, you will have to [set up your billing by adding a credit card](https://documentation.dnanexus.com/admin/billing-and-account-management).
I know that money is tight for everyone, but everything we'll do in this course should cost no more than $5-10 in compute time.
## Terminal setup / dx-toolkit setup
We'll be running all of these scripts on our own machine. We'll be using the command-line for most of these.
If you are on Linux/Mac, you'll be working with the terminal.
If you are on Windows, I recommend you install [Windows Subsystem for Linux](https://learn.microsoft.com/en-us/windows/wsl/install), and specifically the Ubuntu distribution. That will give you a command-line shell that you can use to interact with the DNAnexus platform.
On your machine, I recommend using a text editor to edit the scripts in your terminal. Good ones include [VS Code](https://code.visualstudio.com/), or built in ones such as `nano`.
Now that we have a terminal and code editor, we can install the [dx-toolkit](https://documentation.dnanexus.com/downloads) onto our machine. In your terminal, you'll first need to make sure that python 3 is installed, and the `pip` installer is installed as well.
## On Ubuntu
You should have a python3 installation - check by typing in:
`which python3`
If you get a blank response, then you'll need to install it.
```
sudo apt-get install python3 ## If python is not yet installed
sudo apt-get install pip3
sudo apt-get install git
sudo apt-get install jq
```
:::{.callout-note}
## What about `miniconda`?
Installing `miniconda` (the minimal [Anaconda installer](https://docs.conda.io/en/latest/miniconda.html)) has some advantages, but does require some learning, especially about conda environments.
Here's a link that discusses using conda environments and how you can use them to install software on your computer: <https://towardsdatascience.com/a-guide-to-conda-environments-bc6180fc533>
The short of it is that you define a *conda environment*, which is a space that lets you install R or Python packages and other software dependencies. For example, I could have an environment that uses Python 2.8, and another one that uses Python 3.9, and I could switch between them using a command called `conda activate`.
:::
## On Macs
Install [homebrew](https://brew.sh/) to your mac and install `python3` and `pip3`:
```
brew install python3
brew install pip3
brew install git
brew install jq
```
## For both machines
Once you have access to Python 3 and `pip3`, you can install the dx-toolkit using the following command:
```
pip3 install dxpy
```
That last command will install the `dx-toolkit` to your machine, which are the command line tools you'll need to work on the DNAnexus cloud.
Test it out by typing:
```
dx -h
```
You should get similar output:
```
usage: dx [-h] [--version] command ...
DNAnexus Command-Line Client, API v1.0.0, client v0.330.0
dx is a command-line client for interacting with the DNAnexus platform. You can log in,
navigate, upload, organize and share your data, launch analyses, and more. For a quick tour
of what the tool can do, see
https://documentation.dnanexus.com/getting-started/tutorials/cli-quickstart#quickstart-for->
For a breakdown of dx commands by category, run "dx help".
```
## Alternative Setup: binder.org
If you aren't able to install the dx-toolkit to your machine, you can use this Binder link to try out the commands. Binder opens a preinstalled image with a shell that has `dxpy` preinstalled on one of the <https://mybinder.org> servers.
<https://mybinder.org/v2/gh/laderast/bash_bioinfo_scripts/HEAD?urlpath=lab>
When you launch it, you will first see this page:
![Launch Binder Screen](images/binder_screen.png)
Then after a few moments (hopefully no more than 5 minutes), the JupyterLab interface should launch. Select the "Terminal" from the bottom of the launcher:
![JupyterLab Interface](images/binder2.png)
Once you select the terminal, this is what you should see:
![JupyterLab Terminal](images/binder3.png)
Now you're ready to get started. Login (@sec-login) and proceed from there.
Just keep in mind that this shell is ephemeral - it will disappear. So make sure that any files you create that you want to save are either uploaded back to your project with `dx upload` or you've downloaded them using the file explorer.
This shell includes the following utilities:
- `git` (needed to download course materials)
- `nano` (needed to edit files)
- `dxpy` (the dx-toolkit)
- `python`/`pip` (needed to install dx-toolkit)
- `jq` (needed to work with JSON files)
## Clone the files and scripts using git
### On Your Own Computer
On your own computer, clone the repo (it will already be cloned if you're in the binder version).
```{bash}
#| eval: false
git clone https://github.com/laderast/bash_bioinfo_scripts/
```
This will create a folder called `/bash_bioinfo_scripts/` in your current directory. Change to it:
```
cd bash_bioinfo_scripts
```
### In the binder
You will already be in the `bash_bioinfo_scripts/` folder.
## Try logging in {#sec-login}
Now that you have an account and the `dx-toolkit` installed, try logging in with `dx login`:
```{bash}
#| eval: false
dx login
```
The platform will then ask you for your username and password. Enter them.
If you are successful, you will see either the select screen or, if you only have one project, that project will be selected for you.
## Super Quick Intro to dx-toolkit
The dx-toolkit is our main tool for interacting with the DNAnexus platform on the command-line. It handles the following:
- Creating a project and managing membership (`dx new project`/`dx invite`/`dx uninvite`).
- File transfer to and from project storage (`dx upload`/`dx download`)
- Starting up computational jobs on the platform with apps and workflows (`dx run`)
- Monitoring/Terminating jobs on the platform (`dx watch`/`dx describe`/`dx terminate`)
- Building Apps, which are executables on the platform (`dx-app-wizard`, `dx build`)
- Building Workflows, which string together apps on the platform (`dx build`)
How do you know a command belongs to the dx-toolkit? They all begin with `dx`. For example, to list the contents of your current project, you'll use something like `ls`:
```{bash}
#| eval: false
dx ls
```
## Create Project for Course
Let's create a project on the platform, and then we will get files into it to prepare for the online work with the platform.
The first command we'll run is `dx new project` in order to create our project.
```{bash}
#| eval: false
dx new project -y my_project
```
:::{.callout-note}
## What's happening here?
When you call `dx new project`, that creates a project on the platform with the name `my_project`. This project lives in the cloud, within the DNAnexus platform.
The `-y` option switches you over into that new project.
:::
## Copying Files to Project Storage
Our worker scripts live in the the `bash_for_bioinformatics/` folder on our machine.
Let's copy our files from the public project into our newly created DNAnexus project.
```{bash}
#| eval: false
dx cp -r "project-BQbJpBj0bvygyQxgQ1800Jkk:/Developer Quickstart" .
dx mv "Developer Quickstart/" "data/"
```
Confirm that your file system on your machine is similiar to the following output:
```{bash}
#| eval: false
tree
```
```
├── JSON
│ └── json_data
│ ├── dxapp.json
│ ├── example.json
│ ├── fastqc.json
│ ├── job.json
│ └── rap-jobs.json
├── batch-processing
│ ├── batch-on-worker.sh
│ ├── dx-find-data-class.sh
│ ├── dx-find-data-field.sh
│ ├── dx-find-data-name.sh
│ ├── dx-find-path.sh
│ └── dx-find-xargs.sh
```
And the file system in your project should look similar to this when you run `dx tree`:
```{bash}
#| eval: false
dx tree
```
```
├── data
│ ├── NA12878.bai
│ ├── NA12878.bam
│ ├── NC_000868.fasta
│ ├── NC_001422.fasta
│ ├── small-celegans-sample.fastq
│ ├── SRR100022_chrom20_mapped_to_b37.bam
│ ├── SRR100022_chrom21_mapped_to_b37.bam
│ └── SRR100022_chrom22_mapped_to_b37.bam
```
::: {.callout-note}
## Why are there two file systems?
When we do cloud computing, there are two file systems we'll have to be familiar with:
1. The file system on our own machine
2. The file system in our DNAnexus project
The `tree` command shows you the contents of your local machine.
`dx tree` on the other hand, shows you the contents of your project storage. Remember that anything that begins with `dx` is a command that interacts with the platform!
Ok, I lied. There are technically 3 filesystems we need to be familar with. The last one is the working directory on the worker machine. We'll talk more about this in the cloud computing basics (@sec-cloud) section.
:::
## Full Script
The full script for setting up is in the `setup-course/project_setup.sh`
```{bash}
#| eval: false
#| filename: setup-course/project_setup.sh
## Login
dx login
## Create your new project
dx new project -y my_project
## Clone this repository onto your computer
git clone https://github.com/laderast/bash_bioinfo_scripts
## Upload the scripts into the scripts folder
cd bash_bioinfo_scripts
dx upload -r worker_scripts/
## Copy the data over
dx cp -r "project-BQbJpBj0bvygyQxgQ1800Jkk:/Developer Quickstart" .
dx mv "Developer Quickstart/" "data/"
## Confirm your project matches
dx tree
tree
```