-
Notifications
You must be signed in to change notification settings - Fork 5
/
index.qmd
103 lines (63 loc) · 6.1 KB
/
index.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
# Introduction
Bash scripting is an essential skill in bioinformatics that we often expect bioinformaticians to have automatically learned. I think that this underestimates the difficulty of learning and applying Bash scripting.
This is a book that is meant to bring you (a budding bioinformaticist) beyond the foundational shell scripting skills learned from a shell scripting course such as [the Software Carpentries Shell Course](https://swcarpentry.github.io/shell-novice/).
You might also be savvy with an on-premise High Performance Computing (HPC) cluster and are wondering how to transition to working in the cloud. We have an abbreviated path for you that can get you to running jobs in the cloud as quickly as possible.
Specifically, this book shows you a path to get started with reproducible cloud computing on the DNAnexus platform.
Our goal is to showcase the "glue" skills that help you do bioinformatics reproducibly in the cloud.
## Learning Objectives for this Book
After reading and doing the exercises in this book, you should be able to:
- **Apply** bash scripting to your own work
- **Articulate** basic Cloud Computing concepts that apply to the DNAnexus platform
- **Leverage** bash scripting and the dx-toolkit to execute jobs on the DNAnexus platform
- **Execute** batch processing of multiple files in a project on the DNAnexus platform
- **Monitor**, **profile**, **terminate** and **retry** jobs to optimize costs
- **Manage** software dependencies reproducibly using container-based technologies such as Docker
## Four Levels of Using DNAnexus
One way to approach learning DNAnexus is to think about the skills you need to process a number of files. Ben Busby has noted there are 4 main skill levels in processing files on the DNAnexus platform:
| Level | \# of Files | Skill |
|-------|---------------|-------------------------------------------|
| 1 | 1 | Interactive Analysis (Cloud Workstation, JupyterLab) |
| 2 | 1-50 Files | `dx run`, Swiss Army Knife |
| 3 | 50-1000 Files | Building your own apps |
| 4 | 1000+ Files, multiple steps | Using WDL (Workflow Description Language) |
We'll be covering mostly level 2, but you will have the skills to move on to Level 3.
The key is to gradually build on your skills.
## What is not covered
- Using Bash scripting in DNAnexus Apps and Workflows
- Using Bash Scripting in Workflow Description Language (WDL)
As mentioned, these are advanced level topics. However, this book will provide an excellent foundation to effectively building apps and workflows on the DNAnexus platform.
This book is not meant to be a substitute for excellent books such as [Data Science on the Command Line](https://datascienceatthecommandline.com/2e/). This book focuses on the essential Bash shell skills that will help you on the DNAnexus platform.
## Notes
This is a very opinionated journey through Bash shell scripting, workflow languages, and reproduciblity. This is written from the perspective of a user, and should not be considered as official DNAnexus documentation.
It is designed to build on each of the concepts in a gradual manner. Where possible, we link to the official DNAnexus documentation. It is not meant to be a replacement for the DNAnexus documentation.
At each step, you'll be able to do useful things with your data. We will focus on skills and programming patterns that are useful.
## Prerequisites
Before you tackle this book, you should be able to accomplish the following:
- Open and utilize a shell. This section from the Missing Semester of your CS Education is very helpful: <https://missing.csail.mit.edu/2020/course-shell/>
- [**Utilize** and **navigate**](https://swcarpentry.github.io/shell-novice/02-filedir/index.html) File Paths (both absolute and relative) in a Unix system
We recommend reviewing a course such as the [Software Carpentry course for Shell Scripting](https://swcarpentry.github.io/shell-novice/) before getting started with this book. [The Missing Semester of your CS Education](https://missing.csail.mit.edu/) is another great introduction/resource.
## Contributors
No one writes a book alone. This book comes from a lot of conversations with everyone at DNAnexus, including:
- Allison Regier
- Ben Busby
- Anastazie Sedlakova
- Scott Funkhouser
- Stanley Lan
- Ondrej Klempir
- Branislav Slavik
- David Stanek
- Chai Fungtammasan
Thanks to the following readers for their corrections:
- Joshua Shapiro (found errors in variable expansion text)
- Alexander Moersburg (found error in cloud computing section)
## Want to be a Contributor?
This is the first draft of this book. It's not going to be perfect, and we need help. Specifically, we need help with testing the setup and the exercises.
If you have an problem, you can file it as an issue using [this link](https://github.com/laderast/bash_for_bioinformatics/issues/new/choose).
In your issue, please note the following:
- Your Name
- What your issue was
- Which section, and line you found problematic or wouldn't run
If you're quarto/GitHub savvy, you can fork and file a pull request for typos/edits. If you're not, you can file an issue.
Just be aware that this is not my primary job - I'll try to be as responsive as I can.
## License
<a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img src="https://i.creativecommons.org/l/by/4.0/88x31.png" alt="Creative Commons License" style="border-width:0"/></a><br />[Bash for Bioinformatics]{xmlns:dct="http://purl.org/dc/terms/" property="dct:title"} by <a xmlns:cc="http://creativecommons.org/ns#" href="https://laderast.github.io/bash_for_bioinformatics" property="cc:attributionName" rel="cc:attributionURL">Ted Laderas</a> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.<br />Based on a work at <a xmlns:dct="http://purl.org/dc/terms/" href="https://github.com/laderast/bash_for_bioinformatics" rel="dct:source">https://github.com/laderast/bash_for_bioinformatics</a>.