-
Notifications
You must be signed in to change notification settings - Fork 1
/
mini_project_nocode.Rmd
95 lines (63 loc) · 3.36 KB
/
mini_project_nocode.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
# Mini project
We will work with two input files:
* **annotation.tsv**: contains the annotation (from Gencode) of 29244 Human genes.
* **normalized_counts.csv**: contains the log2-transformed normalized counts from an RNA-seq project. The expression of 14769 Human genes in **8 samples** from **4 experimental groups** is assessed:
* Treatment 1, time 0 (2 samples)
* Treatment 1, time 20 (2 samples)
* Treatment 2, time 0 (2 samples)
* Treatment 2, time 20 (2 samples)
We will work in **break-out rooms** of 3-4 people. We encourage one of you to **share the RStudio screen** and discuss together how to proceed.
<br>
As you go through the different questions of the exercise, please **write your code in the corresponding column in this** [**padlet**](https://padlet.com/sarahbonnin/r29zq2fq3bhzx81m) !
***
1. Download / read in the two files that are found [here](https://public-docs.crg.es/biocore/projects/training/R_tidyverse_2021/mini_project/) into two tibbles.
***
2. **Tidy** each tibble individually.
<details>
<summary>
<h6 style="background-color: #fdffaf; display: inline-block;">*Some tips!*</h6>
</summary>
* For the **annotation**: something needs to be separated.
* For the **normalized data**: it's important to pivot some columns... And to separate one created by the pivoting!
</details>
***
3. **Join** both datasets so as to obtain one tibble (keep the intersection).
<details>
<summary>
<h6 style="background-color: #fdffaf; display: inline-block;">*Don't know which columns to use for the joining? Click here for help!*</h6>
</summary>
Work on the `gencode_id` column of the normalized data: Gencode IDs only differ from the Ensembl IDs by the suffix (point + numbers).
e.g. ENSG00000140853.15 in Gencode is ENSG00000140853 in Ensembl. Perhaps `str_sub` can help?
</details>
***
4. What is the **average expression** of the different **types of genes** (`gene_type`)?
* According to this data, **which 2 gene types have the highest average expression**?
* Remove all rows which correspond to these 2 gene types from the dataset.
* What is now the size of our dataset?
<details>
<summary>
<h6 style="background-color: #fdffaf; display: inline-block;">*A couple of tips...*</h6>
</summary>
* Remember `slice_max`? <br>
* `pull` could also be useful!
</details>
***
5. Create a new column that contains **the median expression per gene, per experimental group and per gene type.**<br>
*By experimental group, we mean Treatment + time (for example, samples "Treatment1_rep1_t0" and "Treatment1_rep2_t0" are part of the same experimental group: Treatment1_t0)*
<details>
<summary>
<h6 style="background-color: #fdffaf; display: inline-block;">*Help!*</h6>
</summary>
* If you don't have it already, create a column `experimental_group`.
* The grouping should be done using 3 variables.
* Remember that, if the data is grouped, the newly created columns will take into account the groups...
</details>
***
6. For each experimental group, retrieve the **lincRNA** that has the **highest median expression**.
* Is it the same lincRNA gene for all 4 experimental groups?
<details>
<summary>
<h6 style="background-color: #fdffaf; display: inline-block;">*Stuck? Click here.*</h6>
</summary>
Check this [stackoverflow post](https://stackoverflow.com/questions/24237399/how-to-select-the-rows-with-maximum-values-in-each-group-with-dplyr) for inspiration.
</details>