mini_project_nocode.Rmd

# Mini project

We will work with two input files:

* **annotation.tsv**: contains the annotation (from Gencode) of 29244 Human genes.
* **normalized_counts.csv**: contains the log2-transformed normalized counts from an RNA-seq project. The expression of 14769 Human genes in **8 samples** from **4 experimental groups** is assessed:
  * Treatment 1, time 0 (2 samples)
  * Treatment 1, time 20 (2 samples)
  * Treatment 2, time 0 (2 samples)
  * Treatment 2, time 20 (2 samples)

We will work in **break-out rooms** of 3-4 people. We encourage one of you to **share the RStudio screen** and discuss together how to proceed.
<br>
As you go through the different questions of the exercise, please **write your code in the corresponding column in this** [**padlet**](https://padlet.com/sarahbonnin/r29zq2fq3bhzx81m) !

***

1. Download / read in the two files that are found [here](https://public-docs.crg.es/biocore/projects/training/R_tidyverse_2021/mini_project/) into two tibbles.

***

2. **Tidy** each tibble individually.

<details>
<summary>
<h6 style="background-color: #fdffaf; display: inline-block;">*Some tips!*</h6>
</summary>

* For the **annotation**: something needs to be separated.
* For the **normalized data**: it's important to pivot some columns... And to separate one created by the pivoting!

</details>

***

3. **Join** both datasets so as to obtain one tibble (keep the intersection).

<details>
<summary>
<h6 style="background-color: #fdffaf; display: inline-block;">*Don't know which columns to use for the joining? Click here for help!*</h6>
</summary>

Work on the `gencode_id` column of the normalized data: Gencode IDs only differ from the Ensembl IDs by the suffix (point + numbers).
e.g. ENSG00000140853.15 in Gencode is ENSG00000140853 in Ensembl. Perhaps `str_sub` can help?

</details>

***

4. What is the **average expression** of the different **types of genes** (`gene_type`)? 
    * According to this data, **which 2 gene types have the highest average expression**? 
    * Remove all rows which correspond to these 2 gene types from the dataset. 
    * What is now the size of our dataset?

<details>
<summary>
<h6 style="background-color: #fdffaf; display: inline-block;">*A couple of tips...*</h6>
</summary>

* Remember `slice_max`? <br>
* `pull` could also be useful!

</details>

***

5. Create a new column that contains **the median expression per gene, per experimental group and per gene type.**<br>
*By experimental group, we mean Treatment + time (for example, samples "Treatment1_rep1_t0" and "Treatment1_rep2_t0" are part of the same experimental group: Treatment1_t0)*

<details>
<summary>
<h6 style="background-color: #fdffaf; display: inline-block;">*Help!*</h6>
</summary>

* If you don't have it already, create a column `experimental_group`.
* The grouping should be done using 3 variables.
* Remember that, if the data is grouped, the newly created columns will take into account the groups...

</details>

***

6. For each experimental group, retrieve the **lincRNA** that has the **highest median expression**.
    * Is it the same lincRNA gene for all 4 experimental groups?

<details>
<summary>
<h6 style="background-color: #fdffaf; display: inline-block;">*Stuck? Click here.*</h6>
</summary>

Check this [stackoverflow post](https://stackoverflow.com/questions/24237399/how-to-select-the-rows-with-maximum-values-in-each-group-with-dplyr) for inspiration.

</details>