-
Notifications
You must be signed in to change notification settings - Fork 5
Getting Data for Analysis
Progress
- Git clone or wget??!
The git clone [URL]
command copies an existing Git repository.
As a convenience, cloning automatically creates a remote connection called origin pointing back to the original repository. This makes it very easy to interact with a central repository.
git clone https://github.com/Jeanielmj/rnaseq_workshop.git
That creates a directory named “rnaseq_workshop.gitcd ”, initializes a .git directory inside it, pulls down all the data for that repository, and checks out a working copy of the latest version. If you go into the new rnaseq_repo directory, you’ll see the project files in there, ready to be worked on or used.
This guide will use __ bp, singe/paired end reads generated using the Illumina Genome Analyzer. The RNA sample used was a ______ _______ cell line, taken from this experiment Link
The data comes off the machine in “fastq” format.
http://bioinf.wehi.edu.au/bioinfosummer2010/materials/RNAseq_Mapping_Tutorial.pdf
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE32038
- Make sure you are logged in into ghpcc06 massrc org
- Check that you are in your home directory:
$ pwd /home/username
- Create "rnaseq_workshop" directory
$ mkdir rnaseq_workshop $ cd rnaseq_workshop/
- Create a new "data" folder to put data into current folder file from the web
$ mkdir data $ cd ./data $ wget ftp://igenome:[email protected]/Drosophila_melanogaster/Ensembl/BDGP5.25/Drosophila_melanogaster_Ensembl_BDGP5.25.tar.gz
This might take a while (~ 1-2 min) depending on the network connection/ After it is done, check that you have the right file, Drosophila_melanogaster_Ensembl_BDGP5.25.tar.gz, in your directory.
- Unpack compacted files
$ tar -xvzf Drosophila_melanogaster_Ensembl_BDGP5.25.tar.gz
In your current directory, you should have Drosophila_melanogaster Drosophila_melanogaster_Ensembl_BDGP5.25.tar.gz README.txt
We can now remove the compacted file. Run: rm Drosophila_melanogaster_Ensembl_BDGP5.25.tar.gz
The README.txt contains information on the Illumina iGenome file you just decompressed. Read it by running more READ.txt
. To go down the page, tap on your keyboard spacebar. To quit, type q
.
- Assuming we stored the package at rnaseq_workshop/data, the package expands to contain a folder Drosophila_melanogaster/Ensembl/BDGP5.25/, which has the 2 more folders, Annotation and Sequence.
To check:
$ ls Drosophila_melanogaster/Ensembl/BDGP5.25/ Annotation Sequence
Basically, ~/Sequence/
contains subdirectories with genome sequences in various file formats,
~/Annotation/
contains subdirectories with annotation files.
Notice that ~/Sequence/Bowtie2Index/
contains an index of the whole genome for use with the
Bowtie2 aligner, which is also used by TopHat2, which is what we will need later.
You can also have a look at the first few lines of the fastq file using the head command.
$ head -n 8 ./Drosophila_melanogaster/Ensembl/BDGP5.25/Sequence/WholeGenomeFasta/genome.fa
- The Annotation directory contains another directory called ‘Genes’, which contains a file called ‘genes.gtf ’. For the time being, create a link to this file in your example working directory (to simplify the commands needed during the protocol). From your working directory (i.e. rnaseq_workshop/data), type:
$ ln –s ./Drosophila_melanogaster/Ensembl/BDGP5.25/Annotation/Genes/genes.gtf
- Similarly, create links to the Bowtie index included with the iGenome package:
$ ln –s ./Drosophila_melanogaster/Ensembl/BDGP5.25/Sequence/Bowtie2Index/genome.*
- Downloading sequencing data into your current directory
~/rnaseq_workshop/data/
$ wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE32nnn/GSE32038/suppl/GSE32038_simulated_fastq_files.tar.gz -O GSE32038_simulated_fastq_files.tar.gz
- Unpack compacted raw sequence files
$ tar -xvzf GSE32038_simulated_fastq_files.tar.gz
In your current directory, you should have, in addition to "Drosophila_melanogaster" and "README.txt", files ending in ".fq.gz"
We can now remove the compacted file. Run: rm GSE32038_simulated_fastq_files.tar.gz
Now that we have got all the required data sets, we are ready to run pre-processing on our raw sequence data before analysis.
| Previous Section | This Section | Next Section | |:------------------------------------:|:--------------------------:|:--------------------------------------------:| | RNA-seq File Formats and Software-Specific Files| Getting Data for Analysis| Data Quality Assessment |
6-iii. Integrated assignment answers
#Table of Contents
- Module 0 Setting Up for Data Analysis
- Introduction to High Performance Computing Cluster
- Connecting to MGHPCC
- Computing Environment
- Unix Tutorial Part 1: UNIX Bootcamp
- Unix Tutorial Part 2: Shell Scripting
- Unix Tutorial Practice
- Submitting computing jobs to HPC using LSF
- Ignore: Git Tutorial
- Module 1 Introduction/ Overview
- Overview of RNA-seq Experiment
- RNA-Seq Analysis Pipeline
- RNA-Seq Input Data
- RNA-seq File Formats and Software-Specific Files
- Getting Data for Analysis
- Module 2 Quality Control
- Module 3 Tuxedo Pipeline
- The Tuxedo Pipeline
- Read Alignment with TopHat2
- Transcript Assembly with Cufflinks
- Differential Analysis with Cuffdiff
- Visualization with CummeRbund
- Resources and Reference