Skip to content

Getting Data for Analysis

Jeanie Lim edited this page Jul 12, 2016 · 4 revisions

Progress

  • Git clone or wget??!

Cloning an Existing Repository

The git clone [URL] command copies an existing Git repository.

As a convenience, cloning automatically creates a remote connection called origin pointing back to the original repository. This makes it very easy to interact with a central repository.

git clone https://github.com/Jeanielmj/rnaseq_workshop.git

That creates a directory named “rnaseq_workshop.gitcd ”, initializes a .git directory inside it, pulls down all the data for that repository, and checks out a working copy of the latest version. If you go into the new rnaseq_repo directory, you’ll see the project files in there, ready to be worked on or used.

The data we’ll be looking at

This guide will use __ bp, singe/paired end reads generated using the Illumina Genome Analyzer. The RNA sample used was a ______ _______ cell line, taken from this experiment Link

The data comes off the machine in “fastq” format.

http://bioinf.wehi.edu.au/bioinfosummer2010/materials/RNAseq_Mapping_Tutorial.pdf

http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE32038

Getting Data

  1. Make sure you are logged in into ghpcc06 massrc org
  2. Check that you are in your home directory:
    $ pwd
    /home/username
    
  3. Create "rnaseq_workshop" directory
    $ mkdir rnaseq_workshop
    $ cd rnaseq_workshop/
    
  4. Create a new "data" folder to put data into current folder file from the web
    $ mkdir data
    $ cd ./data
    $ wget ftp://igenome:[email protected]/Drosophila_melanogaster/Ensembl/BDGP5.25/Drosophila_melanogaster_Ensembl_BDGP5.25.tar.gz
    

This might take a while (~ 1-2 min) depending on the network connection/ After it is done, check that you have the right file, Drosophila_melanogaster_Ensembl_BDGP5.25.tar.gz, in your directory.

  1. Unpack compacted files
    $ tar -xvzf Drosophila_melanogaster_Ensembl_BDGP5.25.tar.gz  
    

In your current directory, you should have Drosophila_melanogaster Drosophila_melanogaster_Ensembl_BDGP5.25.tar.gz README.txt

We can now remove the compacted file. Run: rm Drosophila_melanogaster_Ensembl_BDGP5.25.tar.gz

The README.txt contains information on the Illumina iGenome file you just decompressed. Read it by running more READ.txt. To go down the page, tap on your keyboard spacebar. To quit, type q.

  1. Assuming we stored the package at rnaseq_workshop/data, the package expands to contain a folder Drosophila_melanogaster/Ensembl/BDGP5.25/, which has the 2 more folders, Annotation and Sequence. To check:
    $ ls Drosophila_melanogaster/Ensembl/BDGP5.25/ 
    Annotation Sequence
    

Basically, ~/Sequence/ contains subdirectories with genome sequences in various file formats, ~/Annotation/ contains subdirectories with annotation files.

Notice that ~/Sequence/Bowtie2Index/ contains an index of the whole genome for use with the Bowtie2 aligner, which is also used by TopHat2, which is what we will need later.

You can also have a look at the first few lines of the fastq file using the head command.

$ head -n 8 ./Drosophila_melanogaster/Ensembl/BDGP5.25/Sequence/WholeGenomeFasta/genome.fa
  1. The Annotation directory contains another directory called ‘Genes’, which contains a file called ‘genes.gtf ’. For the time being, create a link to this file in your example working directory (to simplify the commands needed during the protocol). From your working directory (i.e. rnaseq_workshop/data), type:
$ ln –s ./Drosophila_melanogaster/Ensembl/BDGP5.25/Annotation/Genes/genes.gtf  
  1. Similarly, create links to the Bowtie index included with the iGenome package:
$ ln –s ./Drosophila_melanogaster/Ensembl/BDGP5.25/Sequence/Bowtie2Index/genome.*
  1. Downloading sequencing data into your current directory ~/rnaseq_workshop/data/
$ wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE32nnn/GSE32038/suppl/GSE32038_simulated_fastq_files.tar.gz -O GSE32038_simulated_fastq_files.tar.gz
  1. Unpack compacted raw sequence files
$ tar -xvzf GSE32038_simulated_fastq_files.tar.gz

In your current directory, you should have, in addition to "Drosophila_melanogaster" and "README.txt", files ending in ".fq.gz"

We can now remove the compacted file. Run: rm GSE32038_simulated_fastq_files.tar.gz

Now that we have got all the required data sets, we are ready to run pre-processing on our raw sequence data before analysis.


| Previous Section | This Section | Next Section | |:------------------------------------:|:--------------------------:|:--------------------------------------------:| | RNA-seq File Formats and Software-Specific Files| Getting Data for Analysis| Data Quality Assessment |

Clone this wiki locally