Skip to content

Files

Latest commit

 

History

History
928 lines (811 loc) · 37.3 KB

labnotebook.md

File metadata and controls

928 lines (811 loc) · 37.3 KB

Labnotebook for QAA Assignment

9/6/23

my QAA data assignment: Ross 15_3C_mbnl_S11_L008 24_4A_control_S18_L008

must begin module:

module spider fastqc
module load fastqc/0.11.5
  • here is the FastQC command used:
/usr/bin/time -v fastqc -o /projects/bgmp/roel/bioinfo/Bi623/QAA/part1 /projects/bgmp/shared/2017_sequencing/demultiplexed/15_3C_mbnl_S11_L008_R1_001.fastq.gz /projects/bgmp/shared/2017_sequencing/demultiplexed/15_3C_mbnl_S11_L008_R2_001.fastq.gz /projects/bgmp/shared/2017_sequencing/demultiplexed/24_4A_control_S18_L008_R1_001.fastq.gz /projects/bgmp/shared/2017_sequencing/demultiplexed/24_4A_control_S18_L008_R2_001.fastq.gz
  • here is my ouput for fastqc, seemed to take about 2 minutes:
    Command being timed: "fastqc -o /projects/bgmp/roel/bioinfo/Bi623/QAA/part1 /projects/bgmp/shared/2017_sequencing/demultiplexed/15_3C_mbnl_S11_L008_R1_001.fastq.gz /projects/bgmp/shared/2017_sequencing/demultiplexed/15_3C_mbnl_S11_L008_R2_001.fastq.gz /projects/bgmp/shared/2017_sequencing/demultiplexed/24_4A_control_S18_L008_R2_001.fastq.gz"
        User time (seconds): 129.21
        System time (seconds): 5.34
        Percent of CPU this job got: 100%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 2:14.29
        ...
        Exit status: 0

Here are where the files are found: /projects/bgmp/shared/2017_sequencing/demultiplexed

Command found in /projects/bgmp/roel/bioinfo/Bi623/QAA/part1

scp file to personal computer then I can look at the HTML file scp [email protected]:/projects/bgmp/roel/bioinfo/Bi623/QAA/part1/*.html .

9/7/23

Folder where the data is located: /projects/bgmp/shared/2017_sequencing/demultiplexed/ The four files I will be working with:

15_3C_mbnl_S11_L008_R1_001.fastq.gz
15_3C_mbnl_S11_L008_R2_001.fastq.gz

24_4A_control_S18_L008_R1_001.fastq.gz
24_4A_control_S18_L008_R2_001.fastq.gz

before I ran phredEncodingScript.sh I remembered I had to import bioinfo.py

  • phred encoding used bash script to run python script

part1 1. 2. The graphs I just created are very similar to what was created by fastqc. The distributions follow the same trend, and to my eyes, there are no differences. 3. When comparing the two files, it appears that the file beginning with 24 has a higher mean quality score compared to the file beginning with 15. Additionally, for both files, the Read 1 is higher quality compared to Read 2; this makes sense because Read 2 has spent more time on the sequencer.

Part2: created new conda env for QAA: conda install --name bgmp-QAA conda activate bgmp-QAA installed two programs in this environment:

conda install trimmomatic
conda install cutadapt

Now trying to use cutadapt on the files:

9/8/23

creating cutadapt command... here's the structure:

  • We have paired-end reads! cutadapt -a ADAPT1 -A ADAPT2 [options] -o out1.fastq -p out2.fastq in1.fastq in2.fastq

/usr/bin/time -v cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -o R1fastq -p R2fastq R1path R2path

  • creating in slurm script located: /projects/bgmp/roel/bioinfo/Bi623/QAA/part2/cutadaptScript.sh

cutadapt output for 15_3C_mbnl_S11_L008_R1_001

This is cutadapt 4.4 with Python 3.10.12
Command line parameters: -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -o 15_3C_mbnl_S11_L008_R1_001.cut.fastq -p 15_3C_mbnl_S11_L008_R2_001.cut.fastq /projects/bgmp/shared/2017_sequencing/demultiplexed/15_3C_mbnl_S11_L008_R1_001.fastq.gz /projects/bgmp/shared/2017_sequencing
/demultiplexed/15_3C_mbnl_S11_L008_R2_001.fastq.gz
Processing paired-end reads on 1 core ...
Finished in 47.151 s (6.040 µs/read; 9.93 M reads/minute).

=== Summary ===

Total read pairs processed:          7,806,403
  Read 1 with adapter:                 417,810 (5.4%)
  Read 2 with adapter:                 477,359 (6.1%)
Pairs written (passing filters):     7,806,403 (100.0%)

Total basepairs processed: 1,576,893,406 bp
  Read 1:   788,446,703 bp
  Read 2:   788,446,703 bp
Total written (filtered):  1,566,917,010 bp (99.4%)
  Read 1:   783,587,227 bp
  Read 2:   783,329,783 bp

=== First read: Adapter 1 ===

Sequence: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA; Type: regular 3'; Length: 33; Trimmed: 417810 times

Minimum overlap: 3
No. of allowed errors:
1-9 bp: 0; 10-19 bp: 1; 20-29 bp: 2; 30-33 bp: 3

Bases preceding removed adapters:
  A: 17.0%
  C: 29.2%
  G: 37.9%
  T: 15.8%
  none/other: 0.2%

Overview of removed sequences
length  count   expect  max.err error counts
3       152119  121975.0        0       152119
4       40634   30493.8 0       40634
5       18152   7623.4  0       18152
6       11433   1905.9  0       11433
7       10612   476.5   0       10612
8       9967    119.1   0       9967
9       9897    29.8    0       9695 202
10      9771    7.4     1       9375 396
11      9272    1.9     1       8950 322
12      8958    0.5     1       8669 289
13      8435    0.1     1       8183 252
14      7935    0.0     1       7658 277
15      7804    0.0     1       7549 255
16      7350    0.0     1       7094 256
17      7259    0.0     1       6983 276
18      6676    0.0     1       6426 245 5
19      6321    0.0     1       6049 264 8
20      5859    0.0     2       5566 261 32
21      5824    0.0     2       5576 220 28
22      5455    0.0     2       5191 227 37
23      5009    0.0     2       4748 227 34
24      4769    0.0     2       4530 211 28
25      4490    0.0     2       4245 216 29
26      4138    0.0     2       3888 210 40
27      3984    0.0     2       3763 197 23 1
28      3806    0.0     2       3583 197 24 2
29      3369    0.0     2       3158 184 26 1
30      3224    0.0     3       3015 173 23 13
31      2883    0.0     3       2713 136 24 10
32      2616    0.0     3       2464 123 19 10
33      2510    0.0     3       2359 130 14 7
34      2256    0.0     3       2110 124 15 7
35      2010    0.0     3       1886 107 13 4
36      1944    0.0     3       1835 93 13 3
37      1836    0.0     3       1717 95 16 8
38      1672    0.0     3       1577 82 10 3
39      1575    0.0     3       1486 77 9 3
40      1355    0.0     3       1283 60 9 3
41      1188    0.0     3       1123 58 5 2
42      1074    0.0     3       1024 45 4 1
43      986     0.0     3       927 50 6 3
44      898     0.0     3       850 38 9 1
45      829     0.0     3       781 45 3
46      821     0.0     3       770 43 5 3
47      696     0.0     3       663 26 6 1
48      692     0.0     3       641 40 8 3
49      663     0.0     3       627 28 7 1
50      571     0.0     3       537 28 3 3
51      497     0.0     3       459 31 7
52      456     0.0     3       431 18 5 2
53      437     0.0     3       416 15 5 1
54      351     0.0     3       322 26 1 2
55      362     0.0     3       338 18 4 2
56      303     0.0     3       289 10 2 2
57      313     0.0     3       297 12 3 1
58      279     0.0     3       263 12 3 1
59      261     0.0     3       243 16 2
60      268     0.0     3       258 5 2 3
61      239     0.0     3       228 7 4
62      196     0.0     3       180 14 1 1
63      216     0.0     3       197 16 3
64      197     0.0     3       186 11
65      152     0.0     3       142 9 0 1
66      155     0.0     3       146 5 2 2
67      122     0.0     3       121 1
68      120     0.0     3       112 5 2 1
69      92      0.0     3       85 6 1
70      84      0.0     3       81 0 2 1
71      68      0.0     3       66 2
72      83      0.0     3       80 3
73      70      0.0     3       65 3 1 1
74      50      0.0     3       46 4
75      31      0.0     3       29 0 1 1
76      29      0.0     3       23 4 2
77      15      0.0     3       14 1
78      13      0.0     3       12 1
79      15      0.0     3       15
80      6       0.0     3       6
81      6       0.0     3       6
82      4       0.0     3       3 1
83      7       0.0     3       6 1
84      1       0.0     3       1
85      6       0.0     3       6
86      5       0.0     3       5
87      2       0.0     3       2
88      1       0.0     3       1
89      2       0.0     3       2
90      2       0.0     3       2
91      3       0.0     3       3
92      3       0.0     3       3
93      1       0.0     3       1
95      2       0.0     3       2
96      2       0.0     3       1 1
101     686     0.0     3       2 600 81 3


=== Second read: Adapter 2 ===


Sequence: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT; Type: regular 3'; Length: 33; Trimmed: 477359 times

Minimum overlap: 3
No. of allowed errors:
1-9 bp: 0; 10-19 bp: 1; 20-29 bp: 2; 30-33 bp: 3

Bases preceding removed adapters:
  A: 19.7%
  C: 29.8%
  G: 41.0%
  T: 9.4%
  none/other: 0.1%

Overview of removed sequences
length  count   expect  max.err error counts
3       199425  121975.0        0       199425
4       47599   30493.8 0       47599
5       19744   7623.4  0       19744
6       12401   1905.9  0       12401
7       10799   476.5   0       10799
8       10053   119.1   0       10053
9       10101   29.8    0       9767 334
10      9962    7.4     1       9460 502
11      9490    1.9     1       9136 354
12      9057    0.5     1       8791 266
13      8483    0.1     1       8292 191
14      8005    0.0     1       7815 190
15      7863    0.0     1       7612 251
16      7412    0.0     1       7224 188
17      7300    0.0     1       7070 230
18      6732    0.0     1       6426 304 2
19      6370    0.0     1       6165 201 4
20      5896    0.0     2       5661 200 35
21      5856    0.0     2       5582 242 32
22      5504    0.0     2       5276 197 31
23      5056    0.0     2       4850 179 27
24      4802    0.0     2       4575 197 30
25      4514    0.0     2       4303 183 28
26      4180    0.0     2       3950 201 29
27      4006    0.0     2       3820 154 32
28      3839    0.0     2       3635 176 28
29      3401    0.0     2       3220 150 28 3
30      3242    0.0     3       3080 134 19 9
31      2915    0.0     3       2662 202 33 18
32      2641    0.0     3       2494 116 18 13
33      2537    0.0     3       2380 118 30 9
34      2287    0.0     3       2147 108 23 9
35      2041    0.0     3       1923 89 18 11
36      1959    0.0     3       1832 100 21 6
37      1865    0.0     3       1738 94 27 6
38      1702    0.0     3       1590 89 14 9
39      1613    0.0     3       1512 73 16 12
40      1389    0.0     3       1311 56 13 9
41      1217    0.0     3       1138 56 12 11
42      1093    0.0     3       1040 38 12 3
43      1004    0.0     3       948 37 12 7
44      914     0.0     3       854 44 6 10
45      862     0.0     3       789 48 14 11
46      845     0.0     3       796 34 5 10
47      718     0.0     3       672 39 4 3
48      714     0.0     3       646 50 14 4
49      690     0.0     3       637 35 13 5
50      592     0.0     3       548 32 7 5
51      515     0.0     3       467 34 11 3
52      474     0.0     3       438 26 4 6
53      470     0.0     3       433 22 9 6
54      373     0.0     3       340 21 5 7
55      377     0.0     3       342 22 6 7
56      319     0.0     3       293 19 6 1
57      330     0.0     3       307 14 6 3
58      305     0.0     3       278 18 4 5
59      281     0.0     3       255 14 10 2
60      288     0.0     3       261 16 9 2
61      258     0.0     3       238 15 3 2
62      215     0.0     3       199 9 6 1
63      239     0.0     3       209 23 3 4
64      214     0.0     3       194 10 8 2
65      173     0.0     3       150 16 4 3
66      166     0.0     3       150 8 4 4
67      142     0.0     3       125 10 6 1
68      130     0.0     3       117 10 1 2
69      107     0.0     3       93 6 6 2
70      93      0.0     3       84 5 3 1
71      80      0.0     3       72 5 1 2
72      100     0.0     3       86 7 5 2
73      84      0.0     3       70 8 2 4
74      61      0.0     3       54 3 2 2
75      39      0.0     3       33 2 3 1
76      33      0.0     3       30 1 1 1
77      18      0.0     3       15 2 0 1
78      16      0.0     3       16
79      22      0.0     3       16 3 0 3
80      7       0.0     3       6 1
81      8       0.0     3       7 0 1
82      4       0.0     3       4
83      7       0.0     3       7
84      2       0.0     3       1 0 0 1
85      10      0.0     3       6 1 2 1
86      5       0.0     3       4 1
87      3       0.0     3       2 0 0 1
88      1       0.0     3       1
89      2       0.0     3       2
90      3       0.0     3       2 1
91      4       0.0     3       3 0 1
92      3       0.0     3       2 1
93      1       0.0     3       1
95      2       0.0     3       2
96      2       0.0     3       1 1
98      1       0.0     3       0 0 0 1
99      2       0.0     3       0 0 2
101     680     0.0     3       0 597 73 10
        Command being timed: "cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -o 15_3C_mbnl_S11_L008_R1_001.cut.fastq -p 15_3C_mbnl_S11_L008_R2_001.cut.fastq /projects/bgmp/shared/2017_sequencing/demultiplexed/15_3C_mbnl_S11_L008_R1_001.fastq.gz /projects/bgmp/shared/2
017_sequencing/demultiplexed/15_3C_mbnl_S11_L008_R2_001.fastq.gz"
        User time (seconds): 44.00
        System time (seconds): 1.92
        Percent of CPU this job got: 93%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:49.03
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 40312
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 7298
        Voluntary context switches: 1851
        Involuntary context switches: 48
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

cutadapt output for 24_4A_control_S18_L008


This is cutadapt 4.4 with Python 3.10.12
Command line parameters: -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -o 24_4A_control_S18_L008_R1_001.cut.fastq -p 24_4A_control_S18_L008_R2_001.cut.fastq /projects/bgmp/shared/2017_sequencing/demultiplexed/24_4A_control_S18_L008_R1_001.fastq.gz /projects/bgmp/shared/2017_sequencing/demultiplexed/24_4A_control_S18_L008_R2_001.fastq.gz
Processing paired-end reads on 1 core ...
Finished in 60.490 s (5.752 µs/read; 10.43 M reads/minute).

=== Summary ===

Total read pairs processed:         10,515,874
  Read 1 with adapter:                 335,742 (3.2%)
  Read 2 with adapter:                 417,709 (4.0%)
Pairs written (passing filters):    10,515,874 (100.0%)

Total basepairs processed: 2,124,206,548 bp
  Read 1: 1,062,103,274 bp
  Read 2: 1,062,103,274 bp
Total written (filtered):  2,119,352,063 bp (99.8%)
  Read 1: 1,059,836,363 bp
  Read 2: 1,059,515,700 bp

=== First read: Adapter 1 ===

Sequence: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA; Type: regular 3'; Length: 33; Trimmed: 335742 times

Minimum overlap: 3
No. of allowed errors:
1-9 bp: 0; 10-19 bp: 1; 20-29 bp: 2; 30-33 bp: 3

Bases preceding removed adapters:
  A: 23.7%
  C: 30.9%
  G: 27.9%
  T: 17.4%
  none/other: 0.1%

Overview of removed sequences
length  count   expect  max.err error counts
3       196453  164310.5        0       196453
4       44693   41077.6 0       44693
5       15081   10269.4 0       15081
6       6716    2567.4  0       6716
7       5895    641.8   0       5895
8       5084    160.5   0       5084
9       4848    40.1    0       4649 199
10      4653    10.0    1       4323 330
11      4084    2.5     1       3902 182
12      3829    0.6     1       3700 129
13      3519    0.2     1       3397 122
14      3100    0.0     1       3000 100
15      2907    0.0     1       2814 93
16      2728    0.0     1       2617 111
17      2514    0.0     1       2395 119
18      2389    0.0     1       2289 99 1
19      2047    0.0     1       1961 85 1
20      1936    0.0     2       1849 79 8
21      1781    0.0     2       1689 79 13
22      1681    0.0     2       1604 64 13
23      1447    0.0     2       1373 68 6
24      1334    0.0     2       1269 59 6
25      1309    0.0     2       1237 60 12
26      1172    0.0     2       1098 63 11
27      1159    0.0     2       1099 49 11
28      1047    0.0     2       983 50 14
29      927     0.0     2       866 53 7 1
30      858     0.0     3       807 42 7 2
31      780     0.0     3       728 39 9 4
32      716     0.0     3       661 46 6 3
33      646     0.0     3       596 34 11 5
34      641     0.0     3       601 30 4 6
35      546     0.0     3       509 27 6 4
36      541     0.0     3       507 26 5 3
37      511     0.0     3       466 36 5 4
38      482     0.0     3       451 25 3 3
39      419     0.0     3       381 33 5
40      438     0.0     3       410 23 5
41      364     0.0     3       340 17 7
42      362     0.0     3       330 26 3 3
43      287     0.0     3       267 12 6 2
44      285     0.0     3       258 22 4 1
45      270     0.0     3       248 14 3 5
46      238     0.0     3       220 9 8 1
47      251     0.0     3       225 22 3 1
48      216     0.0     3       200 11 5
49      212     0.0     3       198 9 4 1
50      206     0.0     3       183 16 4 3
51      158     0.0     3       146 9 3
52      167     0.0     3       156 6 2 3
53      158     0.0     3       142 10 4 2
54      121     0.0     3       111 6 2 2
55      90      0.0     3       81 6 1 2
56      103     0.0     3       97 6
57      87      0.0     3       76 5 2 4
58      96      0.0     3       87 4 3 2
59      64      0.0     3       58 4 1 1
60      68      0.0     3       63 3 1 1
61      81      0.0     3       77 2 2
62      63      0.0     3       57 5 1
63      63      0.0     3       57 3 3
64      32      0.0     3       30 1 0 1
65      46      0.0     3       40 5 0 1
66      51      0.0     3       48 0 1 2
67      40      0.0     3       36 3 0 1
68      40      0.0     3       36 3 0 1
69      22      0.0     3       13 4 4 1
70      29      0.0     3       27 2
71      22      0.0     3       19 2 0 1
72      18      0.0     3       15 2 1
73      17      0.0     3       16 0 1
74      9       0.0     3       9
75      20      0.0     3       13 5 1 1
76      13      0.0     3       13
77      7       0.0     3       3 3 0 1
78      7       0.0     3       5 0 1 1
79      5       0.0     3       5
80      2       0.0     3       2
81      5       0.0     3       4 1
82      3       0.0     3       3
83      7       0.0     3       7
84      6       0.0     3       5 0 1
85      8       0.0     3       7 0 0 1
86      3       0.0     3       3
87      6       0.0     3       5 1
88      1       0.0     3       1
89      5       0.0     3       4 1
90      3       0.0     3       2 1
91      6       0.0     3       6
92      4       0.0     3       4
93      1       0.0     3       1
94      1       0.0     3       1
95      2       0.0     3       2
96      6       0.0     3       5 1
97      3       0.0     3       3
98      5       0.0     3       5
99      6       0.0     3       6
100     1       0.0     3       0 0 0 1
101     359     0.0     3       1 281 72 5


=== Second read: Adapter 2 ===





=== Second read: Adapter 2 ===

Sequence: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT; Type: regular 3'; Length: 33; Trimmed: 417709 times

Minimum overlap: 3
No. of allowed errors:
1-9 bp: 0; 10-19 bp: 1; 20-29 bp: 2; 30-33 bp: 3

Bases preceding removed adapters:
  A: 27.7%
  C: 31.5%
  G: 28.5%
  T: 12.2%
  none/other: 0.1%

Overview of removed sequences
length  count   expect  max.err error counts
3       259093  164310.5        0       259093
4       58265   41077.6 0       58265
5       17491   10269.4 0       17491
6       7637    2567.4  0       7637
7       6056    641.8   0       6056
8       5204    160.5   0       5204
9       5028    40.1    0       4715 313
10      4890    10.0    1       4410 480
11      4295    2.5     1       3958 337
12      3932    0.6     1       3758 174
13      3578    0.2     1       3470 108
14      3150    0.0     1       3060 90
15      2942    0.0     1       2837 105
16      2773    0.0     1       2695 78
17      2554    0.0     1       2464 90
18      2412    0.0     1       2305 104 3
19      2083    0.0     1       2009 73 1
20      1968    0.0     2       1867 89 12
21      1805    0.0     2       1717 78 10
22      1708    0.0     2       1629 62 17
23      1473    0.0     2       1408 54 11
24      1363    0.0     2       1279 71 13
25      1334    0.0     2       1256 63 15
26      1198    0.0     2       1132 53 13
27      1189    0.0     2       1128 47 14
28      1082    0.0     2       1020 50 12
29      959     0.0     2       894 47 12 6
30      891     0.0     3       829 38 17 7
31      812     0.0     3       734 54 16 8
32      738     0.0     3       672 48 12 6
33      667     0.0     3       621 36 6 4
34      663     0.0     3       611 34 16 2
35      562     0.0     3       514 35 6 7
36      559     0.0     3       521 28 6 4
37      542     0.0     3       491 30 15 6
38      504     0.0     3       462 29 7 6
39      439     0.0     3       398 24 8 9
40      472     0.0     3       434 23 9 6
41      385     0.0     3       352 22 8 3
42      377     0.0     3       348 20 5 4
43      316     0.0     3       280 17 13 6
44      310     0.0     3       283 16 7 4
45      293     0.0     3       265 17 8 3
46      263     0.0     3       234 13 11 5
47      271     0.0     3       242 20 4 5
48      249     0.0     3       215 19 4 11
49      225     0.0     3       201 16 5 3
50      223     0.0     3       196 18 5 4
51      178     0.0     3       152 12 7 7
52      196     0.0     3       169 12 6 9
53      174     0.0     3       156 11 4 3
54      142     0.0     3       121 16 2 3
55      104     0.0     3       86 11 4 3
56      119     0.0     3       101 9 5 4
57      100     0.0     3       85 12 3
58      116     0.0     3       96 13 3 4
59      74      0.0     3       61 7 3 3
60      82      0.0     3       70 8 1 3
61      98      0.0     3       86 7 3 2
62      76      0.0     3       64 7 3 2
63      80      0.0     3       69 7 1 3
64      55      0.0     3       45 4 0 6
65      64      0.0     3       51 6 3 4
66      61      0.0     3       54 3 2 2
67      55      0.0     3       43 5 3 4
68      48      0.0     3       40 4 1 3
69      31      0.0     3       24 1 5 1
70      37      0.0     3       32 3 1 1
71      33      0.0     3       28 2 2 1
72      21      0.0     3       16 2 3
73      25      0.0     3       21 2 0 2
74      16      0.0     3       11 3 1 1
75      30      0.0     3       21 4 4 1
76      22      0.0     3       19 2 1
77      14      0.0     3       8 0 1 5
78      10      0.0     3       7 1 2
79      7       0.0     3       7
80      6       0.0     3       3 0 0 3
81      5       0.0     3       4 1
82      5       0.0     3       4 0 0 1
83      8       0.0     3       7 0 0 1
84      7       0.0     3       6 0 1
85      7       0.0     3       7
86      3       0.0     3       3
87      7       0.0     3       6 0 0 1
88      2       0.0     3       2
89      5       0.0     3       4 1
90      3       0.0     3       1 2
91      6       0.0     3       6
92      4       0.0     3       4
93      2       0.0     3       1 0 1
94      1       0.0     3       1
95      2       0.0     3       1 1
96      7       0.0     3       4 2 0 1
97      3       0.0     3       1 0 2
98      5       0.0     3       2 3
99      6       0.0     3       5 1
101     324     0.0     3       1 255 62 6
        Command being timed: "cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -o 24_4A_control_S18_L008_R1_001.cut.fastq -p 24_4A_control_S18_L008_R2_001.cut.fastq /projects/bgmp/shared/2017_sequencing/demultiplexed/24_4A_control_S18_L008_R1_001.fastq.gz /projects/bgmp/shared/2017_sequencing/demultiplexed/24_4A_control_S18_L008_R2_001.fastq.gz"
        User time (seconds): 57.29
        System time (seconds): 2.86
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:00.68
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 40240
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 7268
        Voluntary context switches: 1622
        Involuntary context switches: 52
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

searching for these adapters using UNIX commands:

  • googled illumina adapters and used these bash commands to confirm these are the proper adapters:

zcat 24_4A_control_S18_L008_R1_001.fastq.gz | grep "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" | head zcat 24_4A_control_S18_L008_R1_001.fastq.gz | grep "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" | wc -l = 8025 (number of records in data) Using this UNIX command, we can now tell that there are adapters in the sequences. So, we known that these are the proper adapters.

Trimmomatic:

(http://www.usadellab.org/cms/?page=trimmomatic)

  • we will use phred33 PE [-version] [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] [-summary <statsSummaryFile>] [-quiet] [-validatePairs] [-basein <inputBase> | <inputFile1> <inputFile2>] [-baseout <outputBase> | <outputFile1P> <outputFile1U> <outputFile2P> <outputFile2U>] <trimmer1>

my script found here: /projects/bgmp/roel/bioinfo/Bi623/QAA/part2.trimmomaticScript.sh output of the files are found: /projects/bgmp/roel/bioinfo/Bi623/QAA/part2/trimmed

  • now I have output .fastq files. I will gzip the files and delete all of the intermediate files gzip *fastq

  • read length distribution: creating new file containing the data zcat 15_3C_mbnl_S11_L008_R1_001.cut.trimmed.fastq.gz | sed -n '2~4p' | awk '{print length($0)}' | sort | uniq -c | sort -n > ../readLenDist.txt

zcat 15_3C_mbnl_S11_L008_R2_001.cut.trimmed.fastq.gz | sed -n '2~4p' | awk '{print length($0)}' | sort | uniq -c | sort -n > ../readLenDist15_3C_mbnl_S11_L008_R2.txt

zcat 24_4A_control_S18_L008_R1_001.cut.trimmed.fastq.gz | sed -n '2~4p' | awk '{print length($0)}' | sort | uniq -c | sort -n > ../readLenDist24_4A_control_S18_L008_R1.txt

zcat 24_4A_control_S18_L008_R2_001.cut.trimmed.fastq.gz | sed -n '2~4p' | awk '{print length($0)}' | sort | uniq -c | sort -n > ../readLenDist24_4A_control_S18_L008_R2.txt

graphing this distribution in /projects/bgmp/roel/bioinfo/Bi623/QAA/part2/readLengthDist.py

commands used to graph the two samples and each of their reads:

./readLengthDist.py -R1 readLenDist15_3C_mbnl_S11_L008_R1.txt -R2 readLenDist15_3C_mbnl_S11_L008_R2.txt -o 15_3C_mbnl_S11_L008ReadLenDist.png
./readLengthDist.py -R1 readLenDist24_4A_control_S18_L008_R1.txt -R2 readLenDist24_4A_control_S18_L008_R2.txt -o 24_4A_control_S18_L008ReadLenDist.png

9/10/23

  • need to finish Part 2 question 7, put moving on to part 3 first:

Part 3:

  • installing following software onto QAA conda environment:
  • star
  • numpy
  • matplotlib
  • htseq
conda activate bgmp-QAA
conda install -c bioconda star
conda install numpy
conda install matplotlib
conda install -c bioconda htseq

The mouse files used: and downloaded to /projects/bgmp/roel/bioinfo/Bi623/QAA/part3

wget https://ftp.ensembl.org/pub/release-110/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz
wget https://ftp.ensembl.org/pub/release-110/gtf/mus_musculus/Mus_musculus.GRCm39.110.gtf.gz

using bash script called ```starScript1.sh``
  • NOTE: look at Bi621 PS8 for instructions usign STAR
  • made this directory to store the alignment: /projects/bgmp/roel/bioinfo/Bi623/QAA/part3/Mus_musculus.GRCm39.ens110.STAR_2.7.10b
  • naming file: the name of the assembly (e.g. “Danio_rerio.GRCz11.dna”)the release of Ensembl (e.g. “ens109”) and the version of STAR (e.g. “STAR_2.7.10b”).
  • unzip files fasta files: gunzip

slurm.out:

        /projects/bgmp/roel/miniconda3/envs/bgmp_star/bin/STAR-avx2 --runThreadN 8 --runMode genomeGenerate --genomeDir /projects/bgmp/roel/bioinfo/Bi623/QAA/part3/Mus_musculus.GRCm39.ens110.STAR_2.7.10b --genomeFastaFiles /projects/bgmp/roel/bioinfo/Bi623/QAA/part3/Mus_musculus.GRCm39.dna.primary_assembly.fa --sjdbGTFfile /projects/bgmp/roe
l/bioinfo/Bi623/QAA/part3/Mus_musculus.GRCm39.110.gtf
        STAR version: 2.7.10b   compiled: 2023-05-25T06:56:23+0000 :/opt/conda/conda-bld/star_1684997536154/work/source
Sep 10 13:28:36 ..... started STAR run
Sep 10 13:28:36 ... starting to generate Genome files
Sep 10 13:29:16 ..... processing annotations GTF
Sep 10 13:29:29 ... starting to sort Suffix Array. This may take a long time...
Sep 10 13:29:43 ... sorting Suffix Array chunks and saving them to disk...
Sep 10 13:39:31 ... loading chunks from disk, packing SA...
Sep 10 13:40:40 ... finished generating suffix array
Sep 10 13:40:40 ... generating Suffix Array index
Sep 10 13:43:24 ... completed Suffix Array index
Sep 10 13:43:24 ..... inserting junctions into the genome indices
Sep 10 13:45:57 ... writing Genome to disk ...
Sep 10 13:45:57 ... writing Suffix Array to disk ...
Sep 10 13:46:12 ... writing SAindex to disk
Sep 10 13:46:13 ..... finished successfully
        Command being timed: "STAR --runThreadN 8 --runMode genomeGenerate --genomeDir /projects/bgmp/roel/bioinfo/Bi623/QAA/part3/Mus_musculus.GRCm39.ens110.STAR_2.7.10b --genomeFastaFiles /projects/bgmp/roel/bioinfo/Bi623/QAA/part3/Mus_musculus.GRCm39.dna.primary_assembly.fa --sjdbGTFfile /projects/bgmp/roel/bioinfo/Bi623/QAA/part3/Mus_mus
culus.GRCm39.110.gtf"
        User time (seconds): 5223.38
        System time (seconds): 50.05
        Percent of CPU this job got: 498%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 17:37.25
      ...
        Exit status: 0
~

using bash script called starScript2.sh
  • for 15_3C sample slurm out:
        STAR --runThreadN 8 --runMode alignReads --outFilterMultimapNmax 3 --outSAMunmapped Within KeepPairs --alignIntronMax 1000000 --alignMatesGapMax 1000000 --readFilesCommand zcat --readFilesIn /projects/bgmp/roel/bioinfo/Bi623/QAA/part2/trimmed/15_3C_mbnl_S11_L008_R1_001.cut.trimmed.fastq.gz /projects/bgmp/roel/bioinfo/Bi623/QAA/part2/trimmed/15_3C_mbnl_S11_L008_R2_001.cut.trimmed.fastq.gz --genomeDir /projects/bgmp/roel/bioinfo/Bi623/QAA/part3/Mus_musculus.GRCm39.ens110.STAR_2.7.10b --outFileNamePrefix 15_3C_mbnl_S11_L008
        STAR version: 2.7.10b   compiled: 2022-11-01T09:53:26-04:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Sep 10 14:55:04 ..... started STAR run
Sep 10 14:55:04 ..... loading genome
Sep 10 14:55:42 ..... started mapping
Sep 10 14:56:41 ..... finished mapping
Sep 10 14:56:42 ..... finished successfully
        Command being timed: "STAR --runThreadN 8 --runMode alignReads --outFilterMultimapNmax 3 --outSAMunmapped Within KeepPairs --alignIntronMax 1000000 --alignMatesGapMax 1000000 --readFilesCommand zcat --readFilesIn /projects/bgmp/roel/bioinfo/Bi623/QAA/part2/trimmed/15_3C_mbnl_S11_L008_R1_001.cut.trimmed.fastq.gz /projects/bgmp/roel/bioinfo/Bi623/QAA/part2/trimmed/15_3C_mbnl_S11_L008_R2_001.cut.trimmed.fastq.gz --genomeDir /projects/bgmp/roel/bioinfo/Bi623/QAA/part3/Mus_musculus.GRCm39.ens110.STAR_2.7.10b --outFileNamePrefix 15_3C_mbnl_S11_L008"
        User time (seconds): 398.41
        System time (seconds): 9.63
        Percent of CPU this job got: 418%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:37.53
       ...
        Exit status: 0

  • for 24_4A sample slurm out:
        STAR --runThreadN 8 --runMode alignReads --outFilterMultimapNmax 3 --outSAMunmapped Within KeepPairs --alignIntronMax 1000000 --alignMatesGapMax 1000000 --readFilesCommand zcat --readFilesIn /projects/bgmp/roel/bioinfo/Bi623/QAA/part2/trimmed/24_4A_control_S18_L008_R1_001.cut.trimmed.fastq.gz /projects/bgmp/roel/bioinfo/Bi623/QAA/part2/trimmed/24_4A_control_S18_L008_R2_001.cut.trimmed.fastq.gz --genomeDir /projects/bgmp/roel/bioinfo/Bi623/QAA/part3/Mus_musculus.GRCm39.ens110.STAR_2.7.10b --outFileNamePrefix 24_4A_control_S18
        STAR version: 2.7.10b   compiled: 2022-11-01T09:53:26-04:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Sep 10 15:04:04 ..... started STAR run
Sep 10 15:04:04 ..... loading genome
Sep 10 15:05:32 ..... started mapping
Sep 10 15:06:55 ..... finished mapping
Sep 10 15:06:56 ..... finished successfully
        Command being timed: "STAR --runThreadN 8 --runMode alignReads --outFilterMultimapNmax 3 --outSAMunmapped Within KeepPairs --alignIntronMax 1000000 --alignMatesGapMax 1000000 --readFilesCommand zcat --readFilesIn /projects/bgmp/roel/bioinfo/Bi623/QAA/part2/trimmed/24_4A_control_S18_L008_R1_001.cut.trimmed.fastq.gz /projects/bgmp/roel/bioinfo/Bi623/QAA/part2/trimmed/24_4A_control_S18_L008_R2_001.cut.trimmed.fastq.gz --genomeDir /projects/bgmp/roel/bioinfo/Bi623/QAA/part3/Mus_musculus.GRCm39.ens110.STAR_2.7.10b --outFileNamePrefix 24_4A_control_S18"
        User time (seconds): 571.53
        System time (seconds): 12.50
        Percent of CPU this job got: 341%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 2:51.23
      ...
        Exit status: 0
~

using PS8 parseSam.py to get count of mapped and unmapped reads:

  • for 15_3C sample slurm out:
Mapped reads: 14436368
Unmapped reads: 400406
        Command being timed: "./parseSam.py -f 15_3C_mbnl_S11_L008Aligned.out.sam"
        User time (seconds): 15.46
        System time (seconds): 2.41
        Percent of CPU this job got: 92%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:19.34
   ...
        Exit status: 0
  • for 24_4A sample slurm out:
Mapped reads: 19780620
Unmapped reads: 710244
        Command being timed: "./parseSam.py -f 24_4A_control_S18Aligned.out.sam"
        User time (seconds): 21.50
        System time (seconds): 3.34
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:25.05
        ...
        Exit status: 0

htseq-count to count reads that mapped to features(genes)

htseq-count 15_3C_mbnl_S11_L008Aligned.out.sam Mus_musculus.GRCm39.110.gtf --stranded=yes

output:

__no_feature    6608774
__ambiguous     6210
__too_low_aQual 13834
__not_aligned   192826
__alignment_not_unique  325255

Now running bash script because it takes a while : /projects/bgmp/roel/bioinfo/Bi623/QAA/part3/htseqScript.sh

  • note: had to do cat slurm-54053.out | grep "__alignment_not_unique" to get each of the cases
15_3C_mbnl_S11_L008Aligned.out.sam, stranded=yes
__no_feature    6608774
__ambiguous     6210
__too_low_aQual 13834
__not_aligned   192826
__alignment_not_unique  325255

24_4A_control_S18Aligned.out.sam, stranded=yes
__no_feature    9046162
__ambiguous     7178
__too_low_aQual 11100
__not_aligned   349222
__alignment_not_unique  481470

15_3C_mbnl_S11_L008Aligned.out.sam, stranded=reverse
__no_feature    601623
__ambiguous     119790
__too_low_aQual 13834
__not_aligned   192826
__alignment_not_unique  325255

24_4A_control_S18Aligned.out.sam, stranded=reverse
__no_feature    874899
__ambiguous     151638
__too_low_aQual 11100
__not_aligned   349222
__alignment_not_unique  481470

9/12/23

  • use ICA4 for bash command to count amount of genes that map to a feature:
  • realized that my one htseq output file has all 4 files, so I will be sub-sectioning them: \

  • 15_3C_mbnl_S11_L008Aligned.out.sam, stranded=yes cat slurm-htseq.out | awk 'NR >= 1 && NR <= 57064 { print }' > htseq_15_3C_strandedyes.out

  • 24_4A_control_S18Aligned.out.sam, stranded=yes cat slurm-htseq.out | awk 'NR >= 57065 && NR <= 114156 { print }' > htseq_24_4A_strandedyes.out

  • 15_3C_mbnl_S11_L008Aligned.out.sam, stranded=reverse cat slurm-htseq.out | awk 'NR >= 114157 && NR <= 171220 { print }' > htseq_15_3C_strandedreverse.out

  • 24_4A_control_S18Aligned.out.sam, stranded=reverse cat slurm-htseq.out | awk 'NR >= 171221 && NR <= 228312 { print }' > htseq_24_4A_strandedreverse.out

Percentage of reads that map to features (parts from ICA4)

reads that mapped:

  • htseq_15_3C_strandedyes.out cat htseq_15_3C_strandedyes.out | grep -v "__" | awk '$2>0 {sum+=$2} END {print sum' = 271488 (genes that do NOT have 0 reads)

  • htseq_24_4A_strandedyes.out cat htseq_24_4A_strandedyes.out | grep -v "__" | awk '$2>0 {sum+=$2} END {print sum}' = 350300

  • htseq_15_3C_strandedreverse.out

cat htseq_15_3C_strandedreverse.out | grep -v "__" | awk '$2>0 {sum+=$2} END
 {print sum}'

= 6165059

  • htseq_24_4A_strandedreverse.out
{print sum}'``` = 8377103


Total reads:
- htseq_15_3C_strandedyes.out
```cat htseq_15_3C_strandedyes.out | awk '{sum+=$2} END {print sum}'``` = 7418387

- htseq_24_4A_strandedyes.out
```10245432```

- htseq_15_3C_strandedreverse.out
```7418387```

- htseq_24_4A_strandedreverse.out
```10245432```



Percentage of mapped reads:
- htseq_15_3C_strandedyes.out
271488/7418387 = 0.03659663 = 3.659663%

- htseq_24_4A_strandedyes.out
350300/10245432 = 0.03419085 = 3.419085%

- htseq_15_3C_strandedreverse.out
6165059/7418387 = 0.8310511 = 83.10511%

- htseq_24_4A_strandedreverse.out
8377103/10245432 = 0.8176427 = 81.76427%

The data are from strand-specific libraries because (percentages above. ^)
It is stranded and most of the reads came from the reverse strand
since 80% of reads reads came from reverse