Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Odd results of --SMsn using BAM #15

Open
pbasting opened this issue Aug 28, 2018 · 1 comment
Open

Odd results of --SMsn using BAM #15

pbasting opened this issue Aug 28, 2018 · 1 comment

Comments

@pbasting
Copy link

Hi,

I've been attempting to use SMALR to get SMsn scores. The script runs to completion without errors, but I have noticed a few odd things in the output that I am hoping you could clarify.

  • I am using a subread BAM file as input, and I have noticed that the number of rows produced in my output changes depending upon how the --procs parameter is set
  • For example, I ran SMALR with the provided test aligned_subreads.bam file using the command found in the run_test_SMsn_bam.sh script modified so --procs is set to 1,4,8,12,16,20 and 24. Each of these resulted in a different sized SMsn.out file.

cd ~/software/SMALR/test/

declare -a procs=(1 4 8 12 16 20 24)
for proc in "${procs[@]}";
do
    smalr -i --SMsn --motif=CATG --mod_pos=2 --procs=$proc --useZMW -c 1 --wgaCovThresh=1 input_SMsn_bam.txt &>/dev/null;
    echo -ne "lines in output using --procs="$proc": ";
    cat lambda_NEB3011_SMsn/SMsn.out | wc -l; 
done
#output
lines in output using --procs=1: 126
lines in output using --procs=4: 119
lines in output using --procs=8: 104
lines in output using --procs=12: 80
lines in output using --procs=16: 76
lines in output using --procs=20: 84
lines in output using --procs=24: 65
  • Is something going wrong when the program is chunking/combining results? I would expect that the number of processors used shouldn't have an effect on the final output.

I have also noticed when using an aligned_subread.bam file from my own data to get SMsn scores (run using --procs=28), that I have a number of entries that have the same Mol ID at the same position in the same strand multiple times.

  • My understanding is that SMALR is meant to merge subreads originating from the same ZMW read together into one SMsn score per position. So it shouldn't be possible for the mol ID to appear twice in the results at the same position, right?
strand	pos	score	mol	nat	wga	N_nat	N_wga	subread_len
1	999	0.076	91061_m170621_003657_42154_c101200072550000001823279810241715_s1_p0	0.475	0.399	7	755	24166
1	999	0.684	91061_m170621_003657_42154_c101200072550000001823279810241715_s1_p0	1.083	0.399	1	755	1156
1	999	0.749	91061_m170621_003657_42154_c101200072550000001823279810241715_s1_p0	1.148	0.399	1	755	26789

  • Also many of the values given for mean subread length of native molecule seem far too large.

    • ex. the molecule shown above is stated to have a mean subread length of 26789 in the last row at position 999. None of the subreads from hole number 91061 have a length this large in the BAM. I would expect them to be <1200 bp.
  • I also noticed that the header for the results often appears somewhere in the middle of the SMsn.out files rather than in the first line when using my aligned_subreads.bam

$ cat SMsn.out | grep -A 5 -B 5 "strand"
1	0	-4.389	117760_m170620_182037_42154_c101200072550000001823279810241714_s1_p0	-4.589	-0.200	1	395	315
1	0	-4.398	147220_m170621_003657_42154_c101200072550000001823279810241715_s1_p0	-4.598	-0.200	1	395	862
1	0	-4.463	124484_m170621_003657_42154_c101200072550000001823279810241715_s1_p0	-4.664	-0.200	1	395	1225
1	0	-4.529	120756_m170621_003657_42154_c101200072550000001823279810241715_s1_p0	-4.730	-0.200	1	395	1181
1	0	-4.743	150307_m170621_003657_42154_c101200072550000001823279810241715_s1_p0	-4.943	-0.200	1	395	1355
strand	pos	score	mol	nat	wga	N_nat	N_wga	subread_len
1	3	-0.004	52184_m170621_003657_42154_c101200072550000001823279810241715_s1_p0	-0.246	-0.242	3	591	11463
1	3	0.005	147823_m170620_182037_42154_c101200072550000001823279810241714_s1_p0	-0.237	-0.242	1	591	1369
1	3	-0.010	25856_m170621_003657_42154_c101200072550000001823279810241715_s1_p0	-0.252	-0.242	3	591	6043
1	3	-0.012	62736_m170620_120417_42154_c101200072550000001823279810241713_s1_p0	-0.254	-0.242	1	591	1171
1	3	-0.013	17313_m170620_182037_42154_c101200072550000001823279810241714_s1_p0	-0.255	-0.242	1	591	1167

  • It may be worth noting that I am producing the BAM file I am using as input by alignment of subreads using pbalign in SMRTLink v5.1.0
  • The SMsn scores for my data were produced using the following SMALR parameters:
    • smalr -d --SMsn --motif=A --mod_pos=1 --useZMW --procs=28 -c 1 --wgaCovThresh=10 $sample_input.txt

Thanks in advance for any help that you can provide. This tools should be very useful for my research I just want to make sure things are running correctly and that I understand the results.

-Preston

@pbasting
Copy link
Author

  • In case anyone runs into this same issue, I figured out the cause of most of the odd results described above.

  • When SMALR is dividing up subreads into chunks for multiprocessing, it's doing so based on the structure of the position-sorted bam file without attempting to group subreads from the same template to the same chunks. As the chunks are calculating the SMsn scores in parallel, if a template molecules subreads are split between multiple chunks, then multiple scores will be calculated for the same template molecule.

  • To avoid this, the program attempts to remove the scores for molecules that have subreads in different chunks, thus data is lost as more --procs are set due to more chunks being created and more split subread data.

  • See line 197-203 of parse_mol_aligns.py

def remove_split_up_molecules( mols, split_mols ):
	"""
	A few molecules have their alignments split between processes, so split_mols
	keeps track of these and now we'll exclude them from further analysis. This
	is a stop-gap measure until we can figure out a quick way of splitting the 
	original alignments file more cleverly.
	"""
  • As this was causing a significant amount of my data to be excluded from analysis, I made some modifications to SMALR that ensures that subreads from the same molecule are kept within the same chunk, thus allowing parallel processing without loss of data.
  • This fix resolved both the loss of data issues I reported earlier, as well as the duplicate scores I was finding in my output.
  • I still haven't resolved the subread length issue I mentioned in the previous post

-Preston

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant