You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been attempting to use SMALR to get SMsn scores. The script runs to completion without errors, but I have noticed a few odd things in the output that I am hoping you could clarify.
I am using a subread BAM file as input, and I have noticed that the number of rows produced in my output changes depending upon how the --procs parameter is set
For example, I ran SMALR with the provided test aligned_subreads.bam file using the command found in the run_test_SMsn_bam.sh script modified so --procs is set to 1,4,8,12,16,20 and 24. Each of these resulted in a different sized SMsn.out file.
cd ~/software/SMALR/test/
declare -a procs=(1 4 8 12 16 20 24)
for proc in "${procs[@]}";
do
smalr -i --SMsn --motif=CATG --mod_pos=2 --procs=$proc --useZMW -c 1 --wgaCovThresh=1 input_SMsn_bam.txt &>/dev/null;
echo -ne "lines in output using --procs="$proc": ";
cat lambda_NEB3011_SMsn/SMsn.out | wc -l;
done
#output
lines in output using --procs=1: 126
lines in output using --procs=4: 119
lines in output using --procs=8: 104
lines in output using --procs=12: 80
lines in output using --procs=16: 76
lines in output using --procs=20: 84
lines in output using --procs=24: 65
Is something going wrong when the program is chunking/combining results? I would expect that the number of processors used shouldn't have an effect on the final output.
I have also noticed when using an aligned_subread.bam file from my own data to get SMsn scores (run using --procs=28), that I have a number of entries that have the same Mol ID at the same position in the same strand multiple times.
My understanding is that SMALR is meant to merge subreads originating from the same ZMW read together into one SMsn score per position. So it shouldn't be possible for the mol ID to appear twice in the results at the same position, right?
Also many of the values given for mean subread length of native molecule seem far too large.
ex. the molecule shown above is stated to have a mean subread length of 26789 in the last row at position 999. None of the subreads from hole number 91061 have a length this large in the BAM. I would expect them to be <1200 bp.
I also noticed that the header for the results often appears somewhere in the middle of the SMsn.out files rather than in the first line when using my aligned_subreads.bam
Thanks in advance for any help that you can provide. This tools should be very useful for my research I just want to make sure things are running correctly and that I understand the results.
-Preston
The text was updated successfully, but these errors were encountered:
In case anyone runs into this same issue, I figured out the cause of most of the odd results described above.
When SMALR is dividing up subreads into chunks for multiprocessing, it's doing so based on the structure of the position-sorted bam file without attempting to group subreads from the same template to the same chunks. As the chunks are calculating the SMsn scores in parallel, if a template molecules subreads are split between multiple chunks, then multiple scores will be calculated for the same template molecule.
To avoid this, the program attempts to remove the scores for molecules that have subreads in different chunks, thus data is lost as more --procs are set due to more chunks being created and more split subread data.
def remove_split_up_molecules( mols, split_mols ):
"""
A few molecules have their alignments split between processes, so split_mols
keeps track of these and now we'll exclude them from further analysis. This
is a stop-gap measure until we can figure out a quick way of splitting the
original alignments file more cleverly.
"""
As this was causing a significant amount of my data to be excluded from analysis, I made some modifications to SMALR that ensures that subreads from the same molecule are kept within the same chunk, thus allowing parallel processing without loss of data.
Hi,
I've been attempting to use SMALR to get SMsn scores. The script runs to completion without errors, but I have noticed a few odd things in the output that I am hoping you could clarify.
--procs
parameter is setrun_test_SMsn_bam.sh
script modified so--procs
is set to 1,4,8,12,16,20 and 24. Each of these resulted in a different sizedSMsn.out
file.I have also noticed when using an aligned_subread.bam file from my own data to get SMsn scores (run using --procs=28), that I have a number of entries that have the same Mol ID at the same position in the same strand multiple times.
Also many of the values given for
mean subread length of native molecule
seem far too large.I also noticed that the header for the results often appears somewhere in the middle of the
SMsn.out
files rather than in the first line when using my aligned_subreads.bampbalign
inSMRTLink v5.1.0
smalr -d --SMsn --motif=A --mod_pos=1 --useZMW --procs=28 -c 1 --wgaCovThresh=10 $sample_input.txt
Thanks in advance for any help that you can provide. This tools should be very useful for my research I just want to make sure things are running correctly and that I understand the results.
-Preston
The text was updated successfully, but these errors were encountered: