Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distiller (bwa mem) cannot handle Hi-C data sequenced with 25bp #141

Open
sergpolly opened this issue May 14, 2019 · 9 comments
Open

distiller (bwa mem) cannot handle Hi-C data sequenced with 25bp #141

sergpolly opened this issue May 14, 2019 · 9 comments

Comments

@sergpolly
Copy link
Member

Rather major stuff! as discovered by @Marlies1993
bwa mem yields empty alignments for sequences of 25bp ...
apparently it isn't designed for 25bp, and something like bwa aln should be used instead ...

@golobor @nvictus @mimakaev any thought on that ? switch to minimap ? I know nothing about bwa aln ...

we'll try to proceed with bowtie2 for this dataset and will try to avoid 25bp in future, but I think this issue is important enough to consider and address

@meoomen
Copy link

meoomen commented May 14, 2019

Bowtie2 command that did work for mapping 25bp reads:
bowtie2 -x $genome -1 $read_1.fastq.gz -2 $read_2.fastq.gz

Thank you!

@Phlya
Copy link
Member

Phlya commented May 14, 2019

Is there currently a supported sequencer that produces 25 bp reads? I haven't heard of any data in the last ~5 years with such short reads!

@meoomen
Copy link

meoomen commented May 14, 2019

It's just regular MiSeq paired end run. Since I'm working with a small genome and just testing several conditions, I figured it would be sufficient. But now I wish I would have used longer reads (>50bp)...

@Phlya
Copy link
Member

Phlya commented May 15, 2019

Author of minimap says bwa-mem should be better for Hi-C.
"Furthermore, I also realize that bwa-mem will be better than minimap2 at Hi-C alignment because bwa-mem is more sensitive to short matches."
https://lh3.github.io/2018/04/02/minimap2-and-the-future-of-bwa

@golobor
Copy link
Member

golobor commented May 16, 2019 via email

@sergpolly
Copy link
Member Author

sergpolly commented May 20, 2019

@golobor @Phlya @Marlies1993 after a little chat with Job - this is no longer URGENT, but should probably be addressed anyways - couple of arguments:

  • more people are looking to use these little personal sequencing machines (MiSeq etc) to run quick preliminary things and those are often 25bp
  • or it might be economical to do 25bp and the only option for the above case
  • 25 bp might be more than sufficient for some species - bacteria/archaea/yeast etc - why go 50/100 then ?

in our case it seems like bowtie2 did the job ok (for quick-n-dirty prelim check, at least), according to @Marlies1993 - we just plugged it in, instead of bwa: https://github.com/dekkerlab/distiller-nf/blob/a510c40471d267397164916d770bbe0d06dbb539/distiller.nf#L469
and rebuild the docker with bowtie2 inside

Importantly, mapping with aln is two-step: first, you run bwa aln and get .sai files, then you run bwa samse/sampe and convert ,sai files to .sams. We could introduce a switch to the logic of distiller if you feel that it could be important!

it's a shame it's not "plug n play" between bwa mem and bwa aln - but it seems like the best option to do this switch from mem to aln - with all of the intermediate steps - according to this story that @Phlya found #141 (comment)

@golobor
Copy link
Member

golobor commented May 21, 2019

Thank you for this update! These arguments seem very reasonable, indeed.
The only issue with bowtie2 that I see is that it requires a different index - but it seems to be working just fine, right?
re: bwa aln - we can hide away all of its logic into an .sh script that would create temp .sai files and then convert them into .sams, streaming out the result.

@agalitsyna
Copy link
Member

Hi, this is not the problem for MiSeq small reads only. In methods like Hi-CO (https://doi.org/10.1016/j.cell.2018.12.014) the length of meaningful read parts is required to be 15-36 bp due to complex ligation procedure and adaptors trimming.
bwa aln seems to handle the problem if inserted into distiller code at the step of mapping:

    bwa aln -t ${bwa_threads} ${bwa_index_base} ${fastq1} > ${library}.${run}.${ASSEMBLY_NAME}.${chunk}.1.sai

    bwa aln -t ${bwa_threads} ${bwa_index_base} ${fastq2} > ${library}.${run}.${ASSEMBLY_NAME}.${chunk}.2.sai

    bwa sampe ${bwa_index_base} ${library}.${run}.${ASSEMBLY_NAME}.${chunk}.1.sai ${library}.${run}.${ASSEMBLY_NAME}.${chunk}.2.sai ${fastq1} ${fastq2} \
        | pairtools parse ${dropsam_flag} ${dropreadid_flag} ${dropseq_flag} \
            ${parsing_options} \
            -c ${chrom_sizes} \
            | pairtools sort --nproc ${sorting_threads} \
                             -o ${library}.${run}.${ASSEMBLY_NAME}.${chunk}.pairsam.${suffix} \
                             --tmpdir \$TASK_TMP_DIR \
            | cat```

@agalitsyna
Copy link
Member

Hi, there is now a test distiller option for short reads mapping with bwa aln (see long_reads option of map):
https://github.com/open2c/distiller-nf/tree/mapping-options

Any suggestions/improvements and testing are much appreciated!
@Phlya @Marlies1993 @sergpolly @golobor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants