Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: a way to extract read-only k-mers #9

Open
bmansfeld opened this issue Jun 26, 2020 · 14 comments
Open

Question: a way to extract read-only k-mers #9

bmansfeld opened this issue Jun 26, 2020 · 14 comments
Labels
documentation Improvements or additions to documentation

Comments

@bmansfeld
Copy link

Hi again Dr. Rhie,
I'm having a great time using merqury to asses quality and completeness of different versions of our assembly. It's an awesome tool and it is allowing us to make more informed decisions about haplotig purging and other parameter tweaks. So thanks again!

As I was examining our spectra I noticed our assembly is missing some heterozygosity ie a black (read-only) peak at 1-copy.

I was wondering if there is a way to pull out those kmers specifically and then the reads they came from. My thought is that we could then assemble at least part of the missing heterozygosity using the short reads then add that to our haplotigs to get a more complete assembly.

Can you suggest a way to extract those kmers and their sequence for such a purpose?

Thanks in advance,
Ben

@arangrhie
Copy link
Contributor

arangrhie commented Jun 26, 2020

Hi Ben,

Glad to hear that merqury is being useful!

1. Collect all k-mers in your assembly

meryl count k=$k asm1.fasta output asm1.meryl
# If you have 2 haplotype assemblies, do the next as well
meryl count k=$k asm2.fasta output asm2.meryl
meryl union-sum asm1.meryl asm2.meryl output asm1_and_asm2.meryl

2. Subtract them from the read set

meryl difference read.meryl asm1.meryl output read.0.meryl
# or
meryl difference read.meryl asm1_and_asm2.meryl output read.0.meryl

read.0.meryl contains k-mer counts in the reads, not observed in your assemblies.
You can double check if the histogram matches what you saw in the spectra-cn or spectra-asm plot. meryl histogram generates the histogram.

3. Extract reads containing the read-only mers

meryl-lookup -include is designed for this purpose.

# For paired-end reads (which I assume you are planning to apply):
meryl-lookup -include -sequence my_R1.fastq.gz -mers read.0.meryl -sequence2 R2.fastq.gz -r2 read_only_R2.fastq.gz | pigz -c > read_only_R1.fastq.gz

# For single-end reads (useful if you have long-reads):
meryl-lookup -include -sequence my_read.fastq.gz -mers read.0.meryl | pigz -c > read_only.fastq.gz

Hope this helps!
Arang

@arangrhie arangrhie added the documentation Improvements or additions to documentation label Jun 26, 2020
@bmansfeld
Copy link
Author

Awesome! thanks so much for the quick and thoroughly laid out answer!

@arangrhie
Copy link
Contributor

Forgot to add the -c in pigz. Fixed in my above comment. Feel free to come back in case there are any other errors.
Will leave this issue open so I remember to document this somewhere.

@bmansfeld
Copy link
Author

bmansfeld commented Jul 3, 2020

Arang, I realize that this is out of the scope of Merqury issues, but I'm having some memory issues running the read extraction steps.

Meryl-lookup was using something like 260G of RAM (out of 400+ available) but crashed with a cannot allocate memory error:

-- Loading kmers from 'read.0.meryl' into lookup table.

 p       prefixes             bits gigabytes (allowed: 472 GB)
-- -------------- ---------------- ---------
22        4194304     137555580373    16.014
23        8388608     134303832626    15.635
24       16777216     131320520335    15.288
25       33554432     128874078956    15.003
26       67108864     127501379401    14.843 (smallest)
27      134217728     128276163494    14.933
28      268435456     133345914883    15.524
29      536870912     147005600864    17.114
30     1073741824     177845156029    20.704
31     2147483648     243044449562    28.294
32     4294967296     376963219831    43.884
33     8589934592     648320943572    75.474
34    17179869184    1194556574257   139.065
35    34359738368    2290548018830   266.655 (used)
36    68719476736    4486051091179   522.245
37   137438953472    8880577419080  1033.835
38   274877906944   17673150258085  2057.425
39   549755813888   35261816119298  4105.016
-- -------------- ---------------- ---------

For 3520183203 distinct 20-mers (with 35 bits used for indexing and 5 bits for tags):
  266.655 GB memory
  256.000 GB memory for index (34359738368 elements 64 bits wide)
    2.049 GB memory for tags  (3520183203 elements 5 bits wide)
    8.606 GB memory for data  (3520183203 elements 21 bits wide)

Will load 3520183203 kmers.  Skipping 0 (too low) and 0 (too high) kmers.
Allocating space for 3520183203 suffixes of 5 bits each -> 17600916015 bits (2.049 GB) in blocks of 32.000 MB
                     3520183203 values   of 21 bits each -> 73923847263 bits (8.606 GB) in blocks of 32.000 MB
Loaded 3520183203 kmers.  Skipped 0 (too low) and 0 (too high) kmers.
-- Opening sequences in 'R1.uniq.fastq.gz'.
ERROR:  Failed to open input file 'R1.uniq.fastq.gz': Cannot allocate memory

I'm using meryl from here: https://github.com/marbl/meryl

Any advice?
Thanks in advance!
Ben

@bmansfeld
Copy link
Author

Arang, I've just upped the memory requested to 512G and seems to be running well.
I'll update if something still fails.
Thanks in anycase
Ben

@arangrhie
Copy link
Contributor

Hi Ben,

Sorry for the delayed reply. You can set -memory option in meryl-lookup.
For your example, you could use -memory 20 which will limit to use at maximum 20G.

Arang

@bmansfeld
Copy link
Author

Not a delayed response at all - Nothing to apologize for.
Thanks I found that -memory option and raised it to match my requested memory from the scheduler and all seems to run well now but rather slow.
I wonder if we are wasting time extracting a lot of "error-kmer" reads (ie those w/ less multiplicity than the 1-copy mers)?
I'm looking in to first clipping the data base using meryl greater-than N and then running lookup.
What do you think? is this a smart strategy or am I going to loose a lot of data that might be useful for the purpose of assembling those reads?
Thanks in advance,
Ben

@arangrhie
Copy link
Contributor

I think that's a fair strategy. I'd have done the same thing.
From the histogram (meryl histogram read.0.meryl), I'll pick the multiplicity N (first col) which has the minimum count (2nd col).
All k-mers below this N are likely erroneous anyway.

You could also limit the high-copy k-mers, with meryl less-than, to only include the single copy k-mers to avoid reads from repetitive regions not assembled.

Arang

@bmansfeld
Copy link
Author

Awesome thanks!

@kneubehl
Copy link

kneubehl commented Aug 25, 2021

Hi there,

Has the syntax for meryl-lookup changed? I updated to v1.3 and ran the same commands as I have in the past that worked fine but now it seems parts to the command aren't being recognized. I have attached a screenshot of the error. Any help would be appreciated. Thanks.
meryl error

@arangrhie
Copy link
Contributor

Hello @kneubehl , it seems like your meryl is not updated, or used from the previous path. Could you double check this is the path where v1.3 was installed?

@kneubehl
Copy link

kneubehl commented Sep 15, 2021 via email

@arangrhie
Copy link
Contributor

arangrhie commented Sep 15, 2021

Alex, this is the help message for meryl-lookup in v1.3:

usage: ./meryl-lookup <report-type> \
         -sequence <input1.fasta> [<input2.fasta>] \
         -output   <output1>      [<output2>] \
         -mers     <input1.meryl> [<input2.meryl>] [...] [-estimate] \
         -labels   <input1name>   [<input2name>]   [...]

  Compare kmers in input sequences against kmers in input meryl databases.

  Input sequences (-sequence) can be FASTA or FASTQ, uncompressed, or
  compressed with gzip, xz, or bzip2.

  Report types:

  -bed:
     Generate a BED format file showing the location of kmers in
     any input database on each sequence in 'input1.fasta'.

  -wig-count:
     Generate a WIGGLE format file showing the multiplicity of the
     kmer starting at each position in the sequence, if it exists in
     an input kmer database.

  -wig-depth:
     Generate a WIGGLE format file showing the number of kmers in
     any input database that cover each position in the sequence.

  -existence:
     Generate a tab-delimited line for each input sequence with the
     number of kmers in the sequence, in the database and common to both.

  -include:
  -exclude:
     Copy sequences from 'input1.fasta' (and 'input2.fasta') to the
     corresponding output file if the sequence has at least one kmer
     present (include) or no kmers present (exclude) in 'input1.meryl'.

Run `./meryl-lookup <report-type> -help` for details on each method.

No report-type (-bed, -wig-count, -wig-depth, -existence, -include, -exclude) supplied.

How did you install meryl?

@arangrhie
Copy link
Contributor

Also please use this script for excluding reads having kmers from a given kmer db.
The syntax is:

meryl-lookup -exclude -sequence $read1 $read2 -mers $meryl -output ${out_1} ${out_2}

for paired-end (short) reads

or

meryl-lookup -exclude -sequence $read1 -mers $meryl -output ${out_1}

for single-end (long) reads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

3 participants