-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Difference in 2 and 3 copy kmers between read data sets #40
Comments
Hi Jenni, The spectra-cn.hist and only.hist reflects what was found in the assembly. Hope this answer your question! |
Thanks Arang. I think maybe I wasn't clear. I'm looking to compare Genome A with Genome B, where genome B should be an improved, less haplotypic version of genome A. So I thought I might be able to do that by comparing the number of kmers which occur twice in genome A with genome B etc. As I understand it, the spectra-cn.hist output file contains the information required for the layered histogram plots - the first column relating to the colour (read-only/1/2/3), the next the multiplicity of the kmers in the read set and the third the total count of distinct kmers. While the only.hist output file contains those kmers found in the assembly, but not the read data, and only for kmers found once or twice in the genome assembly. So, I was looking to extract the total number of kmers which occured in the genome twice - adding the kmers occuring twice in the only.hist file with those occuring twice ('2' in column 2) in the spectra-cn.hist file. And then compare the data for genome A and genome B, to see if there was a reduction in the number of 2-copy kmers found in the assembly. But first I decided to look at genome A twice, using two different sets of read data. Which was when I found that the number of kmers occurring twice in genome A was different for each read set. I don't understand why this is - apologies if I've totally missed the point! If so, is there a way to compare the genomes like this - to see for reduction of the falsely duplicated regions? Thanks, |
Hi Jenni, Is the purpose to compare the falsely duplicated kmers in each assembly A and B?
I assume your "haplotypic version of genome A" is a pseudo-haplotype or a haploid representation of a diploid genome. The asm_only.hist are considered as artificial kmers (consensus error in the assembly), that do not exist in a genome. Of note - So if you want to extract the num. of kmers occuring twice in your assembly,
is doing what you want. This is extracting lines where the 1st column = 2, and adds up all the kmers in the 3rd column where the 2nd column is smaller than the cutoff. Might be easier to explain if you could share your spectra-cn line plot. Thanks, |
Hi Arang, Thanks for your reply and sorry for my delayed response - had several things on and needed to think about it.
Yes, that's exactly what I'm wanting to do. Our worms are quite heterozygous (the plots also look quite odd - might be issues with sequencing, not sure, maybe just low sequencing and a peak is hidden in the 'error' peak...??). However, we basically have an assembly which we know still contains a lot of falsely duplicated regions. It's been made from a pool of worms using PacBio, and we don't expect to find all the kmers in the Illumina data to be also present in the assembly. So, I'm more interested in the overall duplicated kmers in the assembly, rather than the ones only found in the read data. I think I've been doing what you suggest above using the spectra-cn.hist file. (Do your false duplication.sh script and the awk command essentially do the same thing)? I used awk to first extract the lines in the spectra-cn.hist file with $1==2, and then summed $3 separately from the extracted lines. I then added the 2-copy data from the only.hist file. But I found that my total differed if I used different read data on the same genome, and I can't understand why that would be happening. This is my plot, and my assumption.... Thanks again, |
Hi Jenni, Sorry for the delay! There was a lot going on.. See this thread to obtain better visibility on your spectra-cn plot. Anyways, it seems like your read dataset is containing a lot more kmers that aren't seen in the assembly, To collect false duplications, I suggested to use
This is collecting all kmers seen twice in the assembly. Without limiting to the ~2copy region, it's likely you are counting kmers that should be found multiple times in an assembly.
This is actually representing the number of distinct kmers found multiple times only in the assembly. However, my point was that without knowing what is expected to be found once in the assembly, it's impossible to know what is duplicated. We can't assume the correct copy number of the assembly-only kmers are always 1. Thanks, |
Hi Arang, Thanks! I realise the difference now between your false_duplications and spectra_cn file. That's a very good point about the 2-copy vs the read 2-copy. For the plots - thanks for the link to the other thread. I'd actually found it after I'd posted the above. I don't know if there is an issue with the plotting script though. I'd already been using the -m, -n and -z to re-plot the plots before, but couldn't get the y-axis to scale as I requested, although the x-axis always did. After finding that thread I tried it without the asm file and the y-axis scaled beautifully. I'm not sure if it's picking it up to plot the plot, then over-riding it with my asm errors - the plots would scale as I requested if the error bar was less than the specified y-axis limit. Thanks again for the tool! |
Ah right, yes, I purposely made the Y scale to include the maximum from the error (asm-only) when it is given. Enjoy evaluating your assemblies! |
Thanks Arang! It's been really good. :D |
Hi,
I was looking to see what else I could pull out of the data and decided to compare the 2-copy and 3-copy kmers in different genomes.
For this I used the output files:
genome.genome.spectra-cn.hist
genome.genome.only.hist
And then I used awk to extract 2 and 3 copy kmers from the spectra-cn.hist file and sum them ($3 in file).
For the 2-copy kmers I then added the 2-copy number from the only.hist file.
However, using the same genome, but two different data sets I found different results:
I can't understand why this would be? I'm assuming that the 2 and 3 copy etc are related to the number of times they occur in the genome. So, for the 2-copy, although the output in only.hist will be different between different read sets, I don't understand why the sum of all the 2-copy kmers in both files would change.
What am I getting wrong?
Thanks!
Jenni
The text was updated successfully, but these errors were encountered: