Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question regarding weight distribution #27

Open
Nirmal2310 opened this issue Oct 22, 2024 · 3 comments
Open

Question regarding weight distribution #27

Nirmal2310 opened this issue Oct 22, 2024 · 3 comments

Comments

@Nirmal2310
Copy link

Hey @Adoni5, I hope you are doing well. I have a basic question regarding weight distribution. Please forgive me if I understand it incorrectly.

As you mentioned in the Readme that it basically gives the likelihood of taking the read from the given target genome.
The distribution.json file that you added to the repo looks like this:
{"weights": [6264404, 5227293], "names": ["NC_002516.2", "NC_003997.3"]}

I am assuming that since the ratio is ~1.2, if I generate 1,000,000 bp, 454,546 bp will be from NC_003997.3 and 5454,54 bp from NC_002516.2. Please let me know, if I understand correctly.

Secondly, suppose I want to create a mock community like the zymobiomic gut community for which I know the concentration distribution across multiple species. How should I go about simulating this community through Icarust? One idea I have is to create a distribution.json file and add ratios of different genomes.

Can you tell me is it the right approach? If not, please help me how to go about it.

@Adoni5
Copy link
Contributor

Adoni5 commented Oct 23, 2024

Hi @Nirmal2310 - You have understood correctly! The ratio in this case is between Species, so 6264404 / (6264404 + 5227293) for NC_002516.2 to 5227293 / (6264404 + 5227293) for NC_003997.3.

If you are producing R9 data it would absolutely work to just alter the distributions.json weights, you could even just use 1,2,3,4,5 etc.

If you are producing R10 data, you could instead list this in the Simulation Profile Toml, where each bacteria is a sample, and the weight is given underneath each sample table.

@Nirmal2310
Copy link
Author

Hey @Adoni5, thank you so much for the reply. I will try this approach and get back to you if any problem occurs.

@Nirmal2310
Copy link
Author

Dear @Adoni5,
Sorry for reopening this issue. I am facing some problems with the simulation of metagenomic data and would like your opinion on this.
As mentioned, I was trying to simulate the zymobiomics mock community with R10 flow cell. The toml file is like this:

output_path = "/DATA2/zymo_enrichment"
global_mean_read_length = 5000 #optional
random_seed = 10
target_yield = 100000000
working_pore_percent = 90 # optional (default 85)
pore_type = "R10" #Optional, one of R10 or R9, default R9

[parameters]
sample_name = "Zymobiomics"
experiment_name = "Zymobiomics_Normal"
flowcell_name = "AGE401"
experiment_duration = 4800 # unused currently
device_id = "MN37483"
position = "Bentasaurus"
break_read_ms = 400 # optional,, default 400

[[sample]]
name = "Pseudomonas aeruginosa"
input_genome = "/DATA2/zymo_enrichment/Genomes/Pseudomonas_aeruginosa_complete_genome.fasta"  # Path to (directory of) FASTA file(s)
mean_read_length = 5000.0
weight = 12
uneven = false # Optional

[[sample]]
name = "Escherichia coli"
input_genome = "/DATA2/zymo_enrichment/Genomes/Escherichia_coli_complete_genome.fasta"
mean_read_length = 5000
weight = 12
uneven = false # Optional

[[sample]]
name = "Salmonella enterica"
input_genome = "/DATA2/zymo_enrichment/Genomes/Salmonella_enterica_complete_genome.fasta"
mean_read_length = 5000
weight = 12
uneven = false # Optional

[[sample]]
name = "Lactobacillus fermentum"
input_genome = "/DATA2/zymo_enrichment/Genomes/Lactobacillus_fermentum_complete_genome.fasta"
mean_read_length = 5000
weight = 12
uneven = false # Optional

[[sample]]
name = "Enterococcus faecalis"
input_genome = "/DATA2/zymo_enrichment/Genomes/Enterococcus_faecalis_complete_genome.fasta"
mean_read_length = 5000
weight = 12
uneven = false # Optional

[[sample]]
name = "Staphylococcus aureus"
input_genome = "/DATA2/zymo_enrichment/Genomes/Staphylococcus_aureus_complete_genome.fasta"
mean_read_length = 5000
weight = 12
uneven = false # Optional

[[sample]]
name = "Listeria monocytogenes"
input_genome = "/DATA2/zymo_enrichment/Genomes/Listeria_monocytogenes_complete_genome.fasta"
mean_read_length = 5000
weight = 12
uneven = false # Optional

[[sample]]
name = "Bacillus subtilis"
input_genome = "/DATA2/zymo_enrichment/Genomes/Bacillus_subtilis_complete_genome.fasta"
mean_read_length = 5000
weight = 12
uneven = false # Optional

[[sample]]
name = "Saccharomyces cerevisiae"
input_genome = "/DATA2/zymo_enrichment/Genomes/Saccharomyces_cerevisiae_complete_genome.fasta"
mean_read_length = 5000
weight = 2
uneven = false # Optional

[[sample]]
name = "Cryptococcus neoformans"
input_genome = "/DATA2/zymo_enrichment/Genomes/Cryptococcus_neoformans_complete_genome.fasta"
mean_read_length = 5000
weight = 2
uneven = false # Optional

Now, when I run Icarust for 6 hours using this tool file, it only gives 5 fast5 files, each having 4000 reads per file. When comparing it to the normal run, the difference in the throughput is very high.
My question is whether it is normal behaviour when running Icarust in this manner. If not, what should I change in my toml file?

Thank you so much for helping me out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants