Question regarding weight distribution #27

Nirmal2310 · 2024-10-22T20:09:09Z

Hey @Adoni5, I hope you are doing well. I have a basic question regarding weight distribution. Please forgive me if I understand it incorrectly.

As you mentioned in the Readme that it basically gives the likelihood of taking the read from the given target genome.
The distribution.json file that you added to the repo looks like this:
{"weights": [6264404, 5227293], "names": ["NC_002516.2", "NC_003997.3"]}

I am assuming that since the ratio is ~1.2, if I generate 1,000,000 bp, 454,546 bp will be from NC_003997.3 and 5454,54 bp from NC_002516.2. Please let me know, if I understand correctly.

Secondly, suppose I want to create a mock community like the zymobiomic gut community for which I know the concentration distribution across multiple species. How should I go about simulating this community through Icarust? One idea I have is to create a distribution.json file and add ratios of different genomes.

Can you tell me is it the right approach? If not, please help me how to go about it.

Adoni5 · 2024-10-23T10:52:15Z

Hi @Nirmal2310 - You have understood correctly! The ratio in this case is between Species, so 6264404 / (6264404 + 5227293) for NC_002516.2 to 5227293 / (6264404 + 5227293) for NC_003997.3.

If you are producing R9 data it would absolutely work to just alter the distributions.json weights, you could even just use 1,2,3,4,5 etc.

If you are producing R10 data, you could instead list this in the Simulation Profile Toml, where each bacteria is a sample, and the weight is given underneath each sample table.

Nirmal2310 · 2024-10-23T11:12:31Z

Hey @Adoni5, thank you so much for the reply. I will try this approach and get back to you if any problem occurs.

Nirmal2310 · 2024-11-25T06:40:24Z

Dear @Adoni5,
Sorry for reopening this issue. I am facing some problems with the simulation of metagenomic data and would like your opinion on this.
As mentioned, I was trying to simulate the zymobiomics mock community with R10 flow cell. The toml file is like this:

output_path = "/DATA2/zymo_enrichment"
global_mean_read_length = 5000 #optional
random_seed = 10
target_yield = 100000000
working_pore_percent = 90 # optional (default 85)
pore_type = "R10" #Optional, one of R10 or R9, default R9

[parameters]
sample_name = "Zymobiomics"
experiment_name = "Zymobiomics_Normal"
flowcell_name = "AGE401"
experiment_duration = 4800 # unused currently
device_id = "MN37483"
position = "Bentasaurus"
break_read_ms = 400 # optional,, default 400

[[sample]]
name = "Pseudomonas aeruginosa"
input_genome = "/DATA2/zymo_enrichment/Genomes/Pseudomonas_aeruginosa_complete_genome.fasta"  # Path to (directory of) FASTA file(s)
mean_read_length = 5000.0
weight = 12
uneven = false # Optional

[[sample]]
name = "Escherichia coli"
input_genome = "/DATA2/zymo_enrichment/Genomes/Escherichia_coli_complete_genome.fasta"
mean_read_length = 5000
weight = 12
uneven = false # Optional

[[sample]]
name = "Salmonella enterica"
input_genome = "/DATA2/zymo_enrichment/Genomes/Salmonella_enterica_complete_genome.fasta"
mean_read_length = 5000
weight = 12
uneven = false # Optional

[[sample]]
name = "Lactobacillus fermentum"
input_genome = "/DATA2/zymo_enrichment/Genomes/Lactobacillus_fermentum_complete_genome.fasta"
mean_read_length = 5000
weight = 12
uneven = false # Optional

[[sample]]
name = "Enterococcus faecalis"
input_genome = "/DATA2/zymo_enrichment/Genomes/Enterococcus_faecalis_complete_genome.fasta"
mean_read_length = 5000
weight = 12
uneven = false # Optional

[[sample]]
name = "Staphylococcus aureus"
input_genome = "/DATA2/zymo_enrichment/Genomes/Staphylococcus_aureus_complete_genome.fasta"
mean_read_length = 5000
weight = 12
uneven = false # Optional

[[sample]]
name = "Listeria monocytogenes"
input_genome = "/DATA2/zymo_enrichment/Genomes/Listeria_monocytogenes_complete_genome.fasta"
mean_read_length = 5000
weight = 12
uneven = false # Optional

[[sample]]
name = "Bacillus subtilis"
input_genome = "/DATA2/zymo_enrichment/Genomes/Bacillus_subtilis_complete_genome.fasta"
mean_read_length = 5000
weight = 12
uneven = false # Optional

[[sample]]
name = "Saccharomyces cerevisiae"
input_genome = "/DATA2/zymo_enrichment/Genomes/Saccharomyces_cerevisiae_complete_genome.fasta"
mean_read_length = 5000
weight = 2
uneven = false # Optional

[[sample]]
name = "Cryptococcus neoformans"
input_genome = "/DATA2/zymo_enrichment/Genomes/Cryptococcus_neoformans_complete_genome.fasta"
mean_read_length = 5000
weight = 2
uneven = false # Optional

Now, when I run Icarust for 6 hours using this tool file, it only gives 5 fast5 files, each having 4000 reads per file. When comparing it to the normal run, the difference in the throughput is very high.
My question is whether it is normal behaviour when running Icarust in this manner. If not, what should I change in my toml file?

Thank you so much for helping me out.

Nirmal2310 closed this as completed Oct 23, 2024

Nirmal2310 reopened this Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question regarding weight distribution #27

Question regarding weight distribution #27

Nirmal2310 commented Oct 22, 2024

Adoni5 commented Oct 23, 2024

Nirmal2310 commented Oct 23, 2024

Nirmal2310 commented Nov 25, 2024

Question regarding weight distribution #27

Question regarding weight distribution #27

Comments

Nirmal2310 commented Oct 22, 2024

Adoni5 commented Oct 23, 2024

Nirmal2310 commented Oct 23, 2024

Nirmal2310 commented Nov 25, 2024