cutoff or threshold for determining similarity #13

PlantHealth-Analytics · 2024-11-18T16:55:44Z

Thank you for providing this tool! I am curious if there is an established cutoff or threshold for determining similarity during analysis. Specifically, have you tested how well the tool can discriminate between closely related genomes in real-time? For example, have there been evaluations using raw signals to differentiate such genomes effectively?
Thanks. Jose C. Huguet-Tapia

canfirtina · 2024-11-24T19:53:15Z

Dear Jose @PlantHealth-Analytics,

Thank you for your interest and your question. For a quick answer, please jump directly to 2) under 2. Differentiating genomes.

I will answer your question in two parts with more detail. First, I will describe the thresholding mechanism used to determine the best mapping position for a read. Second, I will explain our efforts related to differentiating/classifying closely related genomes.

Thresholding mechanism. RawHash2 identifies the best mapping position for a read by generating a weighted score based on several metrics from the best chain (i.e., the chain with the highest score). If the weighted score exceeds a certain threshold, the read is considered mapped to the position covered by the best chain:

RawHash/src/rmap.cpp

Lines 492 to 497 in d3956ea

    
           if (weighted_sum >= opt->w_threshold || (opt->flag&RI_M_ALL_CHAINS && reg0->creg[ic].score >= opt->min_chaining_score2)) { 
        
           	reg0->n_maps++; 
        
           	reg0->maps = (ri_map_t*)ri_krealloc(0, reg0->maps, reg0->n_maps*sizeof(ri_map_t)); 
        
           	reg0->maps[reg0->n_maps-1].c_id = ic; 
        
           	// fprintf(stderr, "Aligned ic: %d\n", ic); 
        
           }

weighted_sum is a weighted summation of several metrics. Currently, we calculate:

1) The ratio of the mean score of all chains to the score of the best chain:

RawHash/src/rmap.cpp

Line 485 in d3956ea

    
           r_bestmc = (bestC > 0)?(1.0f - (meanC/bestC)):0.0f; if(r_bestmc < 0) r_bestmc = 0.0f;

2) The ratio of the mean MAPQ of all chains to the MAPQ of the best chain:

RawHash/src/rmap.cpp

Line 484 in d3956ea

    
           r_bestmq = (bestQ > 0)?(1.0f - (meanQ/bestQ)):0.0f; if(r_bestmq < 0) r_bestmq = 0.0f;

3) The ratio of the MAPQ of the best chain to 30 (half of the maximum MAPQ, which is 60):

RawHash/src/rmap.cpp

Line 483 in d3956ea

r_bestq = (bestQ > 0)?(bestQ/30.0f):0.0f; if(r_bestq > 1) r_bestq = 1.0f;

These ratios (metrics) are summed in a weighted manner to calculate weighted_sum:

RawHash/src/rmap.cpp

Line 487 in d3956ea

    
           weighted_sum = opt->w_bestq*r_bestq + opt->w_bestmq*r_bestmq + opt->w_bestmc*r_bestmc;

For each metric, we assign a corresponding weight to make the thresholding decision (similar to a simple perceptron). The default values of these weights are not shown in the help message (since this is an advanced feature), but you can still modify them using the arguments in the comments here (default values are also provided here):

RawHash/src/roptions.c

Lines 80 to 82 in d3956ea

    
           opt->w_bestmq=0.05f; //--w-bestmq 
        
           opt->w_bestmc=0.6f; //--w-bestmc 
        
           opt->w_bestq=0.35f; //--w-bestq

Please note that the weights should sum to 1 for the thresholding mechanism to work effectively.

Differentiating genomes. We evaluated relative abundance estimation based on the genomes that reads map to. The Euclidean distance between RawHash2's estimations and the ground truth is the smallest compared to previous tools. However, there are two potential caveats:

1) The mapping procedure is still at the base level. It is computationally expensive as it requires identifying chains (a costly operation) based on seed matches. To differentiate genomes, it should be possible to perform cheaper computations (e.g., seed voting). I tried this but did not succeed on the first attempt. My approach was to differentiate distant genomes by assigning a read to the genome with the highest seed match count. However, more sophisticated filters/selection strategies are likely needed.

2) I focused on differentiating distant genomes, not closely related ones. I am not entirely sure how well RawHash2 performs in such cases. What we do know is that performing DTW alignment after mapping significantly improves the confidence of relative abundance estimation. Although DTW from our earlier RawAlign work has been integrated into RawHash, it needs further optimization for RawHash2. You may want to integrate alignment and evaluate its performance.

I am happy to help if you would like to explore this further.

Thanks,
Can

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cutoff or threshold for determining similarity #13

cutoff or threshold for determining similarity #13

PlantHealth-Analytics commented Nov 18, 2024 •

edited

Loading

canfirtina commented Nov 24, 2024 •

edited

Loading

cutoff or threshold for determining similarity #13

cutoff or threshold for determining similarity #13

Comments

PlantHealth-Analytics commented Nov 18, 2024 • edited Loading

canfirtina commented Nov 24, 2024 • edited Loading

PlantHealth-Analytics commented Nov 18, 2024 •

edited

Loading

canfirtina commented Nov 24, 2024 •

edited

Loading