Inquiry About Character-Level Basis of Duplication Calculation #116

luc1fer3 · 2024-07-23T03:22:59Z

Hi, thank you for your release. I've been reviewing the method we use to calculate the repetition score for identifying duplicate content in documents, specifically the segment where we compute this score based on the number of characters within duplicate n-grams:

RedPajama-Data/app/src/core/quality_signals/repetitions.py

Lines 136 to 138 in bb594b0

    
           word_lengths = np.array(list(map(len, document.normalized_words))) 
        
           chars_duped = np.sum(word_lengths * duplicated_grams) 
        
           total_chars = np.sum(word_lengths)

I noticed that we're using character counts (word_lengths) to determine the extent of duplication. This approach focuses on the granularity of characters rather than whole words. Could you help me understand the rationale behind choosing character-level analysis for this metric instead of basing our calculations directly on word counts? Are there specific advantages or scenarios where character-level detail provides better insights into data quality or model training effectiveness that might not be as apparent with word-level analysis?

Looking forward to your insights.

mauriceweber · 2024-07-29T13:55:19Z

Hi @luc1fer3 and thanks for your question. This repetition scores measure the ratio between the number characters that appear in duplicated n-grams, and the total number of characters in the document. As such, this score contains both information at a character level, and at a (word-)ngram level. Choosing to compute character-based metrics essentially means you normalize at a higher level of granularity, taking into account more information than when using the number of words (eg, think of long words which are repeated often). It's possible though that a combination with word-level statistics also gives you a good indicator.

Hope this helps!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inquiry About Character-Level Basis of Duplication Calculation #116

Inquiry About Character-Level Basis of Duplication Calculation #116

luc1fer3 commented Jul 23, 2024

mauriceweber commented Jul 29, 2024

Inquiry About Character-Level Basis of Duplication Calculation #116

Inquiry About Character-Level Basis of Duplication Calculation #116

Comments

luc1fer3 commented Jul 23, 2024

mauriceweber commented Jul 29, 2024