Using pre-tokenized queries / documents does not work at the moment #50

mam10eks · 2024-07-30T11:14:18Z

This commit adds some failing unit tests: 4a747d4

Should be simple to resolve this. We load the term-pipeline from the terrier index which we implemented at a time when the pre-tokenized feature was not yet available in PyTerrier, so we likely have a wrong pipeline in case pre-tokenized is specified.

The text was updated successfully, but these errors were encountered:

mam10eks · 2024-07-30T11:15:06Z

cc @Parry-Parry, @heinrichreimer.

mam10eks · 2024-07-30T11:19:27Z

Alright, for pretokenized indexes, termpipelines= is in the index/data.properties file, and in this case ir_axioms uses a default term-pipeline that applies some normalization.

@heinrichreimer Do you have any preferences how we could solve this? E.g., so that it is usable but maybe still compatible with previous behaviour?

Parry-Parry · 2024-07-30T11:22:06Z

@heinrichreimer @mam10eks So I assume the default pipe is stopwords, porter stemmer, this is always included in data.properties should shouldn't be an issue in the default case

mam10eks · 2024-07-30T12:20:22Z

one possible suggestion could also be that we introduce a new PreTokenizedTerrierIndexContext that is a TerrierIndexContext and jst overrides the termpipeline property?

janheinrichmerker · 2024-08-12T10:55:02Z

I'd say it would be best to fix this in the PyTerrier backend here:

ir_axioms/ir_axioms/backend/pyterrier/__init__.py

Lines 229 to 234 in 4212946

    
           def terms( 
        
                   self, 
        
                   query_or_document: Union[Query, Document] 
        
           ) -> Sequence[str]: 
        
               text = self.contents(query_or_document) 
        
               return self._terms(text)

Is there a PyTerrier API to access the pre-tokenized terms given the document ID?

mam10eks self-assigned this Jul 30, 2024

mam10eks added a commit that referenced this issue Sep 16, 2024

Prepare usage of pre-tokenized index #50

3644225

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using pre-tokenized queries / documents does not work at the moment #50

Using pre-tokenized queries / documents does not work at the moment #50

mam10eks commented Jul 30, 2024

mam10eks commented Jul 30, 2024

mam10eks commented Jul 30, 2024

Parry-Parry commented Jul 30, 2024

mam10eks commented Jul 30, 2024

janheinrichmerker commented Aug 12, 2024

Using pre-tokenized queries / documents does not work at the moment #50

Using pre-tokenized queries / documents does not work at the moment #50

Comments

mam10eks commented Jul 30, 2024

mam10eks commented Jul 30, 2024

mam10eks commented Jul 30, 2024

Parry-Parry commented Jul 30, 2024

mam10eks commented Jul 30, 2024

janheinrichmerker commented Aug 12, 2024