You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The sumy module uses the nltk package for stemming and stop words, but nltk does not support e.g. the Polish language out of the box.
Solution:
Stop words:
Download the Polish stop words file from e.g. here, rename it to polish.txt, and place it in the sumy stop words directory (~/.local/lib/python3.10/site-packages/sumy/data/stopwords/polish.txt).
Stemming:
Use the pystempel package, which provides a stemmer for the Polish language. Here’s the code:
fromstempelimportStempelStemmerclassCallableStemmer:
def__init__(self, stemmer):
self.stemmer=stemmerdef__call__(self, word):
returnself.stemmer.stem(word)
defget_stemmer(language):
iflanguage=='pol':
# Create a StempelStemmer object for Polishstemmer_obj=StempelStemmer.default()
# Wrap it in a CallableStemmerreturnCallableStemmer(stemmer_obj)
else:
# For non-Polish languages, use the original StemmerreturnStemmer(language)
Then in this section, in the handle_arguments function, replace the line where the stemmer is created with a call to get_stemmer:
This way, if the language is Polish, get_stemmer will return a CallableStemmer that wraps a StempelStemmer. For any other language, it will return the original Stemmer.
Credit for most of the code: MS Copilot aka Bing
The text was updated successfully, but these errors were encountered:
Problem:
The sumy module uses the nltk package for stemming and stop words, but nltk does not support e.g. the Polish language out of the box.
Solution:
Stop words:
Download the Polish stop words file from e.g. here, rename it to
polish.txt
, and place it in the sumy stop words directory (~/.local/lib/python3.10/site-packages/sumy/data/stopwords/polish.txt
).Stemming:
Use the pystempel package, which provides a stemmer for the Polish language. Here’s the code:
Then in this section, in the handle_arguments function, replace the line where the stemmer is created with a call to get_stemmer:
This way, if the language is Polish, get_stemmer will return a CallableStemmer that wraps a StempelStemmer. For any other language, it will return the original Stemmer.
Credit for most of the code:
MS Copilot aka Bing
The text was updated successfully, but these errors were encountered: