Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Stopword Tala 2003, Lru Cache, and Support List Tokens #17

Open
wants to merge 35 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
426bb88
Update .travis.yml
har07 Jan 16, 2016
d97f73d
Update .travis.yml
har07 Jan 16, 2016
343493b
Update .travis.yml
har07 Jan 16, 2016
4e374d2
Create README.md
har07 Jan 16, 2016
983b092
Update .travis.yml
har07 Jan 16, 2016
2fc0a73
Create .coveragerc
har07 Jan 16, 2016
b905267
Update .travis.yml
har07 Jan 16, 2016
c7a05e3
Update .travis.yml
har07 Jan 16, 2016
450b513
Update .travis.yml
har07 Jan 16, 2016
0eb9f21
Update README.md
har07 Jan 16, 2016
e37f4c1
Update README.md
har07 Jan 16, 2016
900eb35
turn off travis-ci email notif
har07 Jan 16, 2016
3625027
Add Stopwords Tala 2003, Add lru_cache
MufidJamaluddin Mar 14, 2019
9890fcf
Test Stopword Tala
MufidJamaluddin Mar 14, 2019
7a55cbf
Boost Performance
MufidJamaluddin Mar 15, 2019
5630ad6
add stem word
MufidJamaluddin Mar 15, 2019
1d9554f
add stem & stopword removal from tokens/word list
MufidJamaluddin Mar 15, 2019
81b06a4
add python 3.7
MufidJamaluddin Mar 15, 2019
748e608
Merge branch 'development' into master
Mar 15, 2019
150a839
Minor
MufidJamaluddin Mar 15, 2019
99bfac5
Fix Error
MufidJamaluddin Mar 15, 2019
abccaca
Merge branch 'master' of https://github.com/MufidJamaluddin/PySastrawi
MufidJamaluddin Mar 15, 2019
edf2c81
fix error python 2.7
MufidJamaluddin Mar 15, 2019
a47d9b2
LruCache python 2.7
MufidJamaluddin Mar 15, 2019
58d35a7
minor
MufidJamaluddin Mar 15, 2019
345edd1
Fix critical bugs
MufidJamaluddin Mar 15, 2019
1a5f7d6
Travis for Python 3.7
MufidJamaluddin Mar 15, 2019
ae3bc91
add test case
MufidJamaluddin Mar 15, 2019
9fc1b3e
Add Test Case
MufidJamaluddin Mar 15, 2019
15fe5d6
Test Case
MufidJamaluddin Mar 15, 2019
3e4151a
Define Abstract Method & Update Test Case
MufidJamaluddin Mar 15, 2019
6d9fd87
Minor
MufidJamaluddin Mar 15, 2019
8bfc448
LruCache
MufidJamaluddin Apr 19, 2019
3470898
minor
MufidJamaluddin Apr 19, 2019
169edcf
remove lrucache stemword
MufidJamaluddin Apr 19, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,18 @@ python:
- "3.4"
- "3.5"
sudo: false
# Enable 3.7 without globally enabling sudo and dist: xenial for other build jobs
matrix:
include:
- python: 3.7
dist: xenial
sudo: true
install:
- pip install python-coveralls
- pip install coveralls
- pip install cachetools
script: nosetests tests --verbose --with-coverage
after_success:
- coveralls
notifications:
email: false
email: false
3 changes: 3 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"python.linting.pylintEnabled": true
}
60 changes: 3 additions & 57 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,58 +1,4 @@
Sastrawi Python
===============
# sastrawi
Indonesian stemmer. Python port of PHP Sastrawi project.

Sastrawi Python is a simple python library which allows you to reduce inflected words in Indonesian Language (Bahasa Indonesia) to their base form ([stem](http://en.wikipedia.org/wiki/Stemming)).
This is Python port of the original [Sastrawi](https://github.com/sastrawi/sastrawi) project written in PHP (credits goes to the original author and contributors of Sastrawi PHP).


[![Build Status](https://travis-ci.org/har07/PySastrawi.svg?branch=master)](https://travis-ci.org/har07/PySastrawi)
[![Coverage Status](https://coveralls.io/repos/github/har07/PySastrawi/badge.svg?branch=master)](https://coveralls.io/github/har07/PySastrawi?branch=master)
[![PyPI version](https://badge.fury.io/py/PySastrawi.svg)](https://badge.fury.io/py/PySastrawi)

Cara Install
-------------

Sastrawi dapat di-*install* menggunakan [pip](https://docs.python.org/3.6/installing/index.html), dengan menjalankan perintah berikut di terminal/command prompt : `pip install PySastrawi`

Penggunaan
-----------

Jalankan baris-baris kode berikut di *Python interactive terminal* :

```python
# import StemmerFactory class
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory

# create stemmer
factory = StemmerFactory()
stemmer = factory.create_stemmer()

# stemming process
sentence = 'Perekonomian Indonesia sedang dalam pertumbuhan yang membanggakan'
output = stemmer.stem(sentence)

print(output)
# ekonomi indonesia sedang dalam tumbuh yang bangga

print(stemmer.stem('Mereka meniru-nirukannya'))
# mereka tiru
```

Demo
--------

Live demo URL : https://pysastrawi-demo.appspot.com/

Repository : https://github.com/har07/pystastrawi-demo

Lisensi
--------

Lisensi Sastrawi Python adalah MIT License (MIT).

Project ini mengandung kamus kata dasar yang berasal dari Kateglo dengan lisensi [CC-BY-NC-SA 3.0](http://creativecommons.org/licenses/by-nc-sa/3.0/).

Informasi Lebih Lanjut
----------------------

- [Sastrawi PHP Repository page](https://github.com/sastrawi/sastrawi)
[![Coverage Status](https://coveralls.io/repos/har07/sastrawi/badge.svg?branch=development&service=github)](https://coveralls.io/github/har07/sastrawi?branch=development)
20 changes: 11 additions & 9 deletions src/Sastrawi/Dictionary/ArrayDictionary.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,17 @@
class ArrayDictionary(object):
from Sastrawi.Dictionary.DictionaryInterface import DictionaryInterface

class ArrayDictionary(DictionaryInterface):
"""description of class"""

def __init__(self, words=None):
self.words = {}
if words:
if words is None:
self.words = {}
elif type(words) is dict:
self.words = words
elif type(words) is list:
self.add_words(words)
else:
self.words = {}

def contains(self, word):
return word in self.words
Expand All @@ -20,9 +27,4 @@ def add(self, word):
"""Add a word to the dictionary"""
if not word or word.strip() == '':
return
self.words[word]=word





self.words[word] = word
11 changes: 9 additions & 2 deletions src/Sastrawi/Dictionary/DictionaryInterface.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,12 @@
class DictionaryInterface(object):
# @update_by Mufid Jamaluddin
# @update_date 16/03/2019

from abc import ABCMeta, abstractmethod

class DictionaryInterface:
"""Interface definition of dictionary"""
__metaclass__ = ABCMeta

@abstractmethod
def contains(self, word):
raise NotImplementedError('you must implement this method manually')
pass
19 changes: 0 additions & 19 deletions src/Sastrawi/Stemmer/Cache/ArrayCache.py

This file was deleted.

13 changes: 0 additions & 13 deletions src/Sastrawi/Stemmer/Cache/CacheInterface.py

This file was deleted.

Empty file.
27 changes: 0 additions & 27 deletions src/Sastrawi/Stemmer/CachedStemmer.py

This file was deleted.

3 changes: 1 addition & 2 deletions src/Sastrawi/Stemmer/Context/Context.py
Original file line number Diff line number Diff line change
Expand Up @@ -160,5 +160,4 @@ def restore_prefix(self):

for removal in self.removals:
if removal.get_affix_type() == 'DP':
self.removals.remove(removal)

self.removals.remove(removal)
24 changes: 17 additions & 7 deletions src/Sastrawi/Stemmer/Context/ContextInterface.py
Original file line number Diff line number Diff line change
@@ -1,30 +1,40 @@
class ContextInterface(object):
"""description of class"""
# @update_by Mufid Jamaluddin
# @update_date 16/03/2019

from abc import ABCMeta, abstractmethod

class ContextInterface:
"""description of abs class"""
__metaclass__ = ABCMeta

@abstractmethod
def getOriginalWord(self):
pass

@abstractmethod
def setCurrentWord(self, word):
pass

@abstractmethod
def getCurrentWord(self):
pass

@abstractmethod
def getDictionary(self):
pass

@abstractmethod
def stopProcess(self):
pass

@abstractmethod
def processIsStopped(self):
pass

@abstractmethod
def addRemoval(self, removal):
pass

@abstractmethod
def getRemovals(self):
pass




pass
13 changes: 12 additions & 1 deletion src/Sastrawi/Stemmer/Context/RemovalInterface.py
Original file line number Diff line number Diff line change
@@ -1,18 +1,29 @@
class RemovalInterface(object):
# @update_by Mufid Jamaluddin
# @update_date 16/03/2019

from abc import ABCMeta, abstractmethod

class RemovalInterface:
"""description of class"""
__metaclass__ = ABCMeta

@abstractmethod
def get_visitor(self):
pass

@abstractmethod
def get_subject(self):
pass

@abstractmethod
def get_result(self):
pass

@abstractmethod
def get_removed_part(self):
pass

@abstractmethod
def get_affix_type(self):
pass

Expand Down
11 changes: 11 additions & 0 deletions src/Sastrawi/Stemmer/Stemmer.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
from Sastrawi.Stemmer.Context.Visitor.VisitorProvider import VisitorProvider
from Sastrawi.Stemmer.Filter import TextNormalizer
from Sastrawi.Stemmer.Context.Context import Context
from cachetools import cached, LRUCache

class Stemmer(object):
"""Indonesian Stemmer.
Expand Down Expand Up @@ -35,6 +36,16 @@ def stem_word(self, word):
else:
return self.stem_singular_word(word)

# Stemming word in Tokens
# @author Mufid Jamaluddin <[email protected]>
def stem_tokens(self, tokens):
stemmed_tokens = []
for token in tokens:
if not token or token.strip() == '':
continue
stemmed_tokens.append(self.stem_word(token))
return stemmed_tokens

def is_plural(self, word):
#-ku|-mu|-nya
#nikmat-Ku, etc
Expand Down
38 changes: 16 additions & 22 deletions src/Sastrawi/Stemmer/StemmerFactory.py
Original file line number Diff line number Diff line change
@@ -1,43 +1,37 @@
import os
from cachetools import cached, LRUCache
from Sastrawi.Dictionary.ArrayDictionary import ArrayDictionary
from Sastrawi.Stemmer.Stemmer import Stemmer
from Sastrawi.Stemmer.CachedStemmer import CachedStemmer
from Sastrawi.Stemmer.Cache.ArrayCache import ArrayCache

class StemmerFactory(object):
""" Stemmer factory helps creating pre-configured stemmer """
APC_KEY = 'sastrawi_cache_dictionary'

def create_stemmer(self, isDev=False):
""" Returns Stemmer instance """
if isDev:
words = self.get_words_from_file()
dictionary = ArrayDictionary(words)
else:
dictionary = self.get_prod_words_dictionary()

words = self.get_words(isDev)
dictionary = ArrayDictionary(words)
stemmer = Stemmer(dictionary)

resultCache = ArrayCache()
cachedStemmer = CachedStemmer(resultCache, stemmer)

return cachedStemmer
return stemmer

def get_words(self, isDev=False):
#if isDev or callable(getattr(self, 'apc_fetch')):
# words = self.getWordsFromFile()
#else:
# words = apc_fetch(self.APC_KEY)
# if not words:
# words = self.getWordsFromFile()
# apc_store(self.APC_KEY, words)
return self.get_words_from_file()
@cached(cache=LRUCache(maxsize=32))
def get_prod_words_dictionary(self):
words = self.get_words_from_file()
dictionary = ArrayDictionary(words)
return dictionary

def get_words_from_file(self):
current_dir = os.path.dirname(os.path.realpath(__file__))
dictionaryFile = current_dir + '/data/kata-dasar.txt'

if not os.path.isfile(dictionaryFile):
raise RuntimeError('Dictionary file is missing. It seems that your installation is corrupted.')

dictionaryContent = ''
text = ''
with open(dictionaryFile, 'r') as f:
dictionaryContent = f.read()

return dictionaryContent.split('\n')
text = f.read()
return text.split('\n')
8 changes: 5 additions & 3 deletions src/Sastrawi/StopWordRemover/StopWordRemover.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@ def remove(self, text):

return ' '.join(stopped_words)




# Remove Stopword in Tokens
# @author Mufid Jamaluddin <[email protected]>
def remove_tokens(self, tokens):
clean_tokens = [token for token in tokens if not self.dictionary.contains(token)]
return clean_tokens
Loading