All notable changes to this project will be documented in this file.
- Force mwparserfromhell to use version older than 0.6.5 as there was a memory leak solved. earwig/mwparserfromhell#303
- Switch yamlconf to wmf maintained fork.
- Bump tabulate and numpy versions.
- Update (temporarily) yamlconf's version to a git commit while we wait for halfak/yamlconf#8
- Bump numpy version to 1.24 and fix failing test (numpy.float has been deprecated)
- Loosen upper bound restriction for scipy
- Add github actions for CI with debian bullseye image
- Updated dictionaries that have a hunspell package in debian. hunspell is a "superset" of myspell.
- Add new github action that pushes to PYPI - no code relevant changes
- Allow mwapi up to 0.6.x in requirements.txt
- Fixed unit tests for Hindi Wikipedia.
- Add MWApiCache to the API extractor to bypass the default mwapi usage. Useful when revscoring is used in frameworks (like KServe) where asyncio/tornado and co-routines are neeed.
- Allow numpy versions 1.19.x.
- No-op release, the artifact uploaded for 2.11.3 may not be the correct one, hence a new release to avoid issues. The wheel was compiled with Python 3.9 that should work just fine, but since the project support only Python 3.7 we want to be extra careful. The new wheel has been created from a Python 3.7 venv.
- No-op release, the artifact uploaded for 2.11.2 may not be the correct one, hence a new release to avoid issues (first time publishing a revscoring artifact for a new uploader).
- Improved hindi language assets
- Improved error message when decoding a non-item revision from Wikidata
- revscoring score utility now works with rev_docs
- wikitext.revision.list_items feature for counting list items in a page
- Removes CJK tokenization. Needs more work.
- Adds explicit support for CJK tokenization. See revscoring.features.wikitext.revision.cjk.tokens
- Updated deltas to 0.5.1 to address regression in ref singleton matching
- revscoring.languages.english.idioms is now an order of magnitude faster
- Support for python 3.7 and 3.8
- Code example to use revscoring.Model
- Depends on scikit-learn-0.22.1
- Removes flake8 as an install dependency
- languages.portuguese.words_to_watch
- fetch_text was missing drvslots param
- revscoring.languages.engligh.idioms
- textstat
- Explicitly use myspell-pt-(pt|br)
- Pin sphinx to 2.4.4 to manage m2r compatibility
- Handles revision slots in fetch_text utility
- revscoring.datasources.meta.filters.not_none -- Filters a list for None elements.
- revscoring.features.wikitext.revision.sections -- Returns mwparserfromhell.Wikicode objects of every section in a page.
- Reference to english idioms data file
- English language idioms
- tag_str and template_str datasources in features.wikitext.revision
- Support for native gensim vectors and mmaps
- word2vec constructor is reverted back to old behavior for memory usage reasons
- word2vec is generalized and constructor now takes keyed_vectors
- Bumps gensim version to 3.8.1 to deal with smart_open warning
- Fixes name of german language utils
- Rates are now properly formatted when labels are long or numerous
- word2vec generator now yields nothing rather than zero'd vectors when a word isn't found.
- Adds explicit multi-dictionary support to English and German
- Minor fix to serbian badwords
- Add "sudo" to installation commands in readme
- Minor fix to English regexes
- revscoring.Model.model_info now has metrics sorted in label-order
- Added release automation to PyPI via TravisCI.
- Added CHANGELOG.md
- Added revscoring.languages.basque (Minimal dictionary support)
- Added to feature_csv utility.
Pin sklearn to 0.20.3
- Bumped more-itertools requirement to 7.2.x
- Added a specific feature for reference claims in Wikibase sources.
- Updated the versions of numpy,scipy & scikit-learn - @urstrulykkr
- No dupes in
trim()
function. - Count ref statements instead of ref claims in Wikibase sources.
- Minor fix to setup.py for PyPI distribution.
- Badwords, informals, and
words_to_watch
for Chinese. - Pre-processing to regex matches so that we can have traditional Chinese converted to simplified.
- Better error logging for
cv_train
. - Updated language assets for Dutch.
- Added the part for running tests to README.
- Moved tests out of production code.
- Move pytest out of requirements.txt
- Upgrade travis image to xenial.
- Fixed order of imports / isort.
- Bumped yamlconf version to 0.2.4.
- Removed python 3.4 from travis build.
- Turn members of dependents to string value.
- Use
sklearn.model_selection
instead ofsklearn.cross_validation
. - Send
all_dependents
as a list, make it a set everywhere else.
- Properly set Datasource/Feature names for b/c aliases.
- Make basic datasources all json-serializable.
- Fix revscoring extract for users with 2FA.
- Handling for schema issues for multilabel and "boolean".
- Added Galician language assets.
- Added enwiki words to watch.
- README: Fix apt-get command block
- Fixed meta feature naming for sum/max/etc.
- API changes to use the new MCR-aware params and format.
- Updated README.md formatting for language installs.
- Started using
mwbase
instead ofpywikibase
in wikibase-related datasources. - usage string and API fixes
- Fixed score schema for ProbabilityClassifier.
- Handles 'nazism' as an english badword.
- Use
--labels
in tune.
- revscoring.features.meta.aggregators handles numpy array.
- Word vectors - emit a null vector on empty string.
- Restricts numpy and scipy requirements to higher versions and sorts requirements.txt.
- Bumps scikit-learn's requirement to 0.19.x.
- Added intersection utility.
- Minor updates to spanish.py formatting.
- Use global vectors for better multiprocessing.
- Upgrade mwph to v0.5.x
- Fixed a bug in
load_kv
for Word2Vec.
- Removed
english_vectors
.
- Word2Vec features
- Major refactoring to eliminate sklearn's native multilabel classification and introduce per label binary classification.
- List of sklearn's binary classifiers are stored in Revscoring's classifier wrapper, one for each label.
- Tests modified for multilabel.
- Added Icelandic language support.
- Multilabel random forest.
- Added Catalan language assets.
- Fast scoring for fast cross validation.
- Python 3 is now a hard requirement.
- Row formatting for long label names.
- Support for specifying label weights in multilabel classification.
- Use nltk stopwords library instead of homemade list wherever possible.
- Selectors: Count documents, labels per document instance, not per token.
- Fixed label type issue in label-config.
- Use pytest-cov to collect coverage.
- Ignore test-like functions using tox.ini
- Migrate from nosetests to pytests.
- Updated nltk download instructions to include additional corpora.
- Handle ModelInfo lookup error.
- Key pattern in
format_str
for ModelInfo.
- Additional tests to filters and mappers.
- Limit memory usage of threshold statistics.
- Use
__slots__
in ScaledClassificationMatrix to reduce memory usage. - Added
threshold_ndigits
to classification params.
- Added Serbian language assets.
- Natively handle bz2 compressed model files.
- Bosnian language assets.
- Changes uk dict recommendation to aspell (from myspell).
- Fixed sorting in tune utility.
- Use tqdm (progress bar) in extract utility.
- Context manager to close model file after load.
- Removes setuptools requirement.
- Croatian language support.
- Better handing of
label_weights
in Linear model.
- Fixed
tune
utility bug T174704.
- Lativan language support
- Implemented info paths in
model_info
utility. - Adds label information to model_info "score_schema".
- Allow stats in ThresholdOptimization to include exclamation points.
- Fixed arg parsing issues in
model_info
utility.
- Added documentation to ScaledThresholdStatistics test.
- Implemented access to thresholds. - @mdew192837
- Fixed model info formatting bug in
cv_train
. - Fixed test for ScaledThresholdStatistics.
- Updated ScorerModel references in README. - @profgiuseppe
- Fixed
python setup.py upload_docs
call for pypi.
- New utility
union_merge_observations
.
- Included ModelInfo to docs and cleans up some references.
- Extends tests for
model_info
andscore_processor
.
- Fixed typo in module path.
- Fixed main languages documentation.
- Added thresholds class.
- Added tests for threshold optimization.
- Added threshold optimizations to
cv_train
utility. - Added logistic regression to basic model documentation.
- Total overhaul of scorers and statistics.
- Updated
tune
utility to now uses new statistics. - Centralized statistic pattern parsing.
- Moved
OrderedDict
to_data
attr inModelInfo
. - Moved to
model_info
pattern.
- Minor fix to utilites to fix tests.
- AUC metrics
- Fixed model's composition strategy.
- Removed statistics schema.json
- Bumped requirement for mwtypes to include 0.3.x
- Disabled tamil dictionary.
- Added Albanian language assets.
- Tweak lol regexp + use same regexp for portuguese.py
- Disabled Bengali dictionary.
- Fixed typo in
BernoulliNB()
- Put missing options where docopt can find them.
- Added mysqltsv to requirements.txt
- Fixed old reference to README.rst in setup.py.
- Fixed pathological backtracking regexp
- Added ascii transliterations to Tamil badwords.
- Added
CODE_OF_CONDUCT.md
- Bumped pytz requirement to 2017.2
- Convert README from rst to .md
- Included another German badword.
- Use trusty image for travis build.
- Minor fix to
extractors.regex
so that it can handle backwards compatibility.
- Minor fixes to regex exclusions.
- Greek language assets.
- Regex exclusion strategy for RegexMatches.
- Added Bengali language assets.
- Implemented Flesch readability complexity.
- Bumped deltas to 0.4.6
- included bash-command for language install
- Fixed issue caused by old deltas library
- Korean language assets and tests.
- Fixed naming issues with datasources.meta.filters
- Fixed lower() to work for Turkish chars.
- Adds datasources.meta.mappers.derepeat token processor.
- Adds datasources.meta.mappers.de1337 token processor.
- Bumped deltas requirement to 0.4.5.
- Test for gramming.
- Included tests for finnish
- Finnish dict dependency.
- Finnish language assets
- Extends Estonian language assets.
- Added worker count param to
cv_train
. - Better link for enchant docs.
- Fix broken link in README
- Added Romanian language assets.
- Added about.py file for tracking metadata.
- Fixes
rev_doc
injection bug inapi.Extractor
.
- Added
fetch_text
utility.
- Added shuffling to cross-validation.
- Updated revscoring.py docstrings to be consistent.
- Added basic sentence datasources.
- Normalize NaN, Inf, and -Inf in JSON formatting.
- Bumped deltas requirement to 0.4.x
- Removed
demo_load_model
script.
- Cross-validation support for models.
- Added
recall_at_precision
metric.
- Minor fixes to tune utility
- Fixes
recall_at_fpr
metric.
- Added
FeatureVector
class. - Added
vectorizers.vectorize()
method. - Added term frequency gramming, hashing and selection.
- Updated
SklearnClassifier
to handleFeatureVector
. - Moved
extract_features
toextract
(a more general name). - Generalized feature extraction pattern.
- Included Cache & JSON style for utilities.
- OSX install instructions in README.
- Tamil language utilities
- Updated hashing vectorizer example with a
feature_importance
histogram.
- README usage updated to reflect renamed extractor.
- Tests for API extractor datasources.
- Minor cache preservation issue in
dependencies.solve()
.
- Czech language assets
- Norwegian language utilities
- Prepend local dir on path for all utilities that use classpaths.
- Minor fix to OfflineExtractor's use of caches.
- Initialize caches in extractor.
- Swedish language utilities.
- Improved performance on Persian regexes.
- Fixed test for
time_since_registration
for anons.
- Improved English badwords regexs.
- Intermediate filtered stage for mwparserfromhell wikicode.
- Makes cache preserve through
extractor.extract()
andcontext.solve()
- Speeded up dictionary features by ~2
cpu
andIO
config for ScoreProcessor & score utility.- Added Hungarian language support.
- Cleaned verbose of
feature_extractor
for deleted stuff
- Made score utility faster.
###Fixed
- Fixed extractor error.
- Added the possibility to log in to the feature extractor.
- Implemented feature profiling
- Added Russian language utility.
- Added hashing vectorizer example notebook.
- Language assets for Hindi.
- Updated regexes to match new tests.
- Missing commas and additional entries for English language tests.
- Updated LICENSE due to a mistake from an old copy-paste.
- Minor fix to docs for regex matcher
- Fixed feature extraction for page creation revisions.
- Added a basic test for
solve()
. - Extends scipy requirement back to 0.13.3
- Minor fixes to
regex_matches
feature names.
- Japanese language support.
- Modifications to regex extraction to not use word boundary chars sometimes.
- Fixed cache usage in extraction.
- yamlconf requirement range error.
- Bumped yamlconf requirement to > 0.1.0
- Changes
cache
argument tocaches
in api.Extractor. - Added Amir to README.
- Handling of revision caches to API extractor.
- Fixed minor issue making f1 test statistic unavailable.
- F1 test statistic generator.
- Random print to stdout in
scorer_models.util
.
- Added tests for model utilities.
- Added
balanced_sample
option to all scorer models.
- Typo in Feature Engineering notebook.
- 'NoneType' has no len() when processing
longest_repeated_char
.
First official 1.x release!
- Added proportional features to wikibase.
- Basic
revision_oriented
features implemented and tested. - Added
GradientBoosting
scorer model - Added Polish language support and tests.
- Added Arabic language support.
- Added
DependentSet
to dependencies. - Added documentation for languages.
- Updated IPython notebooks.
- Fixed documentation around test statistics and base
revision_oriented
features. - Included "sup" to informals for English features.
- Handled
CaughtDependencyError
error. - Added
ref_tags
towikitext.revision
. - Bumped pywikibase requirement to 0.0.4
- Updated documentation for features.
- Applied new API Extractor to utilities.
- Improved test coverage
- Minor generalization to hebrew's test cases for older dictionaries.
- Fixed typo in wikibase tests. alan (touring --> turing)
- Refactored entire feature structure.
- Added pywikibase to requirements.
- Moved tokenized.delta to tokenized.diff.
- User.name --> User.text
- Moved basic regex operations.
- Added token frequencies to datasources/meta.
- Persian badwords explained.
- Added
revision.diff
to datasources. revision_oriented
usesDependentSet
now.- Improved Documentation.
- Fixed a minor bug in accuracy calculation when using scaling or centering in a scorer model.
- Removed backtracking requirements from english badwords and informals. Increased performance X3
- Removed old backwards compatible ...Model classes from
scorer_models
- Tests for more Portuguese badwords
- Switched to using the
average_precision
scorer whenpr_auc
is required. - Removes 'ha' from italian informals and add test to be sure.
- Fixed issues in offline extractor.
- Adds
pr_auc_score
to metrics available for tuning. - Improved comments and added variations for a few Portuguese bad words.
- Adds yamlconf to requirements.txt
- Added
OfflineExtractor
to extractors.
- Section about "changelog" vs "CHANGELOG".
- Fixed logging and table formatting in
tune
utility.
- Fixed merge issue in
tune
utility.
- Adds
cv-timeout
option to tune utility.
- Updated revscoring utility to list
model_info
and tune utilities. - Added error handling to cross-validation.
- Updated README w/ badges and source highlighting.
- Removed
import_from_path
issue inextract_features
utility. - Fixed test for sklearn classifier.
- Minor fix in Bernoulli spelling.
- Fixed minor issue in svc params config.
- Minor fixes to tuning.
- Switched tuning utility to use multiprocessing directly.
- Cleanup to tuning utility and add config files for each classifier's param space.
- Removed old config ptwiki config files that were never used.
- Added estonian and ukrainian languages
- Switches to aspell packages for et and uk in travis
- Replaces duplicate myspell-de-ch with myspell-de-de in README.rst
- TravisCI errors
- Fixed binary operators.
- Added
and/or/not
operators to features
- Fixes enchant link in README.rst
- Adds
--test-prop
param totrain_test
utility. - Adds a Dockerfile for building an image that will run an ipython notebook to build a revscoring project.
- Adds a
trim()
function for reducing afeature_list
to it's basic 'features' -- a prerequisite for wikimedia/ores#100 - Adds basic language features for Dutch, German and Italian.
- Widens version requirements for scipy and numpy to make compiling dependencies from source less common.
- Substantial improvements to documentation. Now using 'alabaster' theme and simplified examples.
Since 0.5.0:
- Adds batching to Feature extraction for more speed.
- Adds wheel support
- Adds
model_info
storage & utility
- Improved error reporting in api extractor
- Change to configuration -- APIExtractor now requires host instead of url
- Silences a utf16 encoding warning in enchant
- Fixes an issue with looking up user info in APIExtractor
- Fix travis builds, add coverage to reports
- Drops
mediawiki-utilities
in favor ofmwapi
andmwtypes
This release represents a major backwards incompatibility
- Languages as feature sets.
- Codebase is now PEP8 compliant.
- APIExtractor can't find language utilities
- Extended badwords and adds informal words for Persian language
- Fixes pickling issue with languages (see #159)
- Synchronizes dependency versions with https://github.com/wiki-ai/ores and generalizes both.
- Improved installation instructions
- Selective language imports (no need to download all the dictionaries anymore)
- Math domain error when processing imported revisions (
user.age
). - Handle
RevisionDocumentNotFound
when scoring new pages
- Adds Vietnamese support
- Feature
user.is_bot
errors out with None 'groups'.
- Adds spanish language utilities
- Adds informal words utility to English and Spanish languages
- Converts English language badwords detection to regex based strategy.
- Adds 'indonesian' language (thanks @kenrick95!)
- Added
balance_labels
arg to constructor of SVC models. - Improves formatting of
train_test
results (and implements one vs. rest ROC for multiclass models)
- Move "scorer" out of the library (and into ORES).
- Completed documentation (see http://pythonhosted.org/revscoring)
- Implemented a refactoring for the 'dependencies'.
- Also implements some new functions list dig() and expand()
- Specify version of scipy same as in Ubuntu Trusty
- New
Features
andDatasources
.
- Better performance.
- Test completeness.
- Batch feature extraction
- Fix bug where
max()
arg is an empty sequence.
- Added minimal implementation of Turkish language
- Improved behavior of MLScorer and MLScorer model.
- Improved README.
This is an early release of the revscoring library.
- Added revscoring.datasources (Datasource, 20 implemented)
- Added revscoring.extractors (APIExtractor)
- Added revscoring.features (Feature, 42 implemented)
- Added revscoring.languages (english & portuguese)
- Added revscoring.scorers (MLScorer/MLScorerModel, SVCModel, LinearSVCModel & RBFSVCModel)
- Added revscoring.dependent (Dependent, Solve)
- Tests for all new modules.
- Explanation of the recommended reverse chronological release ordering.
- Basic project setup.