Releases: rspeer/wordfreq
v3.0.2: packaging fixes
v3.0: The "handle numbers better" release
Previously, wordfreq would group all digit sequences of the same 'shape',
with length 2 or more, into a single token and return the frequency of that
token, which would be a vast overestimate.
Now it distributes the frequency over all numbers of that shape, with an
estimated distribution that allows for Benford's law (lower numbers are more
frequent) and a special frequency distribution for 4-digit numbers that look
like years (2010 is more frequent than 1020).
More changes related to digits:
-
Functions such as
iter_wordlist
andtop_n_list
no longer return
multi-digit numbers (they used to return them in their "smashed" form, such
as "0000"). -
lossy_tokenize
no longer replaces digit sequences with 0s. That happens
instead in a place that's internal to theword_frequency
function, so we can
look at the values of the digits before they're replaced.
Other changes:
-
wordfreq is now developed using
poetry
as its package manager, and with
pyproject.toml
as the source of configuration instead ofsetup.py
. -
The minimum version of Python supported is 3.7.
-
Type information is exported using
py.typed
.
v2.5.1
Version 2.5.1 (2021-09-02)
-
Import ftfy and use its
uncurl_quotes
method to turn curly quotes into
straight ones, providing consistency with multiple forms of apostrophes. -
Set minimum version requierements on
regex
,jieba
, andlangcodes
so that tokenization will give consistent results. -
Work around an inconsistency in the
msgpack
API around
strict_map_key=False
.
Version 2.5 (2021-04-15)
- Incorporate data from the OSCAR corpus.