Skip to content

Commit

Permalink
packaging updates
Browse files Browse the repository at this point in the history
  • Loading branch information
rspeer committed Mar 11, 2022
1 parent 3180972 commit 0fc7756
Show file tree
Hide file tree
Showing 7 changed files with 42 additions and 93 deletions.
11 changes: 10 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ estimated distribution that allows for Benford's law (lower numbers are more
frequent) and a special frequency distribution for 4-digit numbers that look
like years (2010 is more frequent than 1020).

Relatedly:
More changes related to digits:

- Functions such as `iter_wordlist` and `top_n_list` no longer return
multi-digit numbers (they used to return them in their "smashed" form, such
Expand All @@ -23,6 +23,15 @@ Relatedly:
instead in a place that's internal to the `word_frequency` function, so we can
look at the values of the digits before they're replaced.

Other changes:

- wordfreq is now developed using `poetry` as its package manager, and with
`pyproject.toml` as the source of configuration instead of `setup.py`.

- The minimum version of Python supported is 3.7.

- Type information is exported using `py.typed`.

## Version 2.5.1 (2021-09-02)

- Import ftfy and use its `uncurl_quotes` method to turn curly quotes into
Expand Down
4 changes: 0 additions & 4 deletions Jenkinsfile

This file was deleted.

40 changes: 20 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ in the usual way, either by getting it from pip:

pip3 install wordfreq

or by getting the repository and installing it using [poetry][]:
or by getting the repository and installing it for development, using [poetry][]:

poetry install

Expand All @@ -23,8 +23,8 @@ steps that are necessary to get Chinese, Japanese, and Korean word frequencies.
## Usage

wordfreq provides access to estimates of the frequency with which a word is
used, in 36 languages (see *Supported languages* below). It uses many different
data sources, not just one corpus.
used, in over 40 languages (see *Supported languages* below). It uses many
different data sources, not just one corpus.

It provides both 'small' and 'large' wordlists:

Expand Down Expand Up @@ -144,8 +144,8 @@ as `##` or `####` or `#.#####`, with `#` standing in for digits. (For compatibil
with earlier versions of wordfreq, our stand-in character is actually `0`.) This
is the same form of aggregation that the word2vec vocabulary does.

Single-digit numbers are unaffected by this "binning" process; "0" through "9" have
their own entries in each language's wordlist.
Single-digit numbers are unaffected by this process; "0" through "9" have their own
entries in each language's wordlist.

When asked for the frequency of a token containing multiple digits, we multiply
the frequency of that aggregated entry by a distribution estimating the frequency
Expand All @@ -158,10 +158,10 @@ The first digits are assigned probabilities by Benford's law, and years are assi
probabilities from a distribution that peaks at the "present". I explored this in
a Twitter thread at <https://twitter.com/r_speer/status/1493715982887571456>.

The part of this distribution representing the "present" is not strictly a peak;
it's a 20-year-long plateau from 2019 to 2039. (2019 is the last time Google Books
Ngrams was updated, and 2039 is a time by which I will probably have figured out
a new distribution.)
The part of this distribution representing the "present" is not strictly a peak and
doesn't move forward with time as the present does. Instead, it's a 20-year-long
plateau from 2019 to 2039. (2019 is the last time Google Books Ngrams was updated,
and 2039 is a time by which I will probably have figured out a new distribution.)

Some examples:

Expand All @@ -172,7 +172,7 @@ Some examples:
>>> word_frequency("1022", "en")
1.28e-07

Aside from years, the distribution does **not** care about the meaning of the numbers:
Aside from years, the distribution does not care about the meaning of the numbers:

>>> word_frequency("90210", "en")
3.34e-10
Expand Down Expand Up @@ -419,19 +419,16 @@ As much as we would like to give each language its own distinct code and its
own distinct word list with distinct source data, there aren't actually sharp
boundaries between languages.

Sometimes, it's convenient to pretend that the boundaries between
languages coincide with national borders, following the maxim that "a language
is a dialect with an army and a navy" (Max Weinreich). This gets complicated
when the linguistic situation and the political situation diverge.
Moreover, some of our data sources rely on language detection, which of course
has no idea which country the writer of the text belongs to.
Sometimes, it's convenient to pretend that the boundaries between languages
coincide with national borders, following the maxim that "a language is a
dialect with an army and a navy" (Max Weinreich). This gets complicated when the
linguistic situation and the political situation diverge. Moreover, some of our
data sources rely on language detection, which of course has no idea which
country the writer of the text belongs to.

So we've had to make some arbitrary decisions about how to represent the
fuzzier language boundaries, such as those within Chinese, Malay, and
Croatian/Bosnian/Serbian. See [Language Log][] for some firsthand reports of
the mutual intelligibility or unintelligibility of languages.

[Language Log]: http://languagelog.ldc.upenn.edu/nll/?p=12633
Croatian/Bosnian/Serbian.

Smoothing over our arbitrary decisions is the fact that we use the `langcodes`
module to find the best match for a language code. If you ask for word
Expand All @@ -446,6 +443,9 @@ the 'cjk' feature:

pip install wordfreq[cjk]

You can put `wordfreq[cjk]` in a list of dependencies, such as the
`[tool.poetry.dependencies]` list of your own project.

Tokenizing Chinese depends on the `jieba` package, tokenizing Japanese depends
on `mecab-python3` and `ipadic`, and tokenizing Korean depends on `mecab-python3`
and `mecab-ko-dic`.
Expand Down
7 changes: 6 additions & 1 deletion poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 6 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ description = "Look up the frequencies of words in many languages, based on many
authors = ["Robyn Speer <[email protected]>"]
license = "MIT"
readme = "README.md"
homepage = "https://github.com/rspeer/wordfreq/"

[tool.poetry.dependencies]
python = "^3.7"
Expand All @@ -25,6 +26,11 @@ black = "^22.1.0"
flake8 = "^4.0.1"
types-setuptools = "^57.4.9"

[tool.poetry.extras]
cjk = ["mecab-python3", "ipadic", "mecab-ko-dic", "jieba >= 0.42"]
mecab = ["mecab-python3", "ipadic", "mecab-ko-dic"]
jieba = ["jieba >= 0.42"]

[build-system]
requires = ["poetry-core>=1.0.0"]
build-backend = "poetry.core.masonry.api"
2 changes: 0 additions & 2 deletions setup.cfg

This file was deleted.

65 changes: 0 additions & 65 deletions setup.py

This file was deleted.

0 comments on commit 0fc7756

Please sign in to comment.