packaging updates

rspeer · Mar 11, 2022 · 0fc7756 · 0fc7756
1 parent 3180972
commit 0fc7756
Show file tree

Hide file tree

Showing 7 changed files with 42 additions and 93 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -13,7 +13,7 @@ estimated distribution that allows for Benford's law (lower numbers are more
 frequent) and a special frequency distribution for 4-digit numbers that look
 like years (2010 is more frequent than 1020).
 
-Relatedly:
+More changes related to digits:
 
 - Functions such as `iter_wordlist` and `top_n_list` no longer return
   multi-digit numbers (they used to return them in their "smashed" form, such
@@ -23,6 +23,15 @@ Relatedly:
   instead in a place that's internal to the `word_frequency` function, so we can
   look at the values of the digits before they're replaced.
 
+Other changes:
+
+- wordfreq is now developed using `poetry` as its package manager, and with
+  `pyproject.toml` as the source of configuration instead of `setup.py`.
+
+- The minimum version of Python supported is 3.7.
+
+- Type information is exported using `py.typed`.
+
 ## Version 2.5.1 (2021-09-02)
 
 - Import ftfy and use its `uncurl_quotes` method to turn curly quotes into

diff --git a/Jenkinsfile b/Jenkinsfile
diff --git a/README.md b/README.md
@@ -11,7 +11,7 @@ in the usual way, either by getting it from pip:
 
     pip3 install wordfreq
 
-or by getting the repository and installing it using [poetry][]:
+or by getting the repository and installing it for development, using [poetry][]:
 
     poetry install
 
@@ -23,8 +23,8 @@ steps that are necessary to get Chinese, Japanese, and Korean word frequencies.
 ## Usage
 
 wordfreq provides access to estimates of the frequency with which a word is
-used, in 36 languages (see *Supported languages* below). It uses many different
-data sources, not just one corpus.
+used, in over 40 languages (see *Supported languages* below). It uses many
+different data sources, not just one corpus.
 
 It provides both 'small' and 'large' wordlists:
 
@@ -144,8 +144,8 @@ as `##` or `####` or `#.#####`, with `#` standing in for digits. (For compatibil
 with earlier versions of wordfreq, our stand-in character is actually `0`.) This
 is the same form of aggregation that the word2vec vocabulary does.
 
-Single-digit numbers are unaffected by this "binning" process; "0" through "9" have
-their own entries in each language's wordlist.
+Single-digit numbers are unaffected by this process; "0" through "9" have their own
+entries in each language's wordlist.
 
 When asked for the frequency of a token containing multiple digits, we multiply
 the frequency of that aggregated entry by a distribution estimating the frequency
@@ -158,10 +158,10 @@ The first digits are assigned probabilities by Benford's law, and years are assi
 probabilities from a distribution that peaks at the "present". I explored this in
 a Twitter thread at <https://twitter.com/r_speer/status/1493715982887571456>.
 
-The part of this distribution representing the "present" is not strictly a peak;
-it's a 20-year-long plateau from 2019 to 2039. (2019 is the last time Google Books
-Ngrams was updated, and 2039 is a time by which I will probably have figured out
-a new distribution.)
+The part of this distribution representing the "present" is not strictly a peak and
+doesn't move forward with time as the present does. Instead, it's a 20-year-long
+plateau from 2019 to 2039. (2019 is the last time Google Books Ngrams was updated,
+and 2039 is a time by which I will probably have figured out a new distribution.)
 
 Some examples:
 
@@ -172,7 +172,7 @@ Some examples:
     >>> word_frequency("1022", "en")
     1.28e-07
 
-Aside from years, the distribution does **not** care about the meaning of the numbers:
+Aside from years, the distribution does not care about the meaning of the numbers:
 
     >>> word_frequency("90210", "en")
     3.34e-10
@@ -419,19 +419,16 @@ As much as we would like to give each language its own distinct code and its
 own distinct word list with distinct source data, there aren't actually sharp
 boundaries between languages.
 
-Sometimes, it's convenient to pretend that the boundaries between
-languages coincide with national borders, following the maxim that "a language
-is a dialect with an army and a navy" (Max Weinreich). This gets complicated
-when the linguistic situation and the political situation diverge.
-Moreover, some of our data sources rely on language detection, which of course
-has no idea which country the writer of the text belongs to.
+Sometimes, it's convenient to pretend that the boundaries between languages
+coincide with national borders, following the maxim that "a language is a
+dialect with an army and a navy" (Max Weinreich). This gets complicated when the
+linguistic situation and the political situation diverge. Moreover, some of our
+data sources rely on language detection, which of course has no idea which
+country the writer of the text belongs to.
 
 So we've had to make some arbitrary decisions about how to represent the
 fuzzier language boundaries, such as those within Chinese, Malay, and
-Croatian/Bosnian/Serbian.  See [Language Log][] for some firsthand reports of
-the mutual intelligibility or unintelligibility of languages.
-
-[Language Log]: http://languagelog.ldc.upenn.edu/nll/?p=12633
+Croatian/Bosnian/Serbian.
 
 Smoothing over our arbitrary decisions is the fact that we use the `langcodes`
 module to find the best match for a language code. If you ask for word
@@ -446,6 +443,9 @@ the 'cjk' feature:
 
     pip install wordfreq[cjk]
 
+You can put `wordfreq[cjk]` in a list of dependencies, such as the
+`[tool.poetry.dependencies]` list of your own project.
+
 Tokenizing Chinese depends on the `jieba` package, tokenizing Japanese depends
 on `mecab-python3` and `ipadic`, and tokenizing Korean depends on `mecab-python3`
 and `mecab-ko-dic`.

diff --git a/poetry.lock b/poetry.lock
diff --git a/pyproject.toml b/pyproject.toml
@@ -5,6 +5,7 @@ description = "Look up the frequencies of words in many languages, based on many
 authors = ["Robyn Speer <[email protected]>"]
 license = "MIT"
 readme = "README.md"
+homepage = "https://github.com/rspeer/wordfreq/"
 
 [tool.poetry.dependencies]
 python = "^3.7"
@@ -25,6 +26,11 @@ black = "^22.1.0"
 flake8 = "^4.0.1"
 types-setuptools = "^57.4.9"
 
+[tool.poetry.extras]
+cjk = ["mecab-python3", "ipadic", "mecab-ko-dic", "jieba >= 0.42"]
+mecab = ["mecab-python3", "ipadic", "mecab-ko-dic"]
+jieba = ["jieba >= 0.42"]
+
 [build-system]
 requires = ["poetry-core>=1.0.0"]
 build-backend = "poetry.core.masonry.api"
diff --git a/setup.cfg b/setup.cfg
diff --git a/setup.py b/setup.py