Skip to content

Releases: scribe-org/Scribe-Data

Scribe-Data 4.0.0

28 Nov 18:27
Compare
Choose a tag to compare

✨ Features

  • Queries for countless data types for countless languages were expanded and added ❤️
  • Scribe-Data is now a fully functional CLI.
    • Querying Wikidata lexicographical data can be done via the get command (#159).
    • The output type of queries can be in JSON, CSV, TSV and SQLite, with converting output types also being possible (#145, #146)
    • Output paths can be set for query results (#144).
    • The version of the CLI can be printed to the command line and the CLI can further be used to upgrade itself (#186, #157 ).
    • Total Wikidata lexemes for languages and data types can be derived with the total command (#147).
    • Interactive and total commands can be used via an interactive mode with the --interactive argument (#158, #203).
    • Outputs were standardized to assure that the CLI experience is consistent
  • The machine translation process has been removed to make way for the Wiktionary based implementation (#292).
  • Package metadata files were standardized for languages, data types and Wikidata lexeme forms.
  • CLI commands have an argument check that can suggest correct languages and data types (#341).

🐞 Bug Fixes

  • Wikidata query process stages no longer trigger the tqdm progress bar when they're unsuccessful (#155).

✅ Tests

  • Tests have been written for the CLI to assure that it's functionality remains consistent.
  • Workflows were created to assure that the Wikidata queries and project structure are consistent to assure package functionality (#339, #357)
    • Project queries and its structure have been updated to match the rules developed for the checks.

📝 Documentation

  • The CLI's functionality has been fully documented (#152, #208).
  • Documentation was created to show how to write Scribe-Data queries (#395).

♻️ Code Refactoring

  • word_type has been switched to data_type throughout the codebase (#160).
  • Case, gender and annotation utility functions were removed as the formatting process that used them has changed.
  • The SPARQLWrapper access method has been extracted to the Wikidata utils and is imported into the files that need it (#164).
  • Export data paths have been converted to centrally saved variables to reduce hard coded string repetition.
  • Many files were renamed including update_data.py being renamed query_data.py
  • Paths within the package have been updated to work for all operating systems via pathlib (#125).
  • The language formatting scripts have been dramatically simplified given changes to export paths all being the same.
  • The update_files directory was removed in preparation of other means of showing data totals.
  • The language_data_extraction directory was moved under the Wikidata directory as it's only used for those processes now (#446).
  • The emoji keyword process was centralized to simplify project maintenance (#359).
  • PyICU was removed as a dependency and a process was made to install it and its needed dependencies given the operating system of the user (#196).
  • The data formatting step was centralized such that we only have one for all languages (#142).
  • Sub-query processes are now no longer hard coded such that we'd need to maintain the total possible sub-queries within the query_data.py process.

Scribe-Data v3.3.0

09 Jun 12:51
Compare
Choose a tag to compare

✨ Features

  • The translation process has been updated to allow for translations from non-English languages (#72, #73, #74, #75, #75, #76, #77, #78, #79).

📝 Documentation

  • The documentation has been given a new layout with the logo in the top left (#90).
  • The documentation now has links to the code at the top of each page (#91).

🐞 Bug Fixes

  • Annotation bugs were removed like repeat or empty values.
  • Perfect tenses of Portuguese verbs were fixed via finding the appropriate PID (#68).
    • Note that the most common past perfect property is not the standard one, so this will need to be fixed.

♻️ Code Refactoring

  • pre-commit have been added to the repo to improve the development experience (#137).
  • Code formatting was shifted from black to Ruff.
  • A Ruff based GitHub workflow was added to check the code formatting and lint the codebase on each pull request (#109).
  • The _update_files directory was renamed update_files as these files are used in non-internal manners now (#57).
  • A common function has been created to map Wikidata ids to noun genders (#69).
  • The project now is installed locally for development and command line usage, so usages of sys.path have been removed from files (#122).
  • The directory structure has been dramatically streamlined and includes folders for future projects where language data could come from other sources like Wiktionary (#139).
    • Translation files are moved to their own directory.
    • The extract_transform directory has been removed and all files within it have been moved one level up.
    • The languages directory has been renamed language_data_extraction.
    • All files within wikidata/_resources have been moved to the resources directory.
    • The gender and case annotations for data formatting have now been commonly defined.
    • All language directory formatted_data files have been now moved to the scribe_data_json_export directory to prepare for outputs being required to be directed to a directory outside of the package.
    • Path computing has been refactored throughout the codebase, and unneeded functions for data transfers have been removed.

Scribe-Data v3.2.2

24 Feb 02:40
Compare
Choose a tag to compare
  • Minor fixes to documentation index and file docstrings to fix errors.
  • Revert change to package path definition to hopefully register the resources directory.

Scribe-Data v3.2.1

24 Feb 01:54
Compare
Choose a tag to compare

♻️ Code Refactoring

  • The docs and tests were grafted into the package using MANIFEST.in.
  • Minor fixes to file and function docstrings and documentation files.
  • include_package_data=True is used in setup.py to hopefully include all files in the package distribution.

Scribe-Data v3.2.0

24 Feb 01:47
Compare
Choose a tag to compare

✨ Features

  • The data and process needed for an English keyboard has been added (#39).
    • The Wikidata queries for English have been updated to get all nouns and verbs.
    • Formatting scripts have been written to prepare the queried data and load it into an SQLite database.
  • The data update process has been cleaned up in preparation for future changes to Scribe-Data and to implement better practices.
  • Language data was extracted into a JSON file for more succinct referencing (#52).
  • Language codes are now checked with the package langcodes for easier expansion.
  • A process has been created to check and update words that can be translated for each Scribe language (#44).
  • The baseline data returned from Wikidata queries is now removed once a formatted data file is created.

🐞 Bug Fixes

  • Tensorflow was removed from the download wiki process to fix build problems on Macs.

✅ Tests

  • A full testing suite has been added to run on GitHub Actions (#37).
  • Unit tests have been added for Wikidata queries (#48) and utility functions (#50).

♻️ Code Refactoring

  • The Anaconda based virtual environment was removed and documentation was updated to reflect this.
  • Language data processes were moved into the src/scribe_data/extract_transform/languages directory to clean up the structure.
  • Code formatting processes were defined with common structures based on language and word type variables defined at the top of files.

Scribe-Data 3.1.0

30 Apr 15:29
Compare
Choose a tag to compare

✨ Features

  • The word "Scribe" is now added to language database nouns files if it's not already present.
  • German contracted prepositions have been added to the German prepositions formatting process.
  • Words that are upper case are now better included in the autocomplete lexicon with their lower case equivalents being removed.
  • Words with apostrophes have been removed from the autocomplete lexicon.

♻️ Code Refactoring

  • Database output column names are now zero indexed to better align with Python and other language standards.

Scribe-Data 3.0.0

19 Apr 00:35
Compare
Choose a tag to compare

✨ Features

  • Scribe-Data now has the ability to generate SQLite databases from formatted language data.
    • data_to_sqlite.py is used to read available JSON files and input their information into the databases.
  • These databases are now sent to Scribe apps via defined paths.
    • send_dbs_to_scribe.py finds all available language databases and copies them.
    • Separating this step from the data update is in preparation for data import in the future where this will be an individual step.
  • Scribe-Data now also creates autocomplete lexicons for each language within data_to_sqlite.py.
  • JSON data is no longer able to be uploaded to Scribe app directories directly, with the SQLite directories now being exported instead.
  • Emojis of singular nouns are now also linked to their plural counterparts if the plural isn't present in the emoji keyword outputs.
  • The emoji process also now updates a column to the data_table.txt file for sharing on readmes with update_data.py maintaining it in the data update process.

♻️ Code Refactoring

  • The Jupyter notebooks for autosuggestions and emojis as well as update_data.py were moved to the extract_transform directory given that they're not used to load data anymore.
    • Their code was refactored to reflect their new locations.
  • Massive amounts of refactoring happened to achieve the shift in the data export method:
    • format_WORD_TYPE.py files export to a formatted_data directory within extract_transform.
    • Copies of all data JSONs that were originally in Scribe apps are now in the formatted_data directories.
    • Functions in update_utils.py were switched given that data is no longer uploaded into a Data directory within the language keyboard directories within Scribe apps.
    • Lots of functions and variables were renamed to make them more understandable.
    • Code to derive appropriate export locations within format_WORD_TYPE.py files was removed in favor of a language formatted_data directory.
    • regex was added as a dependency.
    • pylint comments were removed.
  • Verb SPARQL query scripts for Spanish and Italian were simplified to remove unneeded repeat conditions.

🐞 Bug Fixes

  • The statements in translation files have been fixed as they were improperly defined after a file was moved.

Scribe-Data 2.1.0

05 Nov 09:04
Compare
Choose a tag to compare

✨ Features

  • Scribe-Data can now split Wikidata queries into multiple stages to break up those that were too large to run.

Scribe-Data 2.0.0

10 Oct 10:21
Compare
Choose a tag to compare

✨ Features

  • Scribe-Data now has the ability to download Wikipedia dumps of any language.
  • Functions have been added to parse and clean the above dumps.
  • Autosuggestions are generated from the cleaned texts by deriving most common words and those words that most commonly follow them.
  • A query for profane words has been added and integrated into the autosuggest flow to make sure that inappropriate words are not included.
    • The adjectives column has been removed from Scribe data tables until support is offered.

♻️ Code Refactoring

  • The error messages for incorrect args in update_data.py have been updated.

Scribe-Data 1.0.1

07 Apr 11:17
Compare
Choose a tag to compare

✨ Features

  • update_data.py now functions using SPARQLWrapper instead of wikidataintegrator.

🐞 Bug Fixes

  • The data update process has been fixed to work for all queries.
  • Hard coded strings for Spanish formatting files were fixed.
  • The paths of update_data.py were changed to match the new package structure.