Skip to content

Latest commit

 

History

History
122 lines (93 loc) · 4.57 KB

README.rst

File metadata and controls

122 lines (93 loc) · 4.57 KB

CorPy

Documentation status PyPI package Code style

Installation

$ python3 -m pip install corpy

Only recent versions of Python 3 (3.10+) are supported by design.

Help and feedback

If you get stuck, it's always a good idea to start by searching the documentation, the short URL to which is https://corpy.rtfd.io/.

The project is developed on GitHub. You can ask for help via GitHub discussions and report bugs and give other kinds of feedback via GitHub issues. Support is provided gladly, time and other engagements permitting, but cannot be guaranteed.

What is CorPy?

A fancy plural for corpus ;) Also, a collection of handy but not especially mutually integrated tools for dealing with linguistic data. It abstracts away functionality which is often needed in practice for teaching and/or day to day work at the Czech National Corpus, without aspiring to be a fully featured or consistent NLP framework.

Here's an idea of what you can do with CorPy:

Note

Should I pick UDPipe or MorphoDiTa?

Both are developed at ÚFAL MFF UK. UDPipe has more features at the cost of being somewhat more complex: it does both morphological tagging (including lemmatization) and syntactic parsing, and it handles a number of different input and output formats. You can also download pre-trained models for many different languages.

By contrast, MorphoDiTa only has pre-trained models for Czech and English, and only performs morphological tagging (including lemmatization). However, its output is more straightforward -- it just splits your text into tokens and annotates them, whereas UDPipe can (depending on the model) introduce additional tokens necessary for a more explicit analysis, add multi-word tokens etc. This is because UDPipe is tailored to the type of linguistic analysis conducted within the UniversalDependencies project, using the CoNLL-U data format.

MorphoDiTa can also help you if you just want to tokenize text and don't have a language model available.

Development

Dependencies and building the docs

corpy needs to be installed in the ReadTheDocs virtualenv for autodoc to work. The optional dependencies in the doc group are also needed. This is all configured in .readthedocs.yml.

License

Copyright © 2016--present ÚČNK/David Lukeš

Distributed under the GNU General Public License v3.