Releases: weblyzard/inscriptis
Releases · weblyzard/inscriptis
Fixed annotations for borderline cases
Please refer to https://github.com/weblyzard/inscriptis/releases/tag/2.0rc1 for a list of all new features. This release candidate fixes the following issues in rc1:
- fixed annotations for some borderline cases
- improved documentation compared to 2.0rc2
Improved document model, parsing of borderline cases & HTML annotation support
-
HTML parsing:
- new: new model for handling blocks and lines
- chg: improved HTML parsing of tables, enumerations and margins; fixed borderline cases
- chg: improved whitespace handling
- add: cover more borderline cases with unit tests
-
Inscriptis core:
- new: support for annotation rules and annotation output
- new: annotation post-processors (html, xml, surface form)
- new: type hints
- chg: extended and improved documentation
-
Inscript command line client:
- chg: apply
--encoding
to Web URLs as well
- chg: apply
1.2
Improved margin handling & more liberal licensing
- ignore top margins at the beginning of a document.
- more liberal licensing:
- the license change has been triggered by another project that created a Java port of inscriptis.
- to facilitate the free sharing of code and ideas between our two projects, we have (i) obtained the permission of all contributors for a license change, and (ii) changed the inscriptis license to the "Apache License 2.0".
Improved testing and Python 3.9 support
- minor performance improvements and code optimizations
- added Python 3.9 test environment
- improved test coverage
- updated package metadata
- improved tox configuration
Improved HTML rendering, command line client and Web service
- added support for rendering tags with the
white-space: pre
CSS attribute (e.g.<pre>
which is often used for formatting code). - API change: A
ParserConfig
object replaces the parametersdisplay_images
,dedpulicate_captions
,display_links
andindentation
inget_text()
and for initializing theInscriptis
class.
from lxml.html import fromstring
from inscriptis.model.config import ParserConfig
html_tree = fromstring(html)
# optional parser configuration fine tuning
config = ParserConfig(display_links=True, display_anchors=True)
parser = Inscriptis(html_tree, config)
text = parser.get_text()
- command line client:
- added option for displaying anchor links
--encoding
not sets the HTML and output encoding- new
--version
option
- Web service
- use the related CSS profile per default
- added
version
call
- Documentation fixes and improvements
Improved performance and code structure, documentation and unit testing
- improved performance and code structure.
- use metadata published in
./inscriptis/__init__.py
for versioning and in setup.py. - improved test coverage
- created sphinx API, usage and testing documentation which is published on https://inscriptis.readthedocs.org
- requires Python 3.5+ (dropped support for Python 2.7)
Correct inscript.py default indentation strategy.
Use the extended
indentation strategy per default as outlined in the README.md.
Improved indentation and custom rendering styles
- improved indentation, if span and div tags are used
- support for custom rendering styles
- improved documentation
- use travis for auto CI
- requires Python 2.7+ or Python 3.5+ since lxml does not support Python 3 versions <3.5
Improved table rendering (nested tables and line breaks in tables)
- Correctly handle nested tables and line breaks (e.g. due to enumerations, list or paragraph breaks) in tables.
- Improved content stripping.
Please take a look at the Rendering document for an overview of how Inscriptis renders different tables.