This service compiles LaTeX codes into PDFs and extract detailed layout and reading information. Designed specifically for the academic publications. This tool not only compiles LaTeX but also annotates each token and figures, retrieves their positions in the PDF, identifies corresponding semantic structure labels, and mark the correct reading order.
[2024-06-13]:
- Refactored implementation of multiprocessing.
- Refactored the calls to LaTeXML.
- Guessing the labeling and reading order of unsuccessfully colored words based on line numbers.
- Aligned
PyMuPDF
andpdfplumber
based on position. Now we recognize font flags such as bold, italic.
[2024-03-26]:
- Added de-macro preprocessing, which cleans up LaTeX code by expanding a portion of a simple custom macros defined by
\newcommand
, and also inserts\input
into the main file. This method should increase the success rate of parsing. - Improved conservative parsing strategy for
LatexGroupNode
. We previously didn't color-code them because we were concerned that it was breaking the parameters of the unknown macro. Now we includeLatexGroupNode
with a length of more than 20 characters in the color annotation. - Added a black font to the PDF output. This is because color markup still changes the layout of the original paper to a greater or lesser extent. To keep the annotation looking the same as the original document, we will now compile the "original document" in black font.
[2024-04-08]:
- LaTeXML is installed in the AutoTeX container and is used to standardize the code for
tex
column in the annotation. - Applied standardized annotation to
equation
andtable
annotation. - Refined annotation strategy for
algorithm
environments.
- LaTeX Compilation: Compile LaTeX into PDF using a dockerized environment, leveraging TexLive2023.
- LaTeX Annotation: Add color labels to each token and figure in LaTeX code to facilitate automatic extraction of document layout.
- Data Extraction: Extract fine information about every token and figure, such as its type, position, and corresponding section in the compiled PDF document, and output this as a pandas DataFrame.
- Docker
- Python3.8+
git clone https://github.com/InsightsNet/texannotate.git
cd texannotate
pip install -r requirements.txt
For example, you wish to annotate the LaTeX project for arXiv paper 1601.00978. First, fetch the sources for the project:
mkdir downloaded
wget -O downloaded/1601.00978.tar.gz https://arxiv.org/e-print/1601.00978 --user-agent "Name <email>"
python main.py
The tool outputs two pandas DataFrame for each input LaTeX source package, which has a total of 13 columns.
section_id | nested_to |
---|---|
int | int |
The first row is the Table of Contents root node, whose section_id is the 0 and nested_to is -1;
reading_order | label | block_id | section_id | text | page_no | x0 | y0 | x1 | y1 | font | font_size | flags | tex | page_size | line_no |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
int | str | int | int | str | int | float | float | float | float | str | float | list | str | list | int |
Each row is an figure or token being extracted from the PDF, the integer reading_order starting from 0 is the author's writing order. If it is -1, the token is not content written by the author (e.g., watermarks and headers). label are semantic structure labels, which includes: Abstract, Author, Caption, Equation, Figure, Footer, List, Paragraph, Reference, Section, Table, Title.
See example about the annotation of one paper.
Here's another example summarizing the details of the paper with an LLM.
This work was presented at The 2nd Workshop on Information Extraction from Scientific Publications (WIESP) @ IJCNLP-AACL 2023.
@inproceedings{duan-etal-2023-latex,
title = "{L}a{T}e{X} Rainbow: Universal {L}a{T}e{X} to {PDF} Document Semantic {\&} Layout Annotation Framework",
author = "Duan, Changxu and
Tan, Zhiyin and
Bartsch, Sabine",
editor = "Ghosal, Tirthankar and
Grezes, Felix and
Allen, Thomas and
Lockhart, Kelly and
Accomazzi, Alberto and
Blanco-Cuaresma, Sergi",
booktitle = "Proceedings of the Second Workshop on Information Extraction from Scientific Publications",
month = nov,
year = "2023",
address = "Bali, Indonesia",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.wiesp-1.8",
doi = "10.18653/v1/2023.wiesp-1.8",
pages = "56--67",
}
This work was also presented in non-archived form at 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS) @ EMNLP 2023, you can read our poster here.
- The compilation service is a dockerized wrapper around the AutoTeX library used by arXiv to automatically compile submissions to arXiv. We modified part of it.
- The code for the compilation service is essentially inherited from texcompile, and this repository was formerly a fork of it.
- pylatexenc 3.0alpha is used to identify and traverse the latex code.
- pdfplumber is used to extract shapes and texts from PDF files.
- de-macro parses and expands simple macro definitions.
- A prettier frontend (like streamlit) to interact with papers and(or) to bundle with LLMs.
- Parse
.cls
and.sty
file.- Cannot parse some environment, we need update
pylatexenc
.
- Cannot parse some environment, we need update
- Make our own LaTeX package inheriting from xcolor in CTAN to avoid conflict.
- Investigate Underlying logic of the coloring order.
- Explore the method of SyncTex.
- Line based label correction.
- Rainbow colors #1
- Improve Parsing rules (from Overleaf and TeX-Workshop):
- Package command definitions from TeX-Workshop
and Overleaf.- Adapt
pylatexenc
for such the case of\pagebreak<blah>
and\verb|blah|
- Refine the parsing function for such the case of
\newcommand{\be}{\begin{equation}}
. Expanded byde-macro
. - Unclosed open group
{
. - Standardize LaTeX code annotation in math formulas, tables, etc. with LaTeXML. Because they may include user-defined macros.
- Parse tabulars (with LaTeXML).
- Parse math in detail, which will need understand alignemts.
- Combine
pylatexenc
withlatex-utensils
andunified-latex
. - Learn how LaTeXML parses, expands
\def
and\if
s.
- Adapt
-
\newcommand
parsing strategy fromTex-Workshop (using unified-latex) and Overleaf (using Lezer)pylatexenc
.
- Package command definitions from TeX-Workshop
- Imporve document structure extraction rule from TeX-Workshop
- Parallelization
- Evlauate annotation
- Documentation and Unit testing
Apache 2.0.