LaTeX Rainbow

This service compiles LaTeX codes into PDFs and extract detailed layout and reading information. Designed specifically for the academic publications. This tool not only compiles LaTeX but also annotates each token and figures, retrieves their positions in the PDF, identifies corresponding semantic structure labels, and mark the correct reading order.

Update

[2024-06-13]:

Refactored implementation of multiprocessing.
Refactored the calls to LaTeXML.
Guessing the labeling and reading order of unsuccessfully colored words based on line numbers.
Aligned PyMuPDF and pdfplumber based on position. Now we recognize font flags such as bold, italic.

[2024-03-26]:

Added de-macro preprocessing, which cleans up LaTeX code by expanding a portion of a simple custom macros defined by \newcommand, and also inserts \input into the main file. This method should increase the success rate of parsing.
Improved conservative parsing strategy for LatexGroupNode. We previously didn't color-code them because we were concerned that it was breaking the parameters of the unknown macro. Now we include LatexGroupNode with a length of more than 20 characters in the color annotation.
Added a black font to the PDF output. This is because color markup still changes the layout of the original paper to a greater or lesser extent. To keep the annotation looking the same as the original document, we will now compile the "original document" in black font.

[2024-04-08]:

LaTeXML is installed in the AutoTeX container and is used to standardize the code for tex column in the annotation.
Applied standardized annotation to equation and table annotation.
Refined annotation strategy for algorithm environments.

Main Purpose:

LaTeX Compilation: Compile LaTeX into PDF using a dockerized environment, leveraging TexLive2023.
LaTeX Annotation: Add color labels to each token and figure in LaTeX code to facilitate automatic extraction of document layout.
Data Extraction: Extract fine information about every token and figure, such as its type, position, and corresponding section in the compiled PDF document, and output this as a pandas DataFrame.

Prerequisites:

Docker
Python3.8+

Usage:

git clone https://github.com/InsightsNet/texannotate.git
cd texannotate
pip install -r requirements.txt

For example, you wish to annotate the LaTeX project for arXiv paper 1601.00978. First, fetch the sources for the project:

mkdir downloaded
wget -O downloaded/1601.00978.tar.gz https://arxiv.org/e-print/1601.00978 --user-agent "Name <email>"
python main.py

Output Format

The tool outputs two pandas DataFrame for each input LaTeX source package, which has a total of 13 columns.

Figures and Tokens

reading_order	label	block_id	section_id	text	page_no	x0	y0	x1	y1	font	font_size	flags	tex	page_size	line_no
int	str	int	int	str	int	float	float	float	float	str	float	list	str	list	int

Each row is an figure or token being extracted from the PDF, the integer reading_order starting from 0 is the author's writing order. If it is -1, the token is not content written by the author (e.g., watermarks and headers). label are semantic structure labels, which includes: Abstract, Author, Caption, Equation, Figure, Footer, List, Paragraph, Reference, Section, Table, Title.

See example about the annotation of one paper.

Here's another example summarizing the details of the paper with an LLM.

Citation

This work was presented at The 2nd Workshop on Information Extraction from Scientific Publications (WIESP) @ IJCNLP-AACL 2023.

@inproceedings{duan-etal-2023-latex,
    title = "{L}a{T}e{X} Rainbow: Universal {L}a{T}e{X} to {PDF} Document Semantic {\&} Layout Annotation Framework",
    author = "Duan, Changxu  and
      Tan, Zhiyin  and
      Bartsch, Sabine",
    editor = "Ghosal, Tirthankar  and
      Grezes, Felix  and
      Allen, Thomas  and
      Lockhart, Kelly  and
      Accomazzi, Alberto  and
      Blanco-Cuaresma, Sergi",
    booktitle = "Proceedings of the Second Workshop on Information Extraction from Scientific Publications",
    month = nov,
    year = "2023",
    address = "Bali, Indonesia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.wiesp-1.8",
    doi = "10.18653/v1/2023.wiesp-1.8",
    pages = "56--67",
}

This work was also presented in non-archived form at 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS) @ EMNLP 2023, you can read our poster here.

Acknowledgments

The compilation service is a dockerized wrapper around the AutoTeX library used by arXiv to automatically compile submissions to arXiv. We modified part of it.
The code for the compilation service is essentially inherited from texcompile, and this repository was formerly a fork of it.
pylatexenc 3.0alpha is used to identify and traverse the latex code.
pdfplumber is used to extract shapes and texts from PDF files.
de-macro parses and expands simple macro definitions.

TODO:

License

Apache 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
doc		doc
pdfextract		pdfextract
texannotate		texannotate
texcompile		texcompile
utils		utils
.gitignore		.gitignore
README.md		README.md
main.py		main.py
main_multiprocess.py		main_multiprocess.py
requirements.txt		requirements.txt
streamlit_app.py		streamlit_app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LaTeX Rainbow

Update

Main Purpose:

Prerequisites:

Usage:

Output Format

Table of Contents

Figures and Tokens

Citation

Acknowledgments

TODO:

License

About

Releases

Packages

Contributors 3

Languages

InsightsNet/texannotate

Folders and files

Latest commit

History

Repository files navigation

LaTeX Rainbow

Update

Main Purpose:

Prerequisites:

Usage:

Output Format

Table of Contents

Figures and Tokens

Citation

Acknowledgments

TODO:

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages