There are way too many arxiv papers, so I wrote a quick webapp that lets you search and sort through the mess in a pretty interface, similar to my pretty conference format.
It's super hacky and was written in 4 hours. I'll keep polishing it a bit over time perhaps but it serves its purpose for me already. The code uses Arxiv API to download the most recent papers (as many as you want - I used the last 1100 papers over last 3 months), and then downloads all papers, extracts text, creates tfidf vectors for each paper, and lastly is a flask interface for searching through and filtering similar papers using the vectors.
Main functionality is a search feature, and most useful is that you can click "sort by tfidf similarity to this", which returns all the most similar papers to that one in terms of tfidf bigrams. I find this quite useful.
This code is currently running live at www.arxiv-sanity.com/. Right now it's serving 10400 arxiv papers from cs.[CV|CL|LG] over the last ~3 years, and more will be added in time as I build this out.
You will need numpy, feedparser (to process xml files), scikit learn (for tfidf vectorizer), and flask (for serving the results), and tornado (if you want to run the flask server in production). Also dateutil, and scipy. Most of these are easy to get through pip
, e.g.:
$ virtualenv env # optional: use virtualenv
$ source env/bin/activate # optional: use virtualenv
$ pip install feedparser # only if you want to scrape arxiv
$ pip install numpy
$ pip install scipy
$ pip install scikit-learn # needed for sparse arrays
$ pip install python-dateutil # only in serve.py for some date utils
$ pip install flask # only in serve.py
$ pip install tornado # only in serve.py
Requires reading code and getting hands dirty. Magic numbers throughout code.
- Run
scrape.py
, which queries most recent papers in Arxiv and dumps xml into folderraw
- Run
parse_raw.py
, which reads all xml files inraw
and creates a pickle with all critical information calleddb.p
. - Run
download_pdf.py
, which iterates over all papers in parsed pickle and downloads the papers into folderpdf
- Run
parse_pdf_to_text.py
to export all text from pdfs to files intxt
- Run
analyze.py
to compute tfidf vectors for all documents based on bigrams. Saves atfidf.p
pickle file. - Run
thumb_pdf.py
to export thumbnails of all pdfs tothumb
- Run the flask server with
serve.py
. Visit localhost:5000 and enjoy sane viewing of papers
If you'd like to browse the 10400 arxiv papers currently running in the demo, you can download the prebuilt database. This means you can skip steps 1-6 above and simply run the server (step 7). Here is the download link.. Unzip in root folder and fire up flask with serve.py
.
If you'd like to run this flask server online (e.g. AWS/Terminal) run it as python serve.py --prod
.