DuraScrape

Introduction

DuraScrape is a web crawler that scrapes open-access journal articles from academic journal websites and saves them in a SQL database.

The first (and currently only) journal that is available is Journal of Neurophysiology.

Usage

The code below assumes that you have set up a postgresql server and created a database called findings_db. You can change the database name, of course; just make sure you edit the name in the database.ini file

from journal_scrape import JNeurophys
from dbclass import Database

# initialize the database
db = Database()

# create tables
# this creates two tables:
# * body (for full text)
# * metadata (for metadata and citation linking)
db.create_tables()

# create JNeurophys object
jn = JNeurophys()

# tell it to crawl the journal
jn.crawl_journal()

As the scraping progresses, the crawl_journal() method will print the URL of each article and specify whether or not it was successfully saved to the database.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
database.ini		database.ini
dbclass.py		dbclass.py
example_usage.ipynb		example_usage.ipynb
journal_scrape.py		journal_scrape.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DuraScrape

Introduction

Usage

About

Releases

Packages

Languages

License

danielkentwood/DuraScrape

Folders and files

Latest commit

History

Repository files navigation

DuraScrape

Introduction

Usage

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages