Skip to content

NDNP Batch Ingest Guide

Eben English edited this page Oct 15, 2019 · 5 revisions

NewspaperWorks provides functionality for batch ingest of digitized newspapers conforming to NDNP digitization specs via a command-line rake task.

How to run it

To invoke the rake task, run the following command from the home directory of your application:

$ rake newspaper_works:ingest_ndnp -- --path=/path/to/your/ndnp/batch

In addition to path, the rake task also accepts arguments for admin_set, depositor, and visibility, as in:

$ rake newspaper_works:ingest_ndnp -- --path=/path/to/your/ndnp/batch --admin_set=admin_set/default [email protected] --visibility=open

What it does

When run, the rake task will:

  1. Create a NewspaperTitle object for each publication in the batch
  2. Create a NewspaperContainer object for each reel
  3. Iterate over the directories in the batch, creating NewspaperIssue and NewspaperPage objects for each issue and page
  4. Attach existing page-level derivatives (ALTO, PDF, etc.) to the NewspaperPage objects
  5. Index OCR text to Solr for full-text searching
  6. Create a word-coordinate JSON derivative file to facilitate page-image search hit highlighting
  7. Compile an issue-level PDF object from page files and attach as primary file to each NewspaperIssue object
  8. Add metadata to the created objects from the corresponding XML manifest files in the batch. (See mapping.)

Notes:

  • If a NewspaperTitle object with the LCCN in the batch already exists, objects will be associated with the existing NewspaperTitle.
  • If no admin_set is specified, the default AdminSet (admin_set/default) will be used.
  • If no depositor is specified, objects will have a depositor value of User.batch_user.user_key by default.
  • If visibility is not specified, objects will have visibility value of open by default.
  • A log file of the batch process will be output to your application's log/ingest.log.

Prerequisites

The ingest script makes the following assumptions:

  1. You have a set of files organized according NDNP batch files and directory structure specs.
  2. In the directory specified in the path argument, there is a batch.xml file that provides a listing of issues in the batch.

For examples of NDNP batches, see http://chroniclingamerica.loc.gov/data/batches/ or newspaper_works_fixtures.