Skip to content
This repository has been archived by the owner on May 3, 2023. It is now read-only.

notch8/derivative-rodeo

Repository files navigation

Table of Contents generated with DocToc

Derivative::Rodeo

Welcome to the rodeo! The goal of Derivative::Rodeo is to provide interfaces and processing for files.

The fully public facing methods of Derivative::Rodeo are module methods on the Derivative::Rodeo module. There is an associated Derivative::Rodeo spec file for those methods which is intended to be a place for "feature specs."

Overview

The conceptual logic of Derivative::Rodeo is:

  • Use the file I have locally…
  • Else pull to local the file from a remote source…
  • Else generate a local version…
  • Demand a local copy of the file and proceed to the next step.

The above is encoded in Derivative::Rodeo::Process.

We start from a Derivative::Rodeo::Manifest::PreProcess, which is comprised of:

  • a parent identifier
  • an original filename
  • a set of named derivatives; each named derivative might have path to a "known" already existing file.

We process the original manifest in an Arena. During processing we might spawn multiple "child" processes from one derivative. For example splitting a PDF into one image per page. Each of those page images would then have their own Derivative::Rodeo::Manifest::Derived for further processing.

Diagrams

Conceptual Diagram

“This ain’t my first rodeo.” (an idiomatic American slang for “I’m prepared for what comes next.”)

The Derivative::Rodeo orchestrates moving data from place to place; and ensuring that at each stage the requisite files exist.

Conceptual Diagram

The PlantUML Text for the Conceptual Diagram
@startuml
!theme amiga

component "Pre-Process Arena" {
	() "Local" as pre_process_local
	() "Remote" as pre_process_remote
	control Processor as pre_processor
	pre_processor -- pre_process_local
	pre_processor -- pre_process_remote
}

cloud "Original Storage" as original_storage

cloud "Processing Storage" as processing_storage


component "Ingest Arena" {
	() "Remote" as ingest_remote
	() "Local" as ingest_local
	control Processor as ingest_processor
	ingest_processor -- ingest_remote
	ingest_processor -- ingest_local
}

folder "Ingest\nFile\nSystem" as ingest_storage

original_storage --> pre_process_remote
pre_process_local --> processing_storage
processing_storage --> ingest_remote
ingest_local --> ingest_storage

@enduml

Process Diagram

This is the logical flow chart of the Derivative::Rodeo::Process; it demonstrates the low-level processing task of a single derivative.

Process Diagram

The PlantUML Text for the Process Diagram
@startuml
!theme amiga

start

if (derivative local?) then (yes)

elseif (derivative remote?) then (yes)
	:pull to local;
else
	:generate local;
endif

if (demand local exists?) then (yes)


else (no)
	:raise exception;
        stop
endif
:enqueue next;
@enduml

Interaction with SpaceStone

SpaceStone is an AWS Lambda ecosystem that SoftServ has used in the preliminary work of pre-processing derivatives in a specific use-case. The following diagram shows the conceptual interaction of the Derivative::Rodeo and SpaceStone.

Interaction with SpaceStone

The PlantUML Text for the Interaction with SpaceStone
@startuml
!theme amiga

actor Instigator as instigator

queue "AWS::SQS" as sqs

package SpaceStone {
	control Invoker as invoker
}

package "Derivative::Rodeo" as dr {
	control Process as process
}

instigator -right-> invoker : upload CSV\nof manifests
sqs -right-> invoker : pull message
invoker -right-> process : send message
process --> sqs : put message
@enduml

Interaction with Hyrax Ingest

Hyrax exposes the concept of the Hyrax::DerivativeService; a configurable end-point. Hyrax has a default service Hyrax::FileSetDerivativesService that assumes it will create all derivatives and then assign them to the FileSet.

In the NewspaperWorks gem and IIIF Print gem, the Samvera community introduced different derivative services; in part to expand on the default functionality.

One challenge of these implementations is that they assume that the ingest process simultaneously creates the derivative and assigns the derivative.

The Newman Numismatic Portal introduced the idea of pre-processing the derivatives and splicing into the processes to circumvent some of the derivative generation.

With all of that here's the diagram for the Interaction with Hyrax Ingest.

Interaction with Hyrax Ingest

The PlantUML Text for the Interaction with Hyrax Ingest
  @startuml
  !theme amiga
  !pragma useVerticalIf on
  start
  :Hyrax::DerivativeService;
  if (Derivative::Rodeo::DerivativeService.valid?) then (yes)
	  :read_from_rodeo;
	  :write_to_fedora;
	  stop
  elseif (Hyrax::FileSetDerivativesService.valid?) then (yes)
	  :generate_derivative;
	  :write_to_fedora;
	  stop
  else (no)
	  stop
  endif
  @enduml

Deeper Dive

Inflection Points

There are inflection points that the Derivative::Rodeo considers:

  1. Spawning processes based on the MimeType step
  2. Spawning processes to split a PDF

These inflection points start a new Chain of processing. Because we're jumping from one processing concept to another, the step might not create an associated derivative file. However, we need to verify that the step completed.

The verification is done via the Derivative::Rodeo::Arena#local_demand_path_for!, which delegates to the Derivative::Rodeo::Step::Base. In otherwords, the step that spawns a new chain has the opportunity to say if things are in order. Is it perfect? No. But it's what we have and can improve on from there.

Configuration

There are two conceptual configuration points:

Let’s consider the following.

For one project I need to have two rodeos. The first rodeo is for pre-processing. The second rodeo is for ingesting the pre-processed files (see the Conceptual Diagram section). The storage and queue adapters will be different. For example, the pre-process local storage will likely be the ingest process’s remote storage. Both rodeos will likely have the same required steps for processing.

For another project, I will again need two rodeos. But I want different processing steps; for example I want to add steps to process a 3D model. I might configure the mime type step to sniff out the files that go into a 3D model and then spawn a new step.

For a third project, I again need two rodeos, but then I want to use a different process to determine the file’s mime type; perhaps instead of leveraging the Marcel gem, I leverage Fits and some XML parsing.

In other words, there are some assumptive configurations about a given rodeo:

  • What’s my logging
  • What’s my starting step
  • What’s my queue adapter
  • What’s my storage adapters

And there’s other assumptions based on those decisions. For an AWS SQS Queue Adapter we will likely need region information and even some low level credentials that might go in ENV. For another cloud adapter those rules could be different.

Perhaps we know we’re always working with monochrome images, it’s unlikely we’d want to use the existing Hocr step as written. We can assume that we have monochrome.

As I hope is evident, the Derivative::Rodeo is intended to provide a consistent interface for moving files and ensuring that the requried and desired derivatives are part of that move. And for the Derivative::Rodeo to be something that we can incorporate into many projects and do minimum customization of those projects; instead relying on configuration and building towards interfaces.

Note: The above describes an ideal state and there are identified chores to migrate configuration points to the more appropriate locations.

Note on Development Status

This is in active development and we're exploring the names and concepts as we build towards the technical requirements of several different projects. What does that mean? Look to the Derivative::Rodeo require section that has a large banner. Those are the stable named concepts. Below that level, things are somewhat in-flux; in particular regarding the Derivative::Rodeo::Manifest module.

Design Goals

Derivative::Rodeo is designed in such a way that it can run within an application or as part of a distributed architecture (e.g. AWS Lambdas). Further, it is designed for extension and configuration; through well-documented interfaces and modular boundaries.

It is also designed to provide insight into configuration and failures through custom exceptions and logging. It has a fail early mind set; first verifying that the desired derivatives don't create circular dependencies; flattening those dependencies into a chain which we process one link at a time, via Derivative::Rodeo::Process.

Last, the test suite covers a significant portion of the code; exercising both unit tests and functional tests that can run on a developers machine to help ensure the desired behavior.

Installation

Install the gem and add to the application's Gemfile by executing:

$ bundle add derivative-rodeo

If bundler is not being used to manage dependencies, install the gem by executing:

$ gem install derivative-rodeo

Dependencies

The list of dependencies is not reflective of the current state.

Usage

TODO: Write usage instructions here

Development

After checking out the repository, run bin/setup to install dependencies. You can also run bin/console for an interactive prompt that will allow you to experiment.

To install this gem onto your local machine, run bundle exec rake install. To release a new version, update the version number in version.rb, and then run bundle exec rake release, which will create a git tag for the version, push git commits and the created tag, and push the .gem file to rubygems.org.

Tasks

  • Storage Adapters
    • Flesh out the FromManifest adapter for remote files
    • Add an AWS S3 Adapter; remembering that it could be used as either remote or local
  • Queue Adapters
  • Step work
    • Does it make sense to include fits? We’re gathering technical metadata for processing and eventual storage.
    • Video
    • Alto
    • Audio
    • Thumbnail
    • Text Extraction (Hydra Derivatives leverages SOLR’s text extraction; there’s pdftext to consider)
    • Tidy up the base derivative type; there are some more expressive methods I could adopt to reduce duplication (and introduction of errors).
    • What else?
  • Manifest; I have refactored towards specific manifests and need to revisit existing manifests
    • Create methods for the prerequisites
    • Demand the prerequisites as part of the generate
  • Work on PDF Splitting
    • In conversations with @orangewolf, we may want to OCR in batches instead of one file at a time
  • Integrate Derivative::Rodeo into IIIF Print.
    • Assign “local” file to Fedora S3 location
  • Process: At present the pre-process does not do anything with the locally demanded derivative

Derivative::Rodeo is positioned to be an alternate to Hydra::Derivatives.

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/scientist-softserv/derivative-rodeo.