Derivative::Rodeo

Table of Contents generated with DocToc

Derivative::Rodeo

Derivative::Rodeo

Welcome to the rodeo! The goal of Derivative::Rodeo is to provide interfaces and processing for files.

The fully public facing methods of Derivative::Rodeo are module methods on the Derivative::Rodeo module. There is an associated Derivative::Rodeo spec file for those methods which is intended to be a place for "feature specs."

Overview

The conceptual logic of Derivative::Rodeo is:

Use the file I have locally…
Else pull to local the file from a remote source…
Else generate a local version…
Demand a local copy of the file and proceed to the next step.

The above is encoded in Derivative::Rodeo::Process.

We start from a Derivative::Rodeo::Manifest::PreProcess, which is comprised of:

a parent identifier
an original filename
a set of named derivatives; each named derivative might have path to a "known" already existing file.

We process the original manifest in an Arena. During processing we might spawn multiple "child" processes from one derivative. For example splitting a PDF into one image per page. Each of those page images would then have their own Derivative::Rodeo::Manifest::Derived for further processing.

Diagrams

Conceptual Diagram :: The top-level concept of what the Derivative::Rodeo orchestrates.
Process Diagram :: The low-level diagram of how the Derivative::Rodeo::Process works.
Interaction with Spacestone :: How the Derivative::Rodeo interacts with SpaceStone.
Interaction with Hyrax Ingest :: Leverage the Hyrax::DerivativeService plugins to override the default behavior.

Conceptual Diagram

“This ain’t my first rodeo.” (an idiomatic American slang for “I’m prepared for what comes next.”)

The Derivative::Rodeo orchestrates moving data from place to place; and ensuring that at each stage the requisite files exist.

The PlantUML Text for the Conceptual Diagram

@startuml
!theme amiga

component "Pre-Process Arena" {
	() "Local" as pre_process_local
	() "Remote" as pre_process_remote
	control Processor as pre_processor
	pre_processor -- pre_process_local
	pre_processor -- pre_process_remote
}

cloud "Original Storage" as original_storage

cloud "Processing Storage" as processing_storage


component "Ingest Arena" {
	() "Remote" as ingest_remote
	() "Local" as ingest_local
	control Processor as ingest_processor
	ingest_processor -- ingest_remote
	ingest_processor -- ingest_local
}

folder "Ingest\nFile\nSystem" as ingest_storage

original_storage --> pre_process_remote
pre_process_local --> processing_storage
processing_storage --> ingest_remote
ingest_local --> ingest_storage

@enduml

Process Diagram

This is the logical flow chart of the Derivative::Rodeo::Process; it demonstrates the low-level processing task of a single derivative.

The PlantUML Text for the Process Diagram

@startuml
!theme amiga

start

if (derivative local?) then (yes)

elseif (derivative remote?) then (yes)
	:pull to local;
else
	:generate local;
endif

if (demand local exists?) then (yes)


else (no)
	:raise exception;
        stop
endif
:enqueue next;
@enduml

Interaction with SpaceStone

SpaceStone is an AWS Lambda ecosystem that SoftServ has used in the preliminary work of pre-processing derivatives in a specific use-case. The following diagram shows the conceptual interaction of the Derivative::Rodeo and SpaceStone.

The PlantUML Text for the Interaction with SpaceStone

@startuml
!theme amiga

actor Instigator as instigator

queue "AWS::SQS" as sqs

package SpaceStone {
	control Invoker as invoker
}

package "Derivative::Rodeo" as dr {
	control Process as process
}

instigator -right-> invoker : upload CSV\nof manifests
sqs -right-> invoker : pull message
invoker -right-> process : send message
process --> sqs : put message
@enduml

Interaction with Hyrax Ingest

Hyrax exposes the concept of the Hyrax::DerivativeService; a configurable end-point. Hyrax has a default service Hyrax::FileSetDerivativesService that assumes it will create all derivatives and then assign them to the FileSet.

In the NewspaperWorks gem and IIIF Print gem, the Samvera community introduced different derivative services; in part to expand on the default functionality.

One challenge of these implementations is that they assume that the ingest process simultaneously creates the derivative and assigns the derivative.

The Newman Numismatic Portal introduced the idea of pre-processing the derivatives and splicing into the processes to circumvent some of the derivative generation.

With all of that here's the diagram for the Interaction with Hyrax Ingest.

The PlantUML Text for the Interaction with Hyrax Ingest

  @startuml
  !theme amiga
  !pragma useVerticalIf on
  start
  :Hyrax::DerivativeService;
  if (Derivative::Rodeo::DerivativeService.valid?) then (yes)
	  :read_from_rodeo;
	  :write_to_fedora;
	  stop
  elseif (Hyrax::FileSetDerivativesService.valid?) then (yes)
	  :generate_derivative;
	  :write_to_fedora;
	  stop
  else (no)
	  stop
  endif
  @enduml

Deeper Dive

Inflection Points

There are inflection points that the Derivative::Rodeo considers:

Spawning processes based on the MimeType step
Spawning processes to split a PDF

These inflection points start a new Chain of processing. Because we're jumping from one processing concept to another, the step might not create an associated derivative file. However, we need to verify that the step completed.

The verification is done via the Derivative::Rodeo::Arena#local_demand_path_for!, which delegates to the Derivative::Rodeo::Step::Base. In otherwords, the step that spawns a new chain has the opportunity to say if things are in order. Is it perfect? No. But it's what we have and can improve on from there.

Configuration

There are two conceptual configuration points:

Derivative::Rodeo::Configuration via the Derivative::Rodeo.config method.
The individual classes in the Derivative::Rodeo namespace via ActiveSupport's class_attribute.

Let’s consider the following.

For one project I need to have two rodeos. The first rodeo is for pre-processing. The second rodeo is for ingesting the pre-processed files (see the Conceptual Diagram section). The storage and queue adapters will be different. For example, the pre-process local storage will likely be the ingest process’s remote storage. Both rodeos will likely have the same required steps for processing.

For another project, I will again need two rodeos. But I want different processing steps; for example I want to add steps to process a 3D model. I might configure the mime type step to sniff out the files that go into a 3D model and then spawn a new step.

For a third project, I again need two rodeos, but then I want to use a different process to determine the file’s mime type; perhaps instead of leveraging the Marcel gem, I leverage Fits and some XML parsing.

In other words, there are some assumptive configurations about a given rodeo:

What’s my logging
What’s my starting step
What’s my queue adapter
What’s my storage adapters

And there’s other assumptions based on those decisions. For an AWS SQS Queue Adapter we will likely need region information and even some low level credentials that might go in ENV. For another cloud adapter those rules could be different.

Perhaps we know we’re always working with monochrome images, it’s unlikely we’d want to use the existing Hocr step as written. We can assume that we have monochrome.

As I hope is evident, the Derivative::Rodeo is intended to provide a consistent interface for moving files and ensuring that the requried and desired derivatives are part of that move. And for the Derivative::Rodeo to be something that we can incorporate into many projects and do minimum customization of those projects; instead relying on configuration and building towards interfaces.

Note: The above describes an ideal state and there are identified chores to migrate configuration points to the more appropriate locations.

Note on Development Status

This is in active development and we're exploring the names and concepts as we build towards the technical requirements of several different projects. What does that mean? Look to the Derivative::Rodeo require section that has a large banner. Those are the stable named concepts. Below that level, things are somewhat in-flux; in particular regarding the Derivative::Rodeo::Manifest module.

Design Goals

Derivative::Rodeo is designed in such a way that it can run within an application or as part of a distributed architecture (e.g. AWS Lambdas). Further, it is designed for extension and configuration; through well-documented interfaces and modular boundaries.

It is also designed to provide insight into configuration and failures through custom exceptions and logging. It has a fail early mind set; first verifying that the desired derivatives don't create circular dependencies; flattening those dependencies into a chain which we process one link at a time, via Derivative::Rodeo::Process.

Last, the test suite covers a significant portion of the code; exercising both unit tests and functional tests that can run on a developers machine to help ensure the desired behavior.

Installation

Install the gem and add to the application's Gemfile by executing:

$ bundle add derivative-rodeo

If bundler is not being used to manage dependencies, install the gem by executing:

$ gem install derivative-rodeo

Dependencies

The list of dependencies is not reflective of the current state.

Tesseract-ocr
LibreOffice
ghostscript
poppler-utils
ImageMagick
- ImageMagick policy XML may need to be more permissive in both resources and source media types allowed.
libcurl3
libgbm1

Usage

TODO: Write usage instructions here

Development

After checking out the repository, run bin/setup to install dependencies. You can also run bin/console for an interactive prompt that will allow you to experiment.

To install this gem onto your local machine, run bundle exec rake install. To release a new version, update the version number in version.rb, and then run bundle exec rake release, which will create a git tag for the version, push git commits and the created tag, and push the .gem file to rubygems.org.

Tasks

Derivative::Rodeo is positioned to be an alternate to Hydra::Derivatives.

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/scientist-softserv/derivative-rodeo.

Name		Name	Last commit message	Last commit date
Latest commit History 184 Commits
artifacts		artifacts
bin		bin
git-hooks		git-hooks
lib/derivative		lib/derivative
spec		spec
.editorconfig		.editorconfig
.gitignore		.gitignore
.rspec		.rspec
.rubocop.yml		.rubocop.yml
.solargraph.yml		.solargraph.yml
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
LICENSE		LICENSE
README.md		README.md
Rakefile		Rakefile
derivative-rodeo.gemspec		derivative-rodeo.gemspec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Derivative::Rodeo

Overview

Diagrams

Conceptual Diagram

Process Diagram

Interaction with SpaceStone

Interaction with Hyrax Ingest

Deeper Dive

Inflection Points

Configuration

Note on Development Status

Design Goals

Installation

Dependencies

Usage

Development

Tasks

Contributing

About

Releases

Packages

Contributors 2

Languages

License

notch8/derivative-rodeo

Folders and files

Latest commit

History

Repository files navigation

Derivative::Rodeo

Overview

Diagrams

Conceptual Diagram

Process Diagram

Interaction with SpaceStone

Interaction with Hyrax Ingest

Deeper Dive

Inflection Points

Configuration

Note on Development Status

Design Goals

Installation

Dependencies

Usage

Development

Tasks

Contributing

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages