Table of Contents generated with DocToc
Welcome to the rodeo! The goal of Derivative::Rodeo
is to provide interfaces and processing for files.
The fully public facing methods of Derivative::Rodeo
are module methods on the Derivative::Rodeo module. There is an associated Derivative::Rodeo spec file for those methods which is intended to be a place for "feature specs."
The conceptual logic of Derivative::Rodeo
is:
- Use the file I have locally…
- Else pull to local the file from a remote source…
- Else generate a local version…
- Demand a local copy of the file and proceed to the next step.
The above is encoded in Derivative::Rodeo::Process.
We start from a Derivative::Rodeo::Manifest::PreProcess, which is comprised of:
- a parent identifier
- an original filename
- a set of named derivatives; each named derivative might have path to a "known" already existing file.
We process the original manifest in an Arena. During processing we might spawn multiple "child" processes from one derivative. For example splitting a PDF into one image per page. Each of those page images would then have their own Derivative::Rodeo::Manifest::Derived for further processing.
- Conceptual Diagram :: The top-level concept of what the Derivative::Rodeo orchestrates.
- Process Diagram :: The low-level diagram of how the Derivative::Rodeo::Process works.
- Interaction with Spacestone :: How the
Derivative::Rodeo
interacts with SpaceStone. - Interaction with Hyrax Ingest :: Leverage the Hyrax::DerivativeService plugins to override the default behavior.
“This ain’t my first rodeo.” (an idiomatic American slang for “I’m prepared for what comes next.”)
The Derivative::Rodeo
orchestrates moving data from place to place; and ensuring that at each stage the requisite files exist.
The PlantUML Text for the Conceptual Diagram
@startuml
!theme amiga
component "Pre-Process Arena" {
() "Local" as pre_process_local
() "Remote" as pre_process_remote
control Processor as pre_processor
pre_processor -- pre_process_local
pre_processor -- pre_process_remote
}
cloud "Original Storage" as original_storage
cloud "Processing Storage" as processing_storage
component "Ingest Arena" {
() "Remote" as ingest_remote
() "Local" as ingest_local
control Processor as ingest_processor
ingest_processor -- ingest_remote
ingest_processor -- ingest_local
}
folder "Ingest\nFile\nSystem" as ingest_storage
original_storage --> pre_process_remote
pre_process_local --> processing_storage
processing_storage --> ingest_remote
ingest_local --> ingest_storage
@enduml
This is the logical flow chart of the Derivative::Rodeo::Process; it demonstrates the low-level processing task of a single derivative.
The PlantUML Text for the Process Diagram
@startuml
!theme amiga
start
if (derivative local?) then (yes)
elseif (derivative remote?) then (yes)
:pull to local;
else
:generate local;
endif
if (demand local exists?) then (yes)
else (no)
:raise exception;
stop
endif
:enqueue next;
@enduml
SpaceStone is an AWS Lambda ecosystem that SoftServ has used in the preliminary work of pre-processing derivatives in a specific use-case. The following diagram shows the conceptual interaction of the Derivative::Rodeo
and SpaceStone
.
The PlantUML Text for the Interaction with SpaceStone
@startuml
!theme amiga
actor Instigator as instigator
queue "AWS::SQS" as sqs
package SpaceStone {
control Invoker as invoker
}
package "Derivative::Rodeo" as dr {
control Process as process
}
instigator -right-> invoker : upload CSV\nof manifests
sqs -right-> invoker : pull message
invoker -right-> process : send message
process --> sqs : put message
@enduml
Hyrax exposes the concept of the Hyrax::DerivativeService; a configurable end-point. Hyrax has a default service Hyrax::FileSetDerivativesService that assumes it will create all derivatives and then assign them to the FileSet.
In the NewspaperWorks gem and IIIF Print gem, the Samvera community introduced different derivative services; in part to expand on the default functionality.
One challenge of these implementations is that they assume that the ingest process simultaneously creates the derivative and assigns the derivative.
The Newman Numismatic Portal introduced the idea of pre-processing the derivatives and splicing into the processes to circumvent some of the derivative generation.
With all of that here's the diagram for the Interaction with Hyrax Ingest.
The PlantUML Text for the Interaction with Hyrax Ingest
@startuml
!theme amiga
!pragma useVerticalIf on
start
:Hyrax::DerivativeService;
if (Derivative::Rodeo::DerivativeService.valid?) then (yes)
:read_from_rodeo;
:write_to_fedora;
stop
elseif (Hyrax::FileSetDerivativesService.valid?) then (yes)
:generate_derivative;
:write_to_fedora;
stop
else (no)
stop
endif
@enduml
There are inflection points that the Derivative::Rodeo
considers:
- Spawning processes based on the MimeType step
- Spawning processes to split a PDF
These inflection points start a new Chain of processing. Because we're jumping from one processing concept to another, the step might not create an associated derivative file. However, we need to verify that the step completed.
The verification is done via the Derivative::Rodeo::Arena#local_demand_path_for!, which delegates to the Derivative::Rodeo::Step::Base. In otherwords, the step that spawns a new chain has the opportunity to say if things are in order. Is it perfect? No. But it's what we have and can improve on from there.
There are two conceptual configuration points:
- Derivative::Rodeo::Configuration via the Derivative::Rodeo.config method.
- The individual classes in the Derivative::Rodeo namespace via ActiveSupport's class_attribute.
Let’s consider the following.
For one project I need to have two rodeos. The first rodeo is for pre-processing. The second rodeo is for ingesting the pre-processed files (see the Conceptual Diagram section). The storage and queue adapters will be different. For example, the pre-process local storage will likely be the ingest process’s remote storage. Both rodeos will likely have the same required steps for processing.
For another project, I will again need two rodeos. But I want different processing steps; for example I want to add steps to process a 3D model. I might configure the mime type step to sniff out the files that go into a 3D model and then spawn a new step.
For a third project, I again need two rodeos, but then I want to use a different process to determine the file’s mime type; perhaps instead of leveraging the Marcel gem, I leverage Fits and some XML parsing.
In other words, there are some assumptive configurations about a given rodeo:
- What’s my logging
- What’s my starting step
- What’s my queue adapter
- What’s my storage adapters
And there’s other assumptions based on those decisions. For an AWS SQS Queue Adapter we will likely need region information and even some low level credentials that might go in ENV
. For another cloud adapter those rules could be different.
Perhaps we know we’re always working with monochrome images, it’s unlikely we’d want to use the existing Hocr step as written. We can assume that we have monochrome.
As I hope is evident, the Derivative::Rodeo
is intended to provide a consistent interface for moving files and ensuring that the requried and desired derivatives are part of that move. And for the Derivative::Rodeo
to be something that we can incorporate into many projects and do minimum customization of those projects; instead relying on configuration and building towards interfaces.
Note: The above describes an ideal state and there are identified chores to migrate configuration points to the more appropriate locations.
This is in active development and we're exploring the names and concepts as we build towards the technical requirements of several different projects. What does that mean? Look to the Derivative::Rodeo require section that has a large banner. Those are the stable named concepts. Below that level, things are somewhat in-flux; in particular regarding the Derivative::Rodeo::Manifest module.
Derivative::Rodeo
is designed in such a way that it can run within an application or as part of a distributed architecture (e.g. AWS Lambdas). Further, it is designed for extension and configuration; through well-documented interfaces and modular boundaries.
It is also designed to provide insight into configuration and failures through custom exceptions and logging. It has a fail early mind set; first verifying that the desired derivatives don't create circular dependencies; flattening those dependencies into a chain which we process one link at a time, via Derivative::Rodeo::Process.
Last, the test suite covers a significant portion of the code; exercising both unit tests and functional tests that can run on a developers machine to help ensure the desired behavior.
Install the gem and add to the application's Gemfile by executing:
$ bundle add derivative-rodeo
If bundler is not being used to manage dependencies, install the gem by executing:
$ gem install derivative-rodeo
The list of dependencies is not reflective of the current state.
- Tesseract-ocr
- LibreOffice
- ghostscript
- poppler-utils
- ImageMagick
- ImageMagick policy XML may need to be more permissive in both resources and source media types allowed.
- libcurl3
- libgbm1
TODO: Write usage instructions here
After checking out the repository, run bin/setup
to install dependencies. You can also run bin/console
for an interactive prompt that will allow you to experiment.
To install this gem onto your local machine, run bundle exec rake install
. To release a new version, update the version number in version.rb
, and then run bundle exec rake release
, which will create a git tag for the version, push git commits and the created tag, and push the .gem
file to rubygems.org.
- Storage Adapters
- Flesh out the FromManifest adapter for remote files
- Add an AWS S3 Adapter; remembering that it could be used as either remote or local
- Queue Adapters
- Add an AWS SQS Adapter (see https://github.com/scientist-softserv/space_stone)
- Step work
- Does it make sense to include
fits
? We’re gathering technical metadata for processing and eventual storage. - Video
- Alto
- Audio
- Thumbnail
- Text Extraction (Hydra Derivatives leverages SOLR’s text extraction; there’s
pdftext
to consider) - Tidy up the base derivative type; there are some more expressive methods I could adopt to reduce duplication (and introduction of errors).
- What else?
- Does it make sense to include
- Manifest; I have refactored towards specific manifests and need to revisit existing manifests
- Create methods for the prerequisites
- Demand the prerequisites as part of the generate
- Work on PDF Splitting
- In conversations with @orangewolf, we may want to OCR in batches instead of one file at a time
- Integrate Derivative::Rodeo into IIIF Print.
- Assign “local” file to Fedora S3 location
- Process: At present the pre-process does not do anything with the locally demanded derivative
- Ingest Process: Follows the same logic of Derivative::Rodeo::Process, but moves derivative into FileSet. Note because "original" is a derivative, we will need this processing at the Derivative::Rodeo::Step level
Derivative::Rodeo
is positioned to be an alternate to Hydra::Derivatives.
Bug reports and pull requests are welcome on GitHub at https://github.com/scientist-softserv/derivative-rodeo.