-
Notifications
You must be signed in to change notification settings - Fork 16
Home
CORB is a Java tool designed for bulk content-reprocessing of documents stored in MarkLogic. In a nutshell, CORB works off a list of documents in a database and performs operations against those documents. CORB operations can include generating a report across all documents, manipulating the individual documents or a combination thereof. CORB stands for COntent Reprocessing in Bulk and is a multi-threaded workhorse tool at your disposal.
CORB was originally developed by Michael Blakeley and submitted to the Open Source Community (OSC). CORB 1.0 provided basic functionality for selecting documents and using multiple threads to apply an XQuery or JavaScript module against them. In 2014, CORB2 was released by Bhagat Bandlamudi to the OSC. CORB2 extends CORB by adding considerable new functionality. This wiki is written to aid users of CORB2 but will refer to it in general by using the term ‘CORB’. This wiki is up to date as of CORB 2.1.0.
CORB requires building a list of URIs to work against, which may involve a search to select relevant documents. When using MarkLogic, it is always faster to search against the database’s in-memory indices rather than having to open up a document to determine whether there is a match. Since the initial selection of documents is a single threaded process, the proper use of CORB is to abide by this fact when using an XQuery or JavaScript module to generate the list. The selector should return as quickly and efficiently as possible without opening documents (filtering). At times, this may necessitate ‘casting a wider net’ for documents than those actually needed for a report or transformation. However, once the list has been generated, the documents it contains can then be worked on concurrently by as many threads as the server is capable. At that stage, it is no longer necessary to avoid opening the document and work can be performed at will.
As mentioned, CORB can be used to generate a report against a specific set of documents in a database. A report is generated in text with or without the use of a de-limiter to create a comma-separated value (csv) or pipe separated value (psv) format before being written to disk.
Another major use of CORB is to perform bulk data transforms. Documents can be manipulated in any such way as desired before being stored back to the database. Often, it’s desirable to generate a report of changes made to those documents which is possible as well.
While CORB is a Java program, it uses XQuery or JavaScript modules to perform the data selection and/or data transformation. All of the selection and transformation is performed via XQuery or JavaScript modules, which need to be customized for the specific task at hand.
Specific use cases and their implementations will be provided on subsequent wikis.
Read about...