Skip to content

Latest commit

 

History

History
425 lines (296 loc) · 24.1 KB

README.md

File metadata and controls

425 lines (296 loc) · 24.1 KB

swissbib-metafacture-commands

Plugin with additional Metafacture commands used in linked-swissbib workflows

Build

In order to use the plugins with a standalone instance of Metafacture you have to build a "Fat Jar". For that issue the following commands in the root directory:

# Clone standalone instance of Metafacture
git clone https://github.com/linked-swissbib/mfWorkflows
# Clone this repository
git clone
https://github.com/linked-swissbib/swissbib-metafacture-commands
cd swissbib-metafacture-commands
# Build fat jar
./gradlew clean shadow # For *nix-OSes, otherwise use gradlew.bat
# Move fat jar to plugins folder of mfWorkflows
mv build/libs/swissbibMF-plugins-1.1-all.jar ../mfWorkflows/plugins

Docker

There is an experimental Docker image available which provides a standalone Metafacture instance including the linked-swissbib plugins.

docker pull sschuepbach/mfrunner-sb-5

For further instructions see here

Tests

There are only a few unit tests available (hopefully there will be more in the near future...). To run them type

./gradlew clean check

List of commands

The commands are divided into several categories:

  • Decoders:
  • Pipe:
    • encode-esbulk: Encodes data as JSON-LD or in a special format suitable for bulk indexing in Elasticsearch
    • encode-neo4j: Encodes data as csv files suitable for batch uploads to a Neo4j database
    • encode-ntriples: Encodes data as Ntriples
    • ext-filter: Extends the default filter command in Flux by providing a parameter to implement a "filter not" mechanism
    • itemerase-es: Deletes items which belong to a certain bibliographicResource
    • lookup-es: Filters out records whose identifier already exists in an Elasticsearch index
    • split-entities: Splits entities into individual records.
    • update-es-id: Identifies partially modified documents by comparing them to an Elasticsearch index.
  • Writers:
    • index-esbulk: Uses the bulk mechanisms of Elasticsearch to index records
    • index-neo4j: Indexes nodes and relationships in Neo4j
    • write-csv: Serialises data as CSV file with optional header.
    • write-esbulk: Writes records as JSON files which can comply with the requirements of the Bulk API of Elasticsearch.
    • write-kafka: Acts as a producer in a Kafka cluster.
    • write-neo4j: Writes csv files for batch uploading to a new Neo4j database.
    • write-rdf-1line: Writes RDF-XML files, one line per record.
    • write-socket: Sets up a socket server.
  • Source:
    • read-kafka: Acts as a Kafka Consumer for Metafacture
    • open-multi-http: Allows to open HTTP resources in a "paging" manner, e.g. to get data by chunks from a database
  • Record Splitters:
    • read-json-object: Reads in a JSON file and splits it at the end of the root object / array.
  • Morph Functions:
    • AuthorHash: Creates a hash value for authors based on different MARC fields.
    • ItemHash: Creates a hash value for items based on different MARC fields.

AuthorHash

Creates a hash value for authors based on different MARC fields.

Resources:

decode-json

Parses JSON. Preferably used in conjunction with read-json-object

decode-ntriples

Parses Ntriples-encoded records.

Example: linked-swissbib "EnrichedLine"

encode-esbulk

Encodes records for bulk uploading to Elasticsearch.

  • Implementation: org.swissbib.linked.mf.pipe.ESBulkEncoder
  • In: org.culturegraph.mf.framework.StreamReceiver
  • Out: java.lang.String
  • Options:
    • avoidMergers: If set to true, fields with same keys are modelled as separate inner objects instead of having their values merged (Boolean; default: false)
    • header: Should header for ES bulk be written (Boolean; default: true)? Warning: Setting this parameter to false will result in an invalid Bulk format!
    • escapeChars: Escapes prohibited characters in JSON strings (Boolean; default: true)
    • index: Index name of records
    • type: Type name of records

Example: linked-swissbib "Baseline"

encode-neo4j

Encodes records as csv files for batch uploading them to a new Neo4j-database. As the headers of the csv files are hardcoded, it is not ready to be used in a broader context.

Example: Graph visualisation of the GND

encode-ntriples

Encodes data as Ntriples

Example: Libadmin entries as Ntriples

ext-filter

Extends the default filter command in Flux by providing a parameter to implement a "filter not" mechanism

Example: Show record ids which don't have a title (MARC field 245$a)

handle-marcxml-sb

Directly transforms MARC-XML fields to CSV rows like record-id,field,indicator1,indicator2,subfield,value

Example: 1:1 transformation of MARC-XML to CSV

handle-marcxml-sru

Handles MARC-XML files received from the SRU interface of Swissbib

Example: Workflow which queries the Swissbib SRU interface and filters, transforms and dumps the results to a CSV file

index-esbulk

Indexes records in Elasticsearch.

  • Implementation: org.swissbib.linked.mf.pipe.ESBulkIndexer
  • In: java.lang.Object
  • Out: java.lang.Void
  • Options:
    • esClustername: Elasticsearch cluster name
    • recordsPerUpload: Number of records per single bulk upload
    • esNodes: Elasticsearch nodes. Nodes are separated by #

Example: linked-swissbib "Baseline"

index-neo4j

Indexes fields in Neo4j. Because the selection of the fields which are to be indexed is hardcoded, the benefit of this command outside our admittedly narrow scope is somewhat limited.

ItemHash

Creates a hash value for items based on different MARC fields.

Resource: Morph definition which uses the item hash generator

itemerase-es

Deletes items which belong to a certain bibliographicResource. Recommended for internal use only. Intended to use with the tracking framework of linked-swissbib

lookup-es

Filters out records whose identifier already exists in an Elasticsearch index. Intended to use with the tracking framework of linked-swissbib.

open-multi-http

Allows to open HTTP resources in a "paging" manner, e.g. to get data by chunks from a database. You have to define two variable parts in the URL: ${cs}, which sets the chunk size, and ${pa}, which sets the offset.

  • Implementation: org.swissbib.linked.mf.source.MultiHttpOpener
  • In: java.lang.String
  • Out: java.lang.Reader
  • Options:
    • accept: The accept header in the form type/subtype, e.g. text/plain.
    • encoding: The encoding is used to encode the output and is passed as Accept-Charset to the http connection.
    • lowerBound: Initial offset
    • upperBound: Limit
    • chunkSize: Number of documents to be downloaded in a single retrieval

Example: Workflow which queries the Swissbib SRU interface and filters, transforms and dumps the results to a CSV file

read-kafka

Acts as a Kafka consumer for Metafacture

read-json-object

Reads in a JSON file and splits it at the end of the root object / array. Preferably used in conjunction with decode-json

Example: libadmin entries as Ntriples

split-entitites

Splits entities into individual records.

Example: linked-swissbib "Baseline"

update-es-id

Identifies partially modified documents by comparing them to an Elasticsearch index. Is tailored to the so-called baseline workflow of linked-swissbib, so it's probably useless for other purposes

write-csv

Serialises data as CSV file with optional header

  • Implementation: org.swissbib.linked.mf.writer.ContinuousCsvWriter
  • In: java.lang.String
  • Out: java.lang.Void
  • Options:
    • compression: Sets the compression mode
    • continuousFile: Boolean. If set to true, the header is only written to the first file.
    • encoding: Sets the encoding used by the underlying writer
    • filenamePostfix: By default the filename consists of a zero-filled sequential number with six digits. Sets a postfix for this number.
    • filenamePrefix: By default the filename consists of a zero-filled sequential number with six digits. Sets a prefix for this number.
    • filetype: File ending
    • footer: Sets the footer which is output after the last object
    • header: Sets the header which is output before the first object
    • linesPerFile: Number of lines written to one file
    • path: Path to directory with CSV files
    • separator: Sets the separator which is output between objects

Examples: Workflow which queries the Swissbib SRU interface and filters, transforms and dumps the results to a CSV file

write-esbulk

Writes records as JSON files which comply with the requirements of the Bulk API of Elasticsearch.

  • Implementation: org.swissbib.linked.mf.writer.ESBulkWriter
  • In: java.lang.Object
  • Out: java.lang.Void
  • Options:
    • compress: Should files be .gz-compressed? (Default is true)
    • filePrefix: Prefix for file names
    • fileSize: Number of records in one file
    • jsonCompliant: Should files be JSON compliant (Boolean; default: false)? Warning: Setting this parameter to true will result in an invalid Bulk format!
    • outDir: Root directory for output
    • subdirSize: Number of files in one subdirectory (Default: 300)
    • type: Type name of records (will only be attached to filename)

Example: linked-swissbib "Baseline"

write-kafka

Acts as a producer in a Kafka cluster.

  • Implementation: org.swissbib.linked.mf.writer.KafkaWriter
  • In: java.lang.Object
  • Out: java.lang.Void
  • Options:
    • host: Hostname of Kafka cluster (required)
    • port: Port of Kafka cluster (required)
    • topic: Name of Kafka topic (required)

Example: A very small example of using the Kafka consumer

write-neo4j

Writes csv files for batch uploading to a new Neo4j database. Intended to be used in junction with index-neo4j.

  • Implementation: org.swissbib.linked.mf.writer.NeoWriter
  • In: java.lang.Object
  • Out: java.lang.Void
  • Options:
    • csvDir: Path to the output directory
    • csvFileLength: Numbers of records in one dedicated CSV file
    • batchWriteSize: Maximal number of records of the same category

Example: Graph visualisation of the GND

write-rdf-1line

Writes RDF-XML files, one line per record.

  • Implementation: org.swissbib.linked.mf.writer.SingleLineWriterRDFXml
  • In: java.lang.Object
  • Out: java.lang.Void
  • Options:
    • usecontributor: "true", "false"
    • rootTag: XML root tag
    • extension: File extension for output files
    • compress: Should output files be compressed? ("true", "false")
    • baseOutDir: Base directory for output files:
    • outFilePrefix: Prefix for output files
    • fileSize: Number of records in one file
    • subDirSize: Number of records in one subdirectory
    • type: Concept / type name

Example: Deprecated linked-swissbib "baseline" for bibliographicResource documents (use resourceTransformation.rdfXml.flux)

write-socket

Sets up a socket server

Example: Stream MARC-XML to socket