The long-term technological objective for the Cooperative is a platform that will support a continuously expanding, curated corpus of reliable biographical descriptions of people linked to and providing contextual understanding of the historical records that are the primary evidence for understanding their lives and work. Building and curating a reliable social-document corpus will require a nuanced combination of computer processing and human identity verification and editing. During the pilot phase of the Cooperative, the R&D infrastructure is being thoroughly transformed to a maintenance platform. From a technical perspective, this means transitioning from a multistep human-mediated batch process to an integrated transaction-based platform. The infrastructure under development will automate the flow of data into and out of the different processing steps by interconnecting the processing components, with events taking place in one component triggering related events in another. For example, the addition of a new descriptive record will lead to automatic updating of a graph database and the indexed data in the History Research Tool. This coordinated architecture will support both the batch ingest of data and human editing of the data to verify identities, and will refine and augment the descriptions over time.
We employ a LAMP stack (with PostgreSQL) for efficiency of coding; flexibility enabled by a very large number of available software modules; ease of maintenance; and clarity of software architecture. The result will be a lightweight and easy-to-administer software stack.
The high-level architecture uses a scalable and distributable client-server model. Two clients will be created and hosted by the Cooperative to interact with the back-end server: a graphical web user interface (HTML) and a RESTful API (JSON). The WebUI client will support the Cooperative’s editing user-interface. The Rest API client will allow ArchivesSpace and other approved clients to mechanically interact with the server; it will provide features such as viewing and editing descriptive records, as well as batch processing of data.
The server-side architecture will consist of a number of modules addressing different primary functions. The storage medium of the server will be a PostgreSQL Data Maintenance Store (DMS). The DMS will contain all of the descriptive and maintenance data for each EAC-CPF data file.
Other major server-side components are the Identity Reconciliation Engine, the Data Validation Engine, the Workflow Controller, and a Neo4J Graph Database. The Workflow Controller will coordinate communication among the server-side components through internal programming APIs, and a Server API (JSON) will facilitate communication with the WebUI and Rest API clients.
A PostgreSQL Data Maintenance Store (DMS) will represent the storage foundation of the SNAC technology platform. The DMS will store "identity constellations" that will represent all of the data contained in the EAC-CPF instances, with each instance represented by an Identity Constellation (IC). Additional control data will be stored with each IC to facilitate transaction tracking and management, and fine-grained version control.
In the R&D processing workflow, EAC-CPF instances were placed in a read-only directory as the primary data store. A small number of select components (name strings) of each EAC-CPF XML-encoded instance were loaded into a PostgreSQL database only for matching purposes. In order to support dynamic manual editing of the EAC-CPF instances, the entirety of each EAC-CPF instance will be parsed into PostgreSQL tables as Identity Constellations1. Each IC will retain all of the EAC-CPF data, as well as additional control data that will facilitate transaction tracking and version control. The DMS will also store editor authorization privileges, editor work histories (e.g., edit status on individual identity constellations), and local controlled vocabularies (e.g., occupations, functions, subjects, and geographic names). The DMS will store workflow management data and aid the server in report generation.
Identity Constellation Diagrams:
A major focus of the SNAC R&D has been on identity reconciliation. A fundamental human activity in the development of knowledge involves the identification of unique "real world" entities (for example, a particular person or a specific book) and recording facts about the observed entity that, when taken together, uniquely distinguishes that entity from all others. Establishing the identity of a person, for example, involves examining available evidence, including the existing knowledge base, and recording facts associated with him or her. For a person, the facts would include names used by and for them, dates and places of birth and death, occupations, and so on. Establishing identities is an ongoing, cumulative activity that both leverages existing established identities and establishes new identities. Identity reconciliation is the process by which an encountered identity is compared against established identities, and if not found, is itself contributed to the established base of identities, and if found merges any new data associated with the encountered identity in an existing identity description.
With the emergence of Linked Open Data (LOD) and the opportunity it presents to interconnect distributed sets of information, new names for entities are introduced, namely the URIs used to provide globally unique identifiers to entities. In order to exploit the opportunity presented by LOD, it is necessary to include these URIs in the reconciliation process. SNAC assigns its own identifiers (ARKS) because doing so is essential to effectively managing the identities throughout processing and maintenance. Even if this were not essential for managing the workflow, the majority of the identities in SNAC will not be found in other sources such as VIAF, and thus the SNAC identifiers and associated data that establish the identity are likely to be unique, at least in the near term2. For those identities that do overlap with VIAF, SNAC processing takes advantage of the VIAF reconciliation process to associate VIAF’s identifier as well as identifiers for Wikipedia and WorldCat Identity.
While the R&D matching was based on the name string alone, Cooperative Identity Reconciliation will be based on ICs, that is, the name string and additional information (evidence) that sufficiently establishes the uniqueness of an identity. The determination of match scoring will be based on comparing identity constellations and identifying which properties within each constellation (name, life dates, place of birth, place of death, relations to other identities, etc.) match or closely match, and each match test will result in an assigned score. A major factor in reliable matching, for computers or humans, is the available evidence for each identity. Sparse evidence in compared identities will decrease the probability of making a reliable match or non-match. Conversely, dense evidence supports both reliable matches and non-matches. Based on the scoring, two reconciliation outcomes will be presented: reliable matches and possible matches. Reliable non-matches and match scores that fall below the threshold of reliable and possible will not be flagged. Possible matches will be employed to suggest comparisons that are not reliably matches or non-matches but have sufficient similarity to suggest further human investigation and possible resolution. The Identity Reconciliation module will primarily employ the DMS and ElasticSearch. Ground-truth data, human-reviewed and verified matches and non-matches, will be used in testing and refining the matching algorithms in order to optimize the scoring.
The Identity Reconciliation Engine will be used for both batch ingest and to assist human editors. When EAC-CPF are extracted and assembled using existing archival descriptions (EAD-encoded finding aids, MARC21, or existing non-standard archival authority records) and ingested into the DMS, the Identity Reconciliation module will be invoked to identify reliable matches and possible matches. The results of the evaluation will be available to editors through the Editing User Interface to assist them in verifying identities. When editors create new identity descriptions or revise existing descriptions, the Identity Reconciliation module will be invoked to provide the editors with feedback on likely and potential matches that may be otherwise overlooked when employing human-only authority control techniques.
Developing the Editing User Interface (EUI) is a primary objective of the two-year pilot. Pilot members will be engaged in providing feedback on the iterative development of the EUI in order to ensure that all editorial tasks are supported, that the order in which such tasks are performed is supported by the workflow, and that fundamental transactions (revising, merging, and splitting identity descriptions) are optimally supported. These activities and the findings that result from them will inform the development of the maintenance platform. When the underlying data maintenance platform is in place, development of a prototype EUI will commence, informed by the activities described above. As the EUI becomes functional, the pilot participants will transition to iteratively testing and using it to perform editing tasks to ensure that the essential functions are supported and that the tasks are logical and efficient. Those functions of the EUI that overlap with the History Research Tool (HRT) will employ a common interface. The bulk of the EUI will be based on HTML, CSS, JavaScript, and WebUI server-side PHP.
Sample interaction diagrams:
Neo4J (a graph database) will be used to store a subset of each identity constellation from the DMS. The data in Neo4J will support several services: serving graph data to drive social-document network graphs in the HRT; and serving and providing LOD through a SPARQL endpoint and RDF exports for third-party consumption. LOD data will be exposed in a variety of forms: EAC-CPF XML; RDF/XML, JSON-LD; and others. The data in the Neo4J database will be coordinated with the data in the DMS through the Workflow Controller, keeping the data in Neo4J current as ICs are added, removed, or revised.
Currently there is no existing ontology for archival description, and thus the classes and properties used in exposing graph data expressed in RDF are based on classes and attributes selected from existing, well-known and widely used ontologies and vocabularies: Friend of a Friend, OWL, SKOS, Europeana Data Model (EDM), RDA Group 2 Element Vocabulary, Schema.org, and Dublin Core elements and terms3. In the long term, it should be noted that the International Council on Archives' Expert Group on Archival Description is developing an ontology (Records in Contexts (RiC)) for archival entities and the description thereof4. The SNAC Cooperative will transition to the ICA RiC semantics when it becomes available.
Integration of the server components will be based on a Workflow Controller (WC). The WC interacts with clients via the Server-side JSON API and invokes the required functions through calls to the component subsystems: Identity Reconciliation, Data Validation, Authorization, DMS via a database connector, and Neo4J.
The Rest API client will make server functions (WC actions) available to appropriate third parties, giving them access to server-provided services. A simple example might be a dedicated MARC21-to-EAC converter where a MARC21 record is uploaded, data extracted and transformed into EAC-CPF, and returned in a single transaction. Another example is saving an identity record where the data is written to PostgreSQL, EAC-CPF is exported to and indexed by XTF, and the Neo4j database is updated. The three steps are sequenced by the thin middleware component.
Footnotes
-
PostgreSQL is a widely used and supported open source, SQL standards-compliant relational database management system. Using PostgreSQL as the maintenance platform for the authoritative EAC-CPF descriptions will ensure data integrity and provide robust performance for the large quantity of data (current and anticipated) in SNAC. ↩
-
24.8% of SNAC identities match VIAF identities. ↩
-
Among the RDF vocabularies considered was BIBFRAME, an initiative led by the Library of Congress to replace the MARC21 format using graph technologies, specifically the W3C Resource Description Framework (RDF). BIBFRAME aspires to be "content standard independent," and to accommodate library, museum, and archival description. BIBFRAME development is still in the early stages, and most of the development work centers on accommodating data currently in MARC21. It is unclear at this stage in the development of BIBFRAME whether it will attempt to accommodate the data in EAC-CPF. ↩
-
"Toward an International Conceptual Model for Archival Description: A Preliminary Report from the International Council on Archives’ Experts Group on Archival Description" in The American Archivist (Chicago: SAA), 76/2 Fall/Winter 2013, pp. 566–583. With Gretchen Gueguen, Vitor Manoel Marques da Fonseca, and Claire Sibille-de Grimoüard. Also available here: http://www.ica.org/13851/egad-resources/egad-resources.html. ↩