Commit 7c1a323f by Tom Laudeman

Merge branch 'master' into tom

parents 15cd5ff2 474d0de7
......@@ -2,15 +2,13 @@
### Introduction
The long-term technological objective for the Cooperative is a [platform](http://gitlab.iath.virginia.edu/snac/Public-Documentation/raw/master/Diagrams/SNAC%20Cooperative%20Interaction.pdf) that will support a continuously expanding, curated corpus of reliable biographical descriptions of people linked to and providing contextual understanding of the historical records that are the primary evidence for understanding their lives and work. Building and curating a reliable social-document corpus will require a nuanced combination of computer processing and human identity verification and editing. During the pilot phase of the Cooperative, the R&D infrastructure is being thoroughly transformed to a maintenance platform. From a technical perspective, this means transitioning from a multistep human-mediated batch process to an integrated transaction-based platform. The infrastructure under development will automate the flow of data into and out of the different processing steps by interconnecting the processing components, with events taking place in one component triggering related events in another. For example, the addition of a new descriptive record will lead to automatic updating of a graph database and the indexed data in the History Research Tool. This coordinated architecture will support both the batch ingest of data and human editing of the data to verify identities, and will refine and augment the descriptions over time.
For more diagrams and documents, visit the [documentation repository](http://gitlab.iath.virginia.edu/snac/Public-Documentation/tree/master).
The long-term technological objective for the Cooperative is a [platform](http://gitlab.iath.virginia.edu/snac/Documentation/raw/master/Specifications/Originals/SNAC%20Cooperative%20Interaction.pdf) that will support a continuously expanding, curated corpus of reliable biographical descriptions of people linked to and providing contextual understanding of the historical records that are the primary evidence for understanding their lives and work. Building and curating a reliable social-document corpus will require a nuanced combination of computer processing and human identity verification and editing. During the pilot phase of the Cooperative, the R&D infrastructure is being thoroughly transformed to a maintenance platform. From a technical perspective, this means transitioning from a multistep human-mediated batch process to an integrated transaction-based platform. The infrastructure under development will automate the flow of data into and out of the different processing steps by interconnecting the processing components, with events taking place in one component triggering related events in another. For example, the addition of a new descriptive record will lead to automatic updating of a graph database and the indexed data in the History Research Tool. This coordinated architecture will support both the batch ingest of data and human editing of the data to verify identities, and will refine and augment the descriptions over time.
### Technology Architecture Overview
We employ a LAMP stack (with PostgreSQL) for efficiency of coding; flexibility enabled by a very large number of available software modules; ease of maintenance; and clarity of software architecture. The result will be a lightweight and easy-to-administer software stack.
The [high-level architecture](http://gitlab.iath.virginia.edu/snac/Public-Documentation/raw/master/Diagrams/SNAC%20Server%20Architecture.pdf) uses a scalable and distributable client-server model. Two clients will be created and hosted by the Cooperative to interact with the back-end server: a graphical web user interface (HTML) and a RESTful API (JSON). The WebUI client will support the Cooperative’s editing user-interface. The Rest API client will allow ArchivesSpace and other approved clients to mechanically interact with the server; it will provide features such as viewing and editing descriptive records, as well as batch processing of data.
The [high-level architecture](http://gitlab.iath.virginia.edu/snac/Documentation/raw/master/Specifications/Originals/SNAC%20Server%20Architecture.pdf) uses a scalable and distributable client-server model. Two clients will be created and hosted by the Cooperative to interact with the back-end server: a graphical web user interface (HTML) and a RESTful API (JSON). The WebUI client will support the Cooperative’s editing user-interface. The Rest API client will allow ArchivesSpace and other approved clients to mechanically interact with the server; it will provide features such as viewing and editing descriptive records, as well as batch processing of data.
The server-side architecture will consist of a number of modules addressing different primary functions. The storage medium of the server will be a PostgreSQL Data Maintenance Store (DMS). The DMS will contain all of the descriptive and maintenance data for each EAC-CPF data file.
......@@ -18,14 +16,14 @@ Other major server-side components are the Identity Reconciliation Engine, the D
### Data Maintenance Store
A PostgreSQL Data Maintenance Store (DMS) will represent the storage foundation of the SNAC technology platform. The DMS will store "identity constellations" that will represent all of the data contained in the EAC-CPF instances, with each instance represented by an [Identity Constellation (IC)](http://gitlab.iath.virginia.edu/snac/Public-Documentation/raw/master/Diagrams/Identity%20Constellation.pdf). Additional control data will be stored with each IC to facilitate transaction tracking and management, and fine-grained version control.
A PostgreSQL Data Maintenance Store (DMS) will represent the storage foundation of the SNAC technology platform. The DMS will store "identity constellations" that will represent all of the data contained in the EAC-CPF instances, with each instance represented by an [Identity Constellation (IC)](http://gitlab.iath.virginia.edu/snac/Documentation/raw/master/Specifications/Originals/IC_Overview.pdf). Additional control data will be stored with each IC to facilitate transaction tracking and management, and fine-grained version control.
In the R&D processing workflow, EAC-CPF instances were placed in a read-only directory as the primary data store. A small number of select components (name strings) of each EAC-CPF XML-encoded instance were loaded into a PostgreSQL database only for matching purposes. In order to support dynamic manual editing of the EAC-CPF instances, the entirety of each EAC-CPF instance will be parsed into PostgreSQL tables as Identity Constellations[^1]. Each IC will retain all of the EAC-CPF data, as well as additional control data that will facilitate transaction tracking and version control. The DMS will also store editor authorization privileges, editor work histories (e.g., edit status on individual identity constellations), and local controlled vocabularies (e.g., occupations, functions, subjects, and geographic names). The DMS will store workflow management data and aid the server in report generation.
Identity Constellation Diagrams:
* [Identity Constellation Overview](http://gitlab.iath.virginia.edu/snac/Public-Documentation/raw/master/Diagrams/Identity%20Constellation.pdf)
* [Identity Constellation Relations](http://gitlab.iath.virginia.edu/snac/Public-Documentation/raw/master/Diagrams/Constellation%20Relations.pdf)
* [Identity Constellation Overview](http://gitlab.iath.virginia.edu/snac/Documentation/raw/master/Specifications/Originals/IC_Overview.pdf)
* [Identity Constellation Relations](http://gitlab.iath.virginia.edu/snac/Documentation/raw/master/Specifications/Originals/Constellation%20Relations.pdf)
### Identity Reconciliation
......@@ -33,9 +31,9 @@ A major focus of the SNAC R&D has been on identity reconciliation. A fundamental
With the emergence of Linked Open Data (LOD) and the opportunity it presents to interconnect distributed sets of information, new names for entities are introduced, namely the URIs used to provide globally unique identifiers to entities. In order to exploit the opportunity presented by LOD, it is necessary to include these URIs in the reconciliation process. SNAC assigns its own identifiers (ARKS) because doing so is essential to effectively managing the identities throughout processing and maintenance. Even if this were not essential for managing the workflow, the majority of the identities in SNAC will not be found in other sources such as VIAF, and thus the SNAC identifiers and associated data that establish the identity are likely to be unique, at least in the near term[^2]. For those identities that do overlap with VIAF, SNAC processing takes advantage of the VIAF reconciliation process to associate VIAF’s identifier as well as identifiers for Wikipedia and WorldCat Identity.
While the R&D matching was based on the name string alone, Cooperative [Identity Reconciliation](http://gitlab.iath.virginia.edu/snac/Public-Documentation/raw/master/Diagrams/Identity%20Reconciliation%20Engine.pdf) will be based on ICs, that is, the name string and additional information (evidence) that sufficiently establishes the uniqueness of an identity. The determination of match scoring will be based on comparing identity constellations and identifying which properties within each constellation (name, life dates, place of birth, place of death, relations to other identities, etc.) match or closely match, and each match test will result in an assigned score. A major factor in reliable matching, for computers or humans, is the available evidence for each identity. Sparse evidence in compared identities will decrease the probability of making a reliable match or non-match. Conversely, dense evidence supports both reliable matches and non-matches. Based on the scoring, two reconciliation outcomes will be presented: reliable matches and possible matches. Reliable non-matches and match scores that fall below the threshold of reliable and possible will not be flagged. Possible matches will be employed to suggest comparisons that are not reliably matches or non-matches but have sufficient similarity to suggest further human investigation and possible resolution. The Identity Reconciliation module will primarily employ the DMS and ElasticSearch. Ground-truth data, human-reviewed and verified matches and non-matches, will be used in testing and refining the matching algorithms in order to optimize the scoring.
While the R&D matching was based on the name string alone, Cooperative [Identity Reconciliation](http://gitlab.iath.virginia.edu/snac/Documentation/raw/master/Specifications/Originals/IR_Engine.svg) will be based on ICs, that is, the name string and additional information (evidence) that sufficiently establishes the uniqueness of an identity. The determination of match scoring will be based on comparing identity constellations and identifying which properties within each constellation (name, life dates, place of birth, place of death, relations to other identities, etc.) match or closely match, and each match test will result in an assigned score. A major factor in reliable matching, for computers or humans, is the available evidence for each identity. Sparse evidence in compared identities will decrease the probability of making a reliable match or non-match. Conversely, dense evidence supports both reliable matches and non-matches. Based on the scoring, two reconciliation outcomes will be presented: reliable matches and possible matches. Reliable non-matches and match scores that fall below the threshold of reliable and possible will not be flagged. Possible matches will be employed to suggest comparisons that are not reliably matches or non-matches but have sufficient similarity to suggest further human investigation and possible resolution. The Identity Reconciliation module will primarily employ the DMS and ElasticSearch. Ground-truth data, human-reviewed and verified matches and non-matches, will be used in testing and refining the matching algorithms in order to optimize the scoring.
The [Identity Reconciliation Engine](http://gitlab.iath.virginia.edu/snac/Public-Documentation/raw/master/Diagrams/Identity%20Reconciliation%20Engine.pdf) will be used for both batch ingest and to assist human editors. When EAC-CPF are extracted and assembled using existing archival descriptions (EAD-encoded finding aids, MARC21, or existing non-standard archival authority records) and ingested into the DMS, the Identity Reconciliation module will be invoked to identify reliable matches and possible matches. The results of the evaluation will be available to editors through the Editing User Interface to assist them in verifying identities. When editors create new identity descriptions or revise existing descriptions, the Identity Reconciliation module will be invoked to provide the editors with feedback on likely and potential matches that may be otherwise overlooked when employing human-only authority control techniques.
The [Identity Reconciliation Engine](http://gitlab.iath.virginia.edu/snac/Documentation/raw/master/Specifications/Originals/IR_Engine.svg) will be used for both batch ingest and to assist human editors. When EAC-CPF are extracted and assembled using existing archival descriptions (EAD-encoded finding aids, MARC21, or existing non-standard archival authority records) and ingested into the DMS, the Identity Reconciliation module will be invoked to identify reliable matches and possible matches. The results of the evaluation will be available to editors through the Editing User Interface to assist them in verifying identities. When editors create new identity descriptions or revise existing descriptions, the Identity Reconciliation module will be invoked to provide the editors with feedback on likely and potential matches that may be otherwise overlooked when employing human-only authority control techniques.
### Editing User Interface
......@@ -43,9 +41,9 @@ Developing the Editing User Interface (EUI) is a primary objective of the two-ye
Sample interaction diagrams:
* [Edit and save](http://gitlab.iath.virginia.edu/snac/Public-Documentation/raw/master/Diagrams/User%20Interaction/SNAC%20Edit%20Flow%20Diagram.pdf)
* [Edit with permission error](http://gitlab.iath.virginia.edu/snac/Public-Documentation/raw/master/Diagrams/User%20Interaction/SNAC%20Edit%20Flow%20Diagram%202.pdf)
* [Multiple simultaneous edits](http://gitlab.iath.virginia.edu/snac/Public-Documentation/raw/master/Diagrams/User%20Interaction/SNAC%20Edit%20Flow%20Diagram%203.pdf)
* [Edit and save](http://gitlab.iath.virginia.edu/snac/Documentation/raw/master/Specifications/System%20Interaction%20Diagrams/SNAC%20Edit%20Flow%20Diagram.pdf)
* [Edit with permission error](http://gitlab.iath.virginia.edu/snac/Documentation/raw/master/Specifications/System%20Interaction%20Diagrams/SNAC%20Edit%20Flow%20Diagram%202.pdf)
* [Multiple simultaneous edits](http://gitlab.iath.virginia.edu/snac/Documentation/raw/master/Specifications/System%20Interaction%20Diagrams/SNAC%20Edit%20Flow%20Diagram%203.pdf)
### Graph Data Store – Visualizations and Exposure of RDF/LOD
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment