checkpoint

6fca1af6 · twl8n · b5500926 · 6fca1af6
Commit 6fca1af6 authored Aug 18, 2015 by twl8n
Hide whitespace changes
Inline Side-by-side

Showing with 0 additions and 126 deletions

introduction.md tat_requirements/introduction.md +0 -126

No files found.
--- a/tat_requirements/introduction.md
+++ b/tat_requirements/introduction.md
@@ -376,130 +376,4 @@ Increase exposure of SNAC to general public by leveraging partnerships.  Suggest
 display Wikipedia content in SNAC biographical area and work with Wikipedia to allow for links to SNAC at the
 bottom of all applicable identities. This would serve to escalate and drive traffic to SNAC.
-#### Match/Merge
-Notes: This looks like the current match merge, not the new manual process, or the process based on the IR API.
-The match/merge process has three major data input streams, library authority records, EAC-CPF documents from
-the EAC-CPF extract/create system, and an ARK identifier minter.
-First, a copy of the Virtual International Authority File (VIAF) is indexed as a reference source to aid in
-the record matching process. In addition to authorized name headings from multiple international sources, the
-VIAF data contains biographical data and links to bibliographic records which will be included in the output
-documents.   Then, the EAC-CPF from the extract/create process are serially processed against the VIAF and
-each other to discover and rate potential matches between records. In this phase of processing, matches are
-noted in a database.
-After the matching phase identifies incoming EAC-CPF to merge, a new set of EAC-CPF records are
-generated. This works by running through all the matches in that database, then reading in the EAC-CPF input
-files, and finally outputting a new EAC-CPF records that merges the source EAC-CPF with any information found
-in VIAF. ARK identifiers are also assigned.  This architecture allows for incrementally processing more
-un-merged EAC-CPF documents before. It also allows matches to be adjusted in the database, or alterations to
-be made on the un-merged EAC-CPF documents, and the merge records can be regenerated.
-Cheshire, postgreSQL, and python are the predominate technologies used in the generation of the XML documents
-created by this process.
-[link to the merge output spec]
-This involves processing that compares the derived EAC-CPF records against one another to identify identical
-names. Because names for entities may not match exactly or the same name string may be used for more than one
-entity, contextual information from the finding aids is also used to evaluate the probability that closely and
-exactly matching strings designate the same entity.[1] For matches that have a high degree of probability, the
-EAC-CPF records will be merged, retaining variations in the name entries where these occur, and retaining
-links to the finding aids from which the name or name variant was derived. When no identical names exist, an
-additional matching stage compares the names from the input EAC-CPF records against authority records in the
-Virtual International Authority File (VIAF). Contextual information (dates, inferred dates, etc.) is used to
-enhance the accuracy of the matching. Matched VIAF records are merged with the input derived EAC-CPF records,
-with authoritative or preferred forms of names recorded, and a union set of alternative names from the various
-VIAF contributors, will also be incorporated into the EAC-CPF records. When exact matching and VIAF matching
-fail, then we attempt to find close variants using Ngram (approximate spelling) matching. In addition
-contextual information, when available is used assess the likelihood of the records actually being the
-same. Records that may be for the same entity but the available contextual information is insufficient to make
-a confident match will be flagged for human review (as "May be same as"). While these records will be flagged
-for human review, the current prototype does not provide facilities to manually merge records. The current
-policy governing matching is to err on the side of not merging rather than merging without strong evidence.
-The resulting set of interrelated EAC-CPF records will represent the creators and related entities extracted
-from EAD-encoded finding aids, with a subset of the records enhanced with entries from matching VIAF
-records. The EAC-CPF records will thus represent a large set of archival authority records, related with one
-another and to the archival records descriptions from which they were derived. This record set will then be
-used to build a prototype corporate body, person, and family name and biographical/historical access system.
-In the current system all input records, and potential matches are
-recorded in a relational database with the following structure:
-* * * * *
-[1] Using contextual information in determining that two or more records
-represent the same entity has been successful in matching and merging
-authority records in an international context. See Rick Bennett,
-Christina Hengel-Dittrich, Edward T. O'Neill, and Barbara B. Tillett
-VIAF (Virtual International Authority File): Linking Die Deutsche
-Bibliothek and Library of Congress Name Authority File:
-http://www.ifla.org/IV/ifla72/papers/123-Bennett-en.pdf
-![Screen Shot 2014-06-22 at 3.08.12 PM.png](images/image00.png)
-The the current processing steps are summarized in the following
-diagram:
-![Slide1.jpg](images/image01.jpg)
-#### Discovery/Dissemination
-#### Prototype research tool^[[f]](#cmnt6)^
-The main data input for the prototype research tool are the merged
-EAC-CPF documents produced in the match/merge system. Some other
-supplemental data sources, such as dbpedia and the Digital Public
-Library of America are also consulted during the indexing process.
-A pre-indexing phase is run on the merged EAC-CPF documents. During
-pre-processing, name headings and wikipedia links are extracted, and
-then used to look for possible related links and data in supplemental
-sources. The output of the pre-indexing phase consists of XML documents
-recording supplemental.
-Once the supplemental XML files are generated, two types of indexes are
-created to power which serve as the input to the web site. The first
-index created runs across all documents and provides access to the full
-text and specific facets of metadata extracted from the documents.
-Additionally, the XML structure of each document is indexed as a
-performance optimization that allows for transformations to be
-efficiently applied to large XML documents.
-The public interface to the prototype research tool utilizes the index
-across all documents to enable full text, metadata, and faceted searches
-of the merged EAC-CPF documents. Once a search is completed, and a
-specific merged EAC-CPF document is selected for display; the index of
-the XML document structure is used to quickly transform the merged
-document into an HTML presentation for the end user.
-In the SNAC1 prototype a graph database was created after the full text
-indexing was complete. The graph database was used to power
-relationship visualizations and an API used to dynamically integrate
-links to SNAC into archival description access systems. This graph
-database was then converted into linked data, which was loaded into a
-SQARQL endpoint. This step has not yet been implemented in the SNAC 2
-prototype. Because the merged EAC-CPF documents are of higher quality
-for the SNAC 2 prototype, the graph extraction process is no longer
-dependent on the full text index being complete, so it could run in
-parallel with pre-indexing and indexing.
-XTF is the main technology used to power public access to the merged
-EAC-CPF records. XTF integrates lucene for indexing and saxon for XML
-transformation, making heavy use of XSLT for customization and display
-of search results and the merged documents. EAC-CPF and search results
-are transformed to HTML5 and JSON for consumption by the end users' web
-browser. Multiple javascript and CSS3 libraries and technologies are
-used in the production of the "front end" code for the website. Google
-analytics is used to measure use of the site. Werker, middleman, and
-bower used to build the front end code for the site.
-This technical architecture
-[links to code]