Commit 02b45b2b by Sarah Wells
parents 2ca502de 5944fa85
Gap analysis
------------
#### Authors
Tom Laudeman, Technical lead, University of Virginia, Institute for
Advanced Technology in the Humanities
[twl8n@virginia.edu](mailto:twl8n@virginia.edu)
Brian Tingle, Technical Lead for Digital Special Collections, California
Digital Library
Rachael Hu, User Experience Design Manager, California Digital Library
Ray R. Larson, U.C. Berkeley - School of Information
Robbie Hott
#### Organization of documenatation
[Plan](plan.md) (External, broad view roadmap)
[Co-op Background](co-op_background.md) (This document)
[Introduction ](introduction.md) (Was an introduction, but shaping up to be requirements part one)
[Requirements](requirements.md) (Requirements part two, includes requirements from Rachael's spreadsheets)
#### Introduction to SNAC
Social Networks and Archival Context (SNAC) is a Mellon-funded project
to aid end-user researchers in discovering, locating, and using
distributed historical record descriptions, especially as relates to
corporate bodies, persons, and families (CPF). These descriptions are
often in finding aids, and they often exist in electronic format. They
are distributed across many geographical locations and many networks.
SNAC brings all this data together in a central system, while retaining
links to the original descriptions. Critically, SNAC attempts to merge
descriptions for the same [matching?] CPF identities, linking those
descriptions to a single authoritative name.
^[[a]](#cmnt1)^^[[b]](#cmnt2)^
We have an existing system (SNAC one?) and need additional work to get
to a new system (SNAC 3?), so part of this document is gap analysis. The
scope of this document is to outline technical specifications and
requirements for a production system for the Cooperative
phase^[[c]](#cmnt3)^ of SNAC. This production system will handle
ingestion, processing, matching/merging, discovery, and dissemination of
archival descriptions that are submitted and added to the Cooperative.  
#### Evaluation of Existing Technical Architecture
##### Overview
This section describes the existing technical architecture, and later
moving on to describe the required functionality for the production
system for the Cooperative.
Many of the archival records that are ingested in SNAC are Encoded
Archival Context - Corporate bodies, Persons and Families (EAC-CPF,
hereafter CPF) records. EAC-CPF is an XML schema endorsed as a standard
by the Society of American Archivists. We speak of CPF descriptions in
the sense of a “computer record”: often a single text file and not a
“record” in the archival sense.
“Linked data” technology related to the Resource Description Framework
is also employed to manage some controlled vocabularies in the project.
The current system consists of three main components: extraction,
match/merge, discovery. Extraction consists of extracting data from
incoming archival description records (primarily EAD, MARC21 and some
other unique formats), to create CPF descriptions. Match/merge is to
process the CPF descriptions in search of name matches and to merge
well-matched descriptions. The resulting data set includes merged
descriptions and descriptions with no matches (called singletons), all
in a single database. Discovery is discovery and dissemination of the
data via a web application.
The production system will have two additional components: maintenance
and administration. Maintenance includes manual corrections, such as
correcting data within a description, splitting incorrect merges,
merging descriptions for the same CPF identity, and description embargo
(embargo hides descriptions from public view for either technical or
administrative reasons). Administration is the typical management of
users, accounts, and reporting on the state of the system.
The first two phases of data processing are extraction, and match/merge.
A database of descriptions, both merged and unmerged is the end
result^[[d]](#cmnt4)^. The process of ingesting extracted data and
merging will continue for the life of the project. An extensive
web-based search engine lets users discover descriptions.
We use the term “merged” loosely when applied to the automated system
since the final database may contain descriptions which should be
merged, but which a computer is unable to reliably determine.  We take a
conservative approach, preferring to only merge descriptions that a
computer program can accurately distinguish.^[[e]](#cmnt5)^ Even so,
some descriptions will have been incorrectly merged, and thus the need
for a (future) maintenance system that allows manually splitting of
descriptions, among other things.
Both Extraction and Match/merge are script based, batch processing,
semi-automatic processes managed entirely by software engineers.
Discovery and Maintenance are both web applications with extensive
public user interfaces intended for researchers. Administration is done
mostly via a non-public web application.
Extraction and match/merge are well developed, although we have some
planned improvements. Discovery is well developed, but existing features
are being refined, and adding new features is on-going. Maintenance and
administration have not yet been created and must be written from the
ground up.
#### Current State of the System
CPF description generation is done at the University of Virginia’s
Institute for Advanced Technology in the Humanities (IATH). IATH handles
the CPF data extraction and hosts servers for data processing and the
SNAC prototype web site. Data processing, XTF indexing (for the
discovery interface), and web hosting take place on a Linux server with
24 CPUs and 94 GB of RAM connected to a 1Gbit network switch. This
server is administered by the IATH sysadmin team. \
Collections of archival description computer descriptions in a variety
of formats are extracted into CPF format XML. This process involves
writing XSLT scripts that extract and transform input descriptions, and
create CPF files as output. The current state of the extraction is a
collection of XSLT scripts supplemented by Perl scripts. The input files
are XML with large numbers of files in EAD, MARC XML, and British
Library XML, as well as several smaller data sets.  A large XSLT code
library is shared among most of the extractions. Each type of extraction
builds a generic internal data structure, which is serialized as EAC-CPF
XML output. The XSLT takes into account various descriptive practices in
the input data, and reformats as necessary to create a single type of
normative CPF output. The complexity of this task centers around the
large number of small differences in descriptive practice. Currently
more than 3 million CPF computer descriptions have been created. The
XSLT processor is Saxon 9 HE, which is the free “home edition” of Saxon.
Saxon implements XSLT 2.0. There are a small number of Perl scripts that
integrate the XSLT into a pipeline, automating tasks such as chunking
data sets into sizes that won’t exceed computer memory.
The current state of the match/merge is (filled in by Yiming/Ray/Sara,
initially a one or two paragraph overview with more detail added later
as necessary).
Overview of Brian’s UI and programming for the SNAC2 XTF discovery tool
(add this to another item if there is an umbrella section more
appropriate).
Is XTF the only discovery tool we will offer? Will SNAC be fully indexed
by Google and Bing?
TK The involvement of the UC Berkeley I School includes the development,
testing and modification of the matching and merging components of the
SNAC system. The current system, described in more detail below, takes
the EAC-CPF records derived from the various source institutions and
compares the names and associated information (especial dates) to
identify the records that likely describe the same person,
organization, or family. The process involves not only comparison across
input records, but also comparison with information from the Virtual
International Authority File, and approximate matching for these records
as well.
Rachael has several user studiess. The results of these studies are... The implications of these studies
are...
The current system uses a fairly loose software development process. Source code is maintained on a Linux
server which is managed by standard practices as relate to hardware, software, network, user accounts, back
up, and so on. All the data resides on the server. Source code is managed by version control systems. The
amount of quality assurance and testing has been increasing over time, as well as documentation, and
management aspects such as release process. All tools currently used are open source, and the code written for
SNAC is open source. We have begun to formalize feature request and issue tracking. The development process
is agile in that there are frequent small changes that are committed to the version control, and the code is
nearly always in a working state.
#### Processing Pipeline
TK Describe algorithmic portions, and add a section for new features.
#### Extraction
There are currently several CPF extraction software pipelines: MARC21,
British Library, Smithsonian Agency History, New York State Archives,
Smithsonian Joseph Henry, Smithsonian Field Books, and EAD from nearly
60 institutions.
The first step in adding new records to the SNAC database is to convert
incoming data into EAC-CPF XML.  One EAC-CPF record is created for each
successfully extracted reference to an identity from an archival source.
The processing also allows for some degree of remediation of data
quality issues and serves to normalize the data into a common format.
 Scripting data transformation processes is a significant task that
often requires close communications with data contributors and
customizations to accommodate local practices of the contributors.
Creating an extraction is a complex process since we must deal with
variances in local descriptive practice. The MARC21 tools have been made
available as a web interface and this demonstrates the feasibility of
moving more of the processing responsibility to data donors. If we are
optimistic, we hope that EAD-to-CPF extraction and all other types of
future extractions can be turned into donor-driven tools. Specifically,
we create the tools and then deploy them as web applications and/or
desktop applications. Web hosted extraction tools allow us to leverage
the power of our servers and programmers so that data donors do not need
a large computing infrastructure in order to participate. In any case,
data must be validated before ingest into the match/merge processing.
XSLT and perl are the predominate technologies used in the generation of
the XML documents created by this process.  The code architecture
focuses on reusability of modular routines to facilitate maintenance of
the customizations needed accommodate the diversity of data sources.
Code, sample data, and documentation are in Github. The pipeline is
being run on a server, but the hardware requirements are minimal enough
that most laptop computers could run the extraction. The system requires
unix-like features of Linux, MacOS, or cygwin (for MS Windows). The XSTL
engine is Saxon 9.x HE which is the free, public version of Saxon.
#### Match/Merge
The match/merge process has three major data input streams, library authority records, EAC-CPF documents from
the EAC-CPF extract/create system, and an ARK identifier minter.
First, a copy of the Virtual International Authority File (VIAF) is indexed as a reference source to aid in
the record matching process. In addition to authorized name headings from multiple international sources, the
VIAF data contains biographical data and links to bibliographic records which will be included in the output
documents. Then, the EAC-CPF from the extract/create process are serially processed against the VIAF and
each other to discover and rate potential matches between records. In this phase of processing, matches are
noted in a database.
After the matching phase identifies incoming EAC-CPF to merge, a new set of EAC-CPF records are
generated. This works by running through all the matches in that database, then reading in the EAC-CPF input
files, and finally outputting a new EAC-CPF records that merges the source EAC-CPF with any information found
in VIAF. ARK identifiers are also assigned. This architecture allows for incrementally processing more
un-merged EAC-CPF documents before. It also allows matches to be adjusted in the database, or alterations to
be made on the un-merged EAC-CPF documents, and the merge records can be regenerated.
Cheshire, postgreSQL, and python are the predominate technologies used in the generation of the XML documents
created by this process.
[link to the merge output spec]
This involves processing that compares the derived EAC-CPF records against one another to identify identical
names. Because names for entities may not match exactly or the same name string may be used for more than one
entity, contextual information from the finding aids is also used to evaluate the probability that closely and
exactly matching strings designate the same entity.[1] For matches that have a high degree of probability, the
EAC-CPF records will be merged, retaining variations in the name entries where these occur, and retaining
links to the finding aids from which the name or name variant was derived. When no identical names exist, an
additional matching stage compares the names from the input EAC-CPF records against authority records in the
Virtual International Authority File (VIAF). Contextual information (dates, inferred dates, etc.) is used to
enhance the accuracy of the matching. Matched VIAF records are merged with the input derived EAC-CPF records,
with authoritative or preferred forms of names recorded, and a union set of alternative names from the various
VIAF contributors, will also be incorporated into the EAC-CPF records. When exact matching and VIAF matching
fail, then we attempt to find close variants using Ngram (approximate spelling) matching. In addition
contextual information, when available is used assess the likelihood of the records actually being the
same. Records that may be for the same entity but the available contextual information is insufficient to make
a confident match will be flagged for human review (as "May be same as"). While these records will be flagged
for human review, the current prototype does not provide facilities to manually merge records. The current
policy governing matching is to err on the side of not merging rather than merging without strong evidence.
The resulting set of interrelated EAC-CPF records will represent the creators and related entities extracted
from EAD-encoded finding aids, with a subset of the records enhanced with entries from matching VIAF
records. The EAC-CPF records will thus represent a large set of archival authority records, related with one
another and to the archival records descriptions from which they were derived. This record set will then be
used to build a prototype corporate body, person, and family name and biographical/historical access system.
In the current system all input records, and potential matches are
recorded in a relational database with the following structure:
* * * * *
[1] Using contextual information in determining that two or more records
represent the same entity has been successful in matching and merging
authority records in an international context. See Rick Bennett,
Christina Hengel-Dittrich, Edward T. O'Neill, and Barbara B. Tillett
VIAF (Virtual International Authority File): Linking Die Deutsche
Bibliothek and Library of Congress Name Authority File:
http://www.ifla.org/IV/ifla72/papers/123-Bennett-en.pdf
![Screen Shot 2014-06-22 at 3.08.12 PM.png](images/image00.png)
The the current processing steps are summarized in the following
diagram:
![Slide1.jpg](images/image01.jpg)
#### Discovery/Dissemination
#### Prototype research tool^[[f]](#cmnt6)^
The main data input for the prototype research tool are the merged
EAC-CPF documents produced in the match/merge system. Some other
supplemental data sources, such as dbpedia and the Digital Public
Library of America are also consulted during the indexing process.
A pre-indexing phase is run on the merged EAC-CPF documents. During
pre-processing, name headings and wikipedia links are extracted, and
then used to look for possible related links and data in supplemental
sources. The output of the pre-indexing phase consists of XML documents
recording supplemental.
Once the supplemental XML files are generated, two types of indexes are
created to power which serve as the input to the web site. The first
index created runs across all documents and provides access to the full
text and specific facets of metadata extracted from the documents.
Additionally, the XML structure of each document is indexed as a
performance optimization that allows for transformations to be
efficiently applied to large XML documents.
The public interface to the prototype research tool utilizes the index
across all documents to enable full text, metadata, and faceted searches
of the merged EAC-CPF documents. Once a search is completed, and a
specific merged EAC-CPF document is selected for display; the index of
the XML document structure is used to quickly transform the merged
document into an HTML presentation for the end user.
In the SNAC1 prototype a graph database was created after the full text
indexing was complete. The graph database was used to power
relationship visualizations and an API used to dynamically integrate
links to SNAC into archival description access systems. This graph
database was then converted into linked data, which was loaded into a
SQARQL endpoint. This step has not yet been implemented in the SNAC 2
prototype. Because the merged EAC-CPF documents are of higher quality
for the SNAC 2 prototype, the graph extraction process is no longer
dependent on the full text index being complete, so it could run in
parallel with pre-indexing and indexing.
XTF is the main technology used to power public access to the merged
EAC-CPF records. XTF integrates lucene for indexing and saxon for XML
transformation, making heavy use of XSLT for customization and display
of search results and the merged documents. EAC-CPF and search results
are transformed to HTML5 and JSON for consumption by the end users' web
browser. Multiple javascript and CSS3 libraries and technologies are
used in the production of the "front end" code for the website. Google
analytics is used to measure use of the site. Werker, middleman, and
bower used to build the front end code for the site.
This technical architecture
[links to code]
#### Gap analysis
This is gap analysis between the current and SNAC2-prototype. Perhaps
this should be in the Required and Planned Functionality below.
Data maintenance
----------------
#### Data maintenance
A goal of the pilot phase it to demonstrate cooperative maintenance of
the data resource.  The prototype does not have robust support for
......@@ -57,354 +403,28 @@ not be run in a “clustered” mode; must scale up, not scale out
    •    Cheshire II does not have a Open Source Initiative certified
license
Pilot phase architecture
------------------------
Alternative 1^[[h]](#cmnt8)^
----------------------------
The most expeditious way to launch a pilot phase would be to leave the
basic technical architecture of the prototype in place, and to focus
initial energies into establishing policies and procedures that work
within the constraints of this architecture.  Two key systems that would
need to be set up for this approach to work are a customer relationship
management (CRM) system and ticketed help desk.
Customer relationship management systems have historically be used as a
sales support tool.  Information on current and potential customers,
including contact information and institutional affiliation, are
maintained in a database.  All pilot members institutions and designated
contacts should be entered into a CRM system for the pilot phase.  All
correspondence, call, contracts and agreements with accepted and
potential pilot phase members should be logged or stored in the CRM
system.^[[i]](#cmnt9)^
The CRM system should support or integrate with a help desk that issues
work ticket numbers. ^[[j]](#cmnt10)^ Any addition or change in the
maintained corpus of merged EAC-CPF records will require a work ticket
number.  Expectations for response times for issued tickets should be
established, clearly communicated, and measured for compliance.  A
customer service manager^[[k]](#cmnt11)^ will actively monitor the queue
of work tickets pending.  An operations manual will be maintained so
that the customer service manager or any additional first tier support
staff will be able to handle a set of ticket types.  If a procedure for
the request is not yet documented in the operations manual - or if the
manual indicates this is a task for second tier - then the ticket will
be escalated to the second tier support programmer.  The second tier
support programmer will have the technical skills to manipulate the
technical infrastructure; such as through editing XML files or directly
altering the database.  The second tier support programmer would also be
responsible for performing data extraction and normalization of non
EAC-CPF data sources processed during the pilot phase.^[[l]](#cmnt12)^ 
The volume and type of tickets will help establish priorities for
establishing procedures that can be automated for first tier support and
for future phases that do not require pilot members to contact the help
desk and obtain work tickets.
An automated way to establish a new identity should established early in
the pilot phase, so that participants can mint a new ARK identifier
without creating a work ticket.  Initially, a work ticket would still be
generated once the participant was ready to submit the new record though
the match/merge process.
Given the importance of maintaining links from the merged EAC-CPF record
to related resources, a link harvesting protocol should be developed
early in the pilot phase.  When a pilot phase participant identifies a
match in SNAC with a name they have in a collection description; the
link harvesting protocol would specify how to publish that link in their
HTML display of their collection description or through some other
mechanism (perhaps through an extension to the sitemap protocol, along
the lines of how ResourceSync works).  Procedures would be established
to then notify SNAC to harvest links from the participant, and the SNAC
“related collections” section would be automatically updated.  Such
updates would be based on a “linked data” technology rather than the
submission of XML files.
Alternative 2
-------------
Pure XML architecture for edits (edit the merged EAC-CPF records, maybe
with something like xEAC and with the merged files in revision control.
 This might make export from the match/merge challenging)
Alternative 3
-------------
Pure RDF architecture
Current State Conclusion (All, Daniel, Tom)
-------------------------------------------
The current systems functions well enough for researchers and other
stakeholders to see large data sets fully processed. These systems will
benefit from additional work to make them more mature in the usual ways
that software develops: robustness, testing and QA, documentation,
examples, consistent API. Most of the current software will be used in
the production product.
Required and Planned Functionality (All authors)
================================================
(We need to break out each item into UI functionality, and API
functionality.)
Expanded CPF schema requirements
--------------------------------
Provenance and history of each element/attribute.
Unique ID per element of CPF if that element is editable.
Version control on a per-element basis.
Expanded Database Schema
------------------------
The current database (Postgres) is sufficient for the current project
only. It will expand, and the expansion will probably be fairly
dramatic. We need to determine what tables and fields are necessary to
support additional functions. Each section of this document may need a
“data” section, or else this database schema section needs to address
every functional and UI aspect of all APIs that have anything to do with
the database.
Each field within CPF may (will?) need provenance meta data. Likewise
many fields in the database may need data for provenance.
The database needs audit trail ability to a fairly granular (field)
level. Audit is a new table at the very least. It seems likely that
nearly every table will gain some audit related fields.
Will database records be versioned? How is that handled? Seems like it
may be done via versioning table and some interesting joins. We need to
evaluate the various standard methods for database internal versioning.
CPF record has links to a “watch” table so users can watch each record,
and can watch for certain types of changes. Need UI for the watch
system. Need an API for the watch system.
Need a user table, group table, probably a group permission table so
that permissions are hard code with groups. We also want to allow
several permissions per group. Need UI for user, group, and
group-permission management.
If we create a generalized workflow system (as opposed to an ad-hoc
linked set of reports) then we need workflow tables. The tables would
establish workflow paths, necessary permissions, and would be linked to
users and groups.
Need fields to deal with delete/embargo. This may be best implemented
via a trigger or perhaps a view. By making what appear to be simple
SELECTs through a view, the view can exclude deleted records. We must
think about how using a view (or trigger) will effect UPDATE and INSERT.
Ideally the view is transparent. Is there some clever way we can
restrict access to the original table only via the view?
Need record lock on some types of records. This lock needs to be honored
by several modules, so like “delete”, lock might best be implemented via
a view and we \*only\* access the table in question via the view.
If there are different levels of review for different elements in the
record, then we need extra granularity in the workflow or the edited
record info to know the type of record edited apropos of workflow
variations.
If there different reviewers for different parts of the record, then
workflow data (and workflow configuration) needs to be able to notify
multiple people, and would have to get multiple reviewer approvals
before moving to the next phase of the workflow.
Institutional affiliation is probably common enough to want a field in
the user table, as opposed to creating a group for each institution. The
group is perhaps more generalized and could behave identical (or almost
identical) to a field (with controlled vocabulary) in the user table.
Make sure we can write a query (report) to count numbers of records
based type of edit, institution of the editor, and number of holdings.
If we want to be able to quickly count some CPF element such as outgoing
links from CPF to a given institution, then we should put those CPF
values into the SQL database, as meta data for the CPF record.
What is: How many referral links to EAC records that they created?
Be able to count record views, record downloads. Institutional dashboard
reports need the ability to group-by user, or even filter to a specific
user.
Reporting needs to help managers verify performance metrics. This
assumes that all changes have a date/timestamp. Once workflow and
process decisions are set, performance requirements for users such as
load/performance (how many updates and changes to records can be handled
at once), search response time, edit time (outside of review workflow),
and update times need to be set.
Effort reporting to allow SNAC and participants to communicate to others
the actual level of effort involved. This sounds like a report with time
span and numbers of records handled in various ways. SNAC might use this
when going from pilot into production so that everyone knows what effort
will be required for X number of records/actions (of whatever action
type).
Time/activity reporting could allow us to assess viability, utility, and
efficiency of maintenance system processes.
Similar reports might be generated to evaluate the discovery interface.
Something akin to how much time was required to access a certain number
of records. Rachael said: Assess viability of access funtionality-
performance time, available features, and ease of use.
We could try to report on the amount of training necessary before a new
user was able to work independently in each of various areas (content
input, review, etc.)
Introduction to Planned Functionality
-------------------------------------
The current system works, but is somewhat skeletal. It requires careful
attention from the developers to run the data processing pipelines. It
lacks administrative controls and reporting. Existing software
development process follows modern agile practices, but the some
processes are weak or incomplete. The research tools are somewhat
rudimentary. It needs infrastructure where domain experts can correct
and update merged authority descriptions.
The functional requirements below specify in detail all of the
capabilities of the new [production?] system. A separate section about
user interface (UI) specifies the visual/functional aspects of the UI
and includes discussion of the user experience (UX). Some of the
functional requirements exist only to support actions of the UI, and
UI-related functions should exist in their own independent API.
Software development, processes, and project management
-------------------------------------------------------
Choices for programming languages, operating system, databases, version
control, and various related tools and practices are based on extensive
experience of the developer community, and a complex set of requirements
for the coding process. Current best practices are agile development
using practices that allow programmers wide leeway for implementation
while still keeping the processes manageable.
Test-driven development ideally means automated testing, with careful
attention to regression testing. It takes some extra time up front to
write the tests. Each test is small, and corresponds to small sections
of code where  both code and text can be quickly created. In this way,
the software is kept in a working state with only brief downtimes during
feature creation or bug fixes. Large programs are made up of
intentionally small functions each of which is tested by a small
automated test.
Regression testing refers to verifying that old bugs do not reappear.
Every bug fix has a corresponding test, even if the function in question
did not originally have a test for the bug. Each new bug needs a new
test. Bugs frequently reappear, especially in complex sections of code.
Source code version control is vital to both development process, and to
the release process. During development, frequent small changes are
checked-in to the version control, along with a meaningful comment. The
history of the code can be tracked. This occasionally helps to
understand how bugs come into existence. In the Git system, the history
command is “blame”, a bit of programmer dark humor where the history is
used to know who to blame for a bug (or any undesirable feature).
Moving code into Quality Assurance (QA) and then into the production
environment are both integral with source code management. Many version
control systems allow tagging a release with a name. The collected
source code files are marked as a named (virtual) collection, and can be
used to update a QA area. Human testing and review happens in QA. After
QA we have release. Depending on the nature of the system release can be
quite complex with many parties needing to be notified, and coordination
across groups of developers, sysadmin, managers, support staff, and
customers. Agile development tends towards small, seamless releases on a
frequent (weekly or monthly) basis where communication is primarily via
update of electronic documentation. The process needs to assure that
fixes and new features are documented. The system must have tools to see
the current version of the system with its change log, as well as
comparing that to previous releases. All of these are integrated with
change management.
Bug reporting and feature requests fall (broadly speaking) into the
category of change management. Typically a small group of senior
developers and stakeholders review the bug/feature tracking system to
assign priorities, clarify, and investigate. There are good
off-the-shelf systems for tracking bugs and feature requests, so we have
several choices. This process happens almost as frequently as the
features/bug fix coding work of the developers. That means on-going,
more or less continuous review of fix/features requests every few days,
depending on how independent the developers are. Agile applies to
everyone on the project. Ideal change management is not onerous. As
tasks are completed, someone (developers) update feature status with “in
progress”, “completed” and so on. There might be additional status
updates from QA and release, but SNAC probably isn’t large enough to
justify anything too complex.
QA and Related Tests for Test-driven Development (Tom, Brian, Ray)
------------------------------------------------------------------
The data extraction pipelines manage massive amounts of data, and
visually checking descriptions for bugs would be inefficient if not
infeasible. The MARC extraction process is verified by just over 100
quality assurance descriptions. The output produced from each
description is checked for some specific value that confirms that the
code is working correctly and historical bugs have not reappeared. The
EAD extraction has a set of QA files, but the output verification is not
yet automated. A variety of file counts and measures of various sorts
are performed to verify that descriptions have all been processed. All
CPF output is validated against the Relax NG schema. Processing log
files are checked for a variety of error messages. Settings used for
each run are recorded in documentation maintained with the output files.
The source code is stored in a Subversion repository.
Our disaster recovery processes must be carefully documented.
The match/merge process is validated by …
Required new features
---------------------
The majority of new features will be in two areas: the maintenance
system, and the administration system. None of this code exists. The
maintenance system has a web UI and a server-based back end that
interacts with the same database used by the match-merge. The
maintenance system also requires an authentication system (login) that
allows us to manage the extensive collaborative efforts. The current
processing of data is accomplished only on servers at the command line,
and is handled directly by project programmers. In the new maintenance
system, that will be driven by content experts via a web site, and
therefore must expect the issues of authentication and authorization
inherent in collaborative data manipulation web applications.
The system will require reports. These will cover broad classes of
issues related to managing resources, usage statistics, administration,
maintenance, and some reports for end user researchers.
(Fill in prose introducing the other subsystems such as reporting)
One important aspect of the project is long-term viability and
preservation. We should be able to export all data and metadata in
standard formats. Part of the API should cover export facilities so that
over time we can easily add new export features to support emerging
standards.
The ability to export all the data for preservation purposes also gives
us the ability to offer bulk data downloads to researchers and
collaborating peer institutions.
Documentation (all authors)
---------------------------
Every aspect of the system requires documentation. Most visible to the
public is the user interface for discovery. Maintenance will be
complicated, and our processes are somewhat novel, so this will need to
be extensive, well illustrated with screenshots, and carefully tested.
Documentation intended for developers might be somewhat sparse by
comparison, but will be critical to the on-going software development
process. All the databases, operating system, httpd and other servers
need complete documentation of installation, configuration, deployment,
starting, stopping, and emergency procedures.
It is probably wise to choose a wiki-like documentation system at the
outset of the project.
#### Pilot phase architecture
The software will consist of a web application written in PHP, and using PostgreSQL as the data
storage. Granual work flow wil managed by a work flow engine. Hand off of tasks will be managed by a series of
semaphores.
An Identity Reconciliation (IR) API will test similarity between two identities, and allow us to search the
database for matches. The IR API will make concrete our concept of "identity" as a collection of data fields.
CPF records will be linked to relevant recources. Brian suggest actively harvesting links through an extension
to the sitemap protocol along the lines of ResourceSync. Brian notes that the updates are based on linked
data, not submission of XML files.
#### Current State Conclusion
The current systems functions well enough for researchers and other stakeholders to see large data sets fully
processed. These systems will benefit from additional work to make them more mature in the usual ways that
software develops: robustness, testing and QA, documentation, examples, consistent API. Most of the current
software will be used in the production product.
TAT Functional Requirements
#### TAT Functional Requirements
[Plan](plan.md) (Read this first. External, broad view roadmap)
[Co-op Background](co-op_background.md) (Currrent state and SNAC background)
[Introduction ](introduction.md) (This document. The technical requirements part one)
[Requirements](requirements.md) (Tech requirements part two, includes requirements from Rachael's spreadsheets)
#### Introduction to Planned Functionality
The functional requirements below specify in detail all of the
capabilities of the new [production?] system. A separate section about
user interface (UI) specifies the visual/functional aspects of the UI
and includes discussion of the user experience (UX). Some of the
functional requirements exist only to support actions of the UI, and
UI-related functions should exist in their own independent API.
#### Software development, processes, and project management
Choices for programming languages, operating system, databases, version
control, and various related tools and practices are based on extensive
experience of the developer community, and a complex set of requirements
for the coding process. Current best practices are agile development
using practices that allow programmers wide leeway for implementation
while still keeping the processes manageable.
Test-driven development ideally means automated testing, with careful
attention to regression testing. It takes some extra time up front to
write the tests. Each test is small, and corresponds to small sections
of code where both code and text can be quickly created. In this way,
the software is kept in a working state with only brief downtimes during
feature creation or bug fixes. Large programs are made up of
intentionally small functions each of which is tested by a small
automated test.
Regression testing refers to verifying that old bugs do not reappear.
Every bug fix has a corresponding test, even if the function in question
did not originally have a test for the bug. Each new bug needs a new
test. Bugs frequently reappear, especially in complex sections of code.
Source code version control is vital to both development process, and to
the release process. During development, frequent small changes are
checked-in to the version control, along with a meaningful comment. The
history of the code can be tracked. This occasionally helps to
understand how bugs come into existence. In the Git system, the history
command is “blame”, a bit of programmer dark humor where the history is
used to know who to blame for a bug (or any undesirable feature).
Moving code into Quality Assurance (QA) and then into the production
environment are both integral with source code management. Many version
control systems allow tagging a release with a name. The collected
source code files are marked as a named (virtual) collection, and can be
used to update a QA area. Human testing and review happens in QA. After
QA we have release. Depending on the nature of the system release can be
quite complex with many parties needing to be notified, and coordination
across groups of developers, sysadmin, managers, support staff, and
customers. Agile development tends towards small, seamless releases on a
frequent (weekly or monthly) basis where communication is primarily via
update of electronic documentation. The process needs to assure that
fixes and new features are documented. The system must have tools to see
the current version of the system with its change log, as well as
comparing that to previous releases. All of these are integrated with
change management.
Bug reporting and feature requests fall (broadly speaking) into the
category of change management. Typically a small group of senior
developers and stakeholders review the bug/feature tracking system to
assign priorities, clarify, and investigate. There are good
off-the-shelf systems for tracking bugs and feature requests, so we have
several choices. This process happens almost as frequently as the
features/bug fix coding work of the developers. That means on-going,
more or less continuous review of fix/features requests every few days,
depending on how independent the developers are. Agile applies to
everyone on the project. Ideal change management is not onerous. As
tasks are completed, someone (developers) update feature status with "in
progress", "completed” and so on. There might be additional status
updates from QA and release, but SNAC probably isn't large enough to
justify anything too complex.
#### QA and Related Tests for Test-driven Development
The data extraction pipelines manage massive amounts of data, and
visually checking descriptions for bugs would be inefficient if not
infeasible. The MARC extraction process is verified by just over 100
quality assurance descriptions. The output produced from each
description is checked for some specific value that confirms that the
code is working correctly and historical bugs have not reappeared. The
EAD extraction has a set of QA files, but the output verification is not
yet automated. A variety of file counts and measures of various sorts
are performed to verify that descriptions have all been processed. All
CPF output is validated against the Relax NG schema. Processing log
files are checked for a variety of error messages. Settings used for
each run are recorded in documentation maintained with the output files.
The source code is stored in a Subversion repository.
Our disaster recovery processes must be carefully documented.
The match/merge process is validated by …
#### Documentation
System documentation is in http://gitlab.iath.virginia.edu in markdown files.
Every aspect of the system requires documentation. Most visible to the public is the user interface for
discovery. Maintenance will be complicated, and our processes are somewhat novel, so this will need to be
extensive, well illustrated with screenshots, and carefully tested.
Documentation intended for developers might be somewhat sparse by comparison, but will be critical to the
on-going software development process. All the databases, operating system, httpd and other servers need
complete documentation of installation, configuration, deployment, starting, stopping, and emergency
procedures.
#### Required new features
The majority of new features will be in two areas: the maintenance
system, and the administration system. None of this code exists. The
maintenance system has a web UI and a server-based back end that
interacts with the same database used by the match-merge. The
maintenance system also requires an authentication system (login) that
allows us to manage the extensive collaborative efforts. The current
processing of data is accomplished only on servers at the command line,
and is handled directly by project programmers. In the new maintenance
system, that will be driven by content experts via a web site, and
therefore must expect the issues of authentication and authorization
inherent in collaborative data manipulation web applications.
The system will require reports. These will cover broad classes of
issues related to managing resources, usage statistics, administration,
maintenance, and some reports for end user researchers.
#### Authors
- Web application (architect: Robbie)
The web application is a wrapper for all the APIs. It can have an API of it own, or not. It handles all http
requests, validating the data, deciding what needs to be done, doing real work, and handing some output back
to the user. Typically the output is HTML, but we are already planning for file downloads, and JSON data as
output from REST API calls.
Tom Laudeman, Technical lead, University of Virginia, Institute for
Advanced Technology in the Humanities
[twl8n@virginia.edu](mailto:twl8n@virginia.edu)
- Data validation API
Brian Tingle, Technical Lead for Digital Special Collections, California
Digital Library
Data from the web browser needs sanity checking and untainting before being handed to the rest of the
application. Initially the data validation API can consist of nothing more than untaining input from the
browser. We can add various checks and tests. We need to decide if the validation API can reject data, and if
it can, then it needs to interact with the work flow engine, the actual work flow, and whatever messaging
system we use to display messages to end users.
Rachael Hu, User Experience Design Manager, California Digital Library
- Identitiy Reconciliation (aka IR) (architect: Robbie)
Ray R. Larson, U.C. Berkeley - School of Information
This API uses many aspects of identity, testing each against a target population of other identities. The
final anwser is a floating point number giving a match strength. IR has two modes of operation. Mode one
compares two identities and returns a match strength. Mode two compares a single identity againast the entire
database returning match strength. Mode two is somewhat unclear.
Robbie Hott
- workflow manager (Tom)
#### Organization of documenatation
Every action the application can perform is part of the work flow. The names of these actions along with names
of their requisites are organized into a work flow table. The work flow engine does not know how to do real
work, but it does know the names of the functions which do the real work. A new feature (aka function, task)
is added to the application, by adding its name to the work flow, and creating a function of the same name in
the application. Likewise, requistes are determined by boolean functions, and every requisite must have a
matching function known to the work flow engine. The work flow enforces role-based behavior by testing the
requisites. The workflow engine exists, but needs to be ported from Perl to PHP, and the work flow data should
be stored in the SQL database.
[Plan](plan.md) (external, broad view roadmap)
- Support for work history and task staging.
[Introduction (this document) ](introduction.md)
Editing consists of several stages of work that may be performed by different people and/or different
roles. We need database tables to support saving of work state data. Create a prototype table schema so we can
think about this problem and create a functional spec.
[Requirements](requirements.md)
For an edit we need the CPF id, user id, timedate stamp, bitfield or work flow tags, optional user notes. For
search we need: user id, search string, timedate stamp.
co-op background
- SQL schema (Robbie, Tom)
All data is stored in a SQL database. Details are given elsewhere.
- Controlled vocabulary subsystem or API [Tag system](#controlled-vocabularies-and-tag-system)
We need controlled vocabulary for several data fields. This system handles all aspects of all controlled vocabularies.
- CPF to SQL parser (Robbie)
The input for the application is CPF files. These files need to be parsed into data fields and input into the
SQL database. This application exists, but needs some additional functionality.
- Name serialization tool, selectable pre-configured formats
Outputting name strings based on name data fields in the database is a tricky problem. There are several
output formats. The name serialization deals with this issue.
- Name string parser
Names in CPF files are currently strings. The CPF <part> element has been imported into the SQL database as a
string, but data needs require individual name components. Parsing names is a tricky problem, but several
parsers exist. We need to integrate one or more parsers, and perhaps tweak those parsers to handle the SNAC names.
- Date parser
We have several date parsers, but none are fully comprehensive. We can use the existing parsers, but they need
to be integrated into a single, comprehensive parser.
- CPF record edit, edit each field
Record editing on the server is handled by a collection of functions. The specifications for this may evolve
in parallel to the code. We know that each field needs to be changed, but the details of work flow and data
validation have not been determined. Work flow and validation are both likely to change as the SNAC policies
evolve. There are UI requirements for editing.
- CPF record split, split data into separate cpf identities, deprecate old ARK, mint new ARKs
Record splitting requires a set of functions and UI requirements documented elsewhere.
- CPF record merge, combine fields, deprecate old ARKs, mint new ARK
Record merge requires a set of functions and UI requirements documented elsewhere.
- Object architecture, coding style, class template (architect Robbie)
We will have a specific architecture of the web application, and of the classes and objects involved.
- UI widgets, mostly off the shelf, some custom written. We need to have UI edit/chooser widget for search and
select of large numbers of options, such as the Joseph Henry cpfRelations that contain some 22K
entries. Also need to list all fields which might have large numbers of values. In fact, part of the meta
data for every field is "number of possible entries/reapeat values" or whatever that's called. From a
software architecture perspective, the answer is 0, 1, infinite.
One important aspect of the project is long-term viability and preservation. We should be able to export all
data and metadata in standard formats. Part of the API should cover export facilities so that over time we can
easily add new export features to support emerging standards.
The ability to export all the data for preservation purposes also gives us the ability to offer bulk data
downloads to researchers and collaborating peer institutions.
#### Web application overview
Some aspects of the web app aren't yet clear, so there are details to be worked out, and some large-ish
concepts to clarify. I'm guessing we will agree on most things, and one of us or the other will just concede
on stuff where we don't agree.
Requirements:
- expose an http accessible API that is viable for wget, browser <form>, and Ajax calls.
- Supported input format depends on the complexity of the requested operation.
- Public functions require no authentication. Everything else must include authentication data.
Internal flow:
1. validate the inputs.
1. Somehow slice and dice the CGI params of the REST call into an abstracted request we can pass to the
internal API. I suppose that the external and internal APIs are very similar, but we almost certainly need
some level of symbolic reference aka abstraction. Each REST call has its requisite data. Some data is as
simple as a record id, and some will be fairly interesting json data structures.
1. The web app API does the tasks specified by the REST request and the work flow engine's directions.
1. Every http request must go through the work flow engine so that the work flow is validated and managed.
1. Every web app has a work flow, but people mostly just cobble that together with a bunch of implied
functionality using conditionals and side-effect-full function calls. In our code, the internal API is
100% work flow agnostic.
1. I can explain this in more detail, but it makes a huge improvement in the structure of the application.
1. Create the output data object if it wasn't created by the functions doing the work.
1. Pass the output data to a rendering function (or module) to be rendered into the appropriate output format:
html, text, xml, etc. and sent to stdout, or returned as an http file download. JSON probably doesn't need to
be rendered since JSON is "data" and not "presentation".
The work flow engine relies on functions that read application data and return booleans so that the
work flow engine can detect the application's relevant state. I guess that sounds confusing because the work
flow engine has state, and the application has state. Those two types of state are vastly different and only
related to each other in that the work flow engine can detect the application's state. The internal API of the
web app has no idea that the work flow engine even exists. And the work flow engine knows what work needs to
be done, but has no idea how it will be done. This is a very lovely separation of concerns.
#### Web application output via template
A well known, easy, powerful method of creating presntation output is to use an template module. Templating
separates business logic from presentation logic, thus following an MVC model. Our business logic is our work
flow and related function calls. Presentation is our UI, and the work flow engine has no idea that a UI exists,
let alone how to create it. Curiously, the presentation logic knows how to create the presentation rendering,
but has no idea what it does or what it interacts with. This is another example of strong separation of
concerns.
A simple hello world text template with a single variable world = "world" would be:
```
Hello [% world %]!
```
Or a simple HTML version:
```
<html><body>Hello [% world %]!</body></html>
```
That example is based on the Template Tookit http://www.template-toolkit.org/ for which there is a Perl
module, and a Python module. Template modules are fairly common, so I'm almost certain we will have several to
choose from in PHP.
Choosing our own select software modules, including a template module, is better than being locked into a
large, cumbersome web framework. In general, web frameworks have issues:
- difficult to work with
- no useful functionality that isn't more easily found in another software module
- the often break MVC
- generally make debugging nearly impossible
We can do much better by selecting a few modules to create a lightweight quasi-framework that is perfectly matched to our
needs.
Once the internal API completes its work, we will have output data. Output data is passed to a rendering
layer that relies on the template module. The only code that knows anything about rendering is the rendering
layer. To all the non-rendering code, there is only "output data" which does conform to a standard structure
(almost certainly an output data object). The rendering layer takes the output object, and the requested format
of the output (text, html, pdf, xml, etc.) to create the output. Happily, "rendering" is generally a single
function call. We create a template object, call its "render" method with two arguments:
1. template file name,
2. the output data object.
Default behavior is to write the output to stdout, but the render method can also
return the output in a variable so we can create an http download.
Templates are human created static files containing placeholders. The template engine fills in the placeholders with
values from relevant parts of the output data. Clearly, the output data object and the template must share a
object/property naming convention. The template engine functionality has single value fields, looping over
input lists, and if statement branching based on input. But that's pretty much it. No work is done in the
template that is not directly concerned with filling in placeholders, not even formatting (in the sense of
rounding numbers, capitalizing strings, or adding html tags). Templates are valid documents of the output
type, except in rare cases. The attached template is well-formed XML.
The web app needs a file download output option as well as output to stdout.
#### Data background
The data is in a SQL database. Every piece of data is in a separate field to the extent that is practical.
Data is organized into fields (columns) records (rows) and tables. Fields related to each other are in the
same table. Every record has a unique, permananent, numerical id often called a "key" or "primary key". For
the SNAC Co-op we have decided that records are never overwritten during update. This is somewhat unusual, but
not unheard of. An update operation creates a new record identical to the old record except for updated
fields. All old records are available for viewing via special interface. The old records are invisible to
operations that are intellectually acting on "current" data.
#### What is "normal form" and what informs the database schema design?
Edgar F. "Ted" Codd created 12 rules (revised with a 13th rule) to clarify the Relational Database Management
System (RDBMS).
https://en.wikipedia.org/wiki/Edgar_F._Codd
Breaking any of these rules weakens data integrity and the ability of the system to manage the data. An RDBMS
is not merely a bucket of data, but an entire eco-system for the management of data and data related
activities. Before Codd's work, databases were managed on an ad-hoc basis as collections of files with
links. It was a mess. Data was lost. Only the DBA knew how to find the data, and access methods could be very
different for data in different locations. Accessing data could also be extremely slow. In addition to
assuring the integrity of data, as well as managing it, relational database systems are very fast.
https://en.wikipedia.org/wiki/Codd%27s_12_rules
The "R" in RDBMS is "relational" and Codd invented the relational model of data. Key to relational data
modeling is "normal form".
https://en.wikipedia.org/wiki/Database_normalization
The RDBMS world generally uses third normal form. Lower levels of normalization create additional work for
data operations. Higher forms rarely show any improvements. The key concept of normalization is that a datum
only exists in one place. In the RDBMS world where SQL implements relational algebra, normal form is both
convenient and natural. In other venues such as paper ledgers, data stored in flat files, or in spreadsheets,
normal form can seem awkward.
#### Edit architecture requirements
Daniel proposes a plan (which implies important requirements) that human
edits are applied to a serialized description, and after the first human
edit, the description is always maintained inside the system in the
serialized form. Prior to edits, a description consists internally of
one or more CPF records which are serialized in real time via a specific
blending algorithm for display/viewing. The edit UI displays the
serialized description as it would be viewed in the public discovery web
page. After the first human edit there is no further need to serialize,
so we would disable serializing. (If serializing is disabled after human
edits, does this impact any other real-time rendering features or
formatting that are part of the serializing process? If so, these
processes must also be applied to the post-human-edit description in
real time.)
Prior to human edits, merged records can be algorithmically split by the
computer, assuming we write code to perform such a split. After human
edit, a description split must be performed by a human. Daniel proposes
that all previous versions can be viewed (read-only) during the
human-mediated split operation so the human can refer back to previous
information.
After human edits, rollback only applies to human edited versions. There
is a fire-break where rollback cannot cross from human edits back to
machine-merged descriptions.
All data is stored in the database as separate tables and fields. In theory, we can consider mixed markup, but
Brad Westbrook sugests we avoid mixed markup. From a data perspective, mixed markup is not a good
practice. Data is data, and the database schema can be modified to accomodate necessary data formats. How the
data is displayed is very much a separate issue.
Prior to human edits, merged records can be algorithmically split by the computer, assuming we write code to
perform such a split. After human edit, a split must be performed by a human. It is a requirement that all
previous versions can be viewed (read-only) during the human-mediated split operation so the human can refer
back to previous information.
After human edits, rollback only applies to human edited versions. There is a fire-break where rollback cannot
cross from human edits back to machine-merged descriptions. The policy group needs to supply policy
requirements for the tech folks to implement.
The broad requirements for the application are: edit data, split records, merge records. Secondary features to
make the system useful include: work flow enforcement, search, reporting (including "watch" features),
administration, authorization (data privileges).
#### Expanded CPF schema requirements
- Provenance and history of each element/attribute.
- add this to the schema
- Unique ID per element of CPF if that element is editable.
- we have a unique id per record, and only one field of each type per unique id, so this is covered.
- Version control on a per-element basis.
- already done, but Tom wants to consider an alternative implementation
#### Expanded Database Schema
The database schema has been rewritten to capture all the data in CPF files, as well as meet the various data requirements.
Each field within CPF may (will?) need provenance meta data. Likewise many fields in the database may need
data for provenance. This has not been done, and the developers need policy on provenance, as well as
examples. There seems to be little or no mention of provenance in Rachael's UI requirements.
The new schema has full versions of all records for all time. If not implemented, this is planned. The version
table records each table name, record id, user id who modified, and time datestamp. No changes were made to
existing tables, although existing tables may have gotten a field to distinguish old from current
records. The implementation may change.
Every record has a unique id. The watch system is a query run on some schedule (daily, hourly, ?) that checks
to see if a watched record has changed. CPF record has links to a “watch” table so users can watch each
record, and can watch for certain types of changes. Need UI for the watch system. Need an API for the watch
system.
Need a user table, group (role) table, probably a group permission table so that permissions are hard code
with groups. We also want to allow several permissions per group. Need UI for user, group, and
group-permission management.
We have created a generalized workflow system (as opposed to an ad-hoc linked set of reports). There is a work
flow state table which needs to be moved into the database.
Need fields to deal with delete/embargo. This may be best implemented via a trigger or perhaps a view. By
making what appear to be simple SELECTs through a view, the view can exclude deleted records. We must think
about how using a view (or trigger) will effect UPDATE and INSERT. Ideally the view is transparent. Is there
some clever way we can restrict access to the original table only via the view?
Need record lock on some types of records. This lock needs to be honored by several modules, so like “delete”,
lock might best be implemented via a view and we \*only\* access the table in question via the view.
If there are different levels of review for different elements in the record, then we need extra granularity
in the workflow or the edited record info to know the type of record edited apropos of workflow variations.
If there different reviewers for different parts of the record, then workflow data (and workflow
configuration) needs to be able to notify multiple people, and would have to get multiple reviewer approvals
before moving to the next phase of the workflow.
Institutional affiliation is probably common enough to want a field in the user table, as opposed to creating
a group for each institution. The group is perhaps more generalized and could behave identical (or almost
identical) to a field (with controlled vocabulary) in the user table.
Make sure we can write a query (report) to count numbers of records based type of edit, institution of the
editor, and number of holdings.
If we want to be able to quickly count some CPF element such as outgoing links from CPF to a given
institution, then we should put those CPF values into the SQL database, as meta data for the CPF record.
What is: How many referral links to EAC records that they created?
Be able to count record views, record downloads. Institutional dashboard reports need the ability to group-by
user, or even filter to a specific user.
Reporting needs to help managers verify performance metrics. This assumes that all changes have a
date/timestamp. Once workflow and process decisions are set, performance requirements for users such as
load/performance (how many updates and changes to records can be handled at once), search response time, edit
time (outside of review workflow), and update times need to be set.
Effort reporting to allow SNAC and participants to communicate to others the actual level of effort
involved. This sounds like a report with time span and numbers of records handled in various ways. SNAC might
use this when going from pilot into production so that everyone knows what effort will be required for X
number of records/actions (of whatever action type).
Time/activity reporting could allow us to assess viability, utility, and efficiency of maintenance system
processes.
Similar reports might be generated to evaluate the discovery interface. Something akin to how much time was
required to access a certain number of records. Rachael said: Assess viability of access funtionality-
performance time, available features, and ease of use.
We could try to report on the amount of training necessary before a new user was able to work independently in
each of various areas (content input, review, etc.)
#### Merge and watch
If a file is being watched, and that file is part of an description
(merged or single) then the watch will apply to the results of human
edits, regardless of which part of the description was modified. It is
possible for someone to wish to track a biogHist, but that biogHist
could be completely removed in lieu of an improved and updated
description. We do not track individual elements in CPF. We only track
an entire description, regardless the watcher’s motivation. The original
motivation for watching might no longer exist after an edit, and if so,
the watcher can simply disable their watch. After each edit, all
watchers will get a notification. The watch does not apply to any single
field, but to the entire description, and therefore also to future
descriptions which result from merging.
What happens to a watch on a merged description which is subsequently
split? Does the watch apply to both split descriptions or to neither
description? Perhaps is it best to disable the watch, and inform the
watcher to re-apply to watch a specific record, along with links and
helpful info to make it easy to add the new watch.
#### Brian’s API docs need to be merged in or otherwise referred to:
Note: Ask Robbie what the database architecture is to support merged records.
Users may "watch" an identity. If a file is being watched, and that file is part of an description (merged or
single) then the watch will apply to the results of human edits, regardless of which part of the description
was modified. It is possible for someone to wish to track a biogHist, but that biogHist could be completely
removed in lieu of an improved and updated description. We do not track individual elements in CPF. We only
track an entire description, regardless the watcher's motivation. The original motivation for watching might
no longer exist after an edit, and if so, the watcher can simply disable their watch. After each edit, all
watchers will get a notification. The watch does not apply to any single field, but to the entire description,
and therefore also to future descriptions which result from merging.
What happens to a watch on a merged description which is subsequently split? Does the watch apply to both
split descriptions or to neither description? Perhaps is it best to disable the watch, and inform the watcher
to re-apply to watch a specific record, along with links and helpful info to make it easy to add the new
watch.
#### Brian's API docs need to be merged in or otherwise referred to:
[https://gist.github.com/tingletech/4a3fc5f59e5af3054286](https://www.google.com/url?q=https%3A%2F%2Fgist.github.com%2Ftingletech%2F4a3fc5f59e5af3054286&sa=D&sntz=1&usg=AFQjCNEJeJexryBtHbvLw-WtFYjxP4VwlQ)
### Not sure where to fit these topics into the requirements. Some of them may not be part of technical requirements:
#### Not sure where to fit these topics into the requirements. Some of them may not be part of technical requirements:
Consider implementing linked data standard for relationship links
instead of having to download an entire document of links (as it is
configured now.)
Discuss. What is "as it is configured now"? Consider implementing linked data standard for relationship links
instead of having to download an entire document of links (as it is configured now.)
Sort by common subject headings across all of SNAC - right now SNAC has
Discuss. This seems to be the controlled vocabulary issue. Sort by common subject headings across all of SNAC - right now SNAC has
subject headings that have been applied locally without common practice
across the entire corpus.
Sort by holdings location. Sort by identity's activity location. Sort
and visualize a person through time (show dates for events in a person
or organization's lifetime). Sort and visualize an agency or
organization as it changes over time.
We probably need to build our own holdings authority.
We need to write code to get accurate holdings info from WorldCat records. All the other repositories will
have be handled on a case-by-case basis. Sort by holdings location. Sort by identity's activity location. Sort
and visualize a person through time (show dates for events in a person or organization's lifetime). Sort and
visualize an agency or organization as it changes over time.
Continue to develop and refine context widget.
Sort collection links. Add weighting to understand which collections
have more material directly related to identity. (How is this best
handled programmatically or as an input by contributors- maybe both?).
Increase exposure of SNAC to general public by leveraging partnerships.
Suggested agreement with Wikipedia to display Wikipedia content in SNAC
biographical area and work with Wikipedia to allow for links to SNAC at
the bottom of all applicable identities. This would serve to escalate
and  drive traffic to SNAC.
Introduction (Tom, Ray, Sara, Brian, Daniel)
============================================
Social Networks and Archival Context (SNAC) is a Mellon-funded project
to aid end-user researchers in discovering, locating, and using
distributed historical record descriptions, especially as relates to
corporate bodies, persons, and families (CPF). These descriptions are
often in finding aids, and they often exist in electronic format. They
are distributed across many geographical locations and many networks.
SNAC brings all this data together in a central system, while retaining
links to the original descriptions. Critically, SNAC attempts to merge
descriptions for the same [matching?] CPF identities, linking those
descriptions to a single authoritative name.
^[[a]](#cmnt1)^^[[b]](#cmnt2)^
We have an existing system (SNAC one?) and need additional work to get
to a new system (SNAC 3?), so part of this document is gap analysis. The
scope of this document is to outline technical specifications and
requirements for a production system for the Cooperative
phase^[[c]](#cmnt3)^ of SNAC. This production system will handle
ingestion, processing, matching/merging, discovery, and dissemination of
archival descriptions that are submitted and added to the Cooperative.  
Evaluation of Existing Technical Architecture
=============================================
Overview (All authors)
----------------------
This section describes the existing technical architecture, and later
moving on to describe the required functionality for the production
system for the Cooperative.
Many of the archival records that are ingested in SNAC are Encoded
Archival Context - Corporate bodies, Persons and Families (EAC-CPF,
hereafter CPF) records. EAC-CPF is an XML schema endorsed as a standard
by the Society of American Archivists. We speak of CPF descriptions in
the sense of a “computer record”: often a single text file and not a
“record” in the archival sense.
“Linked data” technology related to the Resource Description Framework
is also employed to manage some controlled vocabularies in the project.
The current system consists of three main components: extraction,
match/merge, discovery. Extraction consists of extracting data from
incoming archival description records (primarily EAD, MARC21 and some
other unique formats), to create CPF descriptions. Match/merge is to
process the CPF descriptions in search of name matches and to merge
well-matched descriptions. The resulting data set includes merged
descriptions and descriptions with no matches (called singletons), all
in a single database. Discovery is discovery and dissemination of the
data via a web application.
The production system will have two additional components: maintenance
and administration. Maintenance includes manual corrections, such as
correcting data within a description, splitting incorrect merges,
merging descriptions for the same CPF identity, and description embargo
(embargo hides descriptions from public view for either technical or
administrative reasons). Administration is the typical management of
users, accounts, and reporting on the state of the system.
The first two phases of data processing are extraction, and match/merge.
A database of descriptions, both merged and unmerged is the end
result^[[d]](#cmnt4)^. The process of ingesting extracted data and
merging will continue for the life of the project. An extensive
web-based search engine lets users discover descriptions.
We use the term “merged” loosely when applied to the automated system
since the final database may contain descriptions which should be
merged, but which a computer is unable to reliably determine.  We take a
conservative approach, preferring to only merge descriptions that a
computer program can accurately distinguish.^[[e]](#cmnt5)^ Even so,
some descriptions will have been incorrectly merged, and thus the need
for a (future) maintenance system that allows manually splitting of
descriptions, among other things.
Both Extraction and Match/merge are script based, batch processing,
semi-automatic processes managed entirely by software engineers.
Discovery and Maintenance are both web applications with extensive
public user interfaces intended for researchers. Administration is done
mostly via a non-public web application.
Extraction and match/merge are well developed, although we have some
planned improvements. Discovery is well developed, but existing features
are being refined, and adding new features is on-going. Maintenance and
administration have not yet been created and must be written from the
ground up.
Current State of the System (Tom, Ray, Sara, Brian, Daniel)
-----------------------------------------------------------
CPF description generation is done at the University of Virginia’s
Institute for Advanced Technology in the Humanities (IATH). IATH handles
the CPF data extraction and hosts servers for data processing and the
SNAC prototype web site. Data processing, XTF indexing (for the
discovery interface), and web hosting take place on a Linux server with
24 CPUs and 94 GB of RAM connected to a 1Gbit network switch. This
server is administered by the IATH sysadmin team. \
Collections of archival description computer descriptions in a variety
of formats are extracted into CPF format XML. This process involves
writing XSLT scripts that extract and transform input descriptions, and
create CPF files as output. The current state of the extraction is a
collection of XSLT scripts supplemented by Perl scripts. The input files
are XML with large numbers of files in EAD, MARC XML, and British
Library XML, as well as several smaller data sets.  A large XSLT code
library is shared among most of the extractions. Each type of extraction
builds a generic internal data structure, which is serialized as EAC-CPF
XML output. The XSLT takes into account various descriptive practices in
the input data, and reformats as necessary to create a single type of
normative CPF output. The complexity of this task centers around the
large number of small differences in descriptive practice. Currently
more than 3 million CPF computer descriptions have been created. The
XSLT processor is Saxon 9 HE, which is the free “home edition” of Saxon.
Saxon implements XSLT 2.0. There are a small number of Perl scripts that
integrate the XSLT into a pipeline, automating tasks such as chunking
data sets into sizes that won’t exceed computer memory.
The current state of the match/merge is (filled in by Yiming/Ray/Sara,
initially a one or two paragraph overview with more detail added later
as necessary).
Overview of Brian’s UI and programming for the SNAC2 XTF discovery tool
(add this to another item if there is an umbrella section more
appropriate).
Is XTF the only discovery tool we will offer? Will SNAC be fully indexed
by Google and Bing?
TK The involvement of the UC Berkeley I School includes the development,
testing and modification of the matching and merging components of the
SNAC system. The current system, described in more detail below, takes
the EAC-CPF records derived from the various source institutions and
compares the names and associated information (especial dates) to
identify the records that likely describe the same person,
organization, or family. The process involves not only comparison across
input records, but also comparison with information from the Virtual
International Authority File, and approximate matching for these records
as well.
TK The involvement of CDL includes … (Brian)
TK We have several extant user studies UI/UX … (Rachael, on-going)
TK The results of these studies are … (Rachael, on-going)
TK The technical implications of these studies are … (Rachael, on-going)
The current system uses a fairly loose software development process.
Source code is primarily maintained on a Linux server which is managed
by standard practices as relate to hardware, software, network, user
accounts, back up, and so on. All the data resides on the server. Source
code is managed by version control systems. The amount of quality
assurance and testing has been increasing over time, as well as
documentation, and management aspects such as release process. All tools
currently used are open source, and the code written for SNAC is open
source. We have begun to formalize feature request and issue tracking.
The development process is agile in that there are frequent small
changes that are committed to the version control, and the code is
nearly always in a working state.
### Processing Pipeline (Ray, Yiming, Sara)
TK Describe algorithmic portions, and add a section for new features.
Extraction (Tom, Daniel)
------------------------
There are currently several CPF extraction software pipelines: MARC21,
British Library, Smithsonian Agency History, New York State Archives,
Smithsonian Joseph Henry, Smithsonian Field Books, and EAD from nearly
60 institutions.
The first step in adding new records to the SNAC database is to convert
incoming data into EAC-CPF XML.  One EAC-CPF record is created for each
successfully extracted reference to an identity from an archival source.
The processing also allows for some degree of remediation of data
quality issues and serves to normalize the data into a common format.
 Scripting data transformation processes is a significant task that
often requires close communications with data contributors and
customizations to accommodate local practices of the contributors.
Creating an extraction is a complex process since we must deal with
variances in local descriptive practice. The MARC21 tools have been made
available as a web interface and this demonstrates the feasibility of
moving more of the processing responsibility to data donors. If we are
optimistic, we hope that EAD-to-CPF extraction and all other types of
future extractions can be turned into donor-driven tools. Specifically,
we create the tools and then deploy them as web applications and/or
desktop applications. Web hosted extraction tools allow us to leverage
the power of our servers and programmers so that data donors do not need
a large computing infrastructure in order to participate. In any case,
data must be validated before ingest into the match/merge processing.
XSLT and perl are the predominate technologies used in the generation of
the XML documents created by this process.  The code architecture
focuses on reusability of modular routines to facilitate maintenance of
the customizations needed accommodate the diversity of data sources.
Code, sample data, and documentation are in Github. The pipeline is
being run on a server, but the hardware requirements are minimal enough
that most laptop computers could run the extraction. The system requires
unix-like features of Linux, MacOS, or cygwin (for MS Windows). The XSTL
engine is Saxon 9.x HE which is the free, public version of Saxon.
Match/Merge (Brian, Yiming, Ray)
--------------------------------
The match/merge process has three major data input streams, library
authority records, EAC-CPF documents from the EAC-CPF extract/create
system, and an ARK identifier minter.
First, a copy of the Virtual International Authority File (VIAF) is
indexed as a reference source to aid in the record matching process.  In
addition to authorized name headings from multiple international
sources, the VIAF data contains biographical data and links to
bibliographic records which will be included in the output documents.  
Then, the EAC-CPF from the extract/create process are serially processed
against the VIAF and each other to discover and rate potential matches
between records.  In this phase of processing, matches are noted in a
database.
After the matching phase identifies incoming EAC-CPF to merge, a new set
of EAC-CPF records are generated.  This works by running through all the
matches in that database, then reading in the EAC-CPF input files, and
finally outputting a new EAC-CPF records that merges the source EAC-CPF
with any information found in VIAF.  ARK identifiers are also assigned.
This architecture allows for incrementally processing more un-merged
EAC-CPF documents before. It also allows matches to be adjusted in the
database, or alterations to be made on the un-merged EAC-CPF documents,
and the merge records can be regenerated.
Cheshire, postgreSQL, and python are the predominate technologies used
in the generation of the XML documents created by this process.
[link to the merge output spec]
This involves processing that compares the derived EAC-CPF records
against one another to identify identical names. Because names for
entities may not match exactly or the same name string may be used for
more than one entity, contextual information from the finding aids is
also used to evaluate the probability that closely and exactly matching
strings designate the same entity.[1] For matches that have a high
degree of probability, the EAC-CPF records will be merged, retaining
variations in the name entries where these occur, and retaining links to
the finding aids from which the name or name variant was derived. When
no identical names exist, an additional matching stage compares the
names from the input EAC-CPF records against authority records in the
Virtual International Authority File (VIAF). Contextual information
(dates, inferred dates, etc.) is used to enhance the accuracy of the
matching. Matched VIAF records are merged with the input derived EAC-CPF
records, with authoritative or preferred forms of names recorded, and a
union set of alternative names from the various VIAF contributors, will
also be incorporated into the EAC-CPF records. When exact matching and
VIAF matching fail, then we attempt to find close variants using Ngram
(approximate spelling) matching. In addition contextual information,
when available is used assess the likelihood of the records actually
being the same. Records that may be for the same entity but the
available contextual information is insufficient to make a confident
match will be flagged for human review (as “May be same as”). While
these records will be flagged for human review, the current prototype
does not provide facilities to manually merge records. The current
policy governing matching is to err on the side of not merging rather
than merging without strong evidence.
The resulting set of interrelated EAC-CPF records will represent the
creators and related entities extracted from EAD-encoded finding aids,
with a subset of the records enhanced with entries from matching VIAF
records. The EAC-CPF records will thus represent a large set of archival
authority records, related with one another and to the archival records
descriptions from which they were derived. This record set will then be
used to build a prototype corporate body, person, and family name and
biographical/historical access system.
In the current system all input records, and potential matches are
recorded in a relational database with the following structure:
* * * * *
[1] Using contextual information in determining that two or more records
represent the same entity has been successful in matching and merging
authority records in an international context. See Rick Bennett,
Christina Hengel-Dittrich, Edward T. O'Neill, and Barbara B. Tillett
VIAF (Virtual International Authority File): Linking Die Deutsche
Bibliothek and Library of Congress Name Authority File:
http://www.ifla.org/IV/ifla72/papers/123-Bennett-en.pdf
![Screen Shot 2014-06-22 at 3.08.12 PM.png](images/image00.png)
The the current processing steps are summarized in the following
diagram:
![Slide1.jpg](images/image01.jpg)
Discovery/Dissemination (Brian, Rachael)
----------------------------------------
Prototype research tool^[[f]](#cmnt6)^
--------------------------------------
The main data input for the prototype research tool are the merged
EAC-CPF documents produced in the match/merge system.  Some other
supplemental data sources, such as dbpedia and the Digital Public
Library of America are also consulted during the indexing process.
A pre-indexing phase is run on the merged EAC-CPF documents.  During
pre-processing, name headings and wikipedia links are extracted, and
then used to look for possible related links and data in supplemental
sources. The output of the pre-indexing phase consists of XML documents
recording supplemental.
Once the supplemental XML files are generated, two types of indexes are
created to power which serve as the input to the web site.  The first
index created runs across all documents and provides access to the full
text and specific facets of metadata extracted from the documents.
 Additionally, the XML structure of each document is indexed as a
performance optimization that allows for transformations to be
efficiently applied to large XML documents.
The public interface to the prototype research tool utilizes the index
across all documents to enable full text, metadata, and faceted searches
of the merged EAC-CPF documents.  Once a search is completed, and a
specific merged EAC-CPF document is selected for display; the index of
the XML document structure is used to quickly transform the merged
document into an HTML presentation for the end user.
In the SNAC1 prototype a graph database was created after the full text
indexing was complete.  The graph database was used to power
relationship visualizations and an API used to dynamically integrate
links to SNAC into archival description access systems. This graph
database was then converted into linked data, which was loaded into a
SQARQL endpoint. This step has not yet been implemented in the SNAC 2
prototype.  Because the merged EAC-CPF documents are of higher quality
for the SNAC 2 prototype, the graph extraction process is no longer
dependent on the full text index being complete, so it could run in
parallel with pre-indexing and indexing.
XTF is the main technology used to power public access to the merged
EAC-CPF records.  XTF integrates lucene for indexing and saxon for XML
transformation, making heavy use of XSLT for customization and display
of search results and the merged documents.  EAC-CPF and search results
are transformed to HTML5 and JSON for consumption by the end users’ web
browser.  Multiple javascript and CSS3 libraries and technologies are
used in the production of the “front end” code for the website.  Google
analytics is used to measure use of the site.  Werker, middleman, and
bower used to build the front end code for the site.
This technical architecture
[links to code]
Sort collection links. Add weighting to understand which collections have more material directly related to
identity. (How is this best handled programmatically or as an input by contributors- maybe both?).
Increase exposure of SNAC to general public by leveraging partnerships. Suggested agreement with Wikipedia to
display Wikipedia content in SNAC biographical area and work with Wikipedia to allow for links to SNAC at the
bottom of all applicable identities. This would serve to escalate and drive traffic to SNAC.
plan.md
--------
plan.md Big questions
plan.md Overview and order of work
plan.md Code we write
plan.md Controlled vocabularies and tag system
plan.md Code we use off the shelf
co-op_background.md
-----
Authors
Organization of documenatation
Introduction to SNAC
Evaluation of Existing Technical Architecture
Overview
Current State of the System
Processing Pipeline
Extraction
Match/Merge
Discovery/Dissemination
Prototype research tool
Gap analysis
Data maintenance
Pilot phase architecture
Current State Conclusion
introduction.md
--------
TAT Functional Requirements
Introduction to Planned Functionality
Software development, processes, and project management
QA and Related Tests for Test-driven Development
Documentation
Required new features
Web application overview
Web application output via template
Data background
What is "normal form" and what informs the database schema design?
Edit architecture requirements
Expanded CPF schema requirements
Expanded Database Schema
Merge and watch
Brian’s API docs need to be merged in or otherwise referred to:
Not sure where to fit these topics into the requirements. Some of them may not be part of technical requirements:
requirements.md
----
List of requirements
Requirements from Rachael's spreadsheet
List of Application Programmer Interfaces (APIs)
Maintenance Functionality
Functionality for Discovery
User interface for Discovery
Functionality for Splitting
User interface for Splitting
Functionality for Merging
User interface for Merging
Functionality for Editing
User interface for Editing
Admin Client for Maintenance System
User Management
Web Application Administration
Reports
System Administration
Community Contributions
Ability to Open/Close the Site during Maintenance
Sandbox for Training, perhaps as a clone of the QA system?
These documents are organized in the following order:
[plan.md](plan.md) gives the big overview.
[plan.md](plan.md) Big overview.
[introduction.md](introduction.md) covers the current state and gives some historical background.
[outline.md](outline.md) An outline of sections in the documents
[co-op_background.md](co-op_background.md) covers broad expectations for the co-op software.
[co-op_background.md](co-op_background.md) Broad expectations for the co-op software.
[requirements.md](requirements.md) are the technical requirements.
[introduction.md](introduction.md) Requirements part one
[requirements.md](requirements.md) Requirements part two, includes tech requirements from Rachael's spreadsheets
Requirements from Rachael's spreadsheet
---
- Programmers contribute some time to help with technology side of the gap analysis of institutional capability
- We need a concrete plan for persistent IDs.
- We need to manage base HREF stubs that are combined with persistent IDs to form working URLs. Ideally, all
the URLs could be composed via a format string (printf), so we could just store the ID, HREF stub, and
format string and be done with it. However, some URLs have interesting issues that require code and thus
exceed the abilities of normal format strings. We can certainly roll out an early version with format
strings, and add some clever functions later as necessary.
#### List of requirements
- Do we need any additional requirements for related name linking?
- Clarify: the co-op version 1 is not going to support bulk data ingest
- Clarify: the co-op version 1 is not going to support bi-directional data exchange and update
- Do we need full delete? For example, a CPF contains something illegal and must be fully deleted. How do we
delete from backups? Are either of these even required by policy?
- Are we assuming that data from the web browser has been sanity checked before hitting the server? Does the
server need to cache edit data prior to writing the data to the cpf database? For example, what if someone
enters "19th century" in a date field? It isn't valid, but we need to save their work.
- We need to sanity check any links we create, especially links back into SNAC.
- Don't forget the X-to-CPF field mapping documentation, and this ties in to the "CPF data contributor's"
guide (below)
- We need the "CPF data contributor's" guide.
- What authority work will we be doing?
- What authority data from other sources do we cache locally?
- Create detailed functional requirements for controlled vocabularies, and a detailed implementation
specification.
- Clarify: versioning is per-record, not per-field.
- Need a watch/notification API. It needs a canonical name. Is there an off-the-shelf event monitor that will
easily integrate with the web REST API and work flow manager?
- Clarify: Are we integrating SNAC and ArchiveSpace in co-op version 1? Will ArchiveSpace have to use our REST API?
- How is embargo implemented at the database level? What are the requirements for embargo?
- Clarify / verify: Technical review vs content review is handled by a combination of roles and work flow.
- Reports: Where are we keeping the Big List of All Reports?
- Clarify: row 43, (unclear) Consider implementing inked data standard for relationship links
instead of having to download an entire document of links, as it is configured now.
- Search: need the Big List of Search Facets, and someone needs to verify that Elastic Search can do facets.
- Does co-op version 1 have a timeline visualization? Does it have a "sort by timeline"? What does it mean to
sort by timeline?
- Clarify: What is a context widget? - row 52, Continue to develop and refine context widget. (technical
requirements unclear)
- Clarify: we need requirements for citations, and details about where they integrate with the rest of the
system.
List of requirements
---
This is the definitive list of all requirements. Anything the application needs to do must be in this
list. Each item and group of items is explained in detail later in the document. Being a "list", this includes
......@@ -157,9 +90,100 @@ only sufficient detail to disambiguate items.
- data integrity testing
#### Requirements from Rachael's spreadsheet
- Programmers contribute some time to help with technology side of the gap analysis of institutional capability
- We need a concrete plan for persistent IDs.
- We need to manage base HREF stubs that are combined with persistent IDs to form working URLs. Ideally, all
the URLs could be composed via a format string (printf), so we could just store the ID, HREF stub, and
format string and be done with it. However, some URLs have interesting issues that require code and thus
exceed the abilities of normal format strings. We can certainly roll out an early version with format
strings, and add some clever functions later as necessary.
- Do we need any additional requirements for related name linking, or more accurately identity linking? Each
identity has and ARK which is a persistent ID with an assciated URL. Use cases for identity links:
1. SNAC links one identity to another internally based on relations between identities
1. SNAC links to itself as a name authority
1. SNAC links to external identities
1. SNAC links to external archival resources
2. External resources link to SNAC as an authority. (Tom asks: is SNAC also an archival resource?)
- Clarify: the co-op version 1 is not going to support bulk data ingest
- Clarify: the co-op version 1 is not going to support bi-directional data exchange and update
- Do we need full delete? For example, a CPF contains something illegal and must be fully deleted. How do we
delete from backups? Are either of these even required by policy?
- Are we assuming that data from the web browser has been sanity checked before hitting the server? (Yes, by
the data validation API)
- Does the server need to save temporary edit data prior to writing the data to the cpf database? For example, what if
someone enters "19th century" in a date field? It isn't valid, but we need to save their work. (Yes, we need to save invalid user input, and give the user a useful message for each type of data validation failure.)
- We need to sanity check any links we create, especially links back into SNAC.
- Don't forget (to create) the X-to-CPF field mapping documentation, and this ties in to the "CPF data contributor's"
guide (Below)
- We need the "CPF data contributor's" guide.
- What authority work will we be doing?
- For example, holding institution ISIL identifier, name, address, contact person, etc.
- What authority data from other sources do we cache locally?
- We need examples of this, as well as a process to manage those resources. It is important to know where
the data came from, technically how it was acquired, the date we acquired it, and some methods of
updating the current local cache. This implies that all external data we use has internal persistent
ids.
- Create detailed functional requirements for controlled vocabularies, and a detailed implementation
specification.
- Clarify: versioning is per-record, not per-field.
- Need a watch/notification API. It needs a canonical name. Is there an off-the-shelf event monitor that will
easily integrate with the web REST API and work flow manager?
- We can write our own status and staging API. It only requires modest SQL schema work. Most of the necessary data is already planned for other features. For example, records can be locked by a user, we know who has the lock, we need administrative functions for unlocking and transfering locks, the work flow explicitly lays out the process for each user interaction with the application.
- Clarify: Are we integrating SNAC and ArchiveSpace in co-op version 1? Will ArchiveSpace have to use our REST API?
- How is embargo implemented at the database level? What are the requirements for embargo?
- Clarify / verify: Technical review vs content review is handled by a combination of roles and work flow.
- Reports: Where are we keeping the Big List of All Reports?
- Clarify: row 43, (unclear) Consider implementing inked data standard for relationship links
instead of having to download an entire document of links, as it is configured now.
- Search: need the Big List of Search Facets, and someone needs to verify that Elastic Search can do facets.
- Does co-op version 1 have a timeline visualization? Does it have a "sort by timeline"?
- What does it mean to sort by timeline?
- Clarify: What is a context widget? - row 52, Continue to develop and refine context widget. (technical
requirements unclear)
- Clarify: we need requirements for citations, and details about where they integrate with the rest of the
system.
#### List of Application Programmer Interfaces (APIs)
List of Application Programmer Interfaces (APIs)
----
The following include both direct programming language intefaces, and REST interfaces. We need to determine
which (REST/direct) is available for each. Modifying data should probably go through authorization and should
......@@ -183,8 +207,8 @@ only public interface.
- record watching (REST?)
Maintenance Functionality (All authors)
---------------------------------------
#### Maintenance Functionality
Maintenance falls into four areas: discover, split, merge, and edit.
......@@ -216,7 +240,7 @@ trail, and there are no destructive changes. For example, there is no
public view. Updated descriptions will be subject to version control so
changes can be rolled back.
### Functionality for Discovery
#### Functionality for Discovery
The discovery tools for maintenance may be somewhat different from the
normal discovery tools for scholarly research. We have a standard
......@@ -230,13 +254,9 @@ Users will have individual accounts, so we can enable a search history,
internal bookmarks, and various saved reports (assuming faceted search
where it could take many mouse clicks to accrete a specific search).
###
### User interface for Discovery (Brian, Rachael)
#### User interface for Discovery
###
### Functionality for Splitting^[[m]](#cmnt13)^^[[n]](#cmnt14)^ (Tom, Danial, all authors)
#### Functionality for Splitting^[[m]](#cmnt13)^^[[n]](#cmnt14)^ 
Keeping in mind that our descriptions are authoritative, and will be
referenced via persistent identifier (ARK), it will be necessary to
......@@ -339,9 +359,9 @@ To review split:
20. admin function to view locked descriptions by user,
21. choose one of my locked descriptions to continue work.
### User interface for Splitting (Tom, Daniel, Rachael, others)
#### User interface for Splitting
### Functionality for Merging (Tom, Daniel, all authors)
#### Functionality for Merging
We need to allow our experts to merge descriptions. This may be far more
common than splitting since the automated pipeline was designed to only
......@@ -408,11 +428,9 @@ To review merging:
19. locks and hides original,
20. makes merged description publically visible.
### User interface for Merging (Rachael, Tom, Daniel, others)
###
#### User interface for Merging
### Functionality for Editing
#### Functionality for Editing
Modifications we expect include but are not limited to: spelling
corrections, date corrections, editing or expanding biographical data,
......@@ -421,12 +439,11 @@ and correcting relations between descriptions. Metadata such as the URL
of the original finding aid may also be updated. The maintenance system
also needs to support bulk data edits of several types.
### User interface for Editing (Rachael, Tom, Daniel, others)
#### User interface for Editing
Admin Client for Maintenance System
-----------------------------------
#### Admin Client for Maintenance System
### User Management (Tom, Brian)
#### User Management
Authentication is validating user logins to the system. Authorization is
the related aspect of controlling which parts of the system users may
......@@ -564,8 +581,8 @@ These users need an admin dashboard with corresponding reports. We may
need to have sub-institution accounts and that gets tricky because we
don’t want to be mixed up in internal institutional politics.
Web Application Administration
------------------------------
#### Web Application Administration
System administration will be required for the web application and the
server hosting the web site. This is well understood from a technical
......@@ -574,8 +591,8 @@ command line accounts involved, and server configuration. This aspect of
administration integrates with versioning, backup, and software
releases.
Reports ^[[s]](#cmnt19)^^[[t]](#cmnt20)^(Tom, Brian, Rachael, Brad)
-------------------------------------------------------------------
#### Reports ^[[s]](#cmnt19)^^[[t]](#cmnt20)^
While the web interface is the primary public face of SNAC, many other
views of the data and meta data are necessary, especially for admins and
......@@ -589,8 +606,8 @@ adding reporting and business intelligence.
(How much detail do we want about reports? Maybe just half a dozen
examples?)
System Administration (Tom, Brian)
----------------------------------
#### System Administration
This is boilerplate server administration, for the most part.
Preservation of original material may not be necessary. Since our data
......@@ -622,8 +639,8 @@ correcting file systems (ZFS, Btrfs) may not be production quality at
this time. If we want to deploy an anti-bit-rot technology, our
alternative may be limited to using Par2/Quick Par/Parchive.
Community Contributions (All authors)
-------------------------------------
#### Community Contributions
Researcher interface/functionality including public facing discovery and
dissemination (All, especially Brian)
......@@ -651,8 +668,8 @@ logic that we will support with UI, code, and database tables/fields.
Many reports will be limited certain roles. Admin users will likely be
heavy report users.
Ability to Open/Close the Site during Maintenance (Tom, Brian)
--------------------------------------------------------------
#### Ability to Open/Close the Site during Maintenance
If the product has a “closed for maintenance” feature,
^[[x]](#cmnt24)^this ability would be available to admins, even though
......@@ -681,8 +698,8 @@ to our software architecture. The more elegant approach is to use one of
several system architectures that  keep a small system front end always
running.
Sandbox for Training, perhaps as a clone of the QA system? (All authors)
------------------------------------------------------------------------
#### Sandbox for Training, perhaps as a clone of the QA system?
TK
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment