Merge branch 'master' of http://gitlab.iath.virginia.edu/snac/Documentation

02b45b2b · Sarah Wells · 2ca502de · 5944fa85 · 02b45b2b · 02b45b2b
Commit 02b45b2b authored Aug 20, 2015 by Sarah Wells
5 changed files
--- a/tat_requirements/co-op_background.md
+++ b/tat_requirements/co-op_background.md
-Gap analysis
------------
+
+#### Authors
+
+
+Tom Laudeman, Technical lead, University of Virginia, Institute for
+Advanced Technology in the Humanities
+[twl8n@virginia.edu](mailto:twl8n@virginia.edu)
+
+Brian Tingle, Technical Lead for Digital Special Collections, California
+Digital Library
+
+Rachael Hu, User Experience Design Manager, California Digital Library
+
+Ray R. Larson, U.C. Berkeley - School of Information
+
+Robbie Hott
+
+#### Organization of documenatation
+
+[Plan](plan.md) (External, broad view roadmap)
+
+[Co-op Background](co-op_background.md)  (This document) 
+
+[Introduction ](introduction.md) (Was an introduction, but shaping up to be requirements part one)
+
+[Requirements](requirements.md) (Requirements part two, includes requirements from Rachael's spreadsheets)
+
+#### Introduction to SNAC
+
+Social Networks and Archival Context (SNAC) is a Mellon-funded project
+to aid end-user researchers in discovering, locating, and using
+distributed historical record descriptions, especially as relates to
+corporate bodies, persons, and families (CPF). These descriptions are
+often in finding aids, and they often exist in electronic format. They
+are distributed across many geographical locations and many networks.
+SNAC brings all this data together in a central system, while retaining
+links to the original descriptions. Critically, SNAC attempts to merge
+descriptions for the same [matching?] CPF identities, linking those
+descriptions to a single authoritative name.
+^[[a]](#cmnt1)^^[[b]](#cmnt2)^
+
+We have an existing system (SNAC one?) and need additional work to get
+to a new system (SNAC 3?), so part of this document is gap analysis. The
+scope of this document is to outline technical specifications and
+requirements for a production system for the Cooperative
+phase^[[c]](#cmnt3)^ of SNAC. This production system will handle
+ingestion, processing, matching/merging, discovery, and dissemination of
+archival descriptions that are submitted and added to the Cooperative.  
+
+#### Evaluation of Existing Technical Architecture
+
+
+##### Overview
+
+This section describes the existing technical architecture, and later
+moving on to describe the required functionality for the production
+system for the Cooperative.
+
+Many of the archival records that are ingested in SNAC are Encoded
+Archival Context - Corporate bodies, Persons and Families (EAC-CPF,
+hereafter CPF) records. EAC-CPF is an XML schema endorsed as a standard
+by the Society of American Archivists. We speak of CPF descriptions in
+the sense of a “computer record”: often a single text file and not a
+“record” in the archival sense.
+
+“Linked data” technology related to the Resource Description Framework
+is also employed to manage some controlled vocabularies in the project.
+
+The current system consists of three main components: extraction,
+match/merge, discovery. Extraction consists of extracting data from
+incoming archival description records (primarily EAD, MARC21 and some
+other unique formats), to create CPF descriptions. Match/merge is to
+process the CPF descriptions in search of name matches and to merge
+well-matched descriptions. The resulting data set includes merged
+descriptions and descriptions with no matches (called singletons), all
+in a single database. Discovery is discovery and dissemination of the
+data via a web application.
+
+The production system will have two additional components: maintenance
+and administration. Maintenance includes manual corrections, such as
+correcting data within a description, splitting incorrect merges,
+merging descriptions for the same CPF identity, and description embargo
+(embargo hides descriptions from public view for either technical or
+administrative reasons). Administration is the typical management of
+users, accounts, and reporting on the state of the system.
+
+The first two phases of data processing are extraction, and match/merge.
+A database of descriptions, both merged and unmerged is the end
+result^[[d]](#cmnt4)^. The process of ingesting extracted data and
+merging will continue for the life of the project. An extensive
+web-based search engine lets users discover descriptions.
+
+We use the term “merged” loosely when applied to the automated system
+since the final database may contain descriptions which should be
+merged, but which a computer is unable to reliably determine.  We take a
+conservative approach, preferring to only merge descriptions that a
+computer program can accurately distinguish.^[[e]](#cmnt5)^ Even so,
+some descriptions will have been incorrectly merged, and thus the need
+for a (future) maintenance system that allows manually splitting of
+descriptions, among other things.
+
+Both Extraction and Match/merge are script based, batch processing,
+semi-automatic processes managed entirely by software engineers.
+Discovery and Maintenance are both web applications with extensive
+public user interfaces intended for researchers. Administration is done
+mostly via a non-public web application.
+
+Extraction and match/merge are well developed, although we have some
+planned improvements. Discovery is well developed, but existing features
+are being refined, and adding new features is on-going. Maintenance and
+administration have not yet been created and must be written from the
+ground up.
+
+#### Current State of the System
+
+CPF description generation is done at the University of Virginia’s
+Institute for Advanced Technology in the Humanities (IATH). IATH handles
+the CPF data extraction and hosts servers for data processing and the
+SNAC prototype web site. Data processing, XTF indexing (for the
+discovery interface), and web hosting take place on a Linux server with
+24 CPUs and 94 GB of RAM connected to a 1Gbit network switch. This
+server is administered by the IATH sysadmin team. \
+
+Collections of archival description computer descriptions in a variety
+of formats are extracted into CPF format XML. This process involves
+writing XSLT scripts that extract and transform input descriptions, and
+create CPF files as output. The current state of the extraction is a
+collection of XSLT scripts supplemented by Perl scripts. The input files
+are XML with large numbers of files in EAD, MARC XML, and British
+Library XML, as well as several smaller data sets.  A large XSLT code
+library is shared among most of the extractions. Each type of extraction
+builds a generic internal data structure, which is serialized as EAC-CPF
+XML output. The XSLT takes into account various descriptive practices in
+the input data, and reformats as necessary to create a single type of
+normative CPF output. The complexity of this task centers around the
+large number of small differences in descriptive practice. Currently
+more than 3 million CPF computer descriptions have been created. The
+XSLT processor is Saxon 9 HE, which is the free “home edition” of Saxon.
+Saxon implements XSLT 2.0. There are a small number of Perl scripts that
+integrate the XSLT into a pipeline, automating tasks such as chunking
+data sets into sizes that won’t exceed computer memory.
+
+The current state of the match/merge is (filled in by Yiming/Ray/Sara,
+initially a one or two paragraph overview with more detail added later
+as necessary).
+
+Overview of Brian’s UI and programming for the SNAC2 XTF discovery tool
+(add this to another item if there is an umbrella section more
+appropriate).
+
+Is XTF the only discovery tool we will offer? Will SNAC be fully indexed
+by Google and Bing?
+
+TK The involvement of the UC Berkeley I School includes the development,
+testing and modification of the matching and merging components of the
+SNAC system. The current system, described in more detail below, takes
+the EAC-CPF records derived from the various source institutions and
+compares the names and associated information (especial dates) to
+identify the records that likely describe the same person,
+
+organization, or family. The process involves not only comparison across
+input records, but also comparison with information from the Virtual
+International Authority File, and approximate matching for these records
+as well.
+
+Rachael has several user studiess. The results of these studies are... The implications of these studies
+are...
+
+The current system uses a fairly loose software development process.  Source code is maintained on a Linux
+server which is managed by standard practices as relate to hardware, software, network, user accounts, back
+up, and so on. All the data resides on the server. Source code is managed by version control systems. The
+amount of quality assurance and testing has been increasing over time, as well as documentation, and
+management aspects such as release process. All tools currently used are open source, and the code written for
+SNAC is open source. We have begun to formalize feature request and issue tracking.  The development process
+is agile in that there are frequent small changes that are committed to the version control, and the code is
+nearly always in a working state.
+
+#### Processing Pipeline
+
+TK Describe algorithmic portions, and add a section for new features.
+
+#### Extraction
+
+There are currently several CPF extraction software pipelines: MARC21,
+British Library, Smithsonian Agency History, New York State Archives,
+Smithsonian Joseph Henry, Smithsonian Field Books, and EAD from nearly
+60 institutions.
+
+The first step in adding new records to the SNAC database is to convert
+incoming data into EAC-CPF XML.  One EAC-CPF record is created for each
+successfully extracted reference to an identity from an archival source.
+The processing also allows for some degree of remediation of data
+quality issues and serves to normalize the data into a common format.
+ Scripting data transformation processes is a significant task that
+often requires close communications with data contributors and
+customizations to accommodate local practices of the contributors.
+
+Creating an extraction is a complex process since we must deal with
+variances in local descriptive practice. The MARC21 tools have been made
+available as a web interface and this demonstrates the feasibility of
+moving more of the processing responsibility to data donors. If we are
+optimistic, we hope that EAD-to-CPF extraction and all other types of
+future extractions can be turned into donor-driven tools. Specifically,
+we create the tools and then deploy them as web applications and/or
+desktop applications. Web hosted extraction tools allow us to leverage
+the power of our servers and programmers so that data donors do not need
+a large computing infrastructure in order to participate. In any case,
+data must be validated before ingest into the match/merge processing.
+
+XSLT and perl are the predominate technologies used in the generation of
+the XML documents created by this process.  The code architecture
+focuses on reusability of modular routines to facilitate maintenance of
+the customizations needed accommodate the diversity of data sources.
+
+Code, sample data, and documentation are in Github. The pipeline is
+being run on a server, but the hardware requirements are minimal enough
+that most laptop computers could run the extraction. The system requires
+unix-like features of Linux, MacOS, or cygwin (for MS Windows). The XSTL
+engine is Saxon 9.x HE which is the free, public version of Saxon.
+
+
+#### Match/Merge
+
+The match/merge process has three major data input streams, library authority records, EAC-CPF documents from
+the EAC-CPF extract/create system, and an ARK identifier minter.
+
+First, a copy of the Virtual International Authority File (VIAF) is indexed as a reference source to aid in
+the record matching process. In addition to authorized name headings from multiple international sources, the
+VIAF data contains biographical data and links to bibliographic records which will be included in the output
+documents.   Then, the EAC-CPF from the extract/create process are serially processed against the VIAF and
+each other to discover and rate potential matches between records. In this phase of processing, matches are
+noted in a database.
+
+After the matching phase identifies incoming EAC-CPF to merge, a new set of EAC-CPF records are
+generated. This works by running through all the matches in that database, then reading in the EAC-CPF input
+files, and finally outputting a new EAC-CPF records that merges the source EAC-CPF with any information found
+in VIAF. ARK identifiers are also assigned.  This architecture allows for incrementally processing more
+un-merged EAC-CPF documents before. It also allows matches to be adjusted in the database, or alterations to
+be made on the un-merged EAC-CPF documents, and the merge records can be regenerated.
+
+Cheshire, postgreSQL, and python are the predominate technologies used in the generation of the XML documents
+created by this process.
+
+[link to the merge output spec]
+
+This involves processing that compares the derived EAC-CPF records against one another to identify identical
+names. Because names for entities may not match exactly or the same name string may be used for more than one
+entity, contextual information from the finding aids is also used to evaluate the probability that closely and
+exactly matching strings designate the same entity.[1] For matches that have a high degree of probability, the
+EAC-CPF records will be merged, retaining variations in the name entries where these occur, and retaining
+links to the finding aids from which the name or name variant was derived. When no identical names exist, an
+additional matching stage compares the names from the input EAC-CPF records against authority records in the
+Virtual International Authority File (VIAF). Contextual information (dates, inferred dates, etc.) is used to
+enhance the accuracy of the matching. Matched VIAF records are merged with the input derived EAC-CPF records,
+with authoritative or preferred forms of names recorded, and a union set of alternative names from the various
+VIAF contributors, will also be incorporated into the EAC-CPF records. When exact matching and VIAF matching
+fail, then we attempt to find close variants using Ngram (approximate spelling) matching. In addition
+contextual information, when available is used assess the likelihood of the records actually being the
+same. Records that may be for the same entity but the available contextual information is insufficient to make
+a confident match will be flagged for human review (as "May be same as"). While these records will be flagged
+for human review, the current prototype does not provide facilities to manually merge records. The current
+policy governing matching is to err on the side of not merging rather than merging without strong evidence.
+
+The resulting set of interrelated EAC-CPF records will represent the creators and related entities extracted
+from EAD-encoded finding aids, with a subset of the records enhanced with entries from matching VIAF
+records. The EAC-CPF records will thus represent a large set of archival authority records, related with one
+another and to the archival records descriptions from which they were derived. This record set will then be
+used to build a prototype corporate body, person, and family name and biographical/historical access system.
+
+In the current system all input records, and potential matches are
+recorded in a relational database with the following structure:
+
+* * * * *
+
+[1] Using contextual information in determining that two or more records
+represent the same entity has been successful in matching and merging
+authority records in an international context. See Rick Bennett,
+Christina Hengel-Dittrich, Edward T. O'Neill, and Barbara B. Tillett
+VIAF (Virtual International Authority File): Linking Die Deutsche
+Bibliothek and Library of Congress Name Authority File:
+http://www.ifla.org/IV/ifla72/papers/123-Bennett-en.pdf
+
+![Screen Shot 2014-06-22 at 3.08.12 PM.png](images/image00.png)
+
+The the current processing steps are summarized in the following
+diagram:
+
+![Slide1.jpg](images/image01.jpg)
+
+
+#### Discovery/Dissemination
+
+
+#### Prototype research tool^[[f]](#cmnt6)^
+
+
+The main data input for the prototype research tool are the merged
+EAC-CPF documents produced in the match/merge system. Some other
+supplemental data sources, such as dbpedia and the Digital Public
+Library of America are also consulted during the indexing process.
+
+A pre-indexing phase is run on the merged EAC-CPF documents. During
+pre-processing, name headings and wikipedia links are extracted, and
+then used to look for possible related links and data in supplemental
+sources. The output of the pre-indexing phase consists of XML documents
+recording supplemental.
+
+Once the supplemental XML files are generated, two types of indexes are
+created to power which serve as the input to the web site. The first
+index created runs across all documents and provides access to the full
+text and specific facets of metadata extracted from the documents.
+Additionally, the XML structure of each document is indexed as a
+performance optimization that allows for transformations to be
+efficiently applied to large XML documents.
+
+The public interface to the prototype research tool utilizes the index
+across all documents to enable full text, metadata, and faceted searches
+of the merged EAC-CPF documents. Once a search is completed, and a
+specific merged EAC-CPF document is selected for display; the index of
+the XML document structure is used to quickly transform the merged
+document into an HTML presentation for the end user.
+
+In the SNAC1 prototype a graph database was created after the full text
+indexing was complete. The graph database was used to power
+relationship visualizations and an API used to dynamically integrate
+links to SNAC into archival description access systems. This graph
+database was then converted into linked data, which was loaded into a
+SQARQL endpoint. This step has not yet been implemented in the SNAC 2
+prototype. Because the merged EAC-CPF documents are of higher quality
+for the SNAC 2 prototype, the graph extraction process is no longer
+dependent on the full text index being complete, so it could run in
+parallel with pre-indexing and indexing.
+
+XTF is the main technology used to power public access to the merged
+EAC-CPF records. XTF integrates lucene for indexing and saxon for XML
+transformation, making heavy use of XSLT for customization and display
+of search results and the merged documents. EAC-CPF and search results
+are transformed to HTML5 and JSON for consumption by the end users' web
+browser. Multiple javascript and CSS3 libraries and technologies are
+used in the production of the "front end" code for the website. Google
+analytics is used to measure use of the site. Werker, middleman, and
+bower used to build the front end code for the site.
+
+This technical architecture
+
+[links to code]
+
+
+#### Gap analysis

 This is gap analysis between the current and SNAC2-prototype. Perhaps
 this should be in the Required and Planned Functionality below.

-Data maintenance
----------------
+#### Data maintenance

 A goal of the pilot phase it to demonstrate cooperative maintenance of
 the data resource.  The prototype does not have robust support for
@@ -57,354 +403,28 @@ not be run in a “clustered” mode; must scale up, not scale out
     •    Cheshire II does not have a Open Source Initiative certified
 license

-Pilot phase architecture
------------------------
-
-Alternative 1^[[h]](#cmnt8)^
----------------------------
-
-The most expeditious way to launch a pilot phase would be to leave the
-basic technical architecture of the prototype in place, and to focus
-initial energies into establishing policies and procedures that work
-within the constraints of this architecture.  Two key systems that would
-need to be set up for this approach to work are a customer relationship
-management (CRM) system and ticketed help desk.
-
-Customer relationship management systems have historically be used as a
-sales support tool.  Information on current and potential customers,
-including contact information and institutional affiliation, are
-maintained in a database.  All pilot members institutions and designated
-contacts should be entered into a CRM system for the pilot phase.  All
-correspondence, call, contracts and agreements with accepted and
-potential pilot phase members should be logged or stored in the CRM
-system.^[[i]](#cmnt9)^
-
-The CRM system should support or integrate with a help desk that issues
-work ticket numbers. ^[[j]](#cmnt10)^ Any addition or change in the
-maintained corpus of merged EAC-CPF records will require a work ticket
-number.  Expectations for response times for issued tickets should be
-established, clearly communicated, and measured for compliance.  A
-customer service manager^[[k]](#cmnt11)^ will actively monitor the queue
-of work tickets pending.  An operations manual will be maintained so
-that the customer service manager or any additional first tier support
-staff will be able to handle a set of ticket types.  If a procedure for
-the request is not yet documented in the operations manual - or if the
-manual indicates this is a task for second tier - then the ticket will
-be escalated to the second tier support programmer.  The second tier
-support programmer will have the technical skills to manipulate the
-technical infrastructure; such as through editing XML files or directly
-altering the database.  The second tier support programmer would also be
-responsible for performing data extraction and normalization of non
-EAC-CPF data sources processed during the pilot phase.^[[l]](#cmnt12)^ 
-The volume and type of tickets will help establish priorities for
-establishing procedures that can be automated for first tier support and
-for future phases that do not require pilot members to contact the help
-desk and obtain work tickets.
-
-An automated way to establish a new identity should established early in
-the pilot phase, so that participants can mint a new ARK identifier
-without creating a work ticket.  Initially, a work ticket would still be
-generated once the participant was ready to submit the new record though
-the match/merge process.
-
-Given the importance of maintaining links from the merged EAC-CPF record
-to related resources, a link harvesting protocol should be developed
-early in the pilot phase.  When a pilot phase participant identifies a
-match in SNAC with a name they have in a collection description; the
-link harvesting protocol would specify how to publish that link in their
-HTML display of their collection description or through some other
-mechanism (perhaps through an extension to the sitemap protocol, along
-the lines of how ResourceSync works).  Procedures would be established
-to then notify SNAC to harvest links from the participant, and the SNAC
-“related collections” section would be automatically updated.  Such
-updates would be based on a “linked data” technology rather than the
-submission of XML files.
-
-Alternative 2
-------------
-
-Pure XML architecture for edits (edit the merged EAC-CPF records, maybe
-with something like xEAC and with the merged files in revision control.
- This might make export from the match/merge challenging)
-
-Alternative 3
-------------
-
-Pure RDF architecture
-
-Current State Conclusion (All, Daniel, Tom)
-------------------------------------------
-
-The current systems functions well enough for researchers and other
-stakeholders to see large data sets fully processed. These systems will
-benefit from additional work to make them more mature in the usual ways
-that software develops: robustness, testing and QA, documentation,
-examples, consistent API. Most of the current software will be used in
-the production product.
-
-Required and Planned Functionality (All authors)
-================================================
-
-(We need to break out each item into UI functionality, and API
-functionality.)
-
-Expanded CPF schema requirements
--------------------------------
-
-Provenance and history of each element/attribute.
-
-Unique ID per element of CPF if that element is editable.
-
-Version control on a per-element basis.
-
-Expanded Database Schema
------------------------
-
-The current database (Postgres) is sufficient for the current project
-only. It will expand, and the expansion will probably be fairly
-dramatic. We need to determine what tables and fields are necessary to
-support additional functions. Each section of this document may need a
-“data” section, or else this database schema section needs to address
-every functional and UI aspect of all APIs that have anything to do with
-the database.
-
-Each field within CPF may (will?) need provenance meta data. Likewise
-many fields in the database may need data for provenance.
-
-The database needs audit trail ability to a fairly granular (field)
-level. Audit is a new table at the very least. It seems likely that
-nearly every table will gain some audit related fields.
-
-Will database records be versioned? How is that handled? Seems like it
-may be done via versioning table and some interesting joins. We need to
-evaluate the various standard methods for database internal versioning.
-
-CPF record has links to a “watch” table so users can watch each record,
-and can watch for certain types of changes. Need UI for the watch
-system. Need an API for the watch system.
-
-Need a user table, group table, probably a group permission table so
-that permissions are hard code with groups. We also want to allow
-several permissions per group. Need UI for user, group, and
-group-permission management.
-
-If we create a generalized workflow system (as opposed to an ad-hoc
-linked set of reports) then we need workflow tables. The tables would
-establish workflow paths, necessary permissions, and would be linked to
-users and groups.
-
-Need fields to deal with delete/embargo. This may be best implemented
-via a trigger or perhaps a view. By making what appear to be simple
-SELECTs through a view, the view can exclude deleted records. We must
-think about how using a view (or trigger) will effect UPDATE and INSERT.
-Ideally the view is transparent. Is there some clever way we can
-restrict access to the original table only via the view?
-
-Need record lock on some types of records. This lock needs to be honored
-by several modules, so like “delete”, lock might best be implemented via
-a view and we \*only\* access the table in question via the view.
-
-If there are different levels of review for different elements in the
-record, then we need extra granularity in the workflow or the edited
-record info to know the type of record edited apropos of workflow
-variations.
-
-If there different reviewers for different parts of the record, then
-workflow data (and workflow configuration) needs to be able to notify
-multiple people, and would have to get multiple reviewer approvals
-before moving to the next phase of the workflow.
-
-Institutional affiliation is probably common enough to want a field in
-the user table, as opposed to creating a group for each institution. The
-group is perhaps more generalized and could behave identical (or almost
-identical) to a field (with controlled vocabulary) in the user table.
-
-Make sure we can write a query (report) to count numbers of records
-based type of edit, institution of the editor, and number of holdings.
-
-If we want to be able to quickly count some CPF element such as outgoing
-links from CPF to a given institution, then we should put those CPF
-values into the SQL database, as meta data for the CPF record.
-
-What is: How many referral links to EAC records that they created?
-
-Be able to count record views, record downloads. Institutional dashboard
-reports need the ability to group-by user, or even filter to a specific
-user.
-
-Reporting needs to help managers verify performance metrics. This
-assumes that all changes have a date/timestamp. Once workflow and
-process decisions are set, performance requirements for users such as
-load/performance (how many updates and changes to records can be handled
-at once), search response time, edit time (outside of review workflow),
-and update times need to be set.
-
-Effort reporting to allow SNAC and participants to communicate to others
-the actual level of effort involved. This sounds like a report with time
-span and numbers of records handled in various ways. SNAC might use this
-when going from pilot into production so that everyone knows what effort
-will be required for X number of records/actions (of whatever action
-type).
-
-Time/activity reporting could allow us to assess viability, utility, and
-efficiency of maintenance system processes.
-
-Similar reports might be generated to evaluate the discovery interface.
-Something akin to how much time was required to access a certain number
-of records. Rachael said: Assess viability of access funtionality-
-performance time, available features, and ease of use.
-
-We could try to report on the amount of training necessary before a new
-user was able to work independently in each of various areas (content
-input, review, etc.)
-
-Introduction to Planned Functionality
-------------------------------------
-
-The current system works, but is somewhat skeletal. It requires careful
-attention from the developers to run the data processing pipelines. It
-lacks administrative controls and reporting. Existing software
-development process follows modern agile practices, but the some
-processes are weak or incomplete. The research tools are somewhat
-rudimentary. It needs infrastructure where domain experts can correct
-and update merged authority descriptions.
-
-The functional requirements below specify in detail all of the
-capabilities of the new [production?] system. A separate section about
-user interface (UI) specifies the visual/functional aspects of the UI
-and includes discussion of the user experience (UX). Some of the
-functional requirements exist only to support actions of the UI, and
-UI-related functions should exist in their own independent API.
-
-Software development, processes, and project management
-------------------------------------------------------
-
-Choices for programming languages, operating system, databases, version
-control, and various related tools and practices are based on extensive
-experience of the developer community, and a complex set of requirements
-for the coding process. Current best practices are agile development
-using practices that allow programmers wide leeway for implementation
-while still keeping the processes manageable.
-
-Test-driven development ideally means automated testing, with careful
-attention to regression testing. It takes some extra time up front to
-write the tests. Each test is small, and corresponds to small sections
-of code where  both code and text can be quickly created. In this way,
-the software is kept in a working state with only brief downtimes during
-feature creation or bug fixes. Large programs are made up of
-intentionally small functions each of which is tested by a small
-automated test.
-
-Regression testing refers to verifying that old bugs do not reappear.
-Every bug fix has a corresponding test, even if the function in question
-did not originally have a test for the bug. Each new bug needs a new
-test. Bugs frequently reappear, especially in complex sections of code.
-
-Source code version control is vital to both development process, and to
-the release process. During development, frequent small changes are
-checked-in to the version control, along with a meaningful comment. The
-history of the code can be tracked. This occasionally helps to
-understand how bugs come into existence. In the Git system, the history
-command is “blame”, a bit of programmer dark humor where the history is
-used to know who to blame for a bug (or any undesirable feature).
-
-Moving code into Quality Assurance (QA) and then into the production
-environment are both integral with source code management. Many version
-control systems allow tagging a release with a name. The collected
-source code files are marked as a named (virtual) collection, and can be
-used to update a QA area. Human testing and review happens in QA. After
-QA we have release. Depending on the nature of the system release can be
-quite complex with many parties needing to be notified, and coordination
-across groups of developers, sysadmin, managers, support staff, and
-customers. Agile development tends towards small, seamless releases on a
-frequent (weekly or monthly) basis where communication is primarily via
-update of electronic documentation. The process needs to assure that
-fixes and new features are documented. The system must have tools to see
-the current version of the system with its change log, as well as
-comparing that to previous releases. All of these are integrated with
-change management.
-
-Bug reporting and feature requests fall (broadly speaking) into the
-category of change management. Typically a small group of senior
-developers and stakeholders review the bug/feature tracking system to
-assign priorities, clarify, and investigate. There are good
-off-the-shelf systems for tracking bugs and feature requests, so we have
-several choices. This process happens almost as frequently as the
-features/bug fix coding work of the developers. That means on-going,
-more or less continuous review of fix/features requests every few days,
-depending on how independent the developers are. Agile applies to
-everyone on the project. Ideal change management is not onerous. As
-tasks are completed, someone (developers) update feature status with “in
-progress”, “completed” and so on. There might be additional status
-updates from QA and release, but SNAC probably isn’t large enough to
-justify anything too complex.
-
-QA and Related Tests for Test-driven Development (Tom, Brian, Ray)
------------------------------------------------------------------
-
-The data extraction pipelines manage massive amounts of data, and
-visually checking descriptions for bugs would be inefficient if not
-infeasible. The MARC extraction process is verified by just over 100
-quality assurance descriptions. The output produced from each
-description is checked for some specific value that confirms that the
-code is working correctly and historical bugs have not reappeared. The
-EAD extraction has a set of QA files, but the output verification is not
-yet automated. A variety of file counts and measures of various sorts
-are performed to verify that descriptions have all been processed. All
-CPF output is validated against the Relax NG schema. Processing log
-files are checked for a variety of error messages. Settings used for
-each run are recorded in documentation maintained with the output files.
-The source code is stored in a Subversion repository.
-
-Our disaster recovery processes must be carefully documented.
-
-The match/merge process is validated by …
-
-Required new features
---------------------
-
-The majority of new features will be in two areas: the maintenance
-system, and the administration system. None of this code exists. The
-maintenance system has a web UI and a server-based back end that
-interacts with the same database used by the match-merge. The
-maintenance system also requires an authentication system (login) that
-allows us to manage the extensive collaborative efforts. The current
-processing of data is accomplished only on servers at the command line,
-and is handled directly by project programmers. In the new maintenance
-system, that will be driven by content experts via a web site, and
-therefore must expect the issues of authentication and authorization
-inherent in collaborative data manipulation web applications.
-
-The system will require reports. These will cover broad classes of
-issues related to managing resources, usage statistics, administration,
-maintenance, and some reports for end user researchers.
-
-(Fill in prose introducing the other subsystems such as reporting)
-
-One important aspect of the project is long-term viability and
-preservation. We should be able to export all data and metadata in
-standard formats. Part of the API should cover export facilities so that
-over time we can easily add new export features to support emerging
-standards.
-
-The ability to export all the data for preservation purposes also gives
-us the ability to offer bulk data downloads to researchers and
-collaborating peer institutions.
-
-Documentation (all authors)
---------------------------
-
-Every aspect of the system requires documentation. Most visible to the
-public is the user interface for discovery. Maintenance will be
-complicated, and our processes are somewhat novel, so this will need to
-be extensive, well illustrated with screenshots, and carefully tested.
-
-Documentation intended for developers might be somewhat sparse by
-comparison, but will be critical to the on-going software development
-process. All the databases, operating system, httpd and other servers
-need complete documentation of installation, configuration, deployment,
-starting, stopping, and emergency procedures.
-
-It is probably wise to choose a wiki-like documentation system at the
-outset of the project.
+#### Pilot phase architecture
+
+
+The software will consist of a web application written in PHP, and using PostgreSQL as the data
+storage. Granual work flow wil managed by a work flow engine. Hand off of tasks will be managed by a series of
+semaphores. 
+
+An Identity Reconciliation (IR) API will test similarity between two identities, and allow us to search the
+database for matches. The IR API will make concrete our concept of "identity" as a collection of data fields.
+
+CPF records will be linked to relevant recources. Brian suggest actively harvesting links through an extension
+to the sitemap protocol along the lines of ResourceSync. Brian notes that the updates are based on linked
+data, not submission of XML files.
+
+
+#### Current State Conclusion
+
+
+The current systems functions well enough for researchers and other stakeholders to see large data sets fully
+processed. These systems will benefit from additional work to make them more mature in the usual ways that
+software develops: robustness, testing and QA, documentation, examples, consistent API. Most of the current
+software will be used in the production product.
+
+

--- a/tat_requirements/introduction.md
+++ b/tat_requirements/introduction.md
-TAT Functional Requirements
+#### TAT Functional Requirements
+
+
+[Plan](plan.md) (Read this first. External, broad view roadmap)
+
+[Co-op Background](co-op_background.md)  (Currrent state and SNAC background)
+
+[Introduction ](introduction.md) (This document. The technical requirements part one)
+
+[Requirements](requirements.md) (Tech requirements part two, includes requirements from Rachael's spreadsheets)
+
+#### Introduction to Planned Functionality
+
+The functional requirements below specify in detail all of the
+capabilities of the new [production?] system. A separate section about
+user interface (UI) specifies the visual/functional aspects of the UI
+and includes discussion of the user experience (UX). Some of the
+functional requirements exist only to support actions of the UI, and
+UI-related functions should exist in their own independent API.
+
+#### Software development, processes, and project management
+
+
+Choices for programming languages, operating system, databases, version
+control, and various related tools and practices are based on extensive
+experience of the developer community, and a complex set of requirements
+for the coding process. Current best practices are agile development
+using practices that allow programmers wide leeway for implementation
+while still keeping the processes manageable.
+
+Test-driven development ideally means automated testing, with careful
+attention to regression testing. It takes some extra time up front to
+write the tests. Each test is small, and corresponds to small sections
+of code where both code and text can be quickly created. In this way,
+the software is kept in a working state with only brief downtimes during
+feature creation or bug fixes. Large programs are made up of
+intentionally small functions each of which is tested by a small
+automated test.
+
+Regression testing refers to verifying that old bugs do not reappear.
+Every bug fix has a corresponding test, even if the function in question
+did not originally have a test for the bug. Each new bug needs a new
+test. Bugs frequently reappear, especially in complex sections of code.
+
+Source code version control is vital to both development process, and to
+the release process. During development, frequent small changes are
+checked-in to the version control, along with a meaningful comment. The
+history of the code can be tracked. This occasionally helps to
+understand how bugs come into existence. In the Git system, the history
+command is “blame”, a bit of programmer dark humor where the history is
+used to know who to blame for a bug (or any undesirable feature).
+
+Moving code into Quality Assurance (QA) and then into the production
+environment are both integral with source code management. Many version
+control systems allow tagging a release with a name. The collected
+source code files are marked as a named (virtual) collection, and can be
+used to update a QA area. Human testing and review happens in QA. After
+QA we have release. Depending on the nature of the system release can be
+quite complex with many parties needing to be notified, and coordination
+across groups of developers, sysadmin, managers, support staff, and
+customers. Agile development tends towards small, seamless releases on a
+frequent (weekly or monthly) basis where communication is primarily via
+update of electronic documentation. The process needs to assure that
+fixes and new features are documented. The system must have tools to see
+the current version of the system with its change log, as well as
+comparing that to previous releases. All of these are integrated with
+change management.
+
+Bug reporting and feature requests fall (broadly speaking) into the
+category of change management. Typically a small group of senior
+developers and stakeholders review the bug/feature tracking system to
+assign priorities, clarify, and investigate. There are good
+off-the-shelf systems for tracking bugs and feature requests, so we have
+several choices. This process happens almost as frequently as the
+features/bug fix coding work of the developers. That means on-going,
+more or less continuous review of fix/features requests every few days,
+depending on how independent the developers are. Agile applies to
+everyone on the project. Ideal change management is not onerous. As
+tasks are completed, someone (developers) update feature status with "in
+progress", "completed” and so on. There might be additional status
+updates from QA and release, but SNAC probably isn't large enough to
+justify anything too complex.
+
+#### QA and Related Tests for Test-driven Development
+
+
+The data extraction pipelines manage massive amounts of data, and
+visually checking descriptions for bugs would be inefficient if not
+infeasible. The MARC extraction process is verified by just over 100
+quality assurance descriptions. The output produced from each
+description is checked for some specific value that confirms that the
+code is working correctly and historical bugs have not reappeared. The
+EAD extraction has a set of QA files, but the output verification is not
+yet automated. A variety of file counts and measures of various sorts
+are performed to verify that descriptions have all been processed. All
+CPF output is validated against the Relax NG schema. Processing log
+files are checked for a variety of error messages. Settings used for
+each run are recorded in documentation maintained with the output files.
+The source code is stored in a Subversion repository.
+
+Our disaster recovery processes must be carefully documented.
+
+The match/merge process is validated by …
+
+#### Documentation
+
+System documentation is in http://gitlab.iath.virginia.edu in markdown files.
+
+Every aspect of the system requires documentation. Most visible to the public is the user interface for
+discovery. Maintenance will be complicated, and our processes are somewhat novel, so this will need to be
+extensive, well illustrated with screenshots, and carefully tested.
+
+Documentation intended for developers might be somewhat sparse by comparison, but will be critical to the
+on-going software development process. All the databases, operating system, httpd and other servers need
+complete documentation of installation, configuration, deployment, starting, stopping, and emergency
+procedures.
+
+#### Required new features
+
+
+The majority of new features will be in two areas: the maintenance
+system, and the administration system. None of this code exists. The
+maintenance system has a web UI and a server-based back end that
+interacts with the same database used by the match-merge. The
+maintenance system also requires an authentication system (login) that
+allows us to manage the extensive collaborative efforts. The current
+processing of data is accomplished only on servers at the command line,
+and is handled directly by project programmers. In the new maintenance
+system, that will be driven by content experts via a web site, and
+therefore must expect the issues of authentication and authorization
+inherent in collaborative data manipulation web applications.
+
+The system will require reports. These will cover broad classes of
+issues related to managing resources, usage statistics, administration,
+maintenance, and some reports for end user researchers.

-#### Authors
+- Web application (architect: Robbie)

+The web application is a wrapper for all the APIs. It can have an API of it own, or not. It handles all http
+requests, validating the data, deciding what needs to be done, doing real work, and handing some output back
+to the user. Typically the output is HTML, but we are already planning for file downloads, and JSON data as
+output from REST API calls. 

-Tom Laudeman, Technical lead, University of Virginia, Institute for
-Advanced Technology in the Humanities
-[twl8n@virginia.edu](mailto:twl8n@virginia.edu)
+- Data validation API

-Brian Tingle, Technical Lead for Digital Special Collections, California
-Digital Library
+Data from the web browser needs sanity checking and untainting before being handed to the rest of the
+application. Initially the data validation API can consist of nothing more than untaining input from the
+browser. We can add various checks and tests. We need to decide if the validation API can reject data, and if
+it can, then it needs to interact with the work flow engine, the actual work flow, and whatever messaging
+system we use to display messages to end users.

-Rachael Hu, User Experience Design Manager, California Digital Library
+- Identitiy Reconciliation (aka IR) (architect: Robbie)

-Ray R. Larson, U.C. Berkeley - School of Information
+This API uses many aspects of identity, testing each against a target population of other identities. The
+final anwser is a floating point number giving a match strength. IR has two modes of operation. Mode one
+compares two identities and returns a match strength. Mode two compares a single identity againast the entire
+database returning match strength. Mode two is somewhat unclear.

-Robbie Hott
+- workflow manager (Tom)

-#### Organization of documenatation
+Every action the application can perform is part of the work flow. The names of these actions along with names
+of their requisites are organized into a work flow table. The work flow engine does not know how to do real
+work, but it does know the names of the functions which do the real work. A new feature (aka function, task)
+is added to the application, by adding its name to the work flow, and creating a function of the same name in
+the application. Likewise, requistes are determined by boolean functions, and every requisite must have a
+matching function known to the work flow engine. The work flow enforces role-based behavior by testing the
+requisites. The workflow engine exists, but needs to be ported from Perl to PHP, and the work flow data should
+be stored in the SQL database.

-[Plan](plan.md) (external, broad view roadmap)
+- Support for work history and task staging. 

-[Introduction (this document) ](introduction.md)
+Editing consists of several stages of work that may be performed by different people and/or different
+roles. We need database tables to support saving of work state data. Create a prototype table schema so we can
+think about this problem and create a functional spec.

-[Requirements](requirements.md)
+For an edit we need the CPF id, user id, timedate stamp, bitfield or work flow tags, optional user notes. For
+search we need: user id, search string, timedate stamp. 

-co-op background
+- SQL schema (Robbie, Tom)

+All data is stored in a SQL database. Details are given elsewhere.
+    
+- Controlled vocabulary subsystem or API [Tag system](#controlled-vocabularies-and-tag-system)
+
+We need controlled vocabulary for several data fields. This system handles all aspects of all controlled vocabularies.
+
+- CPF to SQL parser (Robbie)
+
+The input for the application is CPF files. These files need to be parsed into data fields and input into the
+SQL database. This application exists, but needs some additional functionality.
+
+- Name serialization tool, selectable pre-configured formats
+
+Outputting name strings based on name data fields in the database is a tricky problem. There are several
+output formats. The name serialization deals with this issue.
+
+- Name string parser
+
+Names in CPF files are currently strings. The CPF <part> element has been imported into the SQL database as a
+string, but data needs require individual name components. Parsing names is a tricky problem, but several
+parsers exist. We need to integrate one or more parsers, and perhaps tweak those parsers to handle the SNAC names.
+
+- Date parser
+
+We have several date parsers, but none are fully comprehensive. We can use the existing parsers, but they need
+to be integrated into a single, comprehensive parser.
+
+- CPF record edit, edit each field
+
+Record editing on the server is handled by a collection of functions. The specifications for this may evolve
+in parallel to the code. We know that each field needs to be changed, but the details of work flow and data
+validation have not been determined. Work flow and validation are both likely to change as the SNAC policies
+evolve. There are UI requirements for editing.
+
+- CPF record split, split data into separate cpf identities, deprecate old ARK, mint new ARKs
+
+Record splitting requires a set of functions and UI requirements documented elsewhere.
+
+- CPF record merge, combine fields, deprecate old ARKs, mint new ARK
+
+Record merge requires a set of functions and UI requirements documented elsewhere.
+
+- Object architecture, coding style, class template (architect Robbie)
+
+We will have a specific architecture of the web application, and of the classes and objects involved.
+
+- UI widgets, mostly off the shelf, some custom written. We need to have UI edit/chooser widget for search and
+  select of large numbers of options, such as the Joseph Henry cpfRelations that contain some 22K
+  entries. Also need to list all fields which might have large numbers of values. In fact, part of the meta
+  data for every field is "number of possible entries/reapeat values" or whatever that's called. From a
+  software architecture perspective, the answer is 0, 1, infinite.
+
+One important aspect of the project is long-term viability and preservation. We should be able to export all
+data and metadata in standard formats. Part of the API should cover export facilities so that over time we can
+easily add new export features to support emerging standards.
+
+The ability to export all the data for preservation purposes also gives us the ability to offer bulk data
+downloads to researchers and collaborating peer institutions.
+
+#### Web application overview
+
+Some aspects of the web app aren't yet clear, so there are details to be worked out, and some large-ish
+concepts to clarify. I'm guessing we will agree on most things, and one of us or the other will just concede
+on stuff where we don't agree.
+
+Requirements:
+
+- expose an http accessible API that is viable for wget, browser <form>, and Ajax calls.
+
+- Supported input format depends on the complexity of the requested operation. 
+
+- Public functions require no authentication. Everything else must include authentication data.
+
+Internal flow: 
+
+1. validate the inputs. 
+
+1. Somehow slice and dice the CGI params of the REST call into an abstracted request we can pass to the
+internal API. I suppose that the external and internal APIs are very similar, but we almost certainly need
+some level of symbolic reference aka abstraction. Each REST call has its requisite data. Some data is as
+simple as a record id, and some will be fairly interesting json data structures.
+
+1. The web app API does the tasks specified by the REST request and the work flow engine's directions.
+
+  1. Every http request must go through the work flow engine so that the work flow is validated and managed.
+    
+  1. Every web app has a work flow, but people mostly just cobble that together with a bunch of implied
+    functionality using conditionals and side-effect-full function calls. In our code, the internal API is
+    100% work flow agnostic.
+
+  1. I can explain this in more detail, but it makes a huge improvement in the structure of the application.
+
+1. Create the output data object if it wasn't created by the functions doing the work.
+
+1. Pass the output data to a rendering function (or module) to be rendered into the appropriate output format:
+html, text, xml, etc. and sent to stdout, or returned as an http file download. JSON probably doesn't need to
+be rendered since JSON is "data" and not "presentation".
+
+The work flow engine relies on functions that read application data and return booleans so that the
+work flow engine can detect the application's relevant state. I guess that sounds confusing because the work
+flow engine has state, and the application has state. Those two types of state are vastly different and only
+related to each other in that the work flow engine can detect the application's state. The internal API of the
+web app has no idea that the work flow engine even exists. And the work flow engine knows what work needs to
+be done, but has no idea how it will be done. This is a very lovely separation of concerns.
+
+#### Web application output via template
+
+A well known, easy, powerful method of creating presntation output is to use an template module. Templating
+separates business logic from presentation logic, thus following an MVC model. Our business logic is our work
+flow and related function calls. Presentation is our UI, and the work flow engine has no idea that a UI exists,
+let alone how to create it. Curiously, the presentation logic knows how to create the presentation rendering,
+but has no idea what it does or what it interacts with. This is another example of strong separation of
+concerns.
+
+A simple hello world text template with a single variable world = "world" would be:
+
+```
+Hello [% world %]!
+```
+
+Or a simple HTML version:
+
+```
+<html><body>Hello [% world %]!</body></html>
+```
+
+That example is based on the Template Tookit http://www.template-toolkit.org/ for which there is a Perl
+module, and a Python module. Template modules are fairly common, so I'm almost certain we will have several to
+choose from in PHP.
+
+Choosing our own select software modules, including a template module, is better than being locked into a
+large, cumbersome web framework. In general, web frameworks have issues: 
+
+- difficult to work with
+
+- no useful functionality that isn't more easily found in another software module 
+
+- the often break MVC
+
+- generally make debugging nearly impossible
+
+We can do much better by selecting a few modules to create a lightweight quasi-framework that is perfectly matched to our
+needs.
+
+Once the internal API completes its work, we will have output data. Output data is passed to a rendering
+layer that relies on the template module. The only code that knows anything about rendering is the rendering
+layer. To all the non-rendering code, there is only "output data" which does conform to a standard structure
+(almost certainly an output data object). The rendering layer takes the output object, and the requested format
+of the output (text, html, pdf, xml, etc.) to create the output. Happily, "rendering" is generally a single
+function call. We create a template object, call its "render" method with two arguments: 
+
+1. template file name,
+
+2. the output data object. 
+
+Default behavior is to write the output to stdout, but the render method can also
+return the output in a variable so we can create an http download.
+
+Templates are human created static files containing placeholders. The template engine fills in the placeholders with
+values from relevant parts of the output data. Clearly, the output data object and the template must share a
+object/property naming convention. The template engine functionality has single value fields, looping over
+input lists, and if statement branching based on input. But that's pretty much it. No work is done in the
+template that is not directly concerned with filling in placeholders, not even formatting (in the sense of
+rounding numbers, capitalizing strings, or adding html tags). Templates are valid documents of the output
+type, except in rare cases. The attached template is well-formed XML.
+
+The web app needs a file download output option as well as output to stdout.
+
+
+#### Data background
+
+The data is in a SQL database. Every piece of data is in a separate field to the extent that is practical.
+Data is organized into fields (columns) records (rows) and tables. Fields related to each other are in the
+same table. Every record has a unique, permananent, numerical id often called a "key" or "primary key". For
+the SNAC Co-op we have decided that records are never overwritten during update. This is somewhat unusual, but
+not unheard of. An update operation creates a new record identical to the old record except for updated
+fields. All old records are available for viewing via special interface. The old records are invisible to
+operations that are intellectually acting on "current" data.
+
+#### What is "normal form" and what informs the database schema design?
+
+Edgar F. "Ted" Codd created 12 rules (revised with a 13th rule) to clarify the Relational Database Management
+System (RDBMS). 
+
+https://en.wikipedia.org/wiki/Edgar_F._Codd
+
+Breaking any of these rules weakens data integrity and the ability of the system to manage the data. An RDBMS
+is not merely a bucket of data, but an entire eco-system for the management of data and data related
+activities. Before Codd's work, databases were managed on an ad-hoc basis as collections of files with
+links. It was a mess. Data was lost. Only the DBA knew how to find the data, and access methods could be very
+different for data in different locations. Accessing data could also be extremely slow. In addition to
+assuring the integrity of data, as well as managing it, relational database systems are very fast.
+
+https://en.wikipedia.org/wiki/Codd%27s_12_rules
+
+The "R" in RDBMS is "relational" and Codd invented the relational model of data. Key to relational data
+modeling is "normal form".
+
+https://en.wikipedia.org/wiki/Database_normalization
+
+The RDBMS world generally uses third normal form. Lower levels of normalization create additional work for
+data operations. Higher forms rarely show any improvements. The key concept of normalization is that a datum
+only exists in one place. In the RDBMS world where SQL implements relational algebra, normal form is both
+convenient and natural. In other venues such as paper ledgers, data stored in flat files, or in spreadsheets,
+normal form can seem awkward.

 #### Edit architecture requirements

-Daniel proposes a plan (which implies important requirements) that human
-edits are applied to a serialized description, and after the first human
-edit, the description is always maintained inside the system in the
-serialized form. Prior to edits, a description consists internally of
-one or more CPF records which are serialized in real time via a specific
-blending algorithm for display/viewing. The edit UI displays the
-serialized description as it would be viewed in the public discovery web
-page. After the first human edit there is no further need to serialize,
-so we would disable serializing. (If serializing is disabled after human
-edits, does this impact any other real-time rendering features or
-formatting that are part of the serializing process? If so, these
-processes must also be applied to the post-human-edit description in
-real time.)
-
-Prior to human edits, merged records can be algorithmically split by the
-computer, assuming we write code to perform such a split. After human
-edit, a description split must be performed by a human. Daniel proposes
-that all previous versions can be viewed (read-only) during the
-human-mediated split operation so the human can refer back to previous
-information.
-
-After human edits, rollback only applies to human edited versions. There
-is a fire-break where rollback cannot cross from human edits back to
-machine-merged descriptions.
+All data is stored in the database as separate tables and fields. In theory, we can consider mixed markup, but
+Brad Westbrook sugests we avoid mixed markup. From a data perspective, mixed markup is not a good
+practice. Data is data, and the database schema can be modified to accomodate necessary data formats. How the
+data is displayed is very much a separate issue. 
+
+Prior to human edits, merged records can be algorithmically split by the computer, assuming we write code to
+perform such a split. After human edit, a split must be performed by a human. It is a requirement that all
+previous versions can be viewed (read-only) during the human-mediated split operation so the human can refer
+back to previous information.
+
+After human edits, rollback only applies to human edited versions. There is a fire-break where rollback cannot
+cross from human edits back to machine-merged descriptions. The policy group needs to supply policy
+requirements for the tech folks to implement.
+
+The broad requirements for the application are: edit data, split records, merge records. Secondary features to
+make the system useful include: work flow enforcement, search, reporting (including "watch" features),
+administration, authorization (data privileges).
+
+#### Expanded CPF schema requirements
+
+- Provenance and history of each element/attribute.
+  - add this to the schema
+
+- Unique ID per element of CPF if that element is editable.
+  - we have a unique id per record, and only one field of each type per unique id, so this is covered.
+  
+- Version control on a per-element basis.
+  - already done, but Tom wants to consider an alternative implementation
+  
+  
+#### Expanded Database Schema
+
+The database schema has been rewritten to capture all the data in CPF files, as well as meet the various data requirements.
+
+
+Each field within CPF may (will?) need provenance meta data. Likewise many fields in the database may need
+data for provenance. This has not been done, and the developers need policy on provenance, as well as
+examples. There seems to be little or no mention of provenance in Rachael's UI requirements.
+
+The new schema has full versions of all records for all time. If not implemented, this is planned. The version
+table records each table name, record id, user id who modified, and time datestamp. No changes were made to
+existing tables, although existing tables may have gotten a field to distinguish old from current
+records. The implementation may change.
+
+Every record has a unique id. The watch system is a query run on some schedule (daily, hourly, ?) that checks
+to see if a watched record has changed. CPF record has links to a “watch” table so users can watch each
+record, and can watch for certain types of changes. Need UI for the watch system. Need an API for the watch
+system.
+
+Need a user table, group (role) table, probably a group permission table so that permissions are hard code
+with groups. We also want to allow several permissions per group. Need UI for user, group, and
+group-permission management.
+
+We have created a generalized workflow system (as opposed to an ad-hoc linked set of reports). There is a work
+flow state table which needs to be moved into the database. 
+
+Need fields to deal with delete/embargo. This may be best implemented via a trigger or perhaps a view. By
+making what appear to be simple SELECTs through a view, the view can exclude deleted records. We must think
+about how using a view (or trigger) will effect UPDATE and INSERT.  Ideally the view is transparent. Is there
+some clever way we can restrict access to the original table only via the view?
+
+Need record lock on some types of records. This lock needs to be honored by several modules, so like “delete”,
+lock might best be implemented via a view and we \*only\* access the table in question via the view.
+
+If there are different levels of review for different elements in the record, then we need extra granularity
+in the workflow or the edited record info to know the type of record edited apropos of workflow variations.
+
+If there different reviewers for different parts of the record, then workflow data (and workflow
+configuration) needs to be able to notify multiple people, and would have to get multiple reviewer approvals
+before moving to the next phase of the workflow.
+
+Institutional affiliation is probably common enough to want a field in the user table, as opposed to creating
+a group for each institution. The group is perhaps more generalized and could behave identical (or almost
+identical) to a field (with controlled vocabulary) in the user table.
+
+Make sure we can write a query (report) to count numbers of records based type of edit, institution of the
+editor, and number of holdings.
+
+If we want to be able to quickly count some CPF element such as outgoing links from CPF to a given
+institution, then we should put those CPF values into the SQL database, as meta data for the CPF record.
+
+What is: How many referral links to EAC records that they created?
+
+Be able to count record views, record downloads. Institutional dashboard reports need the ability to group-by
+user, or even filter to a specific user.
+
+Reporting needs to help managers verify performance metrics. This assumes that all changes have a
+date/timestamp. Once workflow and process decisions are set, performance requirements for users such as
+load/performance (how many updates and changes to records can be handled at once), search response time, edit
+time (outside of review workflow), and update times need to be set.
+
+Effort reporting to allow SNAC and participants to communicate to others the actual level of effort
+involved. This sounds like a report with time span and numbers of records handled in various ways. SNAC might
+use this when going from pilot into production so that everyone knows what effort will be required for X
+number of records/actions (of whatever action type).
+
+Time/activity reporting could allow us to assess viability, utility, and efficiency of maintenance system
+processes.
+
+Similar reports might be generated to evaluate the discovery interface.  Something akin to how much time was
+required to access a certain number of records. Rachael said: Assess viability of access funtionality-
+performance time, available features, and ease of use.
+
+We could try to report on the amount of training necessary before a new user was able to work independently in
+each of various areas (content input, review, etc.)

 #### Merge and watch

-If a file is being watched, and that file is part of an description
-(merged or single) then the watch will apply to the results of human
-edits, regardless of which part of the description was modified. It is
-possible for someone to wish to track a biogHist, but that biogHist
-could be completely removed in lieu of an improved and updated
-description. We do not track individual elements in CPF. We only track
-an entire description, regardless the watcher’s motivation. The original
-motivation for watching might no longer exist after an edit, and if so,
-the watcher can simply disable their watch. After each edit, all
-watchers will get a notification. The watch does not apply to any single
-field, but to the entire description, and therefore also to future
-descriptions which result from merging.
-
-What happens to a watch on a merged description which is subsequently
-split? Does the watch apply to both split descriptions or to neither
-description? Perhaps is it best to disable the watch, and inform the
-watcher to re-apply to watch a specific record, along with links and
-helpful info to make it easy to add the new watch.
-
-#### Brian’s API docs need to be merged in or otherwise referred to:
+Note: Ask Robbie what the database architecture is to support merged records.
+
+Users may "watch" an identity. If a file is being watched, and that file is part of an description (merged or
+single) then the watch will apply to the results of human edits, regardless of which part of the description
+was modified. It is possible for someone to wish to track a biogHist, but that biogHist could be completely
+removed in lieu of an improved and updated description. We do not track individual elements in CPF. We only
+track an entire description, regardless the watcher's motivation. The original motivation for watching might
+no longer exist after an edit, and if so, the watcher can simply disable their watch. After each edit, all
+watchers will get a notification. The watch does not apply to any single field, but to the entire description,
+and therefore also to future descriptions which result from merging.
+
+What happens to a watch on a merged description which is subsequently split? Does the watch apply to both
+split descriptions or to neither description? Perhaps is it best to disable the watch, and inform the watcher
+to re-apply to watch a specific record, along with links and helpful info to make it easy to add the new
+watch.
+
+#### Brian's API docs need to be merged in or otherwise referred to:

 [https://gist.github.com/tingletech/4a3fc5f59e5af3054286](https://www.google.com/url?q=https%3A%2F%2Fgist.github.com%2Ftingletech%2F4a3fc5f59e5af3054286&sa=D&sntz=1&usg=AFQjCNEJeJexryBtHbvLw-WtFYjxP4VwlQ)

-### Not sure where to fit these topics into the requirements. Some of them may not be part of technical requirements:
+#### Not sure where to fit these topics into the requirements. Some of them may not be part of technical requirements:

-Consider implementing linked data standard for relationship links
-instead of having to download an entire document of links (as it is
-configured now.)
+Discuss. What is "as it is configured now"? Consider implementing linked data standard for relationship links
+instead of having to download an entire document of links (as it is configured now.)

-Sort by common subject headings across all of SNAC - right now SNAC has
+Discuss. This seems to be the controlled vocabulary issue. Sort by common subject headings across all of SNAC - right now SNAC has
 subject headings that have been applied locally without common practice
 across the entire corpus.

-Sort by holdings location. Sort by identity's activity location. Sort
-and visualize a person through time (show dates for events in a person
-or organization's lifetime). Sort and visualize an agency or
-organization as it changes over time.
+We probably need to build our own holdings authority.
+
+We need to write code to get accurate holdings info from WorldCat records. All the other repositories will
+have be handled on a case-by-case basis. Sort by holdings location. Sort by identity's activity location. Sort
+and visualize a person through time (show dates for events in a person or organization's lifetime). Sort and
+visualize an agency or organization as it changes over time.

 Continue to develop and refine context widget.

-Sort collection links. Add weighting to understand which collections
-have more material directly related to identity. (How is this best
-handled programmatically or as an input by contributors- maybe both?).
-
-Increase exposure of SNAC to general public by leveraging partnerships.
-Suggested agreement with Wikipedia to display Wikipedia content in SNAC
-biographical area and work with Wikipedia to allow for links to SNAC at
-the bottom of all applicable identities. This would serve to escalate
-and  drive traffic to SNAC.
-
-Introduction (Tom, Ray, Sara, Brian, Daniel)
-============================================
-
-Social Networks and Archival Context (SNAC) is a Mellon-funded project
-to aid end-user researchers in discovering, locating, and using
-distributed historical record descriptions, especially as relates to
-corporate bodies, persons, and families (CPF). These descriptions are
-often in finding aids, and they often exist in electronic format. They
-are distributed across many geographical locations and many networks.
-SNAC brings all this data together in a central system, while retaining
-links to the original descriptions. Critically, SNAC attempts to merge
-descriptions for the same [matching?] CPF identities, linking those
-descriptions to a single authoritative name.
-^[[a]](#cmnt1)^^[[b]](#cmnt2)^
-
-We have an existing system (SNAC one?) and need additional work to get
-to a new system (SNAC 3?), so part of this document is gap analysis. The
-scope of this document is to outline technical specifications and
-requirements for a production system for the Cooperative
-phase^[[c]](#cmnt3)^ of SNAC. This production system will handle
-ingestion, processing, matching/merging, discovery, and dissemination of
-archival descriptions that are submitted and added to the Cooperative.  
-
-Evaluation of Existing Technical Architecture
-=============================================
-
-Overview (All authors)
----------------------
-
-This section describes the existing technical architecture, and later
-moving on to describe the required functionality for the production
-system for the Cooperative.
-
-Many of the archival records that are ingested in SNAC are Encoded
-Archival Context - Corporate bodies, Persons and Families (EAC-CPF,
-hereafter CPF) records. EAC-CPF is an XML schema endorsed as a standard
-by the Society of American Archivists. We speak of CPF descriptions in
-the sense of a “computer record”: often a single text file and not a
-“record” in the archival sense.
-
-“Linked data” technology related to the Resource Description Framework
-is also employed to manage some controlled vocabularies in the project.
-
-The current system consists of three main components: extraction,
-match/merge, discovery. Extraction consists of extracting data from
-incoming archival description records (primarily EAD, MARC21 and some
-other unique formats), to create CPF descriptions. Match/merge is to
-process the CPF descriptions in search of name matches and to merge
-well-matched descriptions. The resulting data set includes merged
-descriptions and descriptions with no matches (called singletons), all
-in a single database. Discovery is discovery and dissemination of the
-data via a web application.
-
-The production system will have two additional components: maintenance
-and administration. Maintenance includes manual corrections, such as
-correcting data within a description, splitting incorrect merges,
-merging descriptions for the same CPF identity, and description embargo
-(embargo hides descriptions from public view for either technical or
-administrative reasons). Administration is the typical management of
-users, accounts, and reporting on the state of the system.
-
-The first two phases of data processing are extraction, and match/merge.
-A database of descriptions, both merged and unmerged is the end
-result^[[d]](#cmnt4)^. The process of ingesting extracted data and
-merging will continue for the life of the project. An extensive
-web-based search engine lets users discover descriptions.
-
-We use the term “merged” loosely when applied to the automated system
-since the final database may contain descriptions which should be
-merged, but which a computer is unable to reliably determine.  We take a
-conservative approach, preferring to only merge descriptions that a
-computer program can accurately distinguish.^[[e]](#cmnt5)^ Even so,
-some descriptions will have been incorrectly merged, and thus the need
-for a (future) maintenance system that allows manually splitting of
-descriptions, among other things.
-
-Both Extraction and Match/merge are script based, batch processing,
-semi-automatic processes managed entirely by software engineers.
-Discovery and Maintenance are both web applications with extensive
-public user interfaces intended for researchers. Administration is done
-mostly via a non-public web application.
-
-Extraction and match/merge are well developed, although we have some
-planned improvements. Discovery is well developed, but existing features
-are being refined, and adding new features is on-going. Maintenance and
-administration have not yet been created and must be written from the
-ground up.
-
-Current State of the System (Tom, Ray, Sara, Brian, Daniel)
-----------------------------------------------------------
-
-CPF description generation is done at the University of Virginia’s
-Institute for Advanced Technology in the Humanities (IATH). IATH handles
-the CPF data extraction and hosts servers for data processing and the
-SNAC prototype web site. Data processing, XTF indexing (for the
-discovery interface), and web hosting take place on a Linux server with
-24 CPUs and 94 GB of RAM connected to a 1Gbit network switch. This
-server is administered by the IATH sysadmin team. \
-
-Collections of archival description computer descriptions in a variety
-of formats are extracted into CPF format XML. This process involves
-writing XSLT scripts that extract and transform input descriptions, and
-create CPF files as output. The current state of the extraction is a
-collection of XSLT scripts supplemented by Perl scripts. The input files
-are XML with large numbers of files in EAD, MARC XML, and British
-Library XML, as well as several smaller data sets.  A large XSLT code
-library is shared among most of the extractions. Each type of extraction
-builds a generic internal data structure, which is serialized as EAC-CPF
-XML output. The XSLT takes into account various descriptive practices in
-the input data, and reformats as necessary to create a single type of
-normative CPF output. The complexity of this task centers around the
-large number of small differences in descriptive practice. Currently
-more than 3 million CPF computer descriptions have been created. The
-XSLT processor is Saxon 9 HE, which is the free “home edition” of Saxon.
-Saxon implements XSLT 2.0. There are a small number of Perl scripts that
-integrate the XSLT into a pipeline, automating tasks such as chunking
-data sets into sizes that won’t exceed computer memory.
-
-The current state of the match/merge is (filled in by Yiming/Ray/Sara,
-initially a one or two paragraph overview with more detail added later
-as necessary).
-
-Overview of Brian’s UI and programming for the SNAC2 XTF discovery tool
-(add this to another item if there is an umbrella section more
-appropriate).
-
-Is XTF the only discovery tool we will offer? Will SNAC be fully indexed
-by Google and Bing?
-
-TK The involvement of the UC Berkeley I School includes the development,
-testing and modification of the matching and merging components of the
-SNAC system. The current system, described in more detail below, takes
-the EAC-CPF records derived from the various source institutions and
-compares the names and associated information (especial dates) to
-identify the records that likely describe the same person,
-
-organization, or family. The process involves not only comparison across
-input records, but also comparison with information from the Virtual
-International Authority File, and approximate matching for these records
-as well.
-
-TK The involvement of CDL includes … (Brian)
-
-TK We have several extant user studies UI/UX … (Rachael, on-going)
-
-TK The results of these studies are … (Rachael, on-going)
-
-TK The technical implications of these studies are … (Rachael, on-going)
-
-The current system uses a fairly loose software development process.
-Source code is primarily maintained on a Linux server which is managed
-by standard practices as relate to hardware, software, network, user
-accounts, back up, and so on. All the data resides on the server. Source
-code is managed by version control systems. The amount of quality
-assurance and testing has been increasing over time, as well as
-documentation, and management aspects such as release process. All tools
-currently used are open source, and the code written for SNAC is open
-source. We have begun to formalize feature request and issue tracking.
-The development process is agile in that there are frequent small
-changes that are committed to the version control, and the code is
-nearly always in a working state.
-
-### Processing Pipeline (Ray, Yiming, Sara)
-
-TK Describe algorithmic portions, and add a section for new features.
-
-Extraction (Tom, Daniel)
------------------------
-
-There are currently several CPF extraction software pipelines: MARC21,
-British Library, Smithsonian Agency History, New York State Archives,
-Smithsonian Joseph Henry, Smithsonian Field Books, and EAD from nearly
-60 institutions.
-
-The first step in adding new records to the SNAC database is to convert
-incoming data into EAC-CPF XML.  One EAC-CPF record is created for each
-successfully extracted reference to an identity from an archival source.
-The processing also allows for some degree of remediation of data
-quality issues and serves to normalize the data into a common format.
- Scripting data transformation processes is a significant task that
-often requires close communications with data contributors and
-customizations to accommodate local practices of the contributors.
-
-Creating an extraction is a complex process since we must deal with
-variances in local descriptive practice. The MARC21 tools have been made
-available as a web interface and this demonstrates the feasibility of
-moving more of the processing responsibility to data donors. If we are
-optimistic, we hope that EAD-to-CPF extraction and all other types of
-future extractions can be turned into donor-driven tools. Specifically,
-we create the tools and then deploy them as web applications and/or
-desktop applications. Web hosted extraction tools allow us to leverage
-the power of our servers and programmers so that data donors do not need
-a large computing infrastructure in order to participate. In any case,
-data must be validated before ingest into the match/merge processing.
-
-XSLT and perl are the predominate technologies used in the generation of
-the XML documents created by this process.  The code architecture
-focuses on reusability of modular routines to facilitate maintenance of
-the customizations needed accommodate the diversity of data sources.
-
-Code, sample data, and documentation are in Github. The pipeline is
-being run on a server, but the hardware requirements are minimal enough
-that most laptop computers could run the extraction. The system requires
-unix-like features of Linux, MacOS, or cygwin (for MS Windows). The XSTL
-engine is Saxon 9.x HE which is the free, public version of Saxon.
-
-Match/Merge (Brian, Yiming, Ray)
--------------------------------
-
-The match/merge process has three major data input streams, library
-authority records, EAC-CPF documents from the EAC-CPF extract/create
-system, and an ARK identifier minter.
-
-First, a copy of the Virtual International Authority File (VIAF) is
-indexed as a reference source to aid in the record matching process.  In
-addition to authorized name headings from multiple international
-sources, the VIAF data contains biographical data and links to
-bibliographic records which will be included in the output documents.  
-Then, the EAC-CPF from the extract/create process are serially processed
-against the VIAF and each other to discover and rate potential matches
-between records.  In this phase of processing, matches are noted in a
-database.
-
-After the matching phase identifies incoming EAC-CPF to merge, a new set
-of EAC-CPF records are generated.  This works by running through all the
-matches in that database, then reading in the EAC-CPF input files, and
-finally outputting a new EAC-CPF records that merges the source EAC-CPF
-with any information found in VIAF.  ARK identifiers are also assigned.
-This architecture allows for incrementally processing more un-merged
-EAC-CPF documents before. It also allows matches to be adjusted in the
-database, or alterations to be made on the un-merged EAC-CPF documents,
-and the merge records can be regenerated.
-
-Cheshire, postgreSQL, and python are the predominate technologies used
-in the generation of the XML documents created by this process.
-
-[link to the merge output spec]
-
-This involves processing that compares the derived EAC-CPF records
-against one another to identify identical names. Because names for
-entities may not match exactly or the same name string may be used for
-more than one entity, contextual information from the finding aids is
-also used to evaluate the probability that closely and exactly matching
-strings designate the same entity.[1] For matches that have a high
-degree of probability, the EAC-CPF records will be merged, retaining
-variations in the name entries where these occur, and retaining links to
-the finding aids from which the name or name variant was derived. When
-no identical names exist, an additional matching stage compares the
-names from the input EAC-CPF records against authority records in the
-Virtual International Authority File (VIAF). Contextual information
-(dates, inferred dates, etc.) is used to enhance the accuracy of the
-matching. Matched VIAF records are merged with the input derived EAC-CPF
-records, with authoritative or preferred forms of names recorded, and a
-union set of alternative names from the various VIAF contributors, will
-also be incorporated into the EAC-CPF records. When exact matching and
-VIAF matching fail, then we attempt to find close variants using Ngram
-(approximate spelling) matching. In addition contextual information,
-when available is used assess the likelihood of the records actually
-being the same. Records that may be for the same entity but the
-available contextual information is insufficient to make a confident
-match will be flagged for human review (as “May be same as”). While
-these records will be flagged for human review, the current prototype
-does not provide facilities to manually merge records. The current
-policy governing matching is to err on the side of not merging rather
-than merging without strong evidence.
-
-The resulting set of interrelated EAC-CPF records will represent the
-creators and related entities extracted from EAD-encoded finding aids,
-with a subset of the records enhanced with entries from matching VIAF
-records. The EAC-CPF records will thus represent a large set of archival
-authority records, related with one another and to the archival records
-descriptions from which they were derived. This record set will then be
-used to build a prototype corporate body, person, and family name and
-biographical/historical access system.
-
-In the current system all input records, and potential matches are
-recorded in a relational database with the following structure:
-
-* * * * *
-
-[1] Using contextual information in determining that two or more records
-represent the same entity has been successful in matching and merging
-authority records in an international context. See Rick Bennett,
-Christina Hengel-Dittrich, Edward T. O'Neill, and Barbara B. Tillett
-VIAF (Virtual International Authority File): Linking Die Deutsche
-Bibliothek and Library of Congress Name Authority File:
-http://www.ifla.org/IV/ifla72/papers/123-Bennett-en.pdf
-
-![Screen Shot 2014-06-22 at 3.08.12 PM.png](images/image00.png)
-
-The the current processing steps are summarized in the following
-diagram:
-
-![Slide1.jpg](images/image01.jpg)
-
-Discovery/Dissemination (Brian, Rachael)
----------------------------------------
-
-Prototype research tool^[[f]](#cmnt6)^
--------------------------------------
-
-The main data input for the prototype research tool are the merged
-EAC-CPF documents produced in the match/merge system.  Some other
-supplemental data sources, such as dbpedia and the Digital Public
-Library of America are also consulted during the indexing process.
-
-A pre-indexing phase is run on the merged EAC-CPF documents.  During
-pre-processing, name headings and wikipedia links are extracted, and
-then used to look for possible related links and data in supplemental
-sources. The output of the pre-indexing phase consists of XML documents
-recording supplemental.
-
-Once the supplemental XML files are generated, two types of indexes are
-created to power which serve as the input to the web site.  The first
-index created runs across all documents and provides access to the full
-text and specific facets of metadata extracted from the documents.
- Additionally, the XML structure of each document is indexed as a
-performance optimization that allows for transformations to be
-efficiently applied to large XML documents.
-
-The public interface to the prototype research tool utilizes the index
-across all documents to enable full text, metadata, and faceted searches
-of the merged EAC-CPF documents.  Once a search is completed, and a
-specific merged EAC-CPF document is selected for display; the index of
-the XML document structure is used to quickly transform the merged
-document into an HTML presentation for the end user.
-
-In the SNAC1 prototype a graph database was created after the full text
-indexing was complete.  The graph database was used to power
-relationship visualizations and an API used to dynamically integrate
-links to SNAC into archival description access systems. This graph
-database was then converted into linked data, which was loaded into a
-SQARQL endpoint. This step has not yet been implemented in the SNAC 2
-prototype.  Because the merged EAC-CPF documents are of higher quality
-for the SNAC 2 prototype, the graph extraction process is no longer
-dependent on the full text index being complete, so it could run in
-parallel with pre-indexing and indexing.
-
-XTF is the main technology used to power public access to the merged
-EAC-CPF records.  XTF integrates lucene for indexing and saxon for XML
-transformation, making heavy use of XSLT for customization and display
-of search results and the merged documents.  EAC-CPF and search results
-are transformed to HTML5 and JSON for consumption by the end users’ web
-browser.  Multiple javascript and CSS3 libraries and technologies are
-used in the production of the “front end” code for the website.  Google
-analytics is used to measure use of the site.  Werker, middleman, and
-bower used to build the front end code for the site.
-
-This technical architecture
-
-[links to code]
+Sort collection links. Add weighting to understand which collections have more material directly related to
+identity. (How is this best handled programmatically or as an input by contributors- maybe both?).
+
+Increase exposure of SNAC to general public by leveraging partnerships.  Suggested agreement with Wikipedia to
+display Wikipedia content in SNAC biographical area and work with Wikipedia to allow for links to SNAC at the
+bottom of all applicable identities. This would serve to escalate and drive traffic to SNAC.
+

--- a/tat_requirements/outline.md
+++ b/tat_requirements/outline.md
+
+plan.md
+--------
+
+plan.md Big questions
+
+plan.md Overview and order of work
+
+plan.md Code we write
+
+plan.md Controlled vocabularies and tag system 
+
+plan.md Code we use off the shelf
+
+co-op_background.md
+-----
+
+Authors
+
+Organization of documenatation
+
+Introduction to SNAC
+
+Evaluation of Existing Technical Architecture
+
+Overview
+
+Current State of the System
+
+Processing Pipeline
+
+Extraction
+
+Match/Merge
+
+Discovery/Dissemination
+
+Prototype research tool
+
+Gap analysis
+
+Data maintenance
+
+Pilot phase architecture
+
+Current State Conclusion
+
+
+introduction.md
+--------
+
+TAT Functional Requirements
+
+Introduction to Planned Functionality
+
+Software development, processes, and project management
+
+QA and Related Tests for Test-driven Development
+
+Documentation
+
+Required new features
+
+Web application overview
+
+Web application output via template
+
+Data background
+
+What is "normal form" and what informs the database schema design?
+
+Edit architecture requirements
+
+Expanded CPF schema requirements
+
+Expanded Database Schema
+
+Merge and watch
+
+Brian’s API docs need to be merged in or otherwise referred to:
+
+Not sure where to fit these topics into the requirements. Some of them may not be part of technical requirements:
+
+
+requirements.md
+----
+
+List of requirements
+
+Requirements from Rachael's spreadsheet
+
+List of Application Programmer Interfaces (APIs)
+
+Maintenance Functionality
+
+Functionality for Discovery
+
+User interface for Discovery
+
+Functionality for Splitting
+
+User interface for Splitting
+
+Functionality for Merging
+
+User interface for Merging
+
+Functionality for Editing
+
+User interface for Editing
+
+Admin Client for Maintenance System
+
+User Management
+
+Web Application Administration
+
+Reports
+
+System Administration
+
+Community Contributions
+
+Ability to Open/Close the Site during Maintenance
+
+Sandbox for Training, perhaps as a clone of the QA system?
+
--- a/tat_requirements/readme.md
+++ b/tat_requirements/readme.md

 These documents are organized in the following order:

-[plan.md](plan.md) gives the big overview.
+[plan.md](plan.md) Big overview.

-[introduction.md](introduction.md) covers the current state and gives some historical background.
+[outline.md](outline.md) An outline of sections in the documents

-[co-op_background.md](co-op_background.md) covers broad expectations for the co-op software.
+[co-op_background.md](co-op_background.md) Broad expectations for the co-op software.

-[requirements.md](requirements.md) are the technical requirements.
+[introduction.md](introduction.md) Requirements part one
+
+[requirements.md](requirements.md) Requirements part two, includes tech requirements from Rachael's spreadsheets



--- a/tat_requirements/requirements.md
+++ b/tat_requirements/requirements.md
-Requirements from Rachael's spreadsheet
---

- Programmers contribute some time to help with technology side of the gap analysis of institutional capability
-
- We need a concrete plan for persistent IDs.
-
-  - We need to manage base HREF stubs that are combined with persistent IDs to form working URLs. Ideally, all
-    the URLs could be composed via a format string (printf), so we could just store the ID, HREF stub, and
-    format string and be done with it. However, some URLs have interesting issues that require code and thus
-    exceed the abilities of normal format strings. We can certainly roll out an early version with format
-    strings, and add some clever functions later as necessary.
+#### List of requirements

- Do we need any additional requirements for related name linking?
-
- Clarify: the co-op version 1 is not going to support bulk data ingest
-
- Clarify: the co-op version 1 is not going to support bi-directional data exchange and update
-
- Do we need full delete? For example, a CPF contains something illegal and must be fully deleted. How do we
-  delete from backups? Are either of these even required by policy?
-  
- Are we assuming that data from the web browser has been sanity checked before hitting the server? Does the
-  server need to cache edit data prior to writing the data to the cpf database? For example, what if someone
-  enters "19th century" in a date field? It isn't valid, but we need to save their work.
-  
- We need to sanity check any links we create, especially links back into SNAC.
-
- Don't forget the X-to-CPF field mapping documentation, and this ties in to the "CPF data contributor's"
-  guide (below)
-
- We need the "CPF data contributor's" guide.
-
- What authority work will we be doing?
-
- What authority data from other sources do we cache locally?
-
- Create detailed functional requirements for controlled vocabularies, and a detailed implementation
-  specification.
-  
- Clarify: versioning is per-record, not per-field. 
-
- Need a watch/notification API. It needs a canonical name. Is there an off-the-shelf event monitor that will
-  easily integrate with the web REST API and work flow manager?
-  
- Clarify: Are we integrating SNAC and ArchiveSpace in co-op version 1? Will ArchiveSpace have to use our REST API?
-
- How is embargo implemented at the database level? What are the requirements for embargo?
-
- Clarify / verify: Technical review vs content review is handled by a combination of roles and work flow.
-
- Reports: Where are we keeping the Big List of All Reports? 
-
- Clarify: row 43, (unclear) Consider implementing inked data standard for relationship links
-  instead of having to download an entire document of links, as it is configured now.
-  
- Search: need the Big List of Search Facets, and someone needs to verify that Elastic Search can do facets.
-
- Does co-op version 1 have a timeline visualization? Does it have a "sort by timeline"? What does it mean to
-  sort by timeline?
-  
- Clarify: What is a context widget? - row 52, Continue to develop and refine context widget. (technical
-  requirements unclear)
-  
- Clarify: we need requirements for citations, and details about where they integrate with the rest of the
-  system.
-
-
-List of requirements
---

 This is the definitive list of all requirements. Anything the application needs to do must be in this
 list. Each item and group of items is explained in detail later in the document. Being a "list", this includes
@@ -157,9 +90,100 @@ only sufficient detail to disambiguate items.

 - data integrity testing

+#### Requirements from Rachael's spreadsheet
+
+
+- Programmers contribute some time to help with technology side of the gap analysis of institutional capability
+
+- We need a concrete plan for persistent IDs.
+
+  - We need to manage base HREF stubs that are combined with persistent IDs to form working URLs. Ideally, all
+    the URLs could be composed via a format string (printf), so we could just store the ID, HREF stub, and
+    format string and be done with it. However, some URLs have interesting issues that require code and thus
+    exceed the abilities of normal format strings. We can certainly roll out an early version with format
+    strings, and add some clever functions later as necessary.
+
+- Do we need any additional requirements for related name linking, or more accurately identity linking? Each
+  identity has and ARK which is a persistent ID with an assciated URL. Use cases for identity links:
+  
+  1. SNAC links one identity to another internally based on relations between identities
+  
+  1. SNAC links to itself as a name authority
+  
+  1. SNAC links to external identities
+  
+  1. SNAC links to external archival resources
+  
+  2. External resources link to SNAC as an authority. (Tom asks: is SNAC also an archival resource?)
+
+- Clarify: the co-op version 1 is not going to support bulk data ingest
+
+- Clarify: the co-op version 1 is not going to support bi-directional data exchange and update
+
+- Do we need full delete? For example, a CPF contains something illegal and must be fully deleted. How do we
+  delete from backups? Are either of these even required by policy?
+  
+- Are we assuming that data from the web browser has been sanity checked before hitting the server? (Yes, by
+  the data validation API) 
+  
+- Does the server need to save temporary edit data prior to writing the data to the cpf database? For example, what if
+  someone enters "19th century" in a date field? It isn't valid, but we need to save their work. (Yes, we need to save invalid user input, and give the user a useful message for each type of data validation failure.)
+  
+- We need to sanity check any links we create, especially links back into SNAC.
+
+- Don't forget (to create) the X-to-CPF field mapping documentation, and this ties in to the "CPF data contributor's"
+  guide (Below)
+
+- We need the "CPF data contributor's" guide.
+
+- What authority work will we be doing? 
+
+    - For example, holding institution ISIL identifier, name, address, contact person, etc.
+
+- What authority data from other sources do we cache locally? 
+
+    - We need examples of this, as well as a process to manage those resources. It is important to know where
+      the data came from, technically how it was acquired, the date we acquired it, and some methods of
+      updating the current local cache. This implies that all external data we use has internal persistent
+      ids.
+
+- Create detailed functional requirements for controlled vocabularies, and a detailed implementation
+  specification.
+  
+- Clarify: versioning is per-record, not per-field. 
+
+- Need a watch/notification API. It needs a canonical name. Is there an off-the-shelf event monitor that will
+  easily integrate with the web REST API and work flow manager? 
+  
+      - We can write our own status and staging API. It only requires modest SQL schema work. Most of the necessary data is already planned for other features. For example, records can be locked by a user, we know who has the lock, we need administrative functions for unlocking and transfering locks, the work flow explicitly lays out the process for each user interaction with the application.
+  
+- Clarify: Are we integrating SNAC and ArchiveSpace in co-op version 1? Will ArchiveSpace have to use our REST API?
+
+- How is embargo implemented at the database level? What are the requirements for embargo?
+
+- Clarify / verify: Technical review vs content review is handled by a combination of roles and work flow.
+
+- Reports: Where are we keeping the Big List of All Reports? 
+
+- Clarify: row 43, (unclear) Consider implementing inked data standard for relationship links
+  instead of having to download an entire document of links, as it is configured now.
+  
+- Search: need the Big List of Search Facets, and someone needs to verify that Elastic Search can do facets.
+
+- Does co-op version 1 have a timeline visualization? Does it have a "sort by timeline"? 
+
+    - What does it mean to sort by timeline?
+  
+- Clarify: What is a context widget? - row 52, Continue to develop and refine context widget. (technical
+  requirements unclear)
+  
+- Clarify: we need requirements for citations, and details about where they integrate with the rest of the
+  system.
+
+
+
+#### List of Application Programmer Interfaces (APIs)

-List of Application Programmer Interfaces (APIs)
----

 The following include both direct programming language intefaces, and REST interfaces. We need to determine
 which (REST/direct) is available for each. Modifying data should probably go through authorization and should
@@ -183,8 +207,8 @@ only public interface.
 - record watching (REST?)


-Maintenance Functionality (All authors)
---------------------------------------
+#### Maintenance Functionality
+

 Maintenance falls into four areas: discover, split, merge, and edit.

@@ -216,7 +240,7 @@ trail, and there are no destructive changes. For example, there is no
 public view. Updated descriptions will be subject to version control so
 changes can be rolled back.

-### Functionality for Discovery
+#### Functionality for Discovery

 The discovery tools for maintenance may be somewhat different from the
 normal discovery tools for scholarly research. We have a standard
@@ -230,13 +254,9 @@ Users will have individual accounts, so we can enable a search history,
 internal bookmarks, and various saved reports (assuming faceted search
 where it could take many mouse clicks to accrete a specific search).

-### 
-
-### User interface for Discovery (Brian, Rachael)
+#### User interface for Discovery 

-### 
-
-### Functionality for Splitting^[[m]](#cmnt13)^^[[n]](#cmnt14)^ (Tom, Danial, all authors)
+#### Functionality for Splitting^[[m]](#cmnt13)^^[[n]](#cmnt14)^ 

 Keeping in mind that our descriptions are authoritative, and will be
 referenced via persistent identifier (ARK), it will be necessary to
@@ -339,9 +359,9 @@ To review split:
 20. admin function to view locked descriptions by user,
 21. choose one of my locked descriptions to continue work.

-### User interface for Splitting (Tom, Daniel, Rachael, others)
+#### User interface for Splitting

-### Functionality for Merging (Tom, Daniel, all authors)
+#### Functionality for Merging

 We need to allow our experts to merge descriptions. This may be far more
 common than splitting since the automated pipeline was designed to only
@@ -408,11 +428,9 @@ To review merging:
 19. locks and hides original,
 20. makes merged description publically visible.

-### User interface for Merging (Rachael, Tom, Daniel, others)
-
-### 
+#### User interface for Merging

-### Functionality for Editing
+#### Functionality for Editing

 Modifications we expect include but are not limited to: spelling
 corrections, date corrections, editing or expanding biographical data,
@@ -421,12 +439,11 @@ and correcting relations between descriptions. Metadata such as the URL
 of the original finding aid may also be updated. The maintenance system
 also needs to support bulk data edits of several types.

-### User interface for Editing (Rachael, Tom, Daniel, others)
+#### User interface for Editing

-Admin Client for Maintenance System
-----------------------------------
+#### Admin Client for Maintenance System

-### User Management (Tom, Brian)
+#### User Management

 Authentication is validating user logins to the system. Authorization is
 the related aspect of controlling which parts of the system users may
@@ -564,8 +581,8 @@ These users need an admin dashboard with corresponding reports. We may
 need to have sub-institution accounts and that gets tricky because we
 don’t want to be mixed up in internal institutional politics.

-Web Application Administration
------------------------------
+#### Web Application Administration
+

 System administration will be required for the web application and the
 server hosting the web site. This is well understood from a technical
@@ -574,8 +591,8 @@ command line accounts involved, and server configuration. This aspect of
 administration integrates with versioning, backup, and software
 releases.

-Reports ^[[s]](#cmnt19)^^[[t]](#cmnt20)^(Tom, Brian, Rachael, Brad)
-------------------------------------------------------------------
+#### Reports ^[[s]](#cmnt19)^^[[t]](#cmnt20)^
+

 While the web interface is the primary public face of SNAC, many other
 views of the data and meta data are necessary, especially for admins and
@@ -589,8 +606,8 @@ adding reporting and business intelligence.
 (How much detail do we want about reports? Maybe just half a dozen
 examples?)

-System Administration (Tom, Brian)
----------------------------------
+#### System Administration
+

 This is boilerplate server administration, for the most part.
 Preservation of original material may not be necessary. Since our data
@@ -622,8 +639,8 @@ correcting file systems (ZFS, Btrfs) may not be production quality at
 this time. If we want to deploy an anti-bit-rot technology, our
 alternative may be limited to using Par2/Quick Par/Parchive.

-Community Contributions (All authors)
-------------------------------------
+#### Community Contributions
+

 Researcher interface/functionality including public facing discovery and
 dissemination (All, especially Brian)
@@ -651,8 +668,8 @@ logic that we will support with UI, code, and database tables/fields.
 Many reports will be limited certain roles. Admin users will likely be
 heavy report users.

-Ability to Open/Close the Site during Maintenance (Tom, Brian)
--------------------------------------------------------------
+#### Ability to Open/Close the Site during Maintenance
+

 If the product has a “closed for maintenance” feature,
 ^[[x]](#cmnt24)^this ability would be available to admins, even though
@@ -681,8 +698,8 @@ to our software architecture. The more elegant approach is to use one of
 several system architectures that  keep a small system front end always
 running.

-Sandbox for Training, perhaps as a clone of the QA system? (All authors)
------------------------------------------------------------------------
+#### Sandbox for Training, perhaps as a clone of the QA system?
+

 TK