Merge branch 'master', remote branch 'origin' into tom

6a225b08 · Tom Laudeman · 69a69749 · ea79f79c · 6a225b08 · 6a225b08
Commit 6a225b08 authored Sep 28, 2015 by Tom Laudeman
68 changed files
--- a/Perl-vs-Python.md
+++ b/Perl-vs-Python.md
--- a/Discussion/README.md
+++ b/Discussion/README.md
+# Technical Discussion
+This directory contains technical discussions and related notes between developers on the project.
--- a/Discussion/Relational Databases.md
+++ b/Discussion/Relational Databases.md
+# Discussion on Relational Databases
+#### What is "normal form" and what informs the database schema design?
+Edgar F. "Ted" Codd created 12 rules (revised with a 13th rule) to clarify the Relational Database Management
+System (RDBMS).
+https://en.wikipedia.org/wiki/Edgar_F._Codd
+Breaking any of these rules weakens data integrity and the ability of the system to manage the data. An RDBMS
+is not merely a bucket of data, but an entire eco-system for the management of data and data related
+activities. Before Codd's work, databases were managed on an ad-hoc basis as collections of files with
+links. It was a mess. Data was lost. Only the DBA knew how to find the data, and access methods could be very
+different for data in different locations. Accessing data could also be extremely slow. In addition to
+assuring the integrity of data, as well as managing it, relational database systems are very fast.
+https://en.wikipedia.org/wiki/Codd%27s_12_rules
+The "R" in RDBMS is "relational" and Codd invented the relational model of data. Key to relational data
+modeling is "normal form".
+https://en.wikipedia.org/wiki/Database_normalization
+The RDBMS world generally uses third normal form. Lower levels of normalization create additional work for
+data operations. Higher forms rarely show any improvements. The key concept of normalization is that a datum
+only exists in one place. In the RDBMS world where SQL implements relational algebra, normal form is both
+convenient and natural. In other venues such as paper ledgers, data stored in flat files, or in spreadsheets,
+normal form can seem awkward.
--- a/Discussion/Staffing Model.md
+++ b/Discussion/Staffing Model.md
+# Staffing Model (Brian's draft suggestions)
+Production of a cooperatively maintained high profile web site requires
+different types of Technical and non-technical work.
+Operations Team
+- Communications and interactions with end users and content owners,
+    from marketing to user support, assessment
+- Manages help desk
+-   Support production web application infrastructure, including
+    monitoring, "on call" for first tier response to system monitors
+- batch ingest of new data sources
+-   signs up and on-boards new pilot members
+- Proactive content QA and remediation
+-   work organized around issue queue / customer relationship management
+    system
+Main Artifact: Ticketing Issue tracker that automatically generates a
+ticket for an email to help@example.edu
+Development Team
+- Create new features that deliver customer value
+-   Maintain tests for new features
+- second tier support of deployed features, developers on call for
+    their deployed code
+- deploy code to test, stage, and production environments
+-   work organized around sprints
+Main Artifact: User story backlog that supports scoring stories by
+points,
+Research Team
+- Conduct experiments with new algorithms and technologies
+- interoperation (and participation in the development) of relevant
+    domain specific standards and practices
+Main Artifact: Research Agenda, schemas and specifications (esp. merge
+spec)
--- a/Help-using-gitlab.md
+++ b/Help-using-gitlab.md
-### Info about Markdown
-Markdown is a markup language that is used in Gitlab for documentation text files. Markdown files have a .md extension and can be edited locally or online. However, for best results, we recommend editing files locally and then uploading them. There are good guides to the syntax [here](https://confluence.atlassian.com/stash/using-stash/markdown-syntax-guide) and [here](https://en.wikipedia.org/wiki/Markdown). 
-### Editing
-You can also edit markdown files locally, using a text editing application (such as TextEdit) or a word processing program (such as Word). However, be aware that some word processing programs may affect line breaks and formatting, which may change how information is displayed.
-You can edit markdown files from the Gitlab web site. From the Gitlab home page, click a project on the right
-side. On the project home page, click "Files" in the left navigation bar. Click a .md file. Click the "Edit"
-button on the right side. Update the text and when finished, enter a commit message below, and click the
-"Commit Changes" button.
-#### Markdown, local complete reference
-http://gitlab.iath.virginia.edu/help/markdown/markdown.md
-#### Markdown, same info, somewhat different format
-https://help.github.com/articles/markdown-basics/
-#### Github extensions to standard markdown:
-https://help.github.com/articles/github-flavored-markdown/
-#### Standard markdown notes:
-https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet
 ### Working locally and version control
 The Git technology was created to track revisions to many files. Gitlab provides a web site with some ability

--- a/Help/Markdown.md
+++ b/Help/Markdown.md
+### Info about Markdown
+Markdown is a markup language that is used in Gitlab for documentation text files. Markdown files have a .md extension and can be edited locally or online. However, for best results, we recommend editing files locally and then uploading them. There are good guides to the syntax [here](https://confluence.atlassian.com/stash/using-stash/markdown-syntax-guide) and [here](https://en.wikipedia.org/wiki/Markdown).
+### Editing
+You can edit markdown files locally, rather than on the website.  One full-featured cross-platform Markdown editor is [Atom](http://atom.io).  After opening a file, pressing `Ctrl-Shift-M` for Win/Linux and `Cmd-Shift-M` for Mac will open a real-time preview of the markdown file.
+![Atom Screenshot](http://gitlab.iath.virginia.edu/snac/Documentation/raw/b39387646432816488537cce327f00e41aa79452/images/atom-screenshot.png "Screenshot of Atom editing Interface")
+You can also edit markdown files using a text editing application (such as TextEdit or Notepad). However, be aware that some word processing programs may affect line breaks and formatting, which may change how information is displayed.  
+You can edit markdown files from the Gitlab web site. From the Gitlab home page, click a project on the right
+side. On the project home page, click "Files" in the left navigation bar. Click a .md file. Click the "Edit"
+button on the right side. Update the text and when finished, enter a commit message below, and click the
+"Commit Changes" button.
+### Resources
+* [Gitlab markdown reference (local)](http://gitlab.iath.virginia.edu/help/markdown/markdown.md)
+* [Github markdown reference](https://help.github.com/articles/markdown-basics/)
+* [Github extensions to standard markdown](https://help.github.com/articles/github-flavored-markdown/)
+* [Alternate markdown cheatsheet](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet)
+* [Official Markdown documentation](http://daringfireball.net/projects/markdown/)
--- a/Help/README.md
+++ b/Help/README.md
+# SNAC Help
+This directory contains helpful links and documentation on how to use various pieces of the subsystem.  Specifically, it contains help files for Git, Gitlab, and the Markdown syntax.
--- a/move-bare-repo-to-gitlab
+++ b/move-bare-repo-to-gitlab
--- a/move-repo-to-gitlab.md
+++ b/move-repo-to-gitlab.md
--- a/tat_requirements/co-op_background.md
+++ b/tat_requirements/co-op_background.md
--- a/tat_requirements/tat_functional_requirements.md
+++ b/tat_requirements/tat_functional_requirements.md
--- a/Historical Documentation/README.md
+++ b/Historical Documentation/README.md
+# Historical Documentation
+This directory catalogs documentation related to previous iterations of the SNAC project.
--- a/SNAC-Algorithms-And-Use.md
+++ b/SNAC-Algorithms-And-Use.md
--- a/tat_requirements/images/image00.png
+++ b/tat_requirements/images/image00.png
--- a/tat_requirements/images/image01.jpg
+++ b/tat_requirements/images/image01.jpg
--- a/LICENSE
+++ b/LICENSE
+CC0 1.0 Universal
+Statement of Purpose
+The laws of most jurisdictions throughout the world automatically confer
+exclusive Copyright and Related Rights (defined below) upon the creator and
+subsequent owner(s) (each and all, an "owner") of an original work of
+authorship and/or a database (each, a "Work").
+Certain owners wish to permanently relinquish those rights to a Work for the
+purpose of contributing to a commons of creative, cultural and scientific
+works ("Commons") that the public can reliably and without fear of later
+claims of infringement build upon, modify, incorporate in other works, reuse
+and redistribute as freely as possible in any form whatsoever and for any
+purposes, including without limitation commercial purposes. These owners may
+contribute to the Commons to promote the ideal of a free culture and the
+further production of creative, cultural and scientific works, or to gain
+reputation or greater distribution for their Work in part through the use and
+efforts of others.
+For these and/or other purposes and motivations, and without any expectation
+of additional consideration or compensation, the person associating CC0 with a
+Work (the "Affirmer"), to the extent that he or she is an owner of Copyright
+and Related Rights in the Work, voluntarily elects to apply CC0 to the Work
+and publicly distribute the Work under its terms, with knowledge of his or her
+Copyright and Related Rights in the Work and the meaning and intended legal
+effect of CC0 on those rights.
+1. Copyright and Related Rights. A Work made available under CC0 may be
+protected by copyright and related or neighboring rights ("Copyright and
+Related Rights"). Copyright and Related Rights include, but are not limited
+to, the following:
+  i. the right to reproduce, adapt, distribute, perform, display, communicate,
+  and translate a Work;
+  ii. moral rights retained by the original author(s) and/or performer(s);
+  iii. publicity and privacy rights pertaining to a person's image or likeness
+  depicted in a Work;
+  iv. rights protecting against unfair competition in regards to a Work,
+  subject to the limitations in paragraph 4(a), below;
+  v. rights protecting the extraction, dissemination, use and reuse of data in
+  a Work;
+  vi. database rights (such as those arising under Directive 96/9/EC of the
+  European Parliament and of the Council of 11 March 1996 on the legal
+  protection of databases, and under any national implementation thereof,
+  including any amended or successor version of such directive); and
+  vii. other similar, equivalent or corresponding rights throughout the world
+  based on applicable law or treaty, and any national implementations thereof.
+2. Waiver. To the greatest extent permitted by, but not in contravention of,
+applicable law, Affirmer hereby overtly, fully, permanently, irrevocably and
+unconditionally waives, abandons, and surrenders all of Affirmer's Copyright
+and Related Rights and associated claims and causes of action, whether now
+known or unknown (including existing as well as future claims and causes of
+action), in the Work (i) in all territories worldwide, (ii) for the maximum
+duration provided by applicable law or treaty (including future time
+extensions), (iii) in any current or future medium and for any number of
+copies, and (iv) for any purpose whatsoever, including without limitation
+commercial, advertising or promotional purposes (the "Waiver"). Affirmer makes
+the Waiver for the benefit of each member of the public at large and to the
+detriment of Affirmer's heirs and successors, fully intending that such Waiver
+shall not be subject to revocation, rescission, cancellation, termination, or
+any other legal or equitable action to disrupt the quiet enjoyment of the Work
+by the public as contemplated by Affirmer's express Statement of Purpose.
+3. Public License Fallback. Should any part of the Waiver for any reason be
+judged legally invalid or ineffective under applicable law, then the Waiver
+shall be preserved to the maximum extent permitted taking into account
+Affirmer's express Statement of Purpose. In addition, to the extent the Waiver
+is so judged Affirmer hereby grants to each affected person a royalty-free,
+non transferable, non sublicensable, non exclusive, irrevocable and
+unconditional license to exercise Affirmer's Copyright and Related Rights in
+the Work (i) in all territories worldwide, (ii) for the maximum duration
+provided by applicable law or treaty (including future time extensions), (iii)
+in any current or future medium and for any number of copies, and (iv) for any
+purpose whatsoever, including without limitation commercial, advertising or
+promotional purposes (the "License"). The License shall be deemed effective as
+of the date CC0 was applied by Affirmer to the Work. Should any part of the
+License for any reason be judged legally invalid or ineffective under
+applicable law, such partial invalidity or ineffectiveness shall not
+invalidate the remainder of the License, and in such case Affirmer hereby
+affirms that he or she will not (i) exercise any of his or her remaining
+Copyright and Related Rights in the Work or (ii) assert any associated claims
+and causes of action with respect to the Work, in either case contrary to
+Affirmer's express Statement of Purpose.
+4. Limitations and Disclaimers.
+  a. No trademark or patent rights held by Affirmer are waived, abandoned,
+  surrendered, licensed or otherwise affected by this document.
+  b. Affirmer offers the Work as-is and makes no representations or warranties
+  of any kind concerning the Work, express, implied, statutory or otherwise,
+  including without limitation warranties of title, merchantability, fitness
+  for a particular purpose, non infringement, or the absence of latent or
+  other defects, accuracy, or the present or absence of errors, whether or not
+  discoverable, all to the greatest extent permissible under applicable law.
+  c. Affirmer disclaims responsibility for clearing rights of other persons
+  that may apply to the Work or any use thereof, including without limitation
+  any person's Copyright and Related Rights in the Work. Further, Affirmer
+  disclaims responsibility for obtaining any necessary consents, permissions
+  or other rights required for any use of the Work.
+  d. Affirmer understands and acknowledges that Creative Commons is not a
+  party to this document and has no duty or obligation with respect to this
+  CC0 or use of the Work.
+For more information, please see
+<http://creativecommons.org/publicdomain/zero/1.0/>
--- a/Data-submission-and-Bulk-Data-Standards.md
+++ b/Data-submission-and-Bulk-Data-Standards.md
--- a/Notes/Other-Requirements.md
+++ b/Notes/Other-Requirements.md
+Notes to merge with other requirements
+---
+Note for TAT functional requirements: need to have UI widget for search of very long fields, such as the Joseph Henry cpfRelations
+that contain some 22K entries. Also need to list all fields which migh have large numbers of values. In fact, part of the meta data for
+every field is "number of possible entries/reapeat values" or whatever that's called.This wiki serves as the documentation of the SNAC technical team as relates to Postgres and storage of data.  Currently, we are documenting:
+* the schema and reasons behind the schema,
+* methods for handling versioning of eac-cpf documents, and
+* elastic search for postgres.
+* Need a data constraint and business logic API for validating data in the UI. This layer checks user inputs against some set of rules and when there is an issue it informs the user of problems and suggests solutions. Data should be saved regardless. We could allow "bad" data to go into the database (assumes text fields) and flag for later cleaning. That's ugly. Probably better to save inputs in some agnostic repo which could be json, frozen data structures, or name-value pairs, or even portable source format. The problem is that most often data validation is hard coded into UI JavaScript or host code. Validation really should be configurable separate from the display of data, workflow automation, and data storage. While our database needs to do certain rudimentary sanity checks, the database can't be expected to send messages all the way back up to the UI layer. Nor should the database be burdened with validation rules which are certain to be mutable. Ideally, the validation rules would work in the same state machine framework as the workflow automation API, and might even share some code.
+* Need a mechanism to lock records. Even mark-as-deleted records won't necessarily solve all our problems, so we might want records that are "live", but not editable for whatever reason. The ability to lock records and having the lock integrated across all the data-aware APIs is a good idea.
+* On the topic of data, we will have user/group/other and r/w permissions, and all data-aware APIs should be designed with that in mind.
+* "Literate programming" Using MCV and the state machine workflow automation probably meets (and exceeds) Knuth's ideas of Literate programming. Look at his idea and make sure we aren't missing any key concepts.
+* QA and testing needs to be several layers, one of which is simple documentation. We should have code that examines data for various qualities, which when done in a comprehensive manner will test the data for all properties described in the requirements. As bugs are discovered and features added, this data testing code would expand. Code should be tested on several levels as well, and the tests plus comments in the tests constitute our full understanding of both data and code.
+* Entities (names) have ID values, ARKs and various kinds of persistent ids. Some subsystem needs to know how to gather various ID values, and how to generate URIs and URLs from those id values. All discovered ids need to be attached to the cpf identity in a table related_id. We will need to track the authority that issued the id, so we need an authority table. Perhaps it is best (as previously discussed) to create a CPF record for each authority, and use the CPF persistent ID as the authority identifier in the related_id table.
+```
+create table related_id (
+        ri_id auto primary key,
+        id_value text,
+        uri text,
+        url text,
+        authority_id int -- fk to cpf.id?
+);
+```
+* Allow config of CPF output formats via web interface. For example, in the CPF generator, we can offer some format and config options such as name formats in <part> and/or <relationEntry>
+  - include 4 digit fromDate-toDate for person
+  - include dates for corporateBody
+  - use "fl." for active dates
+  - use "active" for active dates
+  - use explicit "b." and "d."
+  - only use "b." or "d." for single 4 digit dates
+  - enclose date in parentheses
+  - add comma between name and dates (applies only if there is a date)
+and so on. In theory that could be done for all kinds of CPF variations. We should have a single checkbox for "use most portable CPF formats" although I suspect the best data exchange format is not XML, but SQlite, SQL INSERT statements, or json.
+* Does our schema track who edited a record? If not, we should add user_id to the version table.
+* We should review the schema and create rules that human beings follow for primary key names, and foreign key names. Two options are table_id and table_id_fk. There may be a better option
+  * Option 1 is nice for people doing "join on", but I find "join on" syntax confusing, especially when combined with a where clause
+  * Option 2 seems more natural, especially when there is a where clause. Having different field names means that field names are more likely to be unique, thus not requiring explicit table.field syntax. Option 2a will be used most of the time. Option 2b shows a three table join where table.field syntax is required.
+```
+-- 1
+select * from foo as a, bar as b where a.table_id=b.table_id;
+-- 2a
+select * from foo as a, bar as b where table_id=table_id_fk;
+-- or 2b
+select * from foo as a, bar as b, baz as c where a.table_id=b.table_id_fk and a.table_id=c.table_id_fk;
+```
+* Identity Reconciliation has a planned "why" feature to report on why a match matches. It would be nice to also report the negative: why didn't this match X? I'm not even sure that is possible, but something to think about.
--- a/Notes/README.md
+++ b/Notes/README.md
+# Technical Notes
+This directory contains technical notes related to the SNAC project by developers.
--- a/Vocabulary-properties-and-ontologies.md
+++ b/Vocabulary-properties-and-ontologies.md
--- a/README.md
+++ b/README.md
-# Introduction to Documentation
+# SNAC Documentation
-The currently-being-revised TAT requirements are found in the [tat_requirements](tat_requirements).
+This repository contains all the documentation for the SNAC Web Application and related frameworks, engines, and pieces.  Specifically:
-The best place to start is the big, overall [plan](tat_requirements/plan.md).
+* The currently-being-revised Technical Requirements are found in the [Requirements Directory](Requirements).  
+* Formal Specifications for those requirements are in the [Specifications Directory](Specifications).
+* [Help](Help) on using Gitlab, Git, and the Markdown text format
+* [Documentation](Third Party Documentation) on third-party software and applications being used
+* [Historical Documentation](Historical Documentation) on previous iterations of SNAC
+* Technical [Discussions](Discussion) related to the SNAC project
+* [Notes](Notes) from the technical team.
-This is Gitlab, a work-alike clone of the Github web site, but installed locally on a SNAC server. Gitlab is a
+The best place to start is the big, overall [plan](plan.md) document, which describes the process forward with defining requirements and specifications.
+This repository is stored in Gitlab, a work-alike clone of the Github web site, but installed locally on a SNAC server. Gitlab is a
 version control system with a suite of project management tools.
-Ideally we will all create documentation in markdown format (.md files). You may create and edit files from
+Ideally we will all create documentation in [markdown format](http://daringfireball.net/projects/markdown/) (.md files). You may create and edit files from
-the web interface here on gitlab, or download files and edit locally. You can also upload any file type using
+the web interface here on Gitlab, or download files and edit locally. You can also upload any file type using
 standard git commands, or use a Git graphical client (see below). Choose a relevant directory for your docs,
 or create a new directory as necessary.  
 Markdown files are simple text files, which makes them easy to edit and universally portable. Markdown has a
 limited set of conventions to denote headers, lists, URLs and so on. When uploaded to gitlab or github,
 markdown files are rendered into nicely styled HTML. Tools are available to convert markdown into .doc, .pdf,
-LaTex and other formats.
+LaTex and other formats. For more information on Markdown, see [this guide](Help/Markdown.md).
-#### How to use gitlab and markdown
+#### Help Links
-[Help using gitlab](Help-using-gitlab.md)
+* [Git and Gitlab](Help/Git-and-Gitlab.md)
+* [Markdown](Help/Markdown.md)
---
+<p xmlns:dct="http://purl.org/dc/terms/" xmlns:vcard="http://www.w3.org/2001/vcard-rdf/3.0#" align="center">
+  <a rel="license"
+     href="http://creativecommons.org/publicdomain/zero/1.0/">
-Notes to merge with other requirements
+    <img src="http://i.creativecommons.org/p/zero/1.0/88x31.png" style="border-style: none;" alt="CC0" />
---
+  </a>
+  <br />
-Note for TAT functional requirements: need to have UI widget for search of very long fields, such as the Joseph Henry cpfRelations
+  To the extent possible under law,
-that contain some 22K entries. Also need to list all fields which migh have large numbers of values. In fact, part of the meta data for
+  <span resource="[_:publisher]" rel="dct:publisher">
-every field is "number of possible entries/reapeat values" or whatever that's called.This wiki serves as the documentation of the SNAC technical team as relates to Postgres and storage of data.  Currently, we are documenting:
+    <span property="dct:title">SNAC Cooperative</span></span>
-* the schema and reasons behind the schema, 
+  has waived all copyright and related or neighboring rights to
-* methods for handling versioning of eac-cpf documents, and
+  <span property="dct:title">SNAC Documentation</span>.
-* elastic search for postgres.
+This work is published from:
+<span property="vcard:Country" datatype="dct:ISO3166"
-* Need a data constraint and business logic API for validating data in the UI. This layer checks user inputs against some set of rules and when there is an issue it informs the user of problems and suggests solutions. Data should be saved regardless. We could allow "bad" data to go into the database (assumes text fields) and flag for later cleaning. That's ugly. Probably better to save inputs in some agnostic repo which could be json, frozen data structures, or name-value pairs, or even portable source format. The problem is that most often data validation is hard coded into UI JavaScript or host code. Validation really should be configurable separate from the display of data, workflow automation, and data storage. While our database needs to do certain rudimentary sanity checks, the database can't be expected to send messages all the way back up to the UI layer. Nor should the database be burdened with validation rules which are certain to be mutable. Ideally, the validation rules would work in the same state machine framework as the workflow automation API, and might even share some code.
+      content="US" about="[_:publisher]">
+  United States</span>.
-* Need a mechanism to lock records. Even mark-as-deleted records won't necessarily solve all our problems, so we might want records that are "live", but not editable for whatever reason. The ability to lock records and having the lock integrated across all the data-aware APIs is a good idea.
+</p>
\ No newline at end of file
-* On the topic of data, we will have user/group/other and r/w permissions, and all data-aware APIs should be designed with that in mind. 
-* "Literate programming" Using MCV and the state machine workflow automation probably meets (and exceeds) Knuth's ideas of Literate programming. Look at his idea and make sure we aren't missing any key concepts.
-* QA and testing needs to be several layers, one of which is simple documentation. We should have code that examines data for various qualities, which when done in a comprehensive manner will test the data for all properties described in the requirements. As bugs are discovered and features added, this data testing code would expand. Code should be tested on several levels as well, and the tests plus comments in the tests constitute our full understanding of both data and code.
-* Entities (names) have ID values, ARKs and various kinds of persistent ids. Some subsystem needs to know how to gather various ID values, and how to generate URIs and URLs from those id values. All discovered ids need to be attached to the cpf identity in a table related_id. We will need to track the authority that issued the id, so we need an authority table. Perhaps it is best (as previously discussed) to create a CPF record for each authority, and use the CPF persistent ID as the authority identifier in the related_id table. 
-```
-create table related_id (
-        ri_id auto primary key,
-        id_value text,
-        uri text,
-        url text,
-        authority_id int -- fk to cpf.id?
-);
-```
-* Allow config of CPF output formats via web interface. For example, in the CPF generator, we can offer some format and config options such as name formats in <part> and/or <relationEntry>
-  - include 4 digit fromDate-toDate for person
-  - include dates for corporateBody
-  - use "fl." for active dates
-  - use "active" for active dates
-  - use explicit "b." and "d."
-  - only use "b." or "d." for single 4 digit dates
-  - enclose date in parentheses
-  - add comma between name and dates (applies only if there is a date)
-and so on. In theory that could be done for all kinds of CPF variations. We should have a single checkbox for "use most portable CPF formats" although I suspect the best data exchange format is not XML, but SQlite, SQL INSERT statements, or json.
-* Does our schema track who edited a record? If not, we should add user_id to the version table.
-* We should review the schema and create rules that human beings follow for primary key names, and foreign key names. Two options are table_id and table_id_fk. There may be a better option
-  * Option 1 is nice for people doing "join on", but I find "join on" syntax confusing, especially when combined with a where clause
-  * Option 2 seems more natural, especially when there is a where clause. Having different field names means that field names are more likely to be unique, thus not requiring explicit table.field syntax. Option 2a will be used most of the time. Option 2b shows a three table join where table.field syntax is required.
-```
-- 1
-select * from foo as a, bar as b where a.table_id=b.table_id;
-- 2a
-select * from foo as a, bar as b where table_id=table_id_fk;
-- or 2b
-select * from foo as a, bar as b, baz as c where a.table_id=b.table_id_fk and a.table_id=c.table_id_fk;
-```
-* Identity Reconciliation has a planned "why" feature to report on why a match matches. It would be nice to also report the negative: why didn't this match X? I'm not even sure that is possible, but something to think about. 
--- a/Requirements/Generated Documents.md
+++ b/Requirements/Generated Documents.md
+# System-Generated Documents
+The following documents and data should be generated from the completed system.
+## Data Interoperability
+Data should be available to be downloaded in the following formats:
+* EAC-CPF XML
+    * Individual identity constellations should be download-able as fully-formed EAC-CPF XML documents
+* Turtle Triples
+    * Subsets of the data, including the entire database, should be exportable as well-formed Turtle triples
+* RDF Triples
+    * Subsets of the data, including the entire database, should be exportable as well-formed RDF triples
+* JSON-LD
+    * Subsets of the data, not including the entire database, should be exportable as well-formed JSON-LD
+## System Reports
+While the web interface is the primary public face of SNAC, many other views of the data and meta data are
+necessary, especially for admins and governance. Those "views" are reports and will primary be generated via
+integration of a third-party reporting package such as Jaspersoft Business Intelligence Suite, which is free,
+open source, and includes a full range of tools.
+For each user of the system, the following reports should be available for download:
+* List of records the user has edited
+* Number of records the user has edited
+For each holding institution, the following reports should be available for download:
+* Number of records the institution has edited
+* Number of records the institution has contributed
+* List of records the institution has contributed
+* List of records the institution has edited
+* List of individuals within the institution and the records edited by each person
+* List of records the institution has contributed with individuals who contributed to each record
+General reporting:
+* Number of participating holding institutions
+* Number of records edited per hour, day, month, year
+* Number of identity constellations available in the database
--- a/Requirements/Identity Reconciliation.md
+++ b/Requirements/Identity Reconciliation.md
+# Identity Reconciliation
--- a/Requirements/Internal Data Storage.md
+++ b/Requirements/Internal Data Storage.md
+# Internal Data Storage
+The data should be stored in a SQL database. Every piece of data is in a separate field to the extent that is practical.
+Data is organized into fields (columns) records (rows) and tables. Fields related to each other are in the
+same table. Every record has a unique, permanent, numerical id often called a "key" or "primary key". For
+the SNAC Co-op we have decided that records are never overwritten during update.  An update operation creates a new record identical to the old record except for updated
+fields. All old records are available for viewing via special interface. The old records are invisible to
+operations that are intellectually acting on "current" data.
+Version history, including past versions of a field and record, users that made changes to that data, institution history, and timestamps must be kept in the internal data storage.
+Provenance of each element must be captured as well, including across merges and splits of identity constellations.
+The application must avoid storing mixed markup as much as possible.  (Brad Westbrook sugests we avoid mixed markup).
+## Captured actions on data
+Prior to human edits, merged records can be algorithmically split by the computer, assuming we write code to
+perform such a split. After human edit, a split must be performed by a human. It is a requirement that all
+previous versions can be viewed (read-only) during the human-mediated split operation so the human can refer
+back to previous information.
+After human edits, rollback only applies to human edited versions. There is a fire-break where rollback cannot
+cross from human edits back to machine-merged descriptions. The policy group needs to supply policy
+requirements for the tech folks to implement.
+The broad requirements for the application are: edit data, split records, merge records. Secondary features to
+make the system useful include: work flow enforcement, search, reporting (including "watch" features),
+administration, authorization (data privileges).
--- a/Requirements/Licensing.md
+++ b/Requirements/Licensing.md
+# Licensing and Copyright
+The documentation and code generated by the SNAC Cooperative must have license files and text associated with them.
+* [Documentation](#documentation)
+* [Code](#code)
+## Documentation
+All documentation must be assigned the Creative Commons Zero (CC0) license.  It's text is below:
+```
+CC0 1.0 Universal
+Statement of Purpose
+The laws of most jurisdictions throughout the world automatically confer
+exclusive Copyright and Related Rights (defined below) upon the creator and
+subsequent owner(s) (each and all, an "owner") of an original work of
+authorship and/or a database (each, a "Work").
+Certain owners wish to permanently relinquish those rights to a Work for the
+purpose of contributing to a commons of creative, cultural and scientific
+works ("Commons") that the public can reliably and without fear of later
+claims of infringement build upon, modify, incorporate in other works, reuse
+and redistribute as freely as possible in any form whatsoever and for any
+purposes, including without limitation commercial purposes. These owners may
+contribute to the Commons to promote the ideal of a free culture and the
+further production of creative, cultural and scientific works, or to gain
+reputation or greater distribution for their Work in part through the use and
+efforts of others.
+For these and/or other purposes and motivations, and without any expectation
+of additional consideration or compensation, the person associating CC0 with a
+Work (the "Affirmer"), to the extent that he or she is an owner of Copyright
+and Related Rights in the Work, voluntarily elects to apply CC0 to the Work
+and publicly distribute the Work under its terms, with knowledge of his or her
+Copyright and Related Rights in the Work and the meaning and intended legal
+effect of CC0 on those rights.
+1. Copyright and Related Rights. A Work made available under CC0 may be
+protected by copyright and related or neighboring rights ("Copyright and
+Related Rights"). Copyright and Related Rights include, but are not limited
+to, the following:
+  i. the right to reproduce, adapt, distribute, perform, display, communicate,
+  and translate a Work;
+  ii. moral rights retained by the original author(s) and/or performer(s);
+  iii. publicity and privacy rights pertaining to a person's image or likeness
+  depicted in a Work;
+  iv. rights protecting against unfair competition in regards to a Work,
+  subject to the limitations in paragraph 4(a), below;
+  v. rights protecting the extraction, dissemination, use and reuse of data in
+  a Work;
+  vi. database rights (such as those arising under Directive 96/9/EC of the
+  European Parliament and of the Council of 11 March 1996 on the legal
+  protection of databases, and under any national implementation thereof,
+  including any amended or successor version of such directive); and
+  vii. other similar, equivalent or corresponding rights throughout the world
+  based on applicable law or treaty, and any national implementations thereof.
+2. Waiver. To the greatest extent permitted by, but not in contravention of,
+applicable law, Affirmer hereby overtly, fully, permanently, irrevocably and
+unconditionally waives, abandons, and surrenders all of Affirmer's Copyright
+and Related Rights and associated claims and causes of action, whether now
+known or unknown (including existing as well as future claims and causes of
+action), in the Work (i) in all territories worldwide, (ii) for the maximum
+duration provided by applicable law or treaty (including future time
+extensions), (iii) in any current or future medium and for any number of
+copies, and (iv) for any purpose whatsoever, including without limitation
+commercial, advertising or promotional purposes (the "Waiver"). Affirmer makes
+the Waiver for the benefit of each member of the public at large and to the
+detriment of Affirmer's heirs and successors, fully intending that such Waiver
+shall not be subject to revocation, rescission, cancellation, termination, or
+any other legal or equitable action to disrupt the quiet enjoyment of the Work
+by the public as contemplated by Affirmer's express Statement of Purpose.
+3. Public License Fallback. Should any part of the Waiver for any reason be
+judged legally invalid or ineffective under applicable law, then the Waiver
+shall be preserved to the maximum extent permitted taking into account
+Affirmer's express Statement of Purpose. In addition, to the extent the Waiver
+is so judged Affirmer hereby grants to each affected person a royalty-free,
+non transferable, non sublicensable, non exclusive, irrevocable and
+unconditional license to exercise Affirmer's Copyright and Related Rights in
+the Work (i) in all territories worldwide, (ii) for the maximum duration
+provided by applicable law or treaty (including future time extensions), (iii)
+in any current or future medium and for any number of copies, and (iv) for any
+purpose whatsoever, including without limitation commercial, advertising or
+promotional purposes (the "License"). The License shall be deemed effective as
+of the date CC0 was applied by Affirmer to the Work. Should any part of the
+License for any reason be judged legally invalid or ineffective under
+applicable law, such partial invalidity or ineffectiveness shall not
+invalidate the remainder of the License, and in such case Affirmer hereby
+affirms that he or she will not (i) exercise any of his or her remaining
+Copyright and Related Rights in the Work or (ii) assert any associated claims
+and causes of action with respect to the Work, in either case contrary to
+Affirmer's express Statement of Purpose.
+4. Limitations and Disclaimers.
+  a. No trademark or patent rights held by Affirmer are waived, abandoned,
+  surrendered, licensed or otherwise affected by this document.
+  b. Affirmer offers the Work as-is and makes no representations or warranties
+  of any kind concerning the Work, express, implied, statutory or otherwise,
+  including without limitation warranties of title, merchantability, fitness
+  for a particular purpose, non infringement, or the absence of latent or
+  other defects, accuracy, or the present or absence of errors, whether or not
+  discoverable, all to the greatest extent permissible under applicable law.
+  c. Affirmer disclaims responsibility for clearing rights of other persons
+  that may apply to the Work or any use thereof, including without limitation
+  any person's Copyright and Related Rights in the Work. Further, Affirmer
+  disclaims responsibility for obtaining any necessary consents, permissions
+  or other rights required for any use of the Work.
+  d. Affirmer understands and acknowledges that Creative Commons is not a
+  party to this document and has no duty or obligation with respect to this
+  CC0 or use of the Work.
+For more information, please see
+<http://creativecommons.org/publicdomain/zero/1.0/>
+```
+## Code
+All code must be assigned the BSD 3-Clause license, including the copyright header for the Rector and Visitors of the University of Virginia, and
+the Regents of the University of California, as printed in the text below:
+```
+Copyright (c) 2015, the Rector and Visitors of the University of Virginia, and
+the Regents of the University of California
+All rights reserved.
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+1. Redistributions of source code must retain the above copyright notice, this
+list of conditions and the following disclaimer.
+2. Redistributions in binary form must reproduce the above copyright notice,
+this list of conditions and the following disclaimer in the documentation
+and/or other materials provided with the distribution.
+3. Neither the name of the copyright holder nor the names of its contributors
+may be used to endorse or promote products derived from this software without
+specific prior written permission.
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+```
--- a/Requirements/Mellon Proposal.md
+++ b/Requirements/Mellon Proposal.md
--- a/Requirements/Name Parser.md
+++ b/Requirements/Name Parser.md
--- a/Requirements/New Features.md
+++ b/Requirements/New Features.md
+# Required New Features
+The majority of new features will be in two areas: the maintenance
+system, and the administration system. None of this code exists. The
+maintenance system has a web UI and a server-based back end that
+interacts with the same database used by the match-merge. The
+maintenance system also requires an authentication system (login) that
+allows us to manage the extensive collaborative efforts. The current
+processing of data is accomplished only on servers at the command line,
+and is handled directly by project programmers. In the new maintenance
+system, that will be driven by content experts via a web site, and
+therefore must expect the issues of authentication and authorization
+inherent in collaborative data manipulation web applications.
+The system will require reports. These will cover broad classes of
+issues related to managing resources, usage statistics, administration,
+maintenance, and some reports for end user researchers.
+- Web application (architect: Robbie)
+The web application is a wrapper for all the APIs. It can have an API of it own, or not. It handles all http
+requests, validating the data, deciding what needs to be done, doing real work, and handing some output back
+to the user. Typically the output is HTML, but we are already planning for file downloads, and JSON data as
+output from REST API calls.
+- Data validation API
+Data from the web browser needs sanity checking and untainting before being handed to the rest of the
+application. Initially the data validation API can consist of nothing more than untaining input from the
+browser. We can add various checks and tests. We need to decide if the validation API can reject data, and if
+it can, then it needs to interact with the work flow engine, the actual work flow, and whatever messaging
+system we use to display messages to end users.
+- Identitiy Reconciliation (aka IR) (architect: Robbie)
+This API uses many aspects of identity, testing each against a target population of other identities. The
+final anwser is a floating point number giving a match strength. IR has two modes of operation. Mode one
+compares two identities and returns a match strength. Mode two compares a single identity againast the entire
+database returning match strength. Mode two is somewhat unclear.
+- workflow manager (Tom)
+Every action the application can perform is part of the work flow. The names of these actions along with names
+of their requisites are organized into a work flow table. The work flow engine does not know how to do real
+work, but it does know the names of the functions which do the real work. A new feature (aka function, task)
+is added to the application, by adding its name to the work flow, and creating a function of the same name in
+the application. Likewise, requistes are determined by boolean functions, and every requisite must have a
+matching function known to the work flow engine. The work flow enforces role-based behavior by testing the
+requisites. The workflow engine exists, but needs to be ported from Perl to PHP, and the work flow data should
+be stored in the SQL database.
+- Support for work history and task staging.
+Editing consists of several stages of work that may be performed by different people and/or different
+roles. We need database tables to support saving of work state data. Create a prototype table schema so we can
+think about this problem and create a functional spec.
+For an edit we need the CPF id, user id, timedate stamp, bitfield or work flow tags, optional user notes. For
+search we need: user id, search string, timedate stamp.
+- SQL schema (Robbie, Tom)
+All data is stored in a SQL database. Details are given elsewhere.
+- Controlled vocabulary subsystem or API [Tag system](#controlled-vocabularies-and-tag-system)
+We need controlled vocabulary for several data fields. This system handles all aspects of all controlled vocabularies.
+- CPF to SQL parser (Robbie)
+The input for the application is CPF files. These files need to be parsed into data fields and input into the
+SQL database. This application exists, but needs some additional functionality.
+- Name serialization tool, selectable pre-configured formats
+Outputting name strings based on name data fields in the database is a tricky problem. There are several
+output formats. The name serialization deals with this issue.
+- Name string parser
+Names in CPF files are currently strings. The CPF <part> element has been imported into the SQL database as a
+string, but data needs require individual name components. Parsing names is a tricky problem, but several
+parsers exist. We need to integrate one or more parsers, and perhaps tweak those parsers to handle the SNAC names.
+- Date parser
+We have several date parsers, but none are fully comprehensive. We can use the existing parsers, but they need
+to be integrated into a single, comprehensive parser.
+- CPF record edit, edit each field
+Record editing on the server is handled by a collection of functions. The specifications for this may evolve
+in parallel to the code. We know that each field needs to be changed, but the details of work flow and data
+validation have not been determined. Work flow and validation are both likely to change as the SNAC policies
+evolve. There are UI requirements for editing.
+- CPF record split, split data into separate cpf identities, deprecate old ARK, mint new ARKs
+Record splitting requires a set of functions and UI requirements documented elsewhere.
+- CPF record merge, combine fields, deprecate old ARKs, mint new ARK
+Record merge requires a set of functions and UI requirements documented elsewhere.
+- Object architecture, coding style, class template (architect Robbie)
+We will have a specific architecture of the web application, and of the classes and objects involved.
+- UI widgets, mostly off the shelf, some custom written. We need to have UI edit/chooser widget for search and
+  select of large numbers of options, such as the Joseph Henry cpfRelations that contain some 22K
+  entries. Also need to list all fields which might have large numbers of values. In fact, part of the meta
+  data for every field is "number of possible entries/reapeat values" or whatever that's called. From a
+  software architecture perspective, the answer is 0, 1, infinite.
+One important aspect of the project is long-term viability and preservation. We should be able to export all
+data and metadata in standard formats. Part of the API should cover export facilities so that over time we can
+easily add new export features to support emerging standards.
+The ability to export all the data for preservation purposes also gives us the ability to offer bulk data
+downloads to researchers and collaborating peer institutions.
--- a/Requirements/README.md
+++ b/Requirements/README.md
+# Requirements Documents
+These documents describe the functionality desired of the system.  These should be high-level requirements, geared toward the policy side, of the form "The system should do X."
--- a/Requirements/Software Development Process.md
+++ b/Requirements/Software Development Process.md
+# Software Development Process
+Development on the SNAC web application should use agile development practices, with the shortest-possible-but-reasonable sprint size possible.  See [scrum documentation](http://scrummethodology.com/scrum-sprint/) for more detailed information about agile development methods.  Test-driven development should also be employed to automate testing and interconnect testing with the development process.
+The git version control system should be used as the repository for code in the application.  It allows distributed editing with highly-configurable branching of development, a "blame" system that allows viewing which developer added a specific line of code, and is cross-platform.  It is also supported by [gitlab](http://gitlab.iath.virginia.edu), which should be used for internal development timelines, milestones, bug- and issue-tracking, and project management.  Final versions of the repositories may then be pushed to the public-facing [github](https://github.com/snac-cooperative) repositories.
+## General Discussion Notes
+Choices for programming languages, operating system, databases, version
+control, and various related tools and practices are based on extensive
+experience of the developer community, and a complex set of requirements
+for the coding process. Current best practices are agile development
+using practices that allow programmers wide leeway for implementation
+while still keeping the processes manageable.
+Test-driven development ideally means automated testing, with careful
+attention to regression testing. It takes some extra time up front to
+write the tests. Each test is small, and corresponds to small sections
+of code where both code and text can be quickly created. In this way,
+the software is kept in a working state with only brief downtimes during
+feature creation or bug fixes. Large programs are made up of
+intentionally small functions each of which is tested by a small
+automated test.
+Regression testing refers to verifying that old bugs do not reappear.
+Every bug fix has a corresponding test, even if the function in question
+did not originally have a test for the bug. Each new bug needs a new
+test. Bugs frequently reappear, especially in complex sections of code.
+Source code version control is vital to both development process, and to
+the release process. During development, frequent small changes are
+checked-in to the version control, along with a meaningful comment. The
+history of the code can be tracked. This occasionally helps to
+understand how bugs come into existence. In the Git system, the history
+command is “blame”, a bit of programmer dark humor where the history is
+used to know who to blame for a bug (or any undesirable feature).
+Moving code into Quality Assurance (QA) and then into the production
+environment are both integral with source code management. Many version
+control systems allow tagging a release with a name. The collected
+source code files are marked as a named (virtual) collection, and can be
+used to update a QA area. Human testing and review happens in QA. After
+QA we have release. Depending on the nature of the system release can be
+quite complex with many parties needing to be notified, and coordination
+across groups of developers, sysadmin, managers, support staff, and
+customers. Agile development tends towards small, seamless releases on a
+frequent (weekly or monthly) basis where communication is primarily via
+update of electronic documentation. The process needs to assure that
+fixes and new features are documented. The system must have tools to see
+the current version of the system with its change log, as well as
+comparing that to previous releases. All of these are integrated with
+change management.
+Bug reporting and feature requests fall (broadly speaking) into the
+category of change management. Typically a small group of senior
+developers and stakeholders review the bug/feature tracking system to
+assign priorities, clarify, and investigate. There are good
+off-the-shelf systems for tracking bugs and feature requests, so we have
+several choices. This process happens almost as frequently as the
+features/bug fix coding work of the developers. That means on-going,
+more or less continuous review of fix/features requests every few days,
+depending on how independent the developers are. Agile applies to
+everyone on the project. Ideal change management is not onerous. As
+tasks are completed, someone (developers) update feature status with "in
+progress", "completed” and so on. There might be additional status
+updates from QA and release, but SNAC probably isn't large enough to
+justify anything too complex.
+#### QA and Related Tests for Test-driven Development
+The data extraction pipelines manage massive amounts of data, and
+visually checking descriptions for bugs would be inefficient if not
+infeasible. The MARC extraction process is verified by just over 100
+quality assurance descriptions. The output produced from each
+description is checked for some specific value that confirms that the
+code is working correctly and historical bugs have not reappeared. The
+EAD extraction has a set of QA files, but the output verification is not
+yet automated. A variety of file counts and measures of various sorts
+are performed to verify that descriptions have all been processed. All
+CPF output is validated against the Relax NG schema. Processing log
+files are checked for a variety of error messages. Settings used for
+each run are recorded in documentation maintained with the output files.
+The source code is stored in a Subversion repository.
+Our disaster recovery processes must be carefully documented.
--- a/Requirements/User Documentation.md
+++ b/Requirements/User Documentation.md
+# User Documentation
+Every aspect of the system requires documentation. Most visible to the public is the user interface for
+discovery. Maintenance will be complicated, and our processes are somewhat novel, so this will need to be
+extensive, well illustrated with screenshots, and carefully tested.
+Documentation intended for developers might be somewhat sparse by comparison, but will be critical to the
+on-going software development process. All the databases, operating system, httpd and other servers need
+complete documentation of installation, configuration, deployment, starting, stopping, and emergency
+procedures.
--- a/Requirements/User Interface.md
+++ b/Requirements/User Interface.md
+# User Interface Requirements
+## Web Application
+Some aspects of the web app aren't yet clear, so there are details to be worked out, and some large-ish
+concepts to clarify. I'm guessing we will agree on most things, and one of us or the other will just concede
+on stuff where we don't agree.
+Requirements:
+- expose an http accessible API that is viable for `wget` or `curl`, browser `<form>`, and Ajax calls.
+- Supported input format depends on the complexity of the requested operation.
+- Public functions require no authentication. Everything else must include authentication data.
+- Sandbox functionality to for training and testing, which doesn't modify actual SNAC data
+### Web application output via template
+A well known, easy, powerful method of creating presntation output is to use an template module. Templating
+separates business logic from presentation logic, thus following an MVC model. Our business logic is our work
+flow and related function calls. Presentation is our UI, and the work flow engine has no idea that a UI exists,
+let alone how to create it. Curiously, the presentation logic knows how to create the presentation rendering,
+but has no idea what it does or what it interacts with. This is another example of strong separation of
+concerns.
+A simple hello world text template with a single variable world = "world" would be:
+```
+Hello [% world %]!
+```
+Or a simple HTML version:
+```
+<html><body>Hello [% world %]!</body></html>
+```
+That example is based on the Template Tookit http://www.template-toolkit.org/ for which there is a Perl
+module, and a Python module. Template modules are fairly common, so I'm almost certain we will have several to
+choose from in PHP.
+Choosing our own select software modules, including a template module, is better than being locked into a
+large, cumbersome web framework. In general, web frameworks have issues:
+- difficult to work with
+- no useful functionality that isn't more easily found in another software module
+- the often break MVC
+- generally make debugging nearly impossible
+We can do much better by selecting a few modules to create a lightweight quasi-framework that is perfectly matched to our
+needs.
+Once the internal API completes its work, we will have output data. Output data is passed to a rendering
+layer that relies on the template module. The only code that knows anything about rendering is the rendering
+layer. To all the non-rendering code, there is only "output data" which does conform to a standard structure
+(almost certainly an output data object). The rendering layer takes the output object, and the requested format
+of the output (text, html, pdf, xml, etc.) to create the output. Happily, "rendering" is generally a single
+function call. We create a template object, call its "render" method with two arguments:
+1. template file name,
+2. the output data object.
+Default behavior is to write the output to stdout, but the render method can also
+return the output in a variable so we can create an http download.
+Templates are human created static files containing placeholders. The template engine fills in the placeholders with
+values from relevant parts of the output data. Clearly, the output data object and the template must share a
+object/property naming convention. The template engine functionality has single value fields, looping over
+input lists, and if statement branching based on input. But that's pretty much it. No work is done in the
+template that is not directly concerned with filling in placeholders, not even formatting (in the sense of
+rounding numbers, capitalizing strings, or adding html tags). Templates are valid documents of the output
+type, except in rare cases. The attached template is well-formed XML.
+The web app needs a file download output option as well as output to stdout.
+### Watching records
+Users may "watch" an identity constellation. If a constellation is being watched, and that constellation is part of an description (merged or
+single) then the watch will apply to the results of human edits, regardless of which part of the description
+was modified. It is possible for someone to wish to track a biogHist, but that biogHist could be completely
+removed in lieu of an improved and updated description. We will not track individual elements in CPF.
+The watcher should have the ability to disable their watch. After each edit, all
+watchers will get a notification. The watch does not apply to any single field, but to the entire description, and therefore also to future descriptions which result from merging.
+When an identity constellation is split, the watch propagates to both resulting records.  The user will be informed of the change, and then may choose to disable one of the watchers.
+### Ability to Open/Close the Site during Maintenance
+If the web application has a "closed for maintenance" feature, this feature would be available to web admins,
+even though it is the Linux sysadmins who will do the maintenance. A common major failure of web applications
+is the assumption that the product is always up.  This creates havoc when the site simply fails to load due to
+an outage, planned or otherwise. With a little work we should be able to have an orderly "site is closed" web
+page and status message for planned outages. We might be able to failover to some kind of system status
+message. This is a low priority feature since downtime is probably only a few hours per year.  At the same
+time, if it isn't too difficult to implement, it sets our project apart from the majority who either ignore
+the problem, or let their help desk folks spend an hour apologizing to customers.
+When the product is closed, web admins should be able to login (assuming login is possible).
+comment: Do we want an architecture where the login is essentially a separate product so that we can have a
+"lobby" and other front end features that continue to work even when the backend is down for maintenance?
+Most sites simply return a server error or site not available (404) when the site is down for whatever
+reason. We can avoid this a couple of ways. The simplest is to use some Apache server features and a few
+simple scripts so that users see a nice message when the site is down for maintenance. This very simple
+approach requires little or no change to our software architecture. The more elegant approach is to use one of
+several system architectures that  keep a small system front end always running.
--- a/Requirements/User Management.md
+++ b/Requirements/User Management.md
+# User Management
+Authentication is validating user logins to the system. Authorization is the related aspect of controlling
+which parts of the system users may access (or even which parts they may know exist).
+We can use OpenID for authentication, but we will need a user profile for SNAC roles and authorization. There
+are examples of PHP code to implement OpenID at stackexchange:
+http://stackoverflow.com/questions/4459509/how-to-use-open-id-as-login-system
+OpenID seems to constantly be changing, and sites using change frequently. Google has (apparently) deprecated
+OpenID 2.0 in favor of Open Connect. Facebook is using something else, but apparently FB still works with
+OpenID. Stackexchange supports several authentication schemes. If they can do it, so can we. Or we can support
+one scheme for starters and add others as necessary. The SE code is not open source, so we can't see how much
+work it was to support the various OpenID partners.
+Authorization involves controlling what users can do once they are in the system. That function is sort of
+more solved by OAuth or OpenID by sharing the user profile. However, SNAC has specific requirements,
+especially our roles, and those will not be found in other system. There is not anything we must have from
+user profiles. We might want their social networking profile, but social networking is not a core function of
+SNAC.
+By default users can't do anything that isn't exposed to the non-authenticated public users. Privileges are
+added and users are given roles (aka groups) from which they inherit privileges. The authorization system is
+involved in every transaction with the server to the extent that every request to the server is checked for
+authorization before being passed to the code doing the real work.
+The Linux model of three privilege types "user", "group", and "other" works well for authorization permissions
+and we should use this model.  "User" is an authenticated user. "Group" is a set of users, and a user may
+belong to several groups. In SNAC and the non-Linux world "group" is known as "role", so SNAC will call them
+"roles". "Other" privileges apply to SNAC as public, non-authenticated users, although we don't really have
+"other", and the "researcher" role applies to public users.
+Users can have several roles, and will have all the privileges of all their roles. Role membership is managed
+by an administrative UI (part of the dashboard) and related API code. User information such as name, phone
+number, and even password can also change. User ID values cannot be changed, and a user ID is never reused,
+even after account deletion.
+We expect to create additional roles as necessary for application functions.
+Roles include a large number "is instution member" roles. These should be roles like any other, but we may
+want to flag these role records to make them easy to manage and easy to display in the UI. Any user can have
+zero or more roles that define their instutional affiliation. This primarily effects reporting and admin. In
+the case of reports, membership in an institution constrains the reporting. When setting up a report, users
+may only choose from institutions of which they are members. Some reports may auto-detect the user's
+membership.
+By and large when we refer to "accounts" we mean web accounts managed by the Manager/Web admin. The general
+public can use the discovery interface without an account, but saving search history, and other
+session related discovery tools requires an account. It is technically possible to have a single session
+dashboard. Although that has not been mentioned as a requirement and is probably a low priority, it might be
+almost trivial to implement.
+Every account will be in the "Researcher" role which has the same privileges as the general public, but with a
+TBD set of basic privileges including: search history, certain researcher reports.
+| User type                  | Role                | Description                                                           |
+|----------------------------+---------------------+-----------------------------------------------------------------------|
+| Sysadmin                   | Server admin        | Maintain server, backups, etc.                                        |
+| Database Administrator     | DBA                 | Schema maintenance, data dumps, etc.                                  |
+| Software engineer          | Developer           | Coding, testing, QA, release management, data loading, etc.           |
+| Manager                    | Web admin           | Web accounts: create, manage, assign roles, run reports               |
+| Peer vetting               | Vetting             | Approve moderators, reviewers, content experts                        |
+| Moderator                  | Moderator           | Approve maintenance changes, posting those changes                    |
+| Reviewer/editor            | Maintenance         | Maintainer privileges, interacts with moderators                      |
+| Content expert             | Maintenance         | Domain expert, may have zero institutional roles                      |
+| Documentary editor         | Maintenance         | Distinguished by?                                                     |
+| Maintenance                | Maintenance         | Distinguished by?                                                     |
+| Researcher                 | Researcher          | Use the discovery interface and history dashboard                     |
+| Archival description donor | Block upload        | Bulk uploads of CPF or finding aids                                   |
+| Name authority manager     | Name authority      | Donates name authority data perhaps via bulk upload                   |
+| Institutional admins       | Institutional admin | Instutional role admin dashboard, institutional reports               |
+| Public                     | Researcher          | No account, researcher role, no dashboard or single session dashboard |
+Remember: institutional affiliation roles aren't in the table above. There will be many of those roles, and
+users may have zero, one, or several institutional roles that define which insitutions that user is a member
+of.
+It is possible for an institutional admin to be a member of more than one institution. Institutional Admins
+have abilities:
+- view membership lists of their institution(s)
+- add or remove their instutional role for users.
+Roles which require one or more instutitutional roles (affiliation):
+- Block upload
+- Name authority
+- Institutional admin
+Roles which may have zero or more institutional roles:
+- Web admin
+- Vetting
+- Moderator
+- Maintenance (likely to have one or more)
+- Researcher
+There are several dashboard sections:
+- Standard researcher history
+- Standard user account management (password, email, etc.)
+- Web admin account creation, deletion, role assignments
+- Vetting admin (if we have vetting)
+- Available reports.
--- a/Requirements/Workflow Engine.md
+++ b/Requirements/Workflow Engine.md
--- a/Specifications/Coding Specifications.md
+++ b/Specifications/Coding Specifications.md
+# Coding Style
+All code generated by the SNAC project will be written in one of the following languages.
+* PHP 7 (preferred)
+* PHP 5
+* Java
+* XSLT
+## Coding Style Specifications
+Source code must match the following style guidelines:
+* 4-space tabs with literal spaces
+* Maximum line-length of 100 characters
+* Variables and Class names follow standard camel casing syntax, with descriptive names
+    * Class names start with upper-case letters
+    * Variable and field names start with lower-clase letters
+    * No underscores allowed in variable names
+## Internal Documentation of Code
+All code will be internally-documented using [Javadoc](http://www.oracle.com/technetwork/java/javase/documentation/index-137868.html) style documentation, which has been ported to PHP as [phpdoc](http://www.phpdoc.org/docs/latest/guides/docblocks.html) and XSLT as [XSLTdoc](http://www.pnp-software.com/XSLTdoc/).  Tools to generate documentation from the code is also available for [Java](http://www.oracle.com/technetwork/java/javase/documentation/index-jsp-135444.html), [PHP](http://www.phpdoc.org/), and [XSLT](http://www.pnp-software.com/XSLTdoc/).
+* All files, regardless of language,  must have javadoc-style documentation with author attribution, definition of the file, and short-text of the code license, as defined below (in PHP):
+    ```php
+    <?php
+    /**
+     * File Description Headline
+     *
+     * Paragraphs describing the file
+     * 
+     * License:
+     * ....
+     *
+     * @author Robbie Hott
+     * @license http://opensource.org/licenses/BSD-3-Clause BSD 3-Clause
+     * @copyright 2015 the Rector and Visitors of the University of Virginia, and the Regents of the University of California
+     */
+    ?>
+    ```
+* All classes, fields, methods, and function definitions must include documentation, as shown below:
+    ```php
+    <?php
+    /**
+     * Name Reconciliation Engine Main Class
+     *
+     * This class provides the meat of the reconciliation engine. To run the
+     * reconciliation engine, create an instance of this class and call the
+     * reconcile method.
+     *
+     * @author Robbie Hott
+     */
+    class ReconciliationEngine {
+        /**
+         * Main reconciliation function
+         *
+         * This function does the reconciliation and returns the top identity from
+         * the engine.  Other top identities and their corresponding score vectors
+         * may be obtained by other functions within this class.  
+         *
+         * @param identity $identity The identity to be searched. This identity 
+         * must be in the proper form 
+         * @return identity The top identity by the reconciliation
+         * engine
+         */
+        public function reconcile($identity) {
+            return $identity;
+        }
+    }
+    ?>
+    ```
+## Licensing in Github/Gitlab
+Each code repository must contain the full BSD 3-Clause license below.  It must be saved in the document root as a text file titled `LICENSE`.
+```
+Copyright (c) 2015, the Rector and Visitors of the University of Virginia, and
+the Regents of the University of California 
+All rights reserved.
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+1. Redistributions of source code must retain the above copyright notice, this
+list of conditions and the following disclaimer.
+2. Redistributions in binary form must reproduce the above copyright notice,
+this list of conditions and the following disclaimer in the documentation
+and/or other materials provided with the distribution.
+3. Neither the name of the copyright holder nor the names of its contributors
+may be used to endorse or promote products derived from this software without
+specific prior written permission.
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+```
--- a/Specifications/Identity Reconciliation.md
+++ b/Specifications/Identity Reconciliation.md
--- a/Specifications/LICENSE.md
+++ b/Specifications/LICENSE.md
+Copyright (c) 2015, the Rector and Visitors of the University of Virginia, and
+the Regents of the University of California 
+All rights reserved.
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+1. Redistributions of source code must retain the above copyright notice, this
+list of conditions and the following disclaimer.
+2. Redistributions in binary form must reproduce the above copyright notice,
+this list of conditions and the following disclaimer in the documentation
+and/or other materials provided with the distribution.
+3. Neither the name of the copyright holder nor the names of its contributors
+may be used to endorse or promote products derived from this software without
+specific prior written permission.
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
--- a/Specifications/Name Parser.md
+++ b/Specifications/Name Parser.md
--- a/Specifications/Originals/SNAC Server Architecture.odg
+++ b/Specifications/Originals/SNAC Server Architecture.odg
--- a/Specifications/Originals/SNAC Server Architecture.pdf
+++ b/Specifications/Originals/SNAC Server Architecture.pdf
--- a/Specifications/Originals/SNAC Server Architecture.svg
+++ b/Specifications/Originals/SNAC Server Architecture.svg
--- a/Specifications/README.md
+++ b/Specifications/README.md
+# Formal Specification Documents
+These documents describe the specifications of the system.  They include specific decisions for each component and the system as a whole, in order to meet the requirements listed in the [Requirements](/Requirements) section.
\ No newline at end of file
--- a/Specifications/Server Architecture.md
+++ b/Specifications/Server Architecture.md
+# SNAC Server Architecture
+The system will be architected as a LAMP system, with the following components:
+* Linux: CentOS 7
+* Apache: Apache 2 web server
+* PHP: PHP 7
+* PostgreSQL: Postgres
+Each component of the architecture will run on this platform.  Any sub-component must either produce it's own http server on an available port, such as Elastic Search, or utilize the main Apache web server running a virtual host.
+The following diagrams describe the architecture of internal components:
+* ![Overall Server Architecture](http://gitlab.iath.virginia.edu/snac/Documentation/raw/master/Specifications/Originals/SNAC%20Server%20Architecture.svg)
--- a/Specifications/Workflow Engine.md
+++ b/Specifications/Workflow Engine.md
--- a/Elastic-Search-Install-Notes.md
+++ b/Elastic-Search-Install-Notes.md
--- a/Elastic-Search-Query-Notes.md
+++ b/Elastic-Search-Query-Notes.md
--- a/Third Party Documentation/README.md
+++ b/Third Party Documentation/README.md
+# Third-Party Documentation
+This directory contains documentation and links to off-the-shelf components used by the SNAC project.
--- a/Todo-and-Notes.md
+++ b/Todo-and-Notes.md
- check into Apache httpd and http/2 as well as supporting Opportunistic encryption:
-    * [ArsTechnica: new firefox version says might as well to encrypting all web traffic](http://arstechnica.com/security/2015/04/new-firefox-version-says-might-as-well-to-encrypting-all-web-traffic/)
--- a/Unsorted/Additions.md
+++ b/Unsorted/Additions.md
+#### Brian's API docs need to be merged in or otherwise referred to:
+[https://gist.github.com/tingletech/4a3fc5f59e5af3054286](https://www.google.com/url?q=https%3A%2F%2Fgist.github.com%2Ftingletech%2F4a3fc5f59e5af3054286&sa=D&sntz=1&usg=AFQjCNEJeJexryBtHbvLw-WtFYjxP4VwlQ)
+#### Not sure where to fit these topics into the requirements. Some of them may not be part of technical requirements:
+Discuss. What is "as it is configured now"? Consider implementing linked data standard for relationship links
+instead of having to download an entire document of links (as it is configured now.)
+Discuss. This seems to be the controlled vocabulary issue. Sort by common subject headings across all of SNAC - right now SNAC has
+subject headings that have been applied locally without common practice
+across the entire corpus.
+We probably need to build our own holdings authority.
+We need to write code to get accurate holdings info from WorldCat records. All the other repositories will
+have be handled on a case-by-case basis. Sort by holdings location. Sort by identity's activity location. Sort
+and visualize a person through time (show dates for events in a person or organization's lifetime). Sort and
+visualize an agency or organization as it changes over time.
+Continue to develop and refine context widget.
+Sort collection links. Add weighting to understand which collections have more material directly related to
+identity. (How is this best handled programmatically or as an input by contributors- maybe both?).
+Increase exposure of SNAC to general public by leveraging partnerships.  Suggested agreement with Wikipedia to
+display Wikipedia content in SNAC biographical area and work with Wikipedia to allow for links to SNAC at the
+bottom of all applicable identities. This would serve to escalate and drive traffic to SNAC.
--- a/Unsorted/Database Schema.md
+++ b/Unsorted/Database Schema.md
+#### Expanded Database Schema
+The database schema has been rewritten to capture all the data in CPF files, as well as meet the various data requirements.
+Each field within CPF may (will?) need provenance meta data. Likewise many fields in the database may need
+data for provenance. This has not been done, and the developers need policy on provenance, as well as
+examples. There seems to be little or no mention of provenance in Rachael's UI requirements.
+The new schema has full versions of all records for all time. If not implemented, this is planned. The version
+table records each table name, record id, user id who modified, and time datestamp. No changes were made to
+existing tables, although existing tables may have gotten a field to distinguish old from current
+records. The implementation may change.
+Every record has a unique id. The watch system is a query run on some schedule (daily, hourly, ?) that checks
+to see if a watched record has changed. CPF record has links to a “watch” table so users can watch each
+record, and can watch for certain types of changes. Need UI for the watch system. Need an API for the watch
+system.
+Need a user table, group (role) table, probably a group permission table so that permissions are hard code
+with groups. We also want to allow several permissions per group. Need UI for user, group, and
+group-permission management.
+We have created a generalized workflow system (as opposed to an ad-hoc linked set of reports). There is a work
+flow state table which needs to be moved into the database.
+Need fields to deal with delete/embargo. This may be best implemented via a trigger or perhaps a view. By
+making what appear to be simple SELECTs through a view, the view can exclude deleted records. We must think
+about how using a view (or trigger) will effect UPDATE and INSERT.  Ideally the view is transparent. Is there
+some clever way we can restrict access to the original table only via the view?
+Need record lock on some types of records. This lock needs to be honored by several modules, so like “delete”,
+lock might best be implemented via a view and we \*only\* access the table in question via the view.
+If there are different levels of review for different elements in the record, then we need extra granularity
+in the workflow or the edited record info to know the type of record edited apropos of workflow variations.
+If there different reviewers for different parts of the record, then workflow data (and workflow
+configuration) needs to be able to notify multiple people, and would have to get multiple reviewer approvals
+before moving to the next phase of the workflow.
+Institutional affiliation is probably common enough to want a field in the user table, as opposed to creating
+a group for each institution. The group is perhaps more generalized and could behave identical (or almost
+identical) to a field (with controlled vocabulary) in the user table.
+Make sure we can write a query (report) to count numbers of records based type of edit, institution of the
+editor, and number of holdings.
+If we want to be able to quickly count some CPF element such as outgoing links from CPF to a given
+institution, then we should put those CPF values into the SQL database, as meta data for the CPF record.
+What is: How many referral links to EAC records that they created?
+Be able to count record views, record downloads. Institutional dashboard reports need the ability to group-by
+user, or even filter to a specific user.
+Reporting needs to help managers verify performance metrics. This assumes that all changes have a
+date/timestamp. Once workflow and process decisions are set, performance requirements for users such as
+load/performance (how many updates and changes to records can be handled at once), search response time, edit
+time (outside of review workflow), and update times need to be set.
+Effort reporting to allow SNAC and participants to communicate to others the actual level of effort
+involved. This sounds like a report with time span and numbers of records handled in various ways. SNAC might
+use this when going from pilot into production so that everyone knows what effort will be required for X
+number of records/actions (of whatever action type).
+Time/activity reporting could allow us to assess viability, utility, and efficiency of maintenance system
+processes.
+Similar reports might be generated to evaluate the discovery interface.  Something akin to how much time was
+required to access a certain number of records. Rachael said: Assess viability of access funtionality-
+performance time, available features, and ease of use.
+We could try to report on the amount of training necessary before a new user was able to work independently in
+each of various areas (content input, review, etc.)
--- a/Unsorted/README.md
+++ b/Unsorted/README.md
+# Unsorted Documents
+This is a temporary-holding facility for documents as they are parsed and placed into the appropriate sections of the documentation.
--- a/SQL-Schema-Tech-Requirements.md
+++ b/SQL-Schema-Tech-Requirements.md
--- a/Unsorted/Workflow.md
+++ b/Unsorted/Workflow.md
+Internal flow:
+1. validate the inputs.
+1. Somehow slice and dice the CGI params of the REST call into an abstracted request we can pass to the
+internal API. I suppose that the external and internal APIs are very similar, but we almost certainly need
+some level of symbolic reference aka abstraction. Each REST call has its requisite data. Some data is as
+simple as a record id, and some will be fairly interesting json data structures.
+1. The web app API does the tasks specified by the REST request and the work flow engine's directions.
+  1. Every http request must go through the work flow engine so that the work flow is validated and managed.
+  1. Every web app has a work flow, but people mostly just cobble that together with a bunch of implied
+    functionality using conditionals and side-effect-full function calls. In our code, the internal API is
+    100% work flow agnostic.
+  1. I can explain this in more detail, but it makes a huge improvement in the structure of the application.
+1. Create the output data object if it wasn't created by the functions doing the work.
+1. Pass the output data to a rendering function (or module) to be rendered into the appropriate output format:
+html, text, xml, etc. and sent to stdout, or returned as an http file download. JSON probably doesn't need to
+be rendered since JSON is "data" and not "presentation".
+The work flow engine relies on functions that read application data and return booleans so that the
+work flow engine can detect the application's relevant state. I guess that sounds confusing because the work
+flow engine has state, and the application has state. Those two types of state are vastly different and only
+related to each other in that the work flow engine can detect the application's state. The internal API of the
+web app has no idea that the work flow engine even exists. And the work flow engine knows what work needs to
+be done, but has no idea how it will be done. This is a very lovely separation of concerns.
--- a/tat_requirements/requirements.md
+++ b/tat_requirements/requirements.md
--- a/images/atom-screenshot.png
+++ b/images/atom-screenshot.png
--- a/plan.md
+++ b/plan.md
+# Overall Plan
+### Table of Contents
+* [Big questions](#big-questions)
+* [Documents we need to create](#documents-we-need-to-create)
+* [Governance and Policies, etc.](#governance-and-policies-etc)
+* [Overview and order of work](#overview-and-order-of-work)
+* [Non-component notes to be worked into requirements](#non-component-notes-to-be-worked-into-requirements)
+* [System Design](#system-design)
+  * [Developed Components](#developed-components)
+  * [Off-the-shelf Components](#off-the-shelf-components)
+  * [Controlled vocabularies and tag system](#controlled-vocabularies-and-tag-system)
+### Big questions
+- (solved) how is gitlab backed up?
+  - Shayne backs up the whole gitlab VM.
+- We need a complete understanding of our requirements for controlled vocabulary, ontology, and/or tagging as
+  well as how this relates to search facets. This also impacts our future ability to make assertions about the
+  data, and is somewhat related to semantic net. See [Tag system](#controlled-vocabularies-and-tag-system).
+### Documents we need to create
+- Operations and Procedure Manual
+- Formal Requirements Document
+- Formal Specification Document
+- Research Agenda
+- User Story Backlog
+- Design Documents (UI/UX/Graphic Design)
+      - ideally someone writes a (possibly brief) style guide
+      - a set of .psd or other images is not a style guide
+### Governance and Policies, etc.
+- Data curation, preservation, graceful retirement
+- Data expulsion vs. embargo
+- Duplicates, backups, restore, related policy and technical issues
+- Broad pieces that are missing or underdeveloped [Laura]
+- Refresh relationship with OCLC [John, Daniel]
+### Overview and order of work
+1. List requirements of the overall application. (done)
+2. Organize requirements by component, clean up, flesh out, and vet. (requires team meetings)
+3. Create formal specifications for each component based on the official, clean, requirements document. Prototypes will help ensure the formal spec is written correctly.
+4. Define a timeline for development and prototyping based on the formal specifications document.
+5. Create tests for test-driven development based on the formal specification.  This includes creating and mining ground-truth data.
+6. Develop software based on formal specification that passes the given tests.
+### Non-component notes to be worked into requirements
+- CPF record edit, edit each field
+- CPF record split, split data into separate cpf identities, deprecate old ARK, mint new ARKs
+- CPF record merge, combine fields, deprecate old ARKs, mint new ARK
+### System Design
+#### Developed Components
+- Data validation engine
+  - **API:** Custom JSON (needs formal spec)
+  - The data validation engine applies a written system of rules to the incoming data.  The rules must be written in a human-readble form, such that non-technical individuals are able to write and understand the rules.  A rule-writing guide must be supplied to give hints and help for writing rules properly.  The engine will be pluggable and written as an MVC application.  The model will read the user-written rules (stored in a postgres database or flat-file system, depending on the model) and apply them to any input given on the view.  The initial view will be a JSON API, which accepts individual fields and returns the input with either validation suggestions or valid flags.
+  - rule based system abstracted out of the code
+  - rules are data
+  - change the rules, not the actual application code
+  - rules for broad classes of data type, or granular rules for individual fields
+  - probably used this to untaint data as well (remove things that are potential security problems)
+  - send all data through this API
+  - every rule includes a message describing what when wrong and suggesting fixes
+  - rules potentially editable by non-programmers
+  - rules are based on co-op data policies, which implies a data policy document, or the rules **can** be the
+    policy documentation.
+- Identitiy Reconciliation (aka IR) (architect Robbie)
+  - **API:** Custom JSON (needs formal spec)
+ - needs docs wrangled
+- workflow manager (architect Tom)
+  - **API:** Custom JSON? (needs formal spec)
+  - exists, needs tests, needs requirements
+    * **We need to stop now, write requirements, then apply those requirements to the existant system to ensure we meet the requirements**
+  - needs to be integrated into an index.php script that also checks authentication
+  - can the workflow also support the login.php authentication? (Yes).
+- PostgreSQL Storage: schema definition (Robbie, Tom)
+  - **API:** SQL
+  - exists, needs tests, needs requirements
+    * **We need to stop now, write requirements, then apply those requirements moving forward to ensure we meet the requirements**
+  - should we re-architect tables become normal tables, the views go away, and versioned records are moved to shadow tables.
+  - add features for delete-via-mark (as opposed to actual delete)
+  - add features to support embargo
+  - *maybe, discuss* change vocabulary.sql from insert to copy. It would be smaller and faster, although in reality as soon as
+    it is in the database, the text file will never be touched again.
+- discuss; Can/should we create a tag system to deal with ad-hoc requirements later in the project? [Tag system](#controlled-vocabularies-and-tag-system)
+- CPF to SQL parser (Robbie)
+  - **API:** EAC-CPF XML input, JSON output? (needs formal spec)
+  - exists, needs tests, needs requirements
+    * **We need to stop now, write requirements, then apply those requirements moving forward to ensure we meet the requirements**
+- NameEntity serialization tool, selectable pre-configured formats
+- NameEntity string parser
+  - **API:** subroutine? JSON?
+    - Can we find a grammar-based parser for PHP? Should we use a standalone parser?
+    - Can we expose this as a JSON API such that it's given a name-string and returns an identity object of that identity's information?  Possibly a score as well as to how well we thought we could parse it?
+- Name parser (only a portion of the NameEntity string)
+  - **API:** subroutine? JSON?
+  - Can this use the same parser engine as the name string parser?
+- Date parser
+  - **API:** subroutine? JSON?
+  - Can this use the same parser engine as the name string parser?
+  - **This should be distinct, or may be a subroutine of the nameEntry-string parser**
+  - Can we expose this as a JSON API such that given a set of dates, it returns a date object that lists out the individual dates?  Then, it could be called from the name-string parser to parse out dates.
+- Editing User interface
+  - **API:** HTML front-end, makes calls to internal JSON API
+  - Must have ajax-backed interaction for displaying and searching large lists, such as cpfRelations for an identity.
+    - We need to have UI edit/chooser widget for search and select of large numbers of options, such as the Joseph
+      Henry cpfRelations that contain some 22K entries. Also need to list all fields which might have large numbers
+      of values. In fact, part of the meta data for every field is "number of possible entries/reapeat values" or
+      whatever that's called. From a software architecture perspective, the answer is 0, 1, infinite.
+- History Research Tool (redefined)
+  - **API:** HTML front-end, makes calls to internal JSON API
+  - Needs to be reworked to support the Postgres backend
+#### Off-the-shelf Components
+- gitlab for developer documentation, code version management, internal issue tracking and milestone keeping
+- github public code repository
+- test framework, need to choose one
+- authentication
+  - session management, especially as applies to authentication tokens, cookies and something which prevents
+    XSS (cross-site scripting) attacks
+- JavaScript UI component tools, JQuery; what others?
+  - Suggestions: bootstrap, angular JS, JQueryUI
+- reports, probably Jasper
+- PHP, Postgres, Linux, Apache httpd, etc.
+- language modules ?
+#### Controlled vocabularies and tag system
+Tags are simply terms. When inplemented as fixed terms with persistent IDs and some modify/add policy, tags
+become a flat (non-hierarchal) controlled vocabulary.
+The difference being a weaker moderation of tags and more readiness to create new tags (types). The tag table
+would consist of tag term and an ID value. Tag systems lack data type, and generally have no policy or less
+restrictive policies about creating new tags
+Below are some subject examples. It is unclear if these are each topics, or if "--" is used to join granular
+topics into a topic list. Likewise it is unclear if this list relies on some explicit hierarchy.
+```
+American literature--19th century--Periodicals
+American literature--20th century--Periodicals
+Periodicals
+Periodicals--19th century
+World politics--Periodicals
+World politics--Pictorial works
+World politics--Societies, etc.
+World politics--Study and teaching
+```
+**RH: I agree, this is super tricky.  We need to distinguish between types of controlled vocab, so that we don't mix Occupation and Subject.  A tagging system might be very nice for at least subjects.**
--- a/tat_requirements/images/image02.png
+++ b/tat_requirements/images/image02.png
--- a/tat_requirements/images/image03.svg
+++ b/tat_requirements/images/image03.svg
--- a/tat_requirements/images/snac-web-app.png
+++ b/tat_requirements/images/snac-web-app.png
--- a/tat_requirements/images/snac-web-app.svg
+++ b/tat_requirements/images/snac-web-app.svg
--- a/tat_requirements/list_of_all_reports.md
+++ b/tat_requirements/list_of_all_reports.md
-#### All reports
- what records have I edited
- how many records has my institution edited
- how many records has my institution contributed
- list of number of records contributed by institution
--- a/tat_requirements/outline.md
+++ b/tat_requirements/outline.md
-plan.md
--------
-plan.md Big questions
-plan.md Overview and order of work
-plan.md Code we write
-plan.md Controlled vocabularies and tag system 
-plan.md Code we use off the shelf
-co-op_background.md
-----
-Authors
-Organization of documenatation
-Introduction to SNAC
-Evaluation of Existing Technical Architecture
-Overview
-Current State of the System
-Processing Pipeline
-Extraction
-Match/Merge
-Discovery/Dissemination
-Prototype research tool
-Gap analysis
-Data maintenance
-Pilot phase architecture
-Current State Conclusion
-introduction.md
--------
-TAT Functional Requirements
-Introduction to Planned Functionality
-Software development, processes, and project management
-QA and Related Tests for Test-driven Development
-Documentation
-Required new features
-Web application overview
-Web application output via template
-Data background
-What is "normal form" and what informs the database schema design?
-Edit architecture requirements
-Expanded CPF schema requirements
-Expanded Database Schema
-Merge and watch
-Brian’s API docs need to be merged in or otherwise referred to:
-Not sure where to fit these topics into the requirements. Some of them may not be part of technical requirements:
-requirements.md
----
-List of requirements
-Requirements from Rachael's spreadsheet
-List of Application Programmer Interfaces (APIs)
-Work flow engine
-Maintenance Functionality
-Functionality for Discovery
-User interface for Discovery
-Functionality for Splitting
-User interface for Splitting
-Functionality for Merging
-User interface for Merging
-Functionality for Editing
-User interface for Editing
-Admin Client for Maintenance System
-User Management
-Web Application Administration
-Reports
-System Administration
-Community Contributions
-Ability to Open/Close the Site during Maintenance
-Sandbox for Training, perhaps as a clone of the QA system?
-ArchiveSpace Feature Planning via Brad
-Staffing Model (Brian's draft suggestions)
--- a/tat_requirements/plan.md
+++ b/tat_requirements/plan.md
-#### Big questions
- (solved) how is gitlab backed up?
-  - Shayne backs up the whole gitlab VM.
- We need a complete understanding of our requirements for controlled vocabulary, ontology, and/or tagging as
-  well as how this relates to search facets. This also impacts our future ability to make assertions about the
-  data, and is somewhat related to semantic net. See [Tag system](#controlled-vocabularies-and-tag-system).
-#### Documents we need to create
- Operations and Procedure Manual
- Research Agenda
- User Story Backlog
- Design Documents (UI/UX/Graphic Design)
-      - ideally someone writes a (possibly brief) style guide
-      - a set of .psd or other images is not a style guide
-#### Governance and Policies, etc.
- Data curation, preservation, graceful retirement
- Data expulsion vs. embargo
- Duplicates, backups, restore, related policy and technical issues
- Broad pieces that are missing or underdeveloped [Laura]
- Refresh relationship with OCLC [John, Daniel]
-#### Overview and order of work
-1. create tech documents, filling in as much prose as possible
-   - currenly on-going
-1. create prototype software to test tech requirements, iterate updating requirements and prototype
-   - Work flow engine is working and has both a command-line and web interface
-   - We have a SQL database schema
-1. create tests for test driven development, and validate prototype
-1. refactor or rewrite prototype to match requirements
-1. create version 1 of software
-#### Code we write
- Data validation API
-  - rule based system abstracted out of the code
-  - rules are data
-  - change the rules, not the actual application code 
-  - rules for broad classes of data type, or granular rules for individual fields
-  - probably used this to untaint data as well (remove things that are potential security problems)
-  - send all data through this API
-  - every rule includes a message describing what when wrong and suggesting fixes
-  - rules potentially editable by non-programmers
-  - rules are based on co-op data policies, which implies a data policy document, or the rules **can** be the
-    policy documentation.
- Identitiy Reconciliation (aka IR) (architect Robbie)
- - needs docs wrangled
- workflow manager (architect Tom)
-  - exists, needs tests, needs requirements
-  - needs to be integrated into an index.php script that also checks authentication
-  - can the workflow also support the login.php authentication? (Yes).
- SQL schema (Robbie, Tom)
-  - exists, needs tests, needs requirements
-  - should we re-architect tables become normal tables, the views go away, and versioned records are moved to shadow tables.
-  - add features for delete-via-mark (as opposed to actual delete)
-  - add features to support embargo
-  - *maybe, discuss* change vocabulary.sql from insert to copy. It would be smaller and faster, although in reality as soon as
-    it is in the database, the text file will never be touched again.
- discuss; Can/should we create a tag system to deal with ad-hoc requirements later in the project? [Tag system](#controlled-vocabularies-and-tag-system)
- CPF to SQL parser (Robbie)
-  - exists, needs tests, needs requirements
- Name serialization tool, selectable pre-configured formats
- Name string parser
-    - Can we find a grammar-based parser for PHP? Should we use a standalone parser?
- Date parser
-  - Can this use the same parser engine as the name string parser?
- CPF record edit, edit each field
- CPF record split, split data into separate cpf identities, deprecate old ARK, mint new ARKs
- CPF record merge, combine fields, deprecate old ARKs, mint new ARK
- coding style, class template (architect Robbie)
- We need to have UI edit/chooser widget for search and select of large numbers of options, such as the Joseph
-  Henry cpfRelations that contain some 22K entries. Also need to list all fields which might have large numbers
-  of values. In fact, part of the meta data for every field is "number of possible entries/reapeat values" or
-  whatever that's called. From a software architecture perspective, the answer is 0, 1, infinite.
-#### Controlled vocabularies and tag system 
-Tags are simply terms. When inplemented as fixed terms with persistent IDs and some modify/add policy, tags
-become a flat (non-hierarchal) controlled vocabulary.
-The difference being a weaker moderation of tags and more readiness to create new tags (types). The tag table
-would consist of tag term and an ID value. Tag systems lack data type, and generally have no policy or less
-restrictive policies about creating new tags
-Below are some subject examples. It is unclear if these are each topics, or if "--" is used to join granular
-topics into a topic list. Likewise it is unclear if this list relies on some explicit hierarchy. 
-```
-American literature--19th century--Periodicals
-American literature--20th century--Periodicals
-Periodicals
-Periodicals--19th century
-World politics--Periodicals
-World politics--Pictorial works
-World politics--Societies, etc.
-World politics--Study and teaching
-```
-#### Code we use off the shelf
- gitlab for docs, code version management, issue tracking(?)
- github public code repository?
- test framework, need to choose one
- authentication
-  - session management, especially as applies to authentication tokens, cookies and something which prevents
-    XSS (cross-site scripting) attacks
- JavaScript UI component tools, JQuery; what others?
- reports, probably Jasper
- PHP, Postgres, Linux, Apache httpd, etc.
- language modules
--- a/tat_requirements/readme.md
+++ b/tat_requirements/readme.md
-These documents are organized in the following order:
-[plan.md](plan.md) Big overview.
-[outline.md](outline.md) An outline of sections in the documents
-[co-op_background.md](co-op_background.md) Broad expectations for the co-op software.
-[introduction.md](introduction.md) Requirements part one
-[requirements.md](requirements.md) Requirements part two, includes tech requirements from Rachael's spreadsheets
--- a/tat_requirements/snac-web-app-wrapper.md
+++ b/tat_requirements/snac-web-app-wrapper.md
-![SNAC web app API data flow](images/image02.png)
--- a/tat_requirements/snac-web-app.odg
+++ b/tat_requirements/snac-web-app.odg
--- a/tat_requirements/snac-web-app.pdf
+++ b/tat_requirements/snac-web-app.pdf
--- a/tat_requirements/snac-web-app.svg
+++ b/tat_requirements/snac-web-app.svg