Refactored and organized Plan and Overview

e1991959 · Robbie Hott · 51dc11bb · e1991959 · e1991959 · e1991959
Commit e1991959 authored Feb 17, 2016 by Robbie Hott
9 changed files
--- a/Notes/Vocabulary-properties-and-ontologies.md
+++ b/Notes/Vocabulary-properties-and-ontologies.md
@@ -7,7 +7,7 @@ For our purposes the simplest controlled vocabulary is a flat (non-hierarchal) l
 An ontology is a heirarchy of controlled vocabulary terms that explicitly encodes both relatedness and category.

 Using the example of subject (aka "topical subject"), either technology allows us to make assertions about the
-data and relations between identities. 
+data and relations between identities.

 - Both technologies can be used simultaneously to describe identities, however, doing double data entry would
  be irksome.
@@ -47,7 +47,7 @@ data and relations between identities.

 It might also be sensible to design the terms with multilingual vocabulary terms. By multilingual I mean:
 multiple terms for each unique ID where each term is specific to a specific language, and all terms with the
-same ID share a definition. 
+same ID share a definition.

 Each term has a unique id, and a definition (implied or explicit). This is a simple dictionary. Explicit
 definition would improve the vocabulary, but takes more work.
@@ -147,3 +147,33 @@ performed based on ID number, not text string.

 Interestingly, we might be able to apply Markov matrices to identities marked up via ontology, with the same
 sort of relatedness building that occurs with a flat vocabulary list.
+
+
+
+
+# Alternative Strategies
+
+## Controlled vocabularies and tag system
+
+Can/should we create a tag system to deal with ad-hoc requirements later in the project? [Tag system](#controlled-vocabularies-and-tag-system)
+
+Tags are simply terms. When implemented as fixed terms with persistent IDs and some modify/add policy, tags
+become a flat (non-hierarchal) controlled vocabulary.
+
+The difference being a weaker moderation of tags and more readiness to create new tags (types). The tag table
+would consist of tag term and an ID value. Tag systems lack data type, and generally have no policy or less
+restrictive policies about creating new tags
+
+Below are some subject examples. It is unclear if these are each topics, or if "--" is used to join granular
+topics into a topic list. Likewise it is unclear if this list relies on some explicit hierarchy.
+
+```
+American literature--19th century--Periodicals
+American literature--20th century--Periodicals
+Periodicals
+Periodicals--19th century
+World politics--Periodicals
+World politics--Pictorial works
+World politics--Societies, etc.
+World politics--Study and teaching
+```
--- a/Overview.md
+++ b/Overview.md
--- a/Plan.md
+++ b/Plan.md
+# Overall Plan
+
+### Table of Contents
+
+* [Documents to Create](#documents-to-create)
+* [Governance and Policy Discussion](#governance-and-policy-discussion)
+* [Overview of Approach](#overview-of-approach-iterative)
+* [CPF Requirements](#cpf-requirements)
+* [System Design](#system-design)
+  * [Developed Components](#developed-components)
+  * [Off-the-shelf Components](#off-the-shelf-components)
+
+### Documents to Create
+
+- Operations and Procedure Manual
+- Formal Requirements Documents
+- Formal Specification Documents
+- Research Agenda
+- User Story Backstory
+- Design Documents (UI/UX/Graphic Design)
+    - style guides
+    - layouts
+
+### Governance and Policy Discussion
+
+- Data curation, preservation, graceful retirement
+- Data expulsion vs. embargo
+- Duplicates, backups, restore, related policy and technical issues
+- Broad pieces that are missing or underdeveloped [Laura]
+- Refresh relationship with OCLC [John, Daniel]
+
+
+### Overview of Approach (Iterative)
+
+1. List requirements of the overall application.
+2. Organize requirements by component, clean up, flesh out, and vet.
+3. Create formal specifications for each component based on the official, clean, requirements document. Prototypes will help ensure the formal spec is written correctly.
+4. Define a timeline for development and prototyping based on the formal specifications document.
+5. Create tests for test-driven development based on the formal specification.  This includes creating and mining ground-truth data.
+6. Develop software based on formal specification that passes the given tests.
+
+
+
+### CPF Requirements
+
+- CPF record edit, edit each field
+- CPF record split, split data into separate cpf identities, deprecate old ARK, mint new ARKs
+- CPF record merge, combine fields, deprecate old ARKs, mint new ARK
+
+### System Design
+
+#### Developed Components
+
+- [Data validation engine](/Specifications/Data Validation.md)
+    - **API:** Internal programming API (PHP)
+        - Hooks should be made in the [Server API](/Specifications/Server API) and REST API
+
+- [Identitiy Reconciliation](/Specifications/Identity Reconciliation.md)
+    - **API:** Internal programming API (PHP)
+        - Hooks should be made in the [Server API](/Specifications/Server API) and REST API
+
+- [Server Workflow Engine](/Specifications/Workflow Engine.md)
+    - **API:** Understands [Server API](/Specifications/Server API) and calls internal programming APIs of other components
+    - PERL prototype exists
+    - Needs formal description and specification documentation
+
+- [PostgreSQL Storage](/Specifications/Schema SQL.md)
+    - **API:** SQL, with custom internal programming API (PHP)
+    - Supports versioning and mark-on-delete
+    - Needs features to support embargo
+
+- CPF Parser
+    - **API:** EAC-CPF XML input, Constellation output
+        - Output may be serialized into SQL or JSON [Constellation](/Specifications/Server API/Constellation.md)
+    - [List of captured tags](/Specifications/Captured EAC-CPF Tags.md)
+
+- CPF Serializer
+    - **API:** Constellation input, EAC-CPF XML output
+
+- NameEntity Serializer
+    - electable pre-configured formats
+
+- NameEntity Parser
+    - **API:** Internal programming API (PHP)
+    - Discussion
+        - Can we find a grammar-based parser for PHP? Should we use a standalone parser?
+        - Can we expose this as a JSON API such that it's given a name-string and returns an identity object of that identity's information?  Possibly a score as well as to how well we thought we could parse it?
+
+    - Name parser (only a portion of the NameEntity string)
+        - **API:** subroutine? JSON?
+        - Can this use the same parser engine as the name string parser?
+
+- Date parser
+    - **API:** Internal programming API (PHP)
+    - Discussion
+        - Can we expose this as a JSON API such that given a set of dates, it returns a date object that lists out the individual dates?  Then, it could be called from the name-string parser to parse out dates.
+
+- [Editing User Interface](/Specifications/Editing User Interface.md)
+    - **API:** HTML front-end, makes calls to [Server API](/Specifications/Server API)
+    - Must have ajax-backed interaction for displaying and searching large lists, such as relations for a constellation.
+        - We need to have UI edit/chooser widget for search and select of large numbers of options, such as the Joseph Henry cpfRelations that contain some 22K entries. Also need to list all fields which might have large numbers of values.
+    - Need a way of entering metadata about each piece of information
+
+- History Research Tool
+    - **API:** HTML front-end, makes calls to [Server API](/Specifications/Server API)
+    - May be reworked to support the Postgres backend (later)
+    - Retain functionality of current HRT
+
+#### Off-the-shelf Components
+
+The following OTS components will be evaluated and used in development
+
+- GitLab
+    - developer documentation,
+    - code version management,
+    - internal issue tracking, and
+    - milestone keeping
+
+- GitHub
+    - public code repository
+
+- PHPUnit unit test framework
+
+- Authentication
+    - Consider OAuth2 (Google, GitLab, etc) for session management, especially as applies to authentication tokens, cookies and something which prevents XSS (cross-site scripting) attacks
+
+- JavaScript Libraries
+    - JQuery
+    - UI Components, such as Bootstrap
+    - Graphing libraries, such as D3JS
+
+- Reporting Tools
+    - Investigate Jasper
+
+- LAMP Stack (Linux, Apache 2, Postgres, PHP)
--- a/README.md
+++ b/README.md
 # SNAC Documentation

-This repository contains all the documentation for the SNAC Web Application and related frameworks, engines, and pieces.  Specifically:
+This repository contains all the documentation for the SNAC Web Application and related frameworks, engines, and pieces.

-* The currently-being-revised Technical Requirements are found in the [Requirements Directory](Requirements).  
-* Formal Specifications for those requirements are in the [Specifications Directory](Specifications).
-* [Help](Help) on using Gitlab, Git, and the Markdown text format
-* [Documentation](Third Party Documentation) on third-party software and applications being used
-* [Historical Documentation](Historical Documentation) on previous iterations of SNAC
-* Technical [Discussions](Discussion) related to the SNAC project
-* [Notes](Notes) from the technical team.
-* Database schema diagrams auto-generated by SchemaSpy http://shannonvm.village.virginia.edu/~twl8n/schema_spy_output/
+## Table of Contents

-The best place to start is the big, overall [plan](plan.md) document, which describes the process forward with defining requirements and specifications.
+* [Technology Infrastructure Architecture Overview](Overview.md)
+* [Overall Tech Plan Document](Plan.md) (draft)
+* Requirements and Specifications
+    * [Requirements](Requirements): The currently-being-revised Technical Requirements  
+    * [Specifications](Specifications): Formal Specifications for the Technical Requirements
+* Discussion and Documentation
+    * [Database Schema Diagrams](http://shannonvm.village.virginia.edu/~twl8n/schema_spy_output/)
+    * [Third Party Documentation](Third Party Documentation): Documentation on third-party software and applications being used
+    * [Historical Documentation](Historical Documentation): Documentation on previous iterations of SNAC
+    * [Technical Discussions](Discussion): Technical discussions related to the SNAC project
+    * [Notes](Notes): Notes from the technical team.
+* [Help](Help): Help on using Gitlab, Git, and the Markdown text format
+
+## Notes on this Repository

 This repository is stored in Gitlab, a work-alike clone of the Github web site, but installed locally on a SNAC server. Gitlab is a
 version control system with a suite of project management tools.
@@ -47,4 +53,4 @@ This work is published from:
 <span property="vcard:Country" datatype="dct:ISO3166"
      content="US" about="[_:publisher]">
  United States</span>.
-</p>
\ No newline at end of file
+</p>
--- a/Specifications/Coding Specifications.md
+++ b/Specifications/Coding Specifications.md
@@ -89,7 +89,7 @@ Each code repository must contain the full BSD 3-Clause license below.  It must 

 ```
 Copyright (c) 2015, the Rector and Visitors of the University of Virginia, and
-the Regents of the University of California
+the Regents of the University of California.
 All rights reserved.

 Redistribution and use in source and binary forms, with or without

--- a/Specifications/Data Validation.md
+++ b/Specifications/Data Validation.md
+# Data Validation Engine
+
+## Original (Draft) Discussion
+- The data validation engine applies a written system of rules to the incoming data.  The rules must be written in a human-readble form, such that non-technical individuals are able to write and understand the rules.  A rule-writing guide must be supplied to give hints and help for writing rules properly.  The engine will be pluggable and written as an MVC application.  The model will read the user-written rules (stored in a postgres database or flat-file system, depending on the model) and apply them to any input given on the view.  The initial view will be a JSON API, which accepts individual fields and returns the input with either validation suggestions or valid flags.
+- rule based system abstracted out of the code
+- rules are data
+- change the rules, not the actual application code
+- rules for broad classes of data type, or granular rules for individual fields
+- probably used this to untaint data as well (remove things that are potential security problems)
+- send all data through this API
+- every rule includes a message describing what when wrong and suggesting fixes
+- rules potentially editable by non-programmers
+- rules are based on co-op data policies, which implies a data policy document, or the rules **can** be the policy documentation.
--- a/Specifications/Editing User Interface.md
+++ b/Specifications/Editing User Interface.md
+# Editing User Interface
--- a/Specifications/LICENSE.md
+++ b/Specifications/LICENSE.md
-Copyright (c) 2015, the Rector and Visitors of the University of Virginia, and
-the Regents of the University of California 
-All rights reserved.
-
-Redistribution and use in source and binary forms, with or without
-modification, are permitted provided that the following conditions are met:
-
-1. Redistributions of source code must retain the above copyright notice, this
-list of conditions and the following disclaimer.
-
-2. Redistributions in binary form must reproduce the above copyright notice,
-this list of conditions and the following disclaimer in the documentation
-and/or other materials provided with the distribution.
-
-3. Neither the name of the copyright holder nor the names of its contributors
-may be used to endorse or promote products derived from this software without
-specific prior written permission.
-
-THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
-ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
-WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
-DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
-FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
-DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
-SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
-CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
-OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
-OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
--- a/plan.md
+++ b/plan.md
-# Overall Plan
-
-### Table of Contents
-
-* [Big questions](#big-questions)
-* [Documents we need to create](#documents-we-need-to-create)
-* [Governance and Policies, etc.](#governance-and-policies-etc)
-* [Overview and order of work](#overview-and-order-of-work)
-* [Non-component notes to be worked into requirements](#non-component-notes-to-be-worked-into-requirements)
-* [System Design](#system-design)
-  * [Developed Components](#developed-components)
-  * [Off-the-shelf Components](#off-the-shelf-components)
-  * [Controlled vocabularies and tag system](#controlled-vocabularies-and-tag-system)
-
-### Big questions
-
-
- (solved) how is gitlab backed up?
-
-  - Shayne backs up the whole gitlab VM.
-
- We need a complete understanding of our requirements for controlled vocabulary, ontology, and/or tagging as
-  well as how this relates to search facets. This also impacts our future ability to make assertions about the
-  data, and is somewhat related to semantic net. See [Tag system](#controlled-vocabularies-and-tag-system).
-
-### Documents we need to create
-
- Operations and Procedure Manual
-
- Formal Requirements Document
-
- Formal Specification Document
-
- Research Agenda
-
- User Story Backlog
-
- Design Documents (UI/UX/Graphic Design)
-
-      - ideally someone writes a (possibly brief) style guide
-
-      - a set of .psd or other images is not a style guide
-
-### Governance and Policies, etc.
-
- Data curation, preservation, graceful retirement
-
- Data expulsion vs. embargo
-
- Duplicates, backups, restore, related policy and technical issues
-
- Broad pieces that are missing or underdeveloped [Laura]
-
- Refresh relationship with OCLC [John, Daniel]
-
-
-### Overview and order of work
-
-1. List requirements of the overall application. (done)
-
-2. Organize requirements by component, clean up, flesh out, and vet. (requires team meetings)
-
-3. Create formal specifications for each component based on the official, clean, requirements document. Prototypes will help ensure the formal spec is written correctly.
-
-4. Define a timeline for development and prototyping based on the formal specifications document.
-
-5. Create tests for test-driven development based on the formal specification.  This includes creating and mining ground-truth data.
-
-6. Develop software based on formal specification that passes the given tests.
-
-
-
-### Non-component notes to be worked into requirements
- CPF record edit, edit each field
-
- CPF record split, split data into separate cpf identities, deprecate old ARK, mint new ARKs
-
- CPF record merge, combine fields, deprecate old ARKs, mint new ARK
-
-### System Design
-
-#### Developed Components
-
- Data validation engine
-  - **API:** Custom JSON (needs formal spec)
-
-  - The data validation engine applies a written system of rules to the incoming data.  The rules must be written in a human-readble form, such that non-technical individuals are able to write and understand the rules.  A rule-writing guide must be supplied to give hints and help for writing rules properly.  The engine will be pluggable and written as an MVC application.  The model will read the user-written rules (stored in a postgres database or flat-file system, depending on the model) and apply them to any input given on the view.  The initial view will be a JSON API, which accepts individual fields and returns the input with either validation suggestions or valid flags.
-
-  - rule based system abstracted out of the code
-  - rules are data
-  - change the rules, not the actual application code
-  - rules for broad classes of data type, or granular rules for individual fields
-  - probably used this to untaint data as well (remove things that are potential security problems)
-  - send all data through this API
-  - every rule includes a message describing what when wrong and suggesting fixes
-  - rules potentially editable by non-programmers
-  - rules are based on co-op data policies, which implies a data policy document, or the rules **can** be the
-    policy documentation.
-
- Identitiy Reconciliation (aka IR) (architect Robbie)
-  - **API:** Custom JSON (needs formal spec)
-
- - needs docs wrangled
-
- workflow manager (architect Tom)
-  - **API:** Custom JSON? (needs formal spec)
-
-  - exists, needs tests, needs requirements
-    * **We need to stop now, write requirements, then apply those requirements to the existant system to ensure we meet the requirements**
-
-  - needs to be integrated into an index.php script that also checks authentication
-
-  - can the workflow also support the login.php authentication? (Yes).
-
- PostgreSQL Storage: schema definition (Robbie, Tom)
-  - **API:** SQL
-
-  - exists, needs tests, needs requirements
-    * **We need to stop now, write requirements, then apply those requirements moving forward to ensure we meet the requirements**
-
-  - should we re-architect tables become normal tables, the views go away, and versioned records are moved to shadow tables.
-
-  - add features for delete-via-mark (as opposed to actual delete)
-
-  - add features to support embargo
-
-  - *maybe, discuss* change vocabulary.sql from insert to copy. It would be smaller and faster, although in reality as soon as
-    it is in the database, the text file will never be touched again.
-
- discuss; Can/should we create a tag system to deal with ad-hoc requirements later in the project? [Tag system](#controlled-vocabularies-and-tag-system)
-
- CPF to SQL parser (Robbie)
-  - **API:** EAC-CPF XML input, JSON output? (needs formal spec)
-
-  - exists, needs tests, needs requirements
-    * **We need to stop now, write requirements, then apply those requirements moving forward to ensure we meet the requirements**
-
- NameEntity serialization tool, selectable pre-configured formats
-
- NameEntity string parser
-
-  - **API:** subroutine? JSON?
-
-    - Can we find a grammar-based parser for PHP? Should we use a standalone parser?
-
-    - Can we expose this as a JSON API such that it's given a name-string and returns an identity object of that identity's information?  Possibly a score as well as to how well we thought we could parse it?
- Name parser (only a portion of the NameEntity string)
-  - **API:** subroutine? JSON?
-  - Can this use the same parser engine as the name string parser?
-
- Date parser
-  - **API:** subroutine? JSON?
-
-  - Can this use the same parser engine as the name string parser?
-  - **This should be distinct, or may be a subroutine of the nameEntry-string parser**
-  - Can we expose this as a JSON API such that given a set of dates, it returns a date object that lists out the individual dates?  Then, it could be called from the name-string parser to parse out dates.
-
- Editing User interface
-  - **API:** HTML front-end, makes calls to internal JSON API
-  - Must have ajax-backed interaction for displaying and searching large lists, such as cpfRelations for an identity.
-    - We need to have UI edit/chooser widget for search and select of large numbers of options, such as the Joseph
-      Henry cpfRelations that contain some 22K entries. Also need to list all fields which might have large numbers
-      of values. In fact, part of the meta data for every field is "number of possible entries/reapeat values" or
-      whatever that's called. From a software architecture perspective, the answer is 0, 1, infinite.
-
- History Research Tool (redefined)
-  - **API:** HTML front-end, makes calls to internal JSON API
-  - Needs to be reworked to support the Postgres backend
-
-#### Off-the-shelf Components
-
- gitlab for developer documentation, code version management, internal issue tracking and milestone keeping
-
- github public code repository
-
- test framework, need to choose one
-
- authentication
-  - session management, especially as applies to authentication tokens, cookies and something which prevents
-    XSS (cross-site scripting) attacks
-
- JavaScript UI component tools, JQuery; what others?
-  - Suggestions: bootstrap, angular JS, JQueryUI
-
- reports, probably Jasper
-
- PHP, Postgres, Linux, Apache httpd, etc.
-
- language modules ?
-
-
-
-#### Controlled vocabularies and tag system
-
-Tags are simply terms. When inplemented as fixed terms with persistent IDs and some modify/add policy, tags
-become a flat (non-hierarchal) controlled vocabulary.
-
-The difference being a weaker moderation of tags and more readiness to create new tags (types). The tag table
-would consist of tag term and an ID value. Tag systems lack data type, and generally have no policy or less
-restrictive policies about creating new tags
-
-Below are some subject examples. It is unclear if these are each topics, or if "--" is used to join granular
-topics into a topic list. Likewise it is unclear if this list relies on some explicit hierarchy.
-
-```
-American literature--19th century--Periodicals
-American literature--20th century--Periodicals
-Periodicals
-Periodicals--19th century
-World politics--Periodicals
-World politics--Pictorial works
-World politics--Societies, etc.
-World politics--Study and teaching
-```
-
-**RH: I agree, this is super tricky.  We need to distinguish between types of controlled vocab, so that we don't mix Occupation and Subject.  A tagging system might be very nice for at least subjects.**