Commit e1991959 by Robbie Hott

Refactored and organized Plan and Overview

parent 51dc11bb
......@@ -7,7 +7,7 @@ For our purposes the simplest controlled vocabulary is a flat (non-hierarchal) l
An ontology is a heirarchy of controlled vocabulary terms that explicitly encodes both relatedness and category.
Using the example of subject (aka "topical subject"), either technology allows us to make assertions about the
data and relations between identities.
data and relations between identities.
- Both technologies can be used simultaneously to describe identities, however, doing double data entry would
be irksome.
......@@ -47,7 +47,7 @@ data and relations between identities.
It might also be sensible to design the terms with multilingual vocabulary terms. By multilingual I mean:
multiple terms for each unique ID where each term is specific to a specific language, and all terms with the
same ID share a definition.
same ID share a definition.
Each term has a unique id, and a definition (implied or explicit). This is a simple dictionary. Explicit
definition would improve the vocabulary, but takes more work.
......@@ -147,3 +147,33 @@ performed based on ID number, not text string.
Interestingly, we might be able to apply Markov matrices to identities marked up via ontology, with the same
sort of relatedness building that occurs with a flat vocabulary list.
# Alternative Strategies
## Controlled vocabularies and tag system
Can/should we create a tag system to deal with ad-hoc requirements later in the project? [Tag system](#controlled-vocabularies-and-tag-system)
Tags are simply terms. When implemented as fixed terms with persistent IDs and some modify/add policy, tags
become a flat (non-hierarchal) controlled vocabulary.
The difference being a weaker moderation of tags and more readiness to create new tags (types). The tag table
would consist of tag term and an ID value. Tag systems lack data type, and generally have no policy or less
restrictive policies about creating new tags
Below are some subject examples. It is unclear if these are each topics, or if "--" is used to join granular
topics into a topic list. Likewise it is unclear if this list relies on some explicit hierarchy.
```
American literature--19th century--Periodicals
American literature--20th century--Periodicals
Periodicals
Periodicals--19th century
World politics--Periodicals
World politics--Pictorial works
World politics--Societies, etc.
World politics--Study and teaching
```
This diff is collapsed. Click to expand it.
# Overall Plan
### Table of Contents
* [Documents to Create](#documents-to-create)
* [Governance and Policy Discussion](#governance-and-policy-discussion)
* [Overview of Approach](#overview-of-approach-iterative)
* [CPF Requirements](#cpf-requirements)
* [System Design](#system-design)
* [Developed Components](#developed-components)
* [Off-the-shelf Components](#off-the-shelf-components)
### Documents to Create
- Operations and Procedure Manual
- Formal Requirements Documents
- Formal Specification Documents
- Research Agenda
- User Story Backstory
- Design Documents (UI/UX/Graphic Design)
- style guides
- layouts
### Governance and Policy Discussion
- Data curation, preservation, graceful retirement
- Data expulsion vs. embargo
- Duplicates, backups, restore, related policy and technical issues
- Broad pieces that are missing or underdeveloped [Laura]
- Refresh relationship with OCLC [John, Daniel]
### Overview of Approach (Iterative)
1. List requirements of the overall application.
2. Organize requirements by component, clean up, flesh out, and vet.
3. Create formal specifications for each component based on the official, clean, requirements document. Prototypes will help ensure the formal spec is written correctly.
4. Define a timeline for development and prototyping based on the formal specifications document.
5. Create tests for test-driven development based on the formal specification. This includes creating and mining ground-truth data.
6. Develop software based on formal specification that passes the given tests.
### CPF Requirements
- CPF record edit, edit each field
- CPF record split, split data into separate cpf identities, deprecate old ARK, mint new ARKs
- CPF record merge, combine fields, deprecate old ARKs, mint new ARK
### System Design
#### Developed Components
- [Data validation engine](/Specifications/Data Validation.md)
- **API:** Internal programming API (PHP)
- Hooks should be made in the [Server API](/Specifications/Server API) and REST API
- [Identitiy Reconciliation](/Specifications/Identity Reconciliation.md)
- **API:** Internal programming API (PHP)
- Hooks should be made in the [Server API](/Specifications/Server API) and REST API
- [Server Workflow Engine](/Specifications/Workflow Engine.md)
- **API:** Understands [Server API](/Specifications/Server API) and calls internal programming APIs of other components
- PERL prototype exists
- Needs formal description and specification documentation
- [PostgreSQL Storage](/Specifications/Schema SQL.md)
- **API:** SQL, with custom internal programming API (PHP)
- Supports versioning and mark-on-delete
- Needs features to support embargo
- CPF Parser
- **API:** EAC-CPF XML input, Constellation output
- Output may be serialized into SQL or JSON [Constellation](/Specifications/Server API/Constellation.md)
- [List of captured tags](/Specifications/Captured EAC-CPF Tags.md)
- CPF Serializer
- **API:** Constellation input, EAC-CPF XML output
- NameEntity Serializer
- electable pre-configured formats
- NameEntity Parser
- **API:** Internal programming API (PHP)
- Discussion
- Can we find a grammar-based parser for PHP? Should we use a standalone parser?
- Can we expose this as a JSON API such that it's given a name-string and returns an identity object of that identity's information? Possibly a score as well as to how well we thought we could parse it?
- Name parser (only a portion of the NameEntity string)
- **API:** subroutine? JSON?
- Can this use the same parser engine as the name string parser?
- Date parser
- **API:** Internal programming API (PHP)
- Discussion
- Can we expose this as a JSON API such that given a set of dates, it returns a date object that lists out the individual dates? Then, it could be called from the name-string parser to parse out dates.
- [Editing User Interface](/Specifications/Editing User Interface.md)
- **API:** HTML front-end, makes calls to [Server API](/Specifications/Server API)
- Must have ajax-backed interaction for displaying and searching large lists, such as relations for a constellation.
- We need to have UI edit/chooser widget for search and select of large numbers of options, such as the Joseph Henry cpfRelations that contain some 22K entries. Also need to list all fields which might have large numbers of values.
- Need a way of entering metadata about each piece of information
- History Research Tool
- **API:** HTML front-end, makes calls to [Server API](/Specifications/Server API)
- May be reworked to support the Postgres backend (later)
- Retain functionality of current HRT
#### Off-the-shelf Components
The following OTS components will be evaluated and used in development
- GitLab
- developer documentation,
- code version management,
- internal issue tracking, and
- milestone keeping
- GitHub
- public code repository
- PHPUnit unit test framework
- Authentication
- Consider OAuth2 (Google, GitLab, etc) for session management, especially as applies to authentication tokens, cookies and something which prevents XSS (cross-site scripting) attacks
- JavaScript Libraries
- JQuery
- UI Components, such as Bootstrap
- Graphing libraries, such as D3JS
- Reporting Tools
- Investigate Jasper
- LAMP Stack (Linux, Apache 2, Postgres, PHP)
# SNAC Documentation
This repository contains all the documentation for the SNAC Web Application and related frameworks, engines, and pieces. Specifically:
This repository contains all the documentation for the SNAC Web Application and related frameworks, engines, and pieces.
* The currently-being-revised Technical Requirements are found in the [Requirements Directory](Requirements).
* Formal Specifications for those requirements are in the [Specifications Directory](Specifications).
* [Help](Help) on using Gitlab, Git, and the Markdown text format
* [Documentation](Third Party Documentation) on third-party software and applications being used
* [Historical Documentation](Historical Documentation) on previous iterations of SNAC
* Technical [Discussions](Discussion) related to the SNAC project
* [Notes](Notes) from the technical team.
* Database schema diagrams auto-generated by SchemaSpy http://shannonvm.village.virginia.edu/~twl8n/schema_spy_output/
## Table of Contents
The best place to start is the big, overall [plan](plan.md) document, which describes the process forward with defining requirements and specifications.
* [Technology Infrastructure Architecture Overview](Overview.md)
* [Overall Tech Plan Document](Plan.md) (draft)
* Requirements and Specifications
* [Requirements](Requirements): The currently-being-revised Technical Requirements
* [Specifications](Specifications): Formal Specifications for the Technical Requirements
* Discussion and Documentation
* [Database Schema Diagrams](http://shannonvm.village.virginia.edu/~twl8n/schema_spy_output/)
* [Third Party Documentation](Third Party Documentation): Documentation on third-party software and applications being used
* [Historical Documentation](Historical Documentation): Documentation on previous iterations of SNAC
* [Technical Discussions](Discussion): Technical discussions related to the SNAC project
* [Notes](Notes): Notes from the technical team.
* [Help](Help): Help on using Gitlab, Git, and the Markdown text format
## Notes on this Repository
This repository is stored in Gitlab, a work-alike clone of the Github web site, but installed locally on a SNAC server. Gitlab is a
version control system with a suite of project management tools.
......@@ -47,4 +53,4 @@ This work is published from:
<span property="vcard:Country" datatype="dct:ISO3166"
content="US" about="[_:publisher]">
United States</span>.
</p>
\ No newline at end of file
</p>
......@@ -89,7 +89,7 @@ Each code repository must contain the full BSD 3-Clause license below. It must
```
Copyright (c) 2015, the Rector and Visitors of the University of Virginia, and
the Regents of the University of California
the Regents of the University of California.
All rights reserved.
Redistribution and use in source and binary forms, with or without
......
# Data Validation Engine
## Original (Draft) Discussion
- The data validation engine applies a written system of rules to the incoming data. The rules must be written in a human-readble form, such that non-technical individuals are able to write and understand the rules. A rule-writing guide must be supplied to give hints and help for writing rules properly. The engine will be pluggable and written as an MVC application. The model will read the user-written rules (stored in a postgres database or flat-file system, depending on the model) and apply them to any input given on the view. The initial view will be a JSON API, which accepts individual fields and returns the input with either validation suggestions or valid flags.
- rule based system abstracted out of the code
- rules are data
- change the rules, not the actual application code
- rules for broad classes of data type, or granular rules for individual fields
- probably used this to untaint data as well (remove things that are potential security problems)
- send all data through this API
- every rule includes a message describing what when wrong and suggesting fixes
- rules potentially editable by non-programmers
- rules are based on co-op data policies, which implies a data policy document, or the rules **can** be the policy documentation.
Copyright (c) 2015, the Rector and Visitors of the University of Virginia, and
the Regents of the University of California
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
3. Neither the name of the copyright holder nor the names of its contributors
may be used to endorse or promote products derived from this software without
specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
# Overall Plan
### Table of Contents
* [Big questions](#big-questions)
* [Documents we need to create](#documents-we-need-to-create)
* [Governance and Policies, etc.](#governance-and-policies-etc)
* [Overview and order of work](#overview-and-order-of-work)
* [Non-component notes to be worked into requirements](#non-component-notes-to-be-worked-into-requirements)
* [System Design](#system-design)
* [Developed Components](#developed-components)
* [Off-the-shelf Components](#off-the-shelf-components)
* [Controlled vocabularies and tag system](#controlled-vocabularies-and-tag-system)
### Big questions
- (solved) how is gitlab backed up?
- Shayne backs up the whole gitlab VM.
- We need a complete understanding of our requirements for controlled vocabulary, ontology, and/or tagging as
well as how this relates to search facets. This also impacts our future ability to make assertions about the
data, and is somewhat related to semantic net. See [Tag system](#controlled-vocabularies-and-tag-system).
### Documents we need to create
- Operations and Procedure Manual
- Formal Requirements Document
- Formal Specification Document
- Research Agenda
- User Story Backlog
- Design Documents (UI/UX/Graphic Design)
- ideally someone writes a (possibly brief) style guide
- a set of .psd or other images is not a style guide
### Governance and Policies, etc.
- Data curation, preservation, graceful retirement
- Data expulsion vs. embargo
- Duplicates, backups, restore, related policy and technical issues
- Broad pieces that are missing or underdeveloped [Laura]
- Refresh relationship with OCLC [John, Daniel]
### Overview and order of work
1. List requirements of the overall application. (done)
2. Organize requirements by component, clean up, flesh out, and vet. (requires team meetings)
3. Create formal specifications for each component based on the official, clean, requirements document. Prototypes will help ensure the formal spec is written correctly.
4. Define a timeline for development and prototyping based on the formal specifications document.
5. Create tests for test-driven development based on the formal specification. This includes creating and mining ground-truth data.
6. Develop software based on formal specification that passes the given tests.
### Non-component notes to be worked into requirements
- CPF record edit, edit each field
- CPF record split, split data into separate cpf identities, deprecate old ARK, mint new ARKs
- CPF record merge, combine fields, deprecate old ARKs, mint new ARK
### System Design
#### Developed Components
- Data validation engine
- **API:** Custom JSON (needs formal spec)
- The data validation engine applies a written system of rules to the incoming data. The rules must be written in a human-readble form, such that non-technical individuals are able to write and understand the rules. A rule-writing guide must be supplied to give hints and help for writing rules properly. The engine will be pluggable and written as an MVC application. The model will read the user-written rules (stored in a postgres database or flat-file system, depending on the model) and apply them to any input given on the view. The initial view will be a JSON API, which accepts individual fields and returns the input with either validation suggestions or valid flags.
- rule based system abstracted out of the code
- rules are data
- change the rules, not the actual application code
- rules for broad classes of data type, or granular rules for individual fields
- probably used this to untaint data as well (remove things that are potential security problems)
- send all data through this API
- every rule includes a message describing what when wrong and suggesting fixes
- rules potentially editable by non-programmers
- rules are based on co-op data policies, which implies a data policy document, or the rules **can** be the
policy documentation.
- Identitiy Reconciliation (aka IR) (architect Robbie)
- **API:** Custom JSON (needs formal spec)
- needs docs wrangled
- workflow manager (architect Tom)
- **API:** Custom JSON? (needs formal spec)
- exists, needs tests, needs requirements
* **We need to stop now, write requirements, then apply those requirements to the existant system to ensure we meet the requirements**
- needs to be integrated into an index.php script that also checks authentication
- can the workflow also support the login.php authentication? (Yes).
- PostgreSQL Storage: schema definition (Robbie, Tom)
- **API:** SQL
- exists, needs tests, needs requirements
* **We need to stop now, write requirements, then apply those requirements moving forward to ensure we meet the requirements**
- should we re-architect tables become normal tables, the views go away, and versioned records are moved to shadow tables.
- add features for delete-via-mark (as opposed to actual delete)
- add features to support embargo
- *maybe, discuss* change vocabulary.sql from insert to copy. It would be smaller and faster, although in reality as soon as
it is in the database, the text file will never be touched again.
- discuss; Can/should we create a tag system to deal with ad-hoc requirements later in the project? [Tag system](#controlled-vocabularies-and-tag-system)
- CPF to SQL parser (Robbie)
- **API:** EAC-CPF XML input, JSON output? (needs formal spec)
- exists, needs tests, needs requirements
* **We need to stop now, write requirements, then apply those requirements moving forward to ensure we meet the requirements**
- NameEntity serialization tool, selectable pre-configured formats
- NameEntity string parser
- **API:** subroutine? JSON?
- Can we find a grammar-based parser for PHP? Should we use a standalone parser?
- Can we expose this as a JSON API such that it's given a name-string and returns an identity object of that identity's information? Possibly a score as well as to how well we thought we could parse it?
- Name parser (only a portion of the NameEntity string)
- **API:** subroutine? JSON?
- Can this use the same parser engine as the name string parser?
- Date parser
- **API:** subroutine? JSON?
- Can this use the same parser engine as the name string parser?
- **This should be distinct, or may be a subroutine of the nameEntry-string parser**
- Can we expose this as a JSON API such that given a set of dates, it returns a date object that lists out the individual dates? Then, it could be called from the name-string parser to parse out dates.
- Editing User interface
- **API:** HTML front-end, makes calls to internal JSON API
- Must have ajax-backed interaction for displaying and searching large lists, such as cpfRelations for an identity.
- We need to have UI edit/chooser widget for search and select of large numbers of options, such as the Joseph
Henry cpfRelations that contain some 22K entries. Also need to list all fields which might have large numbers
of values. In fact, part of the meta data for every field is "number of possible entries/reapeat values" or
whatever that's called. From a software architecture perspective, the answer is 0, 1, infinite.
- History Research Tool (redefined)
- **API:** HTML front-end, makes calls to internal JSON API
- Needs to be reworked to support the Postgres backend
#### Off-the-shelf Components
- gitlab for developer documentation, code version management, internal issue tracking and milestone keeping
- github public code repository
- test framework, need to choose one
- authentication
- session management, especially as applies to authentication tokens, cookies and something which prevents
XSS (cross-site scripting) attacks
- JavaScript UI component tools, JQuery; what others?
- Suggestions: bootstrap, angular JS, JQueryUI
- reports, probably Jasper
- PHP, Postgres, Linux, Apache httpd, etc.
- language modules ?
#### Controlled vocabularies and tag system
Tags are simply terms. When inplemented as fixed terms with persistent IDs and some modify/add policy, tags
become a flat (non-hierarchal) controlled vocabulary.
The difference being a weaker moderation of tags and more readiness to create new tags (types). The tag table
would consist of tag term and an ID value. Tag systems lack data type, and generally have no policy or less
restrictive policies about creating new tags
Below are some subject examples. It is unclear if these are each topics, or if "--" is used to join granular
topics into a topic list. Likewise it is unclear if this list relies on some explicit hierarchy.
```
American literature--19th century--Periodicals
American literature--20th century--Periodicals
Periodicals
Periodicals--19th century
World politics--Periodicals
World politics--Pictorial works
World politics--Societies, etc.
World politics--Study and teaching
```
**RH: I agree, this is super tricky. We need to distinguish between types of controlled vocab, so that we don't mix Occupation and Subject. A tagging system might be very nice for at least subjects.**
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment