Commit e1991959 by Robbie Hott

Refactored and organized Plan and Overview

parent 51dc11bb
......@@ -7,7 +7,7 @@ For our purposes the simplest controlled vocabulary is a flat (non-hierarchal) l
An ontology is a heirarchy of controlled vocabulary terms that explicitly encodes both relatedness and category.
Using the example of subject (aka "topical subject"), either technology allows us to make assertions about the
data and relations between identities.
data and relations between identities.
- Both technologies can be used simultaneously to describe identities, however, doing double data entry would
be irksome.
......@@ -47,7 +47,7 @@ data and relations between identities.
It might also be sensible to design the terms with multilingual vocabulary terms. By multilingual I mean:
multiple terms for each unique ID where each term is specific to a specific language, and all terms with the
same ID share a definition.
same ID share a definition.
Each term has a unique id, and a definition (implied or explicit). This is a simple dictionary. Explicit
definition would improve the vocabulary, but takes more work.
......@@ -147,3 +147,33 @@ performed based on ID number, not text string.
Interestingly, we might be able to apply Markov matrices to identities marked up via ontology, with the same
sort of relatedness building that occurs with a flat vocabulary list.
# Alternative Strategies
## Controlled vocabularies and tag system
Can/should we create a tag system to deal with ad-hoc requirements later in the project? [Tag system](#controlled-vocabularies-and-tag-system)
Tags are simply terms. When implemented as fixed terms with persistent IDs and some modify/add policy, tags
become a flat (non-hierarchal) controlled vocabulary.
The difference being a weaker moderation of tags and more readiness to create new tags (types). The tag table
would consist of tag term and an ID value. Tag systems lack data type, and generally have no policy or less
restrictive policies about creating new tags
Below are some subject examples. It is unclear if these are each topics, or if "--" is used to join granular
topics into a topic list. Likewise it is unclear if this list relies on some explicit hierarchy.
```
American literature--19th century--Periodicals
American literature--20th century--Periodicals
Periodicals
Periodicals--19th century
World politics--Periodicals
World politics--Pictorial works
World politics--Societies, etc.
World politics--Study and teaching
```
# Technology Infrastructure Architecture Overview, SNAC Cooperative
### Introduction
The long-term technological objective for the Cooperative is a [platform](http://gitlab.iath.virginia.edu/snac/Public-Documentation/raw/master/Diagrams/SNAC%20Cooperative%20Interaction.pdf) that will support a continuously expanding, curated corpus of reliable biographical descriptions of people linked to and providing contextual understanding of the historical records that are the primary evidence for understanding their lives and work. Building and curating a reliable social-document corpus will require a nuanced combination of computer processing and human identity verification and editing. During the pilot phase of the Cooperative, the R&D infrastructure is being thoroughly transformed to a maintenance platform. From a technical perspective, this means transitioning from a multistep human-mediated batch process to an integrated transaction-based platform. The infrastructure under development will automate the flow of data into and out of the different processing steps by interconnecting the processing components, with events taking place in one component triggering related events in another. For example, the addition of a new descriptive record will lead to automatic updating of a graph database and the indexed data in the History Research Tool. This coordinated architecture will support both the batch ingest of data and human editing of the data to verify identities, and will refine and augment the descriptions over time.
For more diagrams and documents, visit the [documentation repository](http://gitlab.iath.virginia.edu/snac/Public-Documentation/tree/master).
### Technology Architecture Overview
We employ a LAMP stack (with PostgreSQL) for efficiency of coding; flexibility enabled by a very large number of available software modules; ease of maintenance; and clarity of software architecture. The result will be a lightweight and easy-to-administer software stack.
The [high-level architecture](http://gitlab.iath.virginia.edu/snac/Public-Documentation/raw/master/Diagrams/SNAC%20Server%20Architecture.pdf) uses a scalable and distributable client-server model. Two clients will be created and hosted by the Cooperative to interact with the back-end server: a graphical web user interface (HTML) and a RESTful API (JSON). The WebUI client will support the Cooperative’s editing user-interface. The Rest API client will allow ArchivesSpace and other approved clients to mechanically interact with the server; it will provide features such as viewing and editing descriptive records, as well as batch processing of data.
The server-side architecture will consist of a number of modules addressing different primary functions. The storage medium of the server will be a PostgreSQL Data Maintenance Store (DMS). The DMS will contain all of the descriptive and maintenance data for each EAC-CPF data file.
Other major server-side components are the Identity Reconciliation Engine, the Data Validation Engine, the Workflow Controller, and a Neo4J Graph Database. The Workflow Controller will coordinate communication among the server-side components through internal programming APIs, and a Server API (JSON) will facilitate communication with the WebUI and Rest API clients.
### Data Maintenance Store
A PostgreSQL Data Maintenance Store (DMS) will represent the storage foundation of the SNAC technology platform. The DMS will store "identity constellations" that will represent all of the data contained in the EAC-CPF instances, with each instance represented by an [Identity Constellation (IC)](http://gitlab.iath.virginia.edu/snac/Public-Documentation/raw/master/Diagrams/Identity%20Constellation.pdf). Additional control data will be stored with each IC to facilitate transaction tracking and management, and fine-grained version control.
In the R&D processing workflow, EAC-CPF instances were placed in a read-only directory as the primary data store. A small number of select components (name strings) of each EAC-CPF XML-encoded instance were loaded into a PostgreSQL database only for matching purposes. In order to support dynamic manual editing of the EAC-CPF instances, the entirety of each EAC-CPF instance will be parsed into PostgreSQL tables as Identity Constellations[^1]. Each IC will retain all of the EAC-CPF data, as well as additional control data that will facilitate transaction tracking and version control. The DMS will also store editor authorization privileges, editor work histories (e.g., edit status on individual identity constellations), and local controlled vocabularies (e.g., occupations, functions, subjects, and geographic names). The DMS will store workflow management data and aid the server in report generation.
Identity Constellation Diagrams:
* [Identity Constellation Overview](http://gitlab.iath.virginia.edu/snac/Public-Documentation/raw/master/Diagrams/Identity%20Constellation.pdf)
* [Identity Constellation Relations](http://gitlab.iath.virginia.edu/snac/Public-Documentation/raw/master/Diagrams/Constellation%20Relations.pdf)
### Identity Reconciliation
A major focus of the SNAC R&D has been on identity reconciliation. A fundamental human activity in the development of knowledge involves the identification of unique "real world" entities (for example, a particular person or a specific book) and recording facts about the observed entity that, when taken together, uniquely distinguishes that entity from all others. Establishing the identity of a person, for example, involves examining available evidence, including the existing knowledge base, and recording facts associated with him or her. For a person, the facts would include names used by and for them, dates and places of birth and death, occupations, and so on. Establishing identities is an ongoing, cumulative activity that both leverages existing established identities and establishes new identities. Identity reconciliation is the process by which an encountered identity is compared against established identities, and if not found, is itself contributed to the established base of identities, and if found merges any new data associated with the encountered identity in an existing identity description.
With the emergence of Linked Open Data (LOD) and the opportunity it presents to interconnect distributed sets of information, new names for entities are introduced, namely the URIs used to provide globally unique identifiers to entities. In order to exploit the opportunity presented by LOD, it is necessary to include these URIs in the reconciliation process. SNAC assigns its own identifiers (ARKS) because doing so is essential to effectively managing the identities throughout processing and maintenance. Even if this were not essential for managing the workflow, the majority of the identities in SNAC will not be found in other sources such as VIAF, and thus the SNAC identifiers and associated data that establish the identity are likely to be unique, at least in the near term[^2]. For those identities that do overlap with VIAF, SNAC processing takes advantage of the VIAF reconciliation process to associate VIAF’s identifier as well as identifiers for Wikipedia and WorldCat Identity.
While the R&D matching was based on the name string alone, Cooperative [Identity Reconciliation](http://gitlab.iath.virginia.edu/snac/Public-Documentation/raw/master/Diagrams/Identity%20Reconciliation%20Engine.pdf) will be based on ICs, that is, the name string and additional information (evidence) that sufficiently establishes the uniqueness of an identity. The determination of match scoring will be based on comparing identity constellations and identifying which properties within each constellation (name, life dates, place of birth, place of death, relations to other identities, etc.) match or closely match, and each match test will result in an assigned score. A major factor in reliable matching, for computers or humans, is the available evidence for each identity. Sparse evidence in compared identities will decrease the probability of making a reliable match or non-match. Conversely, dense evidence supports both reliable matches and non-matches. Based on the scoring, two reconciliation outcomes will be presented: reliable matches and possible matches. Reliable non-matches and match scores that fall below the threshold of reliable and possible will not be flagged. Possible matches will be employed to suggest comparisons that are not reliably matches or non-matches but have sufficient similarity to suggest further human investigation and possible resolution. The Identity Reconciliation module will primarily employ the DMS and ElasticSearch. Ground-truth data, human-reviewed and verified matches and non-matches, will be used in testing and refining the matching algorithms in order to optimize the scoring.
The [Identity Reconciliation Engine](http://gitlab.iath.virginia.edu/snac/Public-Documentation/raw/master/Diagrams/Identity%20Reconciliation%20Engine.pdf) will be used for both batch ingest and to assist human editors. When EAC-CPF are extracted and assembled using existing archival descriptions (EAD-encoded finding aids, MARC21, or existing non-standard archival authority records) and ingested into the DMS, the Identity Reconciliation module will be invoked to identify reliable matches and possible matches. The results of the evaluation will be available to editors through the Editing User Interface to assist them in verifying identities. When editors create new identity descriptions or revise existing descriptions, the Identity Reconciliation module will be invoked to provide the editors with feedback on likely and potential matches that may be otherwise overlooked when employing human-only authority control techniques.
### Editing User Interface
Developing the Editing User Interface (EUI) is a primary objective of the two-year pilot. Pilot members will be engaged in providing feedback on the iterative development of the EUI in order to ensure that all editorial tasks are supported, that the order in which such tasks are performed is supported by the workflow, and that fundamental transactions (revising, merging, and splitting identity descriptions) are optimally supported. These activities and the findings that result from them will inform the development of the maintenance platform. When the underlying data maintenance platform is in place, development of a prototype EUI will commence, informed by the activities described above. As the EUI becomes functional, the pilot participants will transition to iteratively testing and using it to perform editing tasks to ensure that the essential functions are supported and that the tasks are logical and efficient. Those functions of the EUI that overlap with the History Research Tool (HRT) will employ a common interface. The bulk of the EUI will be based on HTML, CSS, JavaScript, and WebUI server-side PHP.
Sample interaction diagrams:
* [Edit and save](http://gitlab.iath.virginia.edu/snac/Public-Documentation/raw/master/Diagrams/User%20Interaction/SNAC%20Edit%20Flow%20Diagram.pdf)
* [Edit with permission error](http://gitlab.iath.virginia.edu/snac/Public-Documentation/raw/master/Diagrams/User%20Interaction/SNAC%20Edit%20Flow%20Diagram%202.pdf)
* [Multiple simultaneous edits](http://gitlab.iath.virginia.edu/snac/Public-Documentation/raw/master/Diagrams/User%20Interaction/SNAC%20Edit%20Flow%20Diagram%203.pdf)
### Graph Data Store – Visualizations and Exposure of RDF/LOD
Neo4J (a graph database) will be used to store a subset of each identity constellation from the DMS. The data in Neo4J will support several services: serving graph data to drive social-document network graphs in the HRT; and serving and providing LOD through a SPARQL endpoint and RDF exports for third-party consumption. LOD data will be exposed in a variety of forms: EAC-CPF XML; RDF/XML, JSON-LD; and others. The data in the Neo4J database will be coordinated with the data in the DMS through the Workflow Controller, keeping the data in Neo4J current as ICs are added, removed, or revised.
Currently there is no existing ontology for archival description, and thus the classes and properties used in exposing graph data expressed in RDF are based on classes and attributes selected from existing, well-known and widely used ontologies and vocabularies: Friend of a Friend, OWL, SKOS, Europeana Data Model (EDM), RDA Group 2 Element Vocabulary, Schema.org, and Dublin Core elements and terms[^3]. In the long term, it should be noted that the International Council on Archives' Expert Group on Archival Description is developing an ontology (Records in Contexts (RiC)) for archival entities and the description thereof[^4]. The SNAC Cooperative will transition to the ICA RiC semantics when it becomes available.
### Workflow Controller
Integration of the server components will be based on a Workflow Controller (WC). The WC interacts with clients via the Server-side JSON API and invokes the required functions through calls to the component subsystems: Identity Reconciliation, Data Validation, Authorization, DMS via a database connector, and Neo4J.
The Rest API client will make server functions (WC actions) available to appropriate third parties, giving them access to server-provided services. A simple example might be a dedicated MARC21-to-EAC converter where a MARC21 record is uploaded, data extracted and transformed into EAC-CPF, and returned in a single transaction. Another example is saving an identity record where the data is written to PostgreSQL, EAC-CPF is exported to and indexed by XTF, and the Neo4j database is updated. The three steps are sequenced by the thin middleware component.
### Notes
[^1]: PostgreSQL is a widely used and supported open source, SQL standards-compliant relational database management system. Using PostgreSQL as the maintenance platform for the authoritative EAC-CPF descriptions will ensure data integrity and provide robust performance for the large quantity of data (current and anticipated) in SNAC.
[^2]: 24.8% of SNAC identities match VIAF identities.
[^3]: Among the RDF vocabularies considered was BIBFRAME, an initiative led by the Library of Congress to replace the MARC21 format using graph technologies, specifically the W3C Resource Description Framework (RDF). BIBFRAME aspires to be "content standard independent," and to accommodate library, museum, and archival description. BIBFRAME development is still in the early stages, and most of the development work centers on accommodating data currently in MARC21. It is unclear at this stage in the development of BIBFRAME whether it will attempt to accommodate the data in EAC-CPF.
[^4]: "Toward an International Conceptual Model for Archival Description: A Preliminary Report from the International Council on Archives’ Experts Group on Archival Description" in The American Archivist (Chicago: SAA), 76/2 Fall/Winter 2013, pp. 566–583. With Gretchen Gueguen, Vitor Manoel Marques da Fonseca, and Claire Sibille-de Grimoüard. Also available here: http://www.ica.org/13851/egad-resources/egad-resources.html.
# Overall Plan
### Table of Contents
* [Documents to Create](#documents-to-create)
* [Governance and Policy Discussion](#governance-and-policy-discussion)
* [Overview of Approach](#overview-of-approach-iterative)
* [CPF Requirements](#cpf-requirements)
* [System Design](#system-design)
* [Developed Components](#developed-components)
* [Off-the-shelf Components](#off-the-shelf-components)
### Documents to Create
- Operations and Procedure Manual
- Formal Requirements Documents
- Formal Specification Documents
- Research Agenda
- User Story Backstory
- Design Documents (UI/UX/Graphic Design)
- style guides
- layouts
### Governance and Policy Discussion
- Data curation, preservation, graceful retirement
- Data expulsion vs. embargo
- Duplicates, backups, restore, related policy and technical issues
- Broad pieces that are missing or underdeveloped [Laura]
- Refresh relationship with OCLC [John, Daniel]
### Overview of Approach (Iterative)
1. List requirements of the overall application.
2. Organize requirements by component, clean up, flesh out, and vet.
3. Create formal specifications for each component based on the official, clean, requirements document. Prototypes will help ensure the formal spec is written correctly.
4. Define a timeline for development and prototyping based on the formal specifications document.
5. Create tests for test-driven development based on the formal specification. This includes creating and mining ground-truth data.
6. Develop software based on formal specification that passes the given tests.
### CPF Requirements
- CPF record edit, edit each field
- CPF record split, split data into separate cpf identities, deprecate old ARK, mint new ARKs
- CPF record merge, combine fields, deprecate old ARKs, mint new ARK
### System Design
#### Developed Components
- [Data validation engine](/Specifications/Data Validation.md)
- **API:** Internal programming API (PHP)
- Hooks should be made in the [Server API](/Specifications/Server API) and REST API
- [Identitiy Reconciliation](/Specifications/Identity Reconciliation.md)
- **API:** Internal programming API (PHP)
- Hooks should be made in the [Server API](/Specifications/Server API) and REST API
- [Server Workflow Engine](/Specifications/Workflow Engine.md)
- **API:** Understands [Server API](/Specifications/Server API) and calls internal programming APIs of other components
- PERL prototype exists
- Needs formal description and specification documentation
- [PostgreSQL Storage](/Specifications/Schema SQL.md)
- **API:** SQL, with custom internal programming API (PHP)
- Supports versioning and mark-on-delete
- Needs features to support embargo
- CPF Parser
- **API:** EAC-CPF XML input, Constellation output
- Output may be serialized into SQL or JSON [Constellation](/Specifications/Server API/Constellation.md)
- [List of captured tags](/Specifications/Captured EAC-CPF Tags.md)
- CPF Serializer
- **API:** Constellation input, EAC-CPF XML output
- NameEntity Serializer
- electable pre-configured formats
- NameEntity Parser
- **API:** Internal programming API (PHP)
- Discussion
- Can we find a grammar-based parser for PHP? Should we use a standalone parser?
- Can we expose this as a JSON API such that it's given a name-string and returns an identity object of that identity's information? Possibly a score as well as to how well we thought we could parse it?
- Name parser (only a portion of the NameEntity string)
- **API:** subroutine? JSON?
- Can this use the same parser engine as the name string parser?
- Date parser
- **API:** Internal programming API (PHP)
- Discussion
- Can we expose this as a JSON API such that given a set of dates, it returns a date object that lists out the individual dates? Then, it could be called from the name-string parser to parse out dates.
- [Editing User Interface](/Specifications/Editing User Interface.md)
- **API:** HTML front-end, makes calls to [Server API](/Specifications/Server API)
- Must have ajax-backed interaction for displaying and searching large lists, such as relations for a constellation.
- We need to have UI edit/chooser widget for search and select of large numbers of options, such as the Joseph Henry cpfRelations that contain some 22K entries. Also need to list all fields which might have large numbers of values.
- Need a way of entering metadata about each piece of information
- History Research Tool
- **API:** HTML front-end, makes calls to [Server API](/Specifications/Server API)
- May be reworked to support the Postgres backend (later)
- Retain functionality of current HRT
#### Off-the-shelf Components
The following OTS components will be evaluated and used in development
- GitLab
- developer documentation,
- code version management,
- internal issue tracking, and
- milestone keeping
- GitHub
- public code repository
- PHPUnit unit test framework
- Authentication
- Consider OAuth2 (Google, GitLab, etc) for session management, especially as applies to authentication tokens, cookies and something which prevents XSS (cross-site scripting) attacks
- JavaScript Libraries
- JQuery
- UI Components, such as Bootstrap
- Graphing libraries, such as D3JS
- Reporting Tools
- Investigate Jasper
- LAMP Stack (Linux, Apache 2, Postgres, PHP)
# SNAC Documentation
This repository contains all the documentation for the SNAC Web Application and related frameworks, engines, and pieces. Specifically:
This repository contains all the documentation for the SNAC Web Application and related frameworks, engines, and pieces.
* The currently-being-revised Technical Requirements are found in the [Requirements Directory](Requirements).
* Formal Specifications for those requirements are in the [Specifications Directory](Specifications).
* [Help](Help) on using Gitlab, Git, and the Markdown text format
* [Documentation](Third Party Documentation) on third-party software and applications being used
* [Historical Documentation](Historical Documentation) on previous iterations of SNAC
* Technical [Discussions](Discussion) related to the SNAC project
* [Notes](Notes) from the technical team.
* Database schema diagrams auto-generated by SchemaSpy http://shannonvm.village.virginia.edu/~twl8n/schema_spy_output/
## Table of Contents
The best place to start is the big, overall [plan](plan.md) document, which describes the process forward with defining requirements and specifications.
* [Technology Infrastructure Architecture Overview](Overview.md)
* [Overall Tech Plan Document](Plan.md) (draft)
* Requirements and Specifications
* [Requirements](Requirements): The currently-being-revised Technical Requirements
* [Specifications](Specifications): Formal Specifications for the Technical Requirements
* Discussion and Documentation
* [Database Schema Diagrams](http://shannonvm.village.virginia.edu/~twl8n/schema_spy_output/)
* [Third Party Documentation](Third Party Documentation): Documentation on third-party software and applications being used
* [Historical Documentation](Historical Documentation): Documentation on previous iterations of SNAC
* [Technical Discussions](Discussion): Technical discussions related to the SNAC project
* [Notes](Notes): Notes from the technical team.
* [Help](Help): Help on using Gitlab, Git, and the Markdown text format
## Notes on this Repository
This repository is stored in Gitlab, a work-alike clone of the Github web site, but installed locally on a SNAC server. Gitlab is a
version control system with a suite of project management tools.
......@@ -47,4 +53,4 @@ This work is published from:
<span property="vcard:Country" datatype="dct:ISO3166"
content="US" about="[_:publisher]">
United States</span>.
</p>
\ No newline at end of file
</p>
......@@ -89,7 +89,7 @@ Each code repository must contain the full BSD 3-Clause license below. It must
```
Copyright (c) 2015, the Rector and Visitors of the University of Virginia, and
the Regents of the University of California
the Regents of the University of California.
All rights reserved.
Redistribution and use in source and binary forms, with or without
......
# Data Validation Engine
## Original (Draft) Discussion
- The data validation engine applies a written system of rules to the incoming data. The rules must be written in a human-readble form, such that non-technical individuals are able to write and understand the rules. A rule-writing guide must be supplied to give hints and help for writing rules properly. The engine will be pluggable and written as an MVC application. The model will read the user-written rules (stored in a postgres database or flat-file system, depending on the model) and apply them to any input given on the view. The initial view will be a JSON API, which accepts individual fields and returns the input with either validation suggestions or valid flags.
- rule based system abstracted out of the code
- rules are data
- change the rules, not the actual application code
- rules for broad classes of data type, or granular rules for individual fields
- probably used this to untaint data as well (remove things that are potential security problems)
- send all data through this API
- every rule includes a message describing what when wrong and suggesting fixes
- rules potentially editable by non-programmers
- rules are based on co-op data policies, which implies a data policy document, or the rules **can** be the policy documentation.
Copyright (c) 2015, the Rector and Visitors of the University of Virginia, and
the Regents of the University of California
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
3. Neither the name of the copyright holder nor the names of its contributors
may be used to endorse or promote products derived from this software without
specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
# Overall Plan
### Table of Contents
* [Big questions](#big-questions)
* [Documents we need to create](#documents-we-need-to-create)
* [Governance and Policies, etc.](#governance-and-policies-etc)
* [Overview and order of work](#overview-and-order-of-work)
* [Non-component notes to be worked into requirements](#non-component-notes-to-be-worked-into-requirements)
* [System Design](#system-design)
* [Developed Components](#developed-components)
* [Off-the-shelf Components](#off-the-shelf-components)
* [Controlled vocabularies and tag system](#controlled-vocabularies-and-tag-system)
### Big questions
- (solved) how is gitlab backed up?
- Shayne backs up the whole gitlab VM.
- We need a complete understanding of our requirements for controlled vocabulary, ontology, and/or tagging as
well as how this relates to search facets. This also impacts our future ability to make assertions about the
data, and is somewhat related to semantic net. See [Tag system](#controlled-vocabularies-and-tag-system).
### Documents we need to create
- Operations and Procedure Manual
- Formal Requirements Document
- Formal Specification Document
- Research Agenda
- User Story Backlog
- Design Documents (UI/UX/Graphic Design)
- ideally someone writes a (possibly brief) style guide
- a set of .psd or other images is not a style guide
### Governance and Policies, etc.
- Data curation, preservation, graceful retirement
- Data expulsion vs. embargo
- Duplicates, backups, restore, related policy and technical issues
- Broad pieces that are missing or underdeveloped [Laura]
- Refresh relationship with OCLC [John, Daniel]
### Overview and order of work
1. List requirements of the overall application. (done)
2. Organize requirements by component, clean up, flesh out, and vet. (requires team meetings)
3. Create formal specifications for each component based on the official, clean, requirements document. Prototypes will help ensure the formal spec is written correctly.
4. Define a timeline for development and prototyping based on the formal specifications document.
5. Create tests for test-driven development based on the formal specification. This includes creating and mining ground-truth data.
6. Develop software based on formal specification that passes the given tests.
### Non-component notes to be worked into requirements
- CPF record edit, edit each field
- CPF record split, split data into separate cpf identities, deprecate old ARK, mint new ARKs
- CPF record merge, combine fields, deprecate old ARKs, mint new ARK
### System Design
#### Developed Components
- Data validation engine
- **API:** Custom JSON (needs formal spec)
- The data validation engine applies a written system of rules to the incoming data. The rules must be written in a human-readble form, such that non-technical individuals are able to write and understand the rules. A rule-writing guide must be supplied to give hints and help for writing rules properly. The engine will be pluggable and written as an MVC application. The model will read the user-written rules (stored in a postgres database or flat-file system, depending on the model) and apply them to any input given on the view. The initial view will be a JSON API, which accepts individual fields and returns the input with either validation suggestions or valid flags.
- rule based system abstracted out of the code
- rules are data
- change the rules, not the actual application code
- rules for broad classes of data type, or granular rules for individual fields
- probably used this to untaint data as well (remove things that are potential security problems)
- send all data through this API
- every rule includes a message describing what when wrong and suggesting fixes
- rules potentially editable by non-programmers
- rules are based on co-op data policies, which implies a data policy document, or the rules **can** be the
policy documentation.
- Identitiy Reconciliation (aka IR) (architect Robbie)
- **API:** Custom JSON (needs formal spec)
- needs docs wrangled
- workflow manager (architect Tom)
- **API:** Custom JSON? (needs formal spec)
- exists, needs tests, needs requirements
* **We need to stop now, write requirements, then apply those requirements to the existant system to ensure we meet the requirements**
- needs to be integrated into an index.php script that also checks authentication
- can the workflow also support the login.php authentication? (Yes).
- PostgreSQL Storage: schema definition (Robbie, Tom)
- **API:** SQL
- exists, needs tests, needs requirements
* **We need to stop now, write requirements, then apply those requirements moving forward to ensure we meet the requirements**
- should we re-architect tables become normal tables, the views go away, and versioned records are moved to shadow tables.
- add features for delete-via-mark (as opposed to actual delete)
- add features to support embargo
- *maybe, discuss* change vocabulary.sql from insert to copy. It would be smaller and faster, although in reality as soon as
it is in the database, the text file will never be touched again.
- discuss; Can/should we create a tag system to deal with ad-hoc requirements later in the project? [Tag system](#controlled-vocabularies-and-tag-system)
- CPF to SQL parser (Robbie)
- **API:** EAC-CPF XML input, JSON output? (needs formal spec)
- exists, needs tests, needs requirements
* **We need to stop now, write requirements, then apply those requirements moving forward to ensure we meet the requirements**
- NameEntity serialization tool, selectable pre-configured formats
- NameEntity string parser
- **API:** subroutine? JSON?
- Can we find a grammar-based parser for PHP? Should we use a standalone parser?
- Can we expose this as a JSON API such that it's given a name-string and returns an identity object of that identity's information? Possibly a score as well as to how well we thought we could parse it?
- Name parser (only a portion of the NameEntity string)
- **API:** subroutine? JSON?
- Can this use the same parser engine as the name string parser?
- Date parser
- **API:** subroutine? JSON?
- Can this use the same parser engine as the name string parser?
- **This should be distinct, or may be a subroutine of the nameEntry-string parser**
- Can we expose this as a JSON API such that given a set of dates, it returns a date object that lists out the individual dates? Then, it could be called from the name-string parser to parse out dates.
- Editing User interface
- **API:** HTML front-end, makes calls to internal JSON API
- Must have ajax-backed interaction for displaying and searching large lists, such as cpfRelations for an identity.
- We need to have UI edit/chooser widget for search and select of large numbers of options, such as the Joseph
Henry cpfRelations that contain some 22K entries. Also need to list all fields which might have large numbers
of values. In fact, part of the meta data for every field is "number of possible entries/reapeat values" or
whatever that's called. From a software architecture perspective, the answer is 0, 1, infinite.
- History Research Tool (redefined)
- **API:** HTML front-end, makes calls to internal JSON API
- Needs to be reworked to support the Postgres backend
#### Off-the-shelf Components
- gitlab for developer documentation, code version management, internal issue tracking and milestone keeping
- github public code repository
- test framework, need to choose one
- authentication
- session management, especially as applies to authentication tokens, cookies and something which prevents
XSS (cross-site scripting) attacks
- JavaScript UI component tools, JQuery; what others?
- Suggestions: bootstrap, angular JS, JQueryUI
- reports, probably Jasper
- PHP, Postgres, Linux, Apache httpd, etc.
- language modules ?
#### Controlled vocabularies and tag system
Tags are simply terms. When inplemented as fixed terms with persistent IDs and some modify/add policy, tags
become a flat (non-hierarchal) controlled vocabulary.
The difference being a weaker moderation of tags and more readiness to create new tags (types). The tag table
would consist of tag term and an ID value. Tag systems lack data type, and generally have no policy or less
restrictive policies about creating new tags
Below are some subject examples. It is unclear if these are each topics, or if "--" is used to join granular
topics into a topic list. Likewise it is unclear if this list relies on some explicit hierarchy.
```
American literature--19th century--Periodicals
American literature--20th century--Periodicals
Periodicals
Periodicals--19th century
World politics--Periodicals
World politics--Pictorial works
World politics--Societies, etc.
World politics--Study and teaching
```
**RH: I agree, this is super tricky. We need to distinguish between types of controlled vocab, so that we don't mix Occupation and Subject. A tagging system might be very nice for at least subjects.**
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment