Commit 6a225b08 by Tom Laudeman

Merge branch 'master', remote branch 'origin' into tom

parents 69a69749 ea79f79c
# Technical Discussion
This directory contains technical discussions and related notes between developers on the project.
# Discussion on Relational Databases
#### What is "normal form" and what informs the database schema design?
Edgar F. "Ted" Codd created 12 rules (revised with a 13th rule) to clarify the Relational Database Management
System (RDBMS).
https://en.wikipedia.org/wiki/Edgar_F._Codd
Breaking any of these rules weakens data integrity and the ability of the system to manage the data. An RDBMS
is not merely a bucket of data, but an entire eco-system for the management of data and data related
activities. Before Codd's work, databases were managed on an ad-hoc basis as collections of files with
links. It was a mess. Data was lost. Only the DBA knew how to find the data, and access methods could be very
different for data in different locations. Accessing data could also be extremely slow. In addition to
assuring the integrity of data, as well as managing it, relational database systems are very fast.
https://en.wikipedia.org/wiki/Codd%27s_12_rules
The "R" in RDBMS is "relational" and Codd invented the relational model of data. Key to relational data
modeling is "normal form".
https://en.wikipedia.org/wiki/Database_normalization
The RDBMS world generally uses third normal form. Lower levels of normalization create additional work for
data operations. Higher forms rarely show any improvements. The key concept of normalization is that a datum
only exists in one place. In the RDBMS world where SQL implements relational algebra, normal form is both
convenient and natural. In other venues such as paper ledgers, data stored in flat files, or in spreadsheets,
normal form can seem awkward.
# Staffing Model (Brian's draft suggestions)
Production of a cooperatively maintained high profile web site requires
different types of Technical and non-technical work.
Operations Team
- Communications and interactions with end users and content owners,
from marketing to user support, assessment
- Manages help desk
- Support production web application infrastructure, including
monitoring, "on call" for first tier response to system monitors
- batch ingest of new data sources
- signs up and on-boards new pilot members
- Proactive content QA and remediation
- work organized around issue queue / customer relationship management
system
Main Artifact: Ticketing Issue tracker that automatically generates a
ticket for an email to help@example.edu
Development Team
- Create new features that deliver customer value
- Maintain tests for new features
- second tier support of deployed features, developers on call for
their deployed code
- deploy code to test, stage, and production environments
- work organized around sprints
Main Artifact: User story backlog that supports scoring stories by
points,
Research Team
- Conduct experiments with new algorithms and technologies
- interoperation (and participation in the development) of relevant
domain specific standards and practices
Main Artifact: Research Agenda, schemas and specifications (esp. merge
spec)
### Info about Markdown
Markdown is a markup language that is used in Gitlab for documentation text files. Markdown files have a .md extension and can be edited locally or online. However, for best results, we recommend editing files locally and then uploading them. There are good guides to the syntax [here](https://confluence.atlassian.com/stash/using-stash/markdown-syntax-guide) and [here](https://en.wikipedia.org/wiki/Markdown).
### Editing
You can also edit markdown files locally, using a text editing application (such as TextEdit) or a word processing program (such as Word). However, be aware that some word processing programs may affect line breaks and formatting, which may change how information is displayed.
You can edit markdown files from the Gitlab web site. From the Gitlab home page, click a project on the right
side. On the project home page, click "Files" in the left navigation bar. Click a .md file. Click the "Edit"
button on the right side. Update the text and when finished, enter a commit message below, and click the
"Commit Changes" button.
#### Markdown, local complete reference
http://gitlab.iath.virginia.edu/help/markdown/markdown.md
#### Markdown, same info, somewhat different format
https://help.github.com/articles/markdown-basics/
#### Github extensions to standard markdown:
https://help.github.com/articles/github-flavored-markdown/
#### Standard markdown notes:
https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet
### Working locally and version control
The Git technology was created to track revisions to many files. Gitlab provides a web site with some ability
......@@ -86,4 +62,4 @@ http://doc.gitlab.com/ce/ssh/README.html
https://confluence.atlassian.com/display/STASH/Creating+SSH+keys#CreatingSSHkeys-CreatinganSSHkeyonWindows
#### Mac github, requires 10.9, probably github only.
https://mac.github.com/
\ No newline at end of file
https://mac.github.com/
### Info about Markdown
Markdown is a markup language that is used in Gitlab for documentation text files. Markdown files have a .md extension and can be edited locally or online. However, for best results, we recommend editing files locally and then uploading them. There are good guides to the syntax [here](https://confluence.atlassian.com/stash/using-stash/markdown-syntax-guide) and [here](https://en.wikipedia.org/wiki/Markdown).
### Editing
You can edit markdown files locally, rather than on the website. One full-featured cross-platform Markdown editor is [Atom](http://atom.io). After opening a file, pressing `Ctrl-Shift-M` for Win/Linux and `Cmd-Shift-M` for Mac will open a real-time preview of the markdown file.
![Atom Screenshot](http://gitlab.iath.virginia.edu/snac/Documentation/raw/b39387646432816488537cce327f00e41aa79452/images/atom-screenshot.png "Screenshot of Atom editing Interface")
You can also edit markdown files using a text editing application (such as TextEdit or Notepad). However, be aware that some word processing programs may affect line breaks and formatting, which may change how information is displayed.
You can edit markdown files from the Gitlab web site. From the Gitlab home page, click a project on the right
side. On the project home page, click "Files" in the left navigation bar. Click a .md file. Click the "Edit"
button on the right side. Update the text and when finished, enter a commit message below, and click the
"Commit Changes" button.
### Resources
* [Gitlab markdown reference (local)](http://gitlab.iath.virginia.edu/help/markdown/markdown.md)
* [Github markdown reference](https://help.github.com/articles/markdown-basics/)
* [Github extensions to standard markdown](https://help.github.com/articles/github-flavored-markdown/)
* [Alternate markdown cheatsheet](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet)
* [Official Markdown documentation](http://daringfireball.net/projects/markdown/)
# SNAC Help
This directory contains helpful links and documentation on how to use various pieces of the subsystem. Specifically, it contains help files for Git, Gitlab, and the Markdown syntax.
# Historical Documentation
This directory catalogs documentation related to previous iterations of the SNAC project.
CC0 1.0 Universal
Statement of Purpose
The laws of most jurisdictions throughout the world automatically confer
exclusive Copyright and Related Rights (defined below) upon the creator and
subsequent owner(s) (each and all, an "owner") of an original work of
authorship and/or a database (each, a "Work").
Certain owners wish to permanently relinquish those rights to a Work for the
purpose of contributing to a commons of creative, cultural and scientific
works ("Commons") that the public can reliably and without fear of later
claims of infringement build upon, modify, incorporate in other works, reuse
and redistribute as freely as possible in any form whatsoever and for any
purposes, including without limitation commercial purposes. These owners may
contribute to the Commons to promote the ideal of a free culture and the
further production of creative, cultural and scientific works, or to gain
reputation or greater distribution for their Work in part through the use and
efforts of others.
For these and/or other purposes and motivations, and without any expectation
of additional consideration or compensation, the person associating CC0 with a
Work (the "Affirmer"), to the extent that he or she is an owner of Copyright
and Related Rights in the Work, voluntarily elects to apply CC0 to the Work
and publicly distribute the Work under its terms, with knowledge of his or her
Copyright and Related Rights in the Work and the meaning and intended legal
effect of CC0 on those rights.
1. Copyright and Related Rights. A Work made available under CC0 may be
protected by copyright and related or neighboring rights ("Copyright and
Related Rights"). Copyright and Related Rights include, but are not limited
to, the following:
i. the right to reproduce, adapt, distribute, perform, display, communicate,
and translate a Work;
ii. moral rights retained by the original author(s) and/or performer(s);
iii. publicity and privacy rights pertaining to a person's image or likeness
depicted in a Work;
iv. rights protecting against unfair competition in regards to a Work,
subject to the limitations in paragraph 4(a), below;
v. rights protecting the extraction, dissemination, use and reuse of data in
a Work;
vi. database rights (such as those arising under Directive 96/9/EC of the
European Parliament and of the Council of 11 March 1996 on the legal
protection of databases, and under any national implementation thereof,
including any amended or successor version of such directive); and
vii. other similar, equivalent or corresponding rights throughout the world
based on applicable law or treaty, and any national implementations thereof.
2. Waiver. To the greatest extent permitted by, but not in contravention of,
applicable law, Affirmer hereby overtly, fully, permanently, irrevocably and
unconditionally waives, abandons, and surrenders all of Affirmer's Copyright
and Related Rights and associated claims and causes of action, whether now
known or unknown (including existing as well as future claims and causes of
action), in the Work (i) in all territories worldwide, (ii) for the maximum
duration provided by applicable law or treaty (including future time
extensions), (iii) in any current or future medium and for any number of
copies, and (iv) for any purpose whatsoever, including without limitation
commercial, advertising or promotional purposes (the "Waiver"). Affirmer makes
the Waiver for the benefit of each member of the public at large and to the
detriment of Affirmer's heirs and successors, fully intending that such Waiver
shall not be subject to revocation, rescission, cancellation, termination, or
any other legal or equitable action to disrupt the quiet enjoyment of the Work
by the public as contemplated by Affirmer's express Statement of Purpose.
3. Public License Fallback. Should any part of the Waiver for any reason be
judged legally invalid or ineffective under applicable law, then the Waiver
shall be preserved to the maximum extent permitted taking into account
Affirmer's express Statement of Purpose. In addition, to the extent the Waiver
is so judged Affirmer hereby grants to each affected person a royalty-free,
non transferable, non sublicensable, non exclusive, irrevocable and
unconditional license to exercise Affirmer's Copyright and Related Rights in
the Work (i) in all territories worldwide, (ii) for the maximum duration
provided by applicable law or treaty (including future time extensions), (iii)
in any current or future medium and for any number of copies, and (iv) for any
purpose whatsoever, including without limitation commercial, advertising or
promotional purposes (the "License"). The License shall be deemed effective as
of the date CC0 was applied by Affirmer to the Work. Should any part of the
License for any reason be judged legally invalid or ineffective under
applicable law, such partial invalidity or ineffectiveness shall not
invalidate the remainder of the License, and in such case Affirmer hereby
affirms that he or she will not (i) exercise any of his or her remaining
Copyright and Related Rights in the Work or (ii) assert any associated claims
and causes of action with respect to the Work, in either case contrary to
Affirmer's express Statement of Purpose.
4. Limitations and Disclaimers.
a. No trademark or patent rights held by Affirmer are waived, abandoned,
surrendered, licensed or otherwise affected by this document.
b. Affirmer offers the Work as-is and makes no representations or warranties
of any kind concerning the Work, express, implied, statutory or otherwise,
including without limitation warranties of title, merchantability, fitness
for a particular purpose, non infringement, or the absence of latent or
other defects, accuracy, or the present or absence of errors, whether or not
discoverable, all to the greatest extent permissible under applicable law.
c. Affirmer disclaims responsibility for clearing rights of other persons
that may apply to the Work or any use thereof, including without limitation
any person's Copyright and Related Rights in the Work. Further, Affirmer
disclaims responsibility for obtaining any necessary consents, permissions
or other rights required for any use of the Work.
d. Affirmer understands and acknowledges that Creative Commons is not a
party to this document and has no duty or obligation with respect to this
CC0 or use of the Work.
For more information, please see
<http://creativecommons.org/publicdomain/zero/1.0/>
Notes to merge with other requirements
---
Note for TAT functional requirements: need to have UI widget for search of very long fields, such as the Joseph Henry cpfRelations
that contain some 22K entries. Also need to list all fields which migh have large numbers of values. In fact, part of the meta data for
every field is "number of possible entries/reapeat values" or whatever that's called.This wiki serves as the documentation of the SNAC technical team as relates to Postgres and storage of data. Currently, we are documenting:
* the schema and reasons behind the schema,
* methods for handling versioning of eac-cpf documents, and
* elastic search for postgres.
* Need a data constraint and business logic API for validating data in the UI. This layer checks user inputs against some set of rules and when there is an issue it informs the user of problems and suggests solutions. Data should be saved regardless. We could allow "bad" data to go into the database (assumes text fields) and flag for later cleaning. That's ugly. Probably better to save inputs in some agnostic repo which could be json, frozen data structures, or name-value pairs, or even portable source format. The problem is that most often data validation is hard coded into UI JavaScript or host code. Validation really should be configurable separate from the display of data, workflow automation, and data storage. While our database needs to do certain rudimentary sanity checks, the database can't be expected to send messages all the way back up to the UI layer. Nor should the database be burdened with validation rules which are certain to be mutable. Ideally, the validation rules would work in the same state machine framework as the workflow automation API, and might even share some code.
* Need a mechanism to lock records. Even mark-as-deleted records won't necessarily solve all our problems, so we might want records that are "live", but not editable for whatever reason. The ability to lock records and having the lock integrated across all the data-aware APIs is a good idea.
* On the topic of data, we will have user/group/other and r/w permissions, and all data-aware APIs should be designed with that in mind.
* "Literate programming" Using MCV and the state machine workflow automation probably meets (and exceeds) Knuth's ideas of Literate programming. Look at his idea and make sure we aren't missing any key concepts.
* QA and testing needs to be several layers, one of which is simple documentation. We should have code that examines data for various qualities, which when done in a comprehensive manner will test the data for all properties described in the requirements. As bugs are discovered and features added, this data testing code would expand. Code should be tested on several levels as well, and the tests plus comments in the tests constitute our full understanding of both data and code.
* Entities (names) have ID values, ARKs and various kinds of persistent ids. Some subsystem needs to know how to gather various ID values, and how to generate URIs and URLs from those id values. All discovered ids need to be attached to the cpf identity in a table related_id. We will need to track the authority that issued the id, so we need an authority table. Perhaps it is best (as previously discussed) to create a CPF record for each authority, and use the CPF persistent ID as the authority identifier in the related_id table.
```
create table related_id (
ri_id auto primary key,
id_value text,
uri text,
url text,
authority_id int -- fk to cpf.id?
);
```
* Allow config of CPF output formats via web interface. For example, in the CPF generator, we can offer some format and config options such as name formats in <part> and/or <relationEntry>
- include 4 digit fromDate-toDate for person
- include dates for corporateBody
- use "fl." for active dates
- use "active" for active dates
- use explicit "b." and "d."
- only use "b." or "d." for single 4 digit dates
- enclose date in parentheses
- add comma between name and dates (applies only if there is a date)
and so on. In theory that could be done for all kinds of CPF variations. We should have a single checkbox for "use most portable CPF formats" although I suspect the best data exchange format is not XML, but SQlite, SQL INSERT statements, or json.
* Does our schema track who edited a record? If not, we should add user_id to the version table.
* We should review the schema and create rules that human beings follow for primary key names, and foreign key names. Two options are table_id and table_id_fk. There may be a better option
* Option 1 is nice for people doing "join on", but I find "join on" syntax confusing, especially when combined with a where clause
* Option 2 seems more natural, especially when there is a where clause. Having different field names means that field names are more likely to be unique, thus not requiring explicit table.field syntax. Option 2a will be used most of the time. Option 2b shows a three table join where table.field syntax is required.
```
-- 1
select * from foo as a, bar as b where a.table_id=b.table_id;
-- 2a
select * from foo as a, bar as b where table_id=table_id_fk;
-- or 2b
select * from foo as a, bar as b, baz as c where a.table_id=b.table_id_fk and a.table_id=c.table_id_fk;
```
* Identity Reconciliation has a planned "why" feature to report on why a match matches. It would be nice to also report the negative: why didn't this match X? I'm not even sure that is possible, but something to think about.
# Technical Notes
This directory contains technical notes related to the SNAC project by developers.
# Introduction to Documentation
# SNAC Documentation
The currently-being-revised TAT requirements are found in the [tat_requirements](tat_requirements).
This repository contains all the documentation for the SNAC Web Application and related frameworks, engines, and pieces. Specifically:
The best place to start is the big, overall [plan](tat_requirements/plan.md).
* The currently-being-revised Technical Requirements are found in the [Requirements Directory](Requirements).
* Formal Specifications for those requirements are in the [Specifications Directory](Specifications).
* [Help](Help) on using Gitlab, Git, and the Markdown text format
* [Documentation](Third Party Documentation) on third-party software and applications being used
* [Historical Documentation](Historical Documentation) on previous iterations of SNAC
* Technical [Discussions](Discussion) related to the SNAC project
* [Notes](Notes) from the technical team.
This is Gitlab, a work-alike clone of the Github web site, but installed locally on a SNAC server. Gitlab is a
The best place to start is the big, overall [plan](plan.md) document, which describes the process forward with defining requirements and specifications.
This repository is stored in Gitlab, a work-alike clone of the Github web site, but installed locally on a SNAC server. Gitlab is a
version control system with a suite of project management tools.
Ideally we will all create documentation in markdown format (.md files). You may create and edit files from
the web interface here on gitlab, or download files and edit locally. You can also upload any file type using
Ideally we will all create documentation in [markdown format](http://daringfireball.net/projects/markdown/) (.md files). You may create and edit files from
the web interface here on Gitlab, or download files and edit locally. You can also upload any file type using
standard git commands, or use a Git graphical client (see below). Choose a relevant directory for your docs,
or create a new directory as necessary.
or create a new directory as necessary.
Markdown files are simple text files, which makes them easy to edit and universally portable. Markdown has a
limited set of conventions to denote headers, lists, URLs and so on. When uploaded to gitlab or github,
markdown files are rendered into nicely styled HTML. Tools are available to convert markdown into .doc, .pdf,
LaTex and other formats.
#### How to use gitlab and markdown
[Help using gitlab](Help-using-gitlab.md)
---
Notes to merge with other requirements
---
Note for TAT functional requirements: need to have UI widget for search of very long fields, such as the Joseph Henry cpfRelations
that contain some 22K entries. Also need to list all fields which migh have large numbers of values. In fact, part of the meta data for
every field is "number of possible entries/reapeat values" or whatever that's called.This wiki serves as the documentation of the SNAC technical team as relates to Postgres and storage of data. Currently, we are documenting:
* the schema and reasons behind the schema,
* methods for handling versioning of eac-cpf documents, and
* elastic search for postgres.
* Need a data constraint and business logic API for validating data in the UI. This layer checks user inputs against some set of rules and when there is an issue it informs the user of problems and suggests solutions. Data should be saved regardless. We could allow "bad" data to go into the database (assumes text fields) and flag for later cleaning. That's ugly. Probably better to save inputs in some agnostic repo which could be json, frozen data structures, or name-value pairs, or even portable source format. The problem is that most often data validation is hard coded into UI JavaScript or host code. Validation really should be configurable separate from the display of data, workflow automation, and data storage. While our database needs to do certain rudimentary sanity checks, the database can't be expected to send messages all the way back up to the UI layer. Nor should the database be burdened with validation rules which are certain to be mutable. Ideally, the validation rules would work in the same state machine framework as the workflow automation API, and might even share some code.
* Need a mechanism to lock records. Even mark-as-deleted records won't necessarily solve all our problems, so we might want records that are "live", but not editable for whatever reason. The ability to lock records and having the lock integrated across all the data-aware APIs is a good idea.
* On the topic of data, we will have user/group/other and r/w permissions, and all data-aware APIs should be designed with that in mind.
* "Literate programming" Using MCV and the state machine workflow automation probably meets (and exceeds) Knuth's ideas of Literate programming. Look at his idea and make sure we aren't missing any key concepts.
* QA and testing needs to be several layers, one of which is simple documentation. We should have code that examines data for various qualities, which when done in a comprehensive manner will test the data for all properties described in the requirements. As bugs are discovered and features added, this data testing code would expand. Code should be tested on several levels as well, and the tests plus comments in the tests constitute our full understanding of both data and code.
* Entities (names) have ID values, ARKs and various kinds of persistent ids. Some subsystem needs to know how to gather various ID values, and how to generate URIs and URLs from those id values. All discovered ids need to be attached to the cpf identity in a table related_id. We will need to track the authority that issued the id, so we need an authority table. Perhaps it is best (as previously discussed) to create a CPF record for each authority, and use the CPF persistent ID as the authority identifier in the related_id table.
```
create table related_id (
ri_id auto primary key,
id_value text,
uri text,
url text,
authority_id int -- fk to cpf.id?
);
```
* Allow config of CPF output formats via web interface. For example, in the CPF generator, we can offer some format and config options such as name formats in <part> and/or <relationEntry>
- include 4 digit fromDate-toDate for person
- include dates for corporateBody
- use "fl." for active dates
- use "active" for active dates
- use explicit "b." and "d."
- only use "b." or "d." for single 4 digit dates
- enclose date in parentheses
- add comma between name and dates (applies only if there is a date)
and so on. In theory that could be done for all kinds of CPF variations. We should have a single checkbox for "use most portable CPF formats" although I suspect the best data exchange format is not XML, but SQlite, SQL INSERT statements, or json.
* Does our schema track who edited a record? If not, we should add user_id to the version table.
* We should review the schema and create rules that human beings follow for primary key names, and foreign key names. Two options are table_id and table_id_fk. There may be a better option
* Option 1 is nice for people doing "join on", but I find "join on" syntax confusing, especially when combined with a where clause
* Option 2 seems more natural, especially when there is a where clause. Having different field names means that field names are more likely to be unique, thus not requiring explicit table.field syntax. Option 2a will be used most of the time. Option 2b shows a three table join where table.field syntax is required.
```
-- 1
select * from foo as a, bar as b where a.table_id=b.table_id;
-- 2a
select * from foo as a, bar as b where table_id=table_id_fk;
-- or 2b
select * from foo as a, bar as b, baz as c where a.table_id=b.table_id_fk and a.table_id=c.table_id_fk;
```
* Identity Reconciliation has a planned "why" feature to report on why a match matches. It would be nice to also report the negative: why didn't this match X? I'm not even sure that is possible, but something to think about.
LaTex and other formats. For more information on Markdown, see [this guide](Help/Markdown.md).
#### Help Links
* [Git and Gitlab](Help/Git-and-Gitlab.md)
* [Markdown](Help/Markdown.md)
<p xmlns:dct="http://purl.org/dc/terms/" xmlns:vcard="http://www.w3.org/2001/vcard-rdf/3.0#" align="center">
<a rel="license"
href="http://creativecommons.org/publicdomain/zero/1.0/">
<img src="http://i.creativecommons.org/p/zero/1.0/88x31.png" style="border-style: none;" alt="CC0" />
</a>
<br />
To the extent possible under law,
<span resource="[_:publisher]" rel="dct:publisher">
<span property="dct:title">SNAC Cooperative</span></span>
has waived all copyright and related or neighboring rights to
<span property="dct:title">SNAC Documentation</span>.
This work is published from:
<span property="vcard:Country" datatype="dct:ISO3166"
content="US" about="[_:publisher]">
United States</span>.
</p>
\ No newline at end of file
# System-Generated Documents
The following documents and data should be generated from the completed system.
## Data Interoperability
Data should be available to be downloaded in the following formats:
* EAC-CPF XML
* Individual identity constellations should be download-able as fully-formed EAC-CPF XML documents
* Turtle Triples
* Subsets of the data, including the entire database, should be exportable as well-formed Turtle triples
* RDF Triples
* Subsets of the data, including the entire database, should be exportable as well-formed RDF triples
* JSON-LD
* Subsets of the data, not including the entire database, should be exportable as well-formed JSON-LD
## System Reports
While the web interface is the primary public face of SNAC, many other views of the data and meta data are
necessary, especially for admins and governance. Those "views" are reports and will primary be generated via
integration of a third-party reporting package such as Jaspersoft Business Intelligence Suite, which is free,
open source, and includes a full range of tools.
For each user of the system, the following reports should be available for download:
* List of records the user has edited
* Number of records the user has edited
For each holding institution, the following reports should be available for download:
* Number of records the institution has edited
* Number of records the institution has contributed
* List of records the institution has contributed
* List of records the institution has edited
* List of individuals within the institution and the records edited by each person
* List of records the institution has contributed with individuals who contributed to each record
General reporting:
* Number of participating holding institutions
* Number of records edited per hour, day, month, year
* Number of identity constellations available in the database
# Internal Data Storage
The data should be stored in a SQL database. Every piece of data is in a separate field to the extent that is practical.
Data is organized into fields (columns) records (rows) and tables. Fields related to each other are in the
same table. Every record has a unique, permanent, numerical id often called a "key" or "primary key". For
the SNAC Co-op we have decided that records are never overwritten during update. An update operation creates a new record identical to the old record except for updated
fields. All old records are available for viewing via special interface. The old records are invisible to
operations that are intellectually acting on "current" data.
Version history, including past versions of a field and record, users that made changes to that data, institution history, and timestamps must be kept in the internal data storage.
Provenance of each element must be captured as well, including across merges and splits of identity constellations.
The application must avoid storing mixed markup as much as possible. (Brad Westbrook sugests we avoid mixed markup).
## Captured actions on data
Prior to human edits, merged records can be algorithmically split by the computer, assuming we write code to
perform such a split. After human edit, a split must be performed by a human. It is a requirement that all
previous versions can be viewed (read-only) during the human-mediated split operation so the human can refer
back to previous information.
After human edits, rollback only applies to human edited versions. There is a fire-break where rollback cannot
cross from human edits back to machine-merged descriptions. The policy group needs to supply policy
requirements for the tech folks to implement.
The broad requirements for the application are: edit data, split records, merge records. Secondary features to
make the system useful include: work flow enforcement, search, reporting (including "watch" features),
administration, authorization (data privileges).
# Licensing and Copyright
The documentation and code generated by the SNAC Cooperative must have license files and text associated with them.
* [Documentation](#documentation)
* [Code](#code)
## Documentation
All documentation must be assigned the Creative Commons Zero (CC0) license. It's text is below:
```
CC0 1.0 Universal
Statement of Purpose
The laws of most jurisdictions throughout the world automatically confer
exclusive Copyright and Related Rights (defined below) upon the creator and
subsequent owner(s) (each and all, an "owner") of an original work of
authorship and/or a database (each, a "Work").
Certain owners wish to permanently relinquish those rights to a Work for the
purpose of contributing to a commons of creative, cultural and scientific
works ("Commons") that the public can reliably and without fear of later
claims of infringement build upon, modify, incorporate in other works, reuse
and redistribute as freely as possible in any form whatsoever and for any
purposes, including without limitation commercial purposes. These owners may
contribute to the Commons to promote the ideal of a free culture and the
further production of creative, cultural and scientific works, or to gain
reputation or greater distribution for their Work in part through the use and
efforts of others.
For these and/or other purposes and motivations, and without any expectation
of additional consideration or compensation, the person associating CC0 with a
Work (the "Affirmer"), to the extent that he or she is an owner of Copyright
and Related Rights in the Work, voluntarily elects to apply CC0 to the Work
and publicly distribute the Work under its terms, with knowledge of his or her
Copyright and Related Rights in the Work and the meaning and intended legal
effect of CC0 on those rights.
1. Copyright and Related Rights. A Work made available under CC0 may be
protected by copyright and related or neighboring rights ("Copyright and
Related Rights"). Copyright and Related Rights include, but are not limited
to, the following:
i. the right to reproduce, adapt, distribute, perform, display, communicate,
and translate a Work;
ii. moral rights retained by the original author(s) and/or performer(s);
iii. publicity and privacy rights pertaining to a person's image or likeness
depicted in a Work;
iv. rights protecting against unfair competition in regards to a Work,
subject to the limitations in paragraph 4(a), below;
v. rights protecting the extraction, dissemination, use and reuse of data in
a Work;
vi. database rights (such as those arising under Directive 96/9/EC of the
European Parliament and of the Council of 11 March 1996 on the legal
protection of databases, and under any national implementation thereof,
including any amended or successor version of such directive); and
vii. other similar, equivalent or corresponding rights throughout the world
based on applicable law or treaty, and any national implementations thereof.
2. Waiver. To the greatest extent permitted by, but not in contravention of,
applicable law, Affirmer hereby overtly, fully, permanently, irrevocably and
unconditionally waives, abandons, and surrenders all of Affirmer's Copyright
and Related Rights and associated claims and causes of action, whether now
known or unknown (including existing as well as future claims and causes of
action), in the Work (i) in all territories worldwide, (ii) for the maximum
duration provided by applicable law or treaty (including future time
extensions), (iii) in any current or future medium and for any number of
copies, and (iv) for any purpose whatsoever, including without limitation
commercial, advertising or promotional purposes (the "Waiver"). Affirmer makes
the Waiver for the benefit of each member of the public at large and to the
detriment of Affirmer's heirs and successors, fully intending that such Waiver
shall not be subject to revocation, rescission, cancellation, termination, or
any other legal or equitable action to disrupt the quiet enjoyment of the Work
by the public as contemplated by Affirmer's express Statement of Purpose.
3. Public License Fallback. Should any part of the Waiver for any reason be
judged legally invalid or ineffective under applicable law, then the Waiver
shall be preserved to the maximum extent permitted taking into account
Affirmer's express Statement of Purpose. In addition, to the extent the Waiver
is so judged Affirmer hereby grants to each affected person a royalty-free,
non transferable, non sublicensable, non exclusive, irrevocable and
unconditional license to exercise Affirmer's Copyright and Related Rights in
the Work (i) in all territories worldwide, (ii) for the maximum duration
provided by applicable law or treaty (including future time extensions), (iii)
in any current or future medium and for any number of copies, and (iv) for any
purpose whatsoever, including without limitation commercial, advertising or
promotional purposes (the "License"). The License shall be deemed effective as
of the date CC0 was applied by Affirmer to the Work. Should any part of the
License for any reason be judged legally invalid or ineffective under
applicable law, such partial invalidity or ineffectiveness shall not
invalidate the remainder of the License, and in such case Affirmer hereby
affirms that he or she will not (i) exercise any of his or her remaining
Copyright and Related Rights in the Work or (ii) assert any associated claims
and causes of action with respect to the Work, in either case contrary to
Affirmer's express Statement of Purpose.
4. Limitations and Disclaimers.
a. No trademark or patent rights held by Affirmer are waived, abandoned,
surrendered, licensed or otherwise affected by this document.
b. Affirmer offers the Work as-is and makes no representations or warranties
of any kind concerning the Work, express, implied, statutory or otherwise,
including without limitation warranties of title, merchantability, fitness
for a particular purpose, non infringement, or the absence of latent or
other defects, accuracy, or the present or absence of errors, whether or not
discoverable, all to the greatest extent permissible under applicable law.
c. Affirmer disclaims responsibility for clearing rights of other persons
that may apply to the Work or any use thereof, including without limitation
any person's Copyright and Related Rights in the Work. Further, Affirmer
disclaims responsibility for obtaining any necessary consents, permissions
or other rights required for any use of the Work.
d. Affirmer understands and acknowledges that Creative Commons is not a
party to this document and has no duty or obligation with respect to this
CC0 or use of the Work.
For more information, please see
<http://creativecommons.org/publicdomain/zero/1.0/>
```
## Code
All code must be assigned the BSD 3-Clause license, including the copyright header for the Rector and Visitors of the University of Virginia, and
the Regents of the University of California, as printed in the text below:
```
Copyright (c) 2015, the Rector and Visitors of the University of Virginia, and
the Regents of the University of California
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
3. Neither the name of the copyright holder nor the names of its contributors
may be used to endorse or promote products derived from this software without
specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
```
# Required New Features
The majority of new features will be in two areas: the maintenance
system, and the administration system. None of this code exists. The
maintenance system has a web UI and a server-based back end that
interacts with the same database used by the match-merge. The
maintenance system also requires an authentication system (login) that
allows us to manage the extensive collaborative efforts. The current
processing of data is accomplished only on servers at the command line,
and is handled directly by project programmers. In the new maintenance
system, that will be driven by content experts via a web site, and
therefore must expect the issues of authentication and authorization
inherent in collaborative data manipulation web applications.
The system will require reports. These will cover broad classes of
issues related to managing resources, usage statistics, administration,
maintenance, and some reports for end user researchers.
- Web application (architect: Robbie)
The web application is a wrapper for all the APIs. It can have an API of it own, or not. It handles all http
requests, validating the data, deciding what needs to be done, doing real work, and handing some output back
to the user. Typically the output is HTML, but we are already planning for file downloads, and JSON data as
output from REST API calls.
- Data validation API
Data from the web browser needs sanity checking and untainting before being handed to the rest of the
application. Initially the data validation API can consist of nothing more than untaining input from the
browser. We can add various checks and tests. We need to decide if the validation API can reject data, and if
it can, then it needs to interact with the work flow engine, the actual work flow, and whatever messaging
system we use to display messages to end users.
- Identitiy Reconciliation (aka IR) (architect: Robbie)
This API uses many aspects of identity, testing each against a target population of other identities. The
final anwser is a floating point number giving a match strength. IR has two modes of operation. Mode one
compares two identities and returns a match strength. Mode two compares a single identity againast the entire
database returning match strength. Mode two is somewhat unclear.
- workflow manager (Tom)
Every action the application can perform is part of the work flow. The names of these actions along with names
of their requisites are organized into a work flow table. The work flow engine does not know how to do real
work, but it does know the names of the functions which do the real work. A new feature (aka function, task)
is added to the application, by adding its name to the work flow, and creating a function of the same name in
the application. Likewise, requistes are determined by boolean functions, and every requisite must have a
matching function known to the work flow engine. The work flow enforces role-based behavior by testing the
requisites. The workflow engine exists, but needs to be ported from Perl to PHP, and the work flow data should
be stored in the SQL database.
- Support for work history and task staging.
Editing consists of several stages of work that may be performed by different people and/or different
roles. We need database tables to support saving of work state data. Create a prototype table schema so we can
think about this problem and create a functional spec.
For an edit we need the CPF id, user id, timedate stamp, bitfield or work flow tags, optional user notes. For
search we need: user id, search string, timedate stamp.
- SQL schema (Robbie, Tom)
All data is stored in a SQL database. Details are given elsewhere.
- Controlled vocabulary subsystem or API [Tag system](#controlled-vocabularies-and-tag-system)
We need controlled vocabulary for several data fields. This system handles all aspects of all controlled vocabularies.
- CPF to SQL parser (Robbie)
The input for the application is CPF files. These files need to be parsed into data fields and input into the
SQL database. This application exists, but needs some additional functionality.
- Name serialization tool, selectable pre-configured formats
Outputting name strings based on name data fields in the database is a tricky problem. There are several
output formats. The name serialization deals with this issue.
- Name string parser
Names in CPF files are currently strings. The CPF <part> element has been imported into the SQL database as a
string, but data needs require individual name components. Parsing names is a tricky problem, but several
parsers exist. We need to integrate one or more parsers, and perhaps tweak those parsers to handle the SNAC names.
- Date parser
We have several date parsers, but none are fully comprehensive. We can use the existing parsers, but they need
to be integrated into a single, comprehensive parser.
- CPF record edit, edit each field
Record editing on the server is handled by a collection of functions. The specifications for this may evolve
in parallel to the code. We know that each field needs to be changed, but the details of work flow and data
validation have not been determined. Work flow and validation are both likely to change as the SNAC policies
evolve. There are UI requirements for editing.
- CPF record split, split data into separate cpf identities, deprecate old ARK, mint new ARKs
Record splitting requires a set of functions and UI requirements documented elsewhere.
- CPF record merge, combine fields, deprecate old ARKs, mint new ARK
Record merge requires a set of functions and UI requirements documented elsewhere.
- Object architecture, coding style, class template (architect Robbie)
We will have a specific architecture of the web application, and of the classes and objects involved.
- UI widgets, mostly off the shelf, some custom written. We need to have UI edit/chooser widget for search and
select of large numbers of options, such as the Joseph Henry cpfRelations that contain some 22K
entries. Also need to list all fields which might have large numbers of values. In fact, part of the meta
data for every field is "number of possible entries/reapeat values" or whatever that's called. From a
software architecture perspective, the answer is 0, 1, infinite.
One important aspect of the project is long-term viability and preservation. We should be able to export all
data and metadata in standard formats. Part of the API should cover export facilities so that over time we can
easily add new export features to support emerging standards.
The ability to export all the data for preservation purposes also gives us the ability to offer bulk data
downloads to researchers and collaborating peer institutions.
# Requirements Documents
These documents describe the functionality desired of the system. These should be high-level requirements, geared toward the policy side, of the form "The system should do X."
# Software Development Process
Development on the SNAC web application should use agile development practices, with the shortest-possible-but-reasonable sprint size possible. See [scrum documentation](http://scrummethodology.com/scrum-sprint/) for more detailed information about agile development methods. Test-driven development should also be employed to automate testing and interconnect testing with the development process.
The git version control system should be used as the repository for code in the application. It allows distributed editing with highly-configurable branching of development, a "blame" system that allows viewing which developer added a specific line of code, and is cross-platform. It is also supported by [gitlab](http://gitlab.iath.virginia.edu), which should be used for internal development timelines, milestones, bug- and issue-tracking, and project management. Final versions of the repositories may then be pushed to the public-facing [github](https://github.com/snac-cooperative) repositories.
## General Discussion Notes
Choices for programming languages, operating system, databases, version
control, and various related tools and practices are based on extensive
experience of the developer community, and a complex set of requirements
for the coding process. Current best practices are agile development
using practices that allow programmers wide leeway for implementation
while still keeping the processes manageable.
Test-driven development ideally means automated testing, with careful
attention to regression testing. It takes some extra time up front to
write the tests. Each test is small, and corresponds to small sections
of code where both code and text can be quickly created. In this way,
the software is kept in a working state with only brief downtimes during
feature creation or bug fixes. Large programs are made up of
intentionally small functions each of which is tested by a small
automated test.
Regression testing refers to verifying that old bugs do not reappear.
Every bug fix has a corresponding test, even if the function in question
did not originally have a test for the bug. Each new bug needs a new
test. Bugs frequently reappear, especially in complex sections of code.
Source code version control is vital to both development process, and to
the release process. During development, frequent small changes are
checked-in to the version control, along with a meaningful comment. The
history of the code can be tracked. This occasionally helps to
understand how bugs come into existence. In the Git system, the history
command is “blame”, a bit of programmer dark humor where the history is
used to know who to blame for a bug (or any undesirable feature).
Moving code into Quality Assurance (QA) and then into the production
environment are both integral with source code management. Many version
control systems allow tagging a release with a name. The collected
source code files are marked as a named (virtual) collection, and can be
used to update a QA area. Human testing and review happens in QA. After
QA we have release. Depending on the nature of the system release can be
quite complex with many parties needing to be notified, and coordination
across groups of developers, sysadmin, managers, support staff, and
customers. Agile development tends towards small, seamless releases on a
frequent (weekly or monthly) basis where communication is primarily via
update of electronic documentation. The process needs to assure that
fixes and new features are documented. The system must have tools to see
the current version of the system with its change log, as well as
comparing that to previous releases. All of these are integrated with
change management.
Bug reporting and feature requests fall (broadly speaking) into the
category of change management. Typically a small group of senior
developers and stakeholders review the bug/feature tracking system to
assign priorities, clarify, and investigate. There are good
off-the-shelf systems for tracking bugs and feature requests, so we have
several choices. This process happens almost as frequently as the
features/bug fix coding work of the developers. That means on-going,
more or less continuous review of fix/features requests every few days,
depending on how independent the developers are. Agile applies to
everyone on the project. Ideal change management is not onerous. As
tasks are completed, someone (developers) update feature status with "in
progress", "completed” and so on. There might be additional status
updates from QA and release, but SNAC probably isn't large enough to
justify anything too complex.
#### QA and Related Tests for Test-driven Development
The data extraction pipelines manage massive amounts of data, and
visually checking descriptions for bugs would be inefficient if not
infeasible. The MARC extraction process is verified by just over 100
quality assurance descriptions. The output produced from each
description is checked for some specific value that confirms that the
code is working correctly and historical bugs have not reappeared. The
EAD extraction has a set of QA files, but the output verification is not
yet automated. A variety of file counts and measures of various sorts
are performed to verify that descriptions have all been processed. All
CPF output is validated against the Relax NG schema. Processing log
files are checked for a variety of error messages. Settings used for
each run are recorded in documentation maintained with the output files.
The source code is stored in a Subversion repository.
Our disaster recovery processes must be carefully documented.
# User Documentation
Every aspect of the system requires documentation. Most visible to the public is the user interface for
discovery. Maintenance will be complicated, and our processes are somewhat novel, so this will need to be
extensive, well illustrated with screenshots, and carefully tested.
Documentation intended for developers might be somewhat sparse by comparison, but will be critical to the
on-going software development process. All the databases, operating system, httpd and other servers need
complete documentation of installation, configuration, deployment, starting, stopping, and emergency
procedures.
# User Interface Requirements
## Web Application
Some aspects of the web app aren't yet clear, so there are details to be worked out, and some large-ish
concepts to clarify. I'm guessing we will agree on most things, and one of us or the other will just concede
on stuff where we don't agree.
Requirements:
- expose an http accessible API that is viable for `wget` or `curl`, browser `<form>`, and Ajax calls.
- Supported input format depends on the complexity of the requested operation.
- Public functions require no authentication. Everything else must include authentication data.
- Sandbox functionality to for training and testing, which doesn't modify actual SNAC data
### Web application output via template
A well known, easy, powerful method of creating presntation output is to use an template module. Templating
separates business logic from presentation logic, thus following an MVC model. Our business logic is our work
flow and related function calls. Presentation is our UI, and the work flow engine has no idea that a UI exists,
let alone how to create it. Curiously, the presentation logic knows how to create the presentation rendering,
but has no idea what it does or what it interacts with. This is another example of strong separation of
concerns.
A simple hello world text template with a single variable world = "world" would be:
```
Hello [% world %]!
```
Or a simple HTML version:
```
<html><body>Hello [% world %]!</body></html>
```
That example is based on the Template Tookit http://www.template-toolkit.org/ for which there is a Perl
module, and a Python module. Template modules are fairly common, so I'm almost certain we will have several to
choose from in PHP.
Choosing our own select software modules, including a template module, is better than being locked into a
large, cumbersome web framework. In general, web frameworks have issues:
- difficult to work with
- no useful functionality that isn't more easily found in another software module
- the often break MVC
- generally make debugging nearly impossible
We can do much better by selecting a few modules to create a lightweight quasi-framework that is perfectly matched to our
needs.
Once the internal API completes its work, we will have output data. Output data is passed to a rendering
layer that relies on the template module. The only code that knows anything about rendering is the rendering
layer. To all the non-rendering code, there is only "output data" which does conform to a standard structure
(almost certainly an output data object). The rendering layer takes the output object, and the requested format
of the output (text, html, pdf, xml, etc.) to create the output. Happily, "rendering" is generally a single
function call. We create a template object, call its "render" method with two arguments:
1. template file name,
2. the output data object.
Default behavior is to write the output to stdout, but the render method can also
return the output in a variable so we can create an http download.
Templates are human created static files containing placeholders. The template engine fills in the placeholders with
values from relevant parts of the output data. Clearly, the output data object and the template must share a
object/property naming convention. The template engine functionality has single value fields, looping over
input lists, and if statement branching based on input. But that's pretty much it. No work is done in the
template that is not directly concerned with filling in placeholders, not even formatting (in the sense of
rounding numbers, capitalizing strings, or adding html tags). Templates are valid documents of the output
type, except in rare cases. The attached template is well-formed XML.
The web app needs a file download output option as well as output to stdout.
### Watching records
Users may "watch" an identity constellation. If a constellation is being watched, and that constellation is part of an description (merged or
single) then the watch will apply to the results of human edits, regardless of which part of the description
was modified. It is possible for someone to wish to track a biogHist, but that biogHist could be completely
removed in lieu of an improved and updated description. We will not track individual elements in CPF.
The watcher should have the ability to disable their watch. After each edit, all
watchers will get a notification. The watch does not apply to any single field, but to the entire description, and therefore also to future descriptions which result from merging.
When an identity constellation is split, the watch propagates to both resulting records. The user will be informed of the change, and then may choose to disable one of the watchers.
### Ability to Open/Close the Site during Maintenance
If the web application has a "closed for maintenance" feature, this feature would be available to web admins,
even though it is the Linux sysadmins who will do the maintenance. A common major failure of web applications
is the assumption that the product is always up. This creates havoc when the site simply fails to load due to
an outage, planned or otherwise. With a little work we should be able to have an orderly "site is closed" web
page and status message for planned outages. We might be able to failover to some kind of system status
message. This is a low priority feature since downtime is probably only a few hours per year. At the same
time, if it isn't too difficult to implement, it sets our project apart from the majority who either ignore
the problem, or let their help desk folks spend an hour apologizing to customers.
When the product is closed, web admins should be able to login (assuming login is possible).
comment: Do we want an architecture where the login is essentially a separate product so that we can have a
"lobby" and other front end features that continue to work even when the backend is down for maintenance?
Most sites simply return a server error or site not available (404) when the site is down for whatever
reason. We can avoid this a couple of ways. The simplest is to use some Apache server features and a few
simple scripts so that users see a nice message when the site is down for maintenance. This very simple
approach requires little or no change to our software architecture. The more elegant approach is to use one of
several system architectures that  keep a small system front end always running.
# User Management
Authentication is validating user logins to the system. Authorization is the related aspect of controlling
which parts of the system users may access (or even which parts they may know exist).
We can use OpenID for authentication, but we will need a user profile for SNAC roles and authorization. There
are examples of PHP code to implement OpenID at stackexchange:
http://stackoverflow.com/questions/4459509/how-to-use-open-id-as-login-system
OpenID seems to constantly be changing, and sites using change frequently. Google has (apparently) deprecated
OpenID 2.0 in favor of Open Connect. Facebook is using something else, but apparently FB still works with
OpenID. Stackexchange supports several authentication schemes. If they can do it, so can we. Or we can support
one scheme for starters and add others as necessary. The SE code is not open source, so we can't see how much
work it was to support the various OpenID partners.
Authorization involves controlling what users can do once they are in the system. That function is sort of
more solved by OAuth or OpenID by sharing the user profile. However, SNAC has specific requirements,
especially our roles, and those will not be found in other system. There is not anything we must have from
user profiles. We might want their social networking profile, but social networking is not a core function of
SNAC.
By default users can't do anything that isn't exposed to the non-authenticated public users. Privileges are
added and users are given roles (aka groups) from which they inherit privileges. The authorization system is
involved in every transaction with the server to the extent that every request to the server is checked for
authorization before being passed to the code doing the real work.
The Linux model of three privilege types "user", "group", and "other" works well for authorization permissions
and we should use this model. "User" is an authenticated user. "Group" is a set of users, and a user may
belong to several groups. In SNAC and the non-Linux world "group" is known as "role", so SNAC will call them
"roles". "Other" privileges apply to SNAC as public, non-authenticated users, although we don't really have
"other", and the "researcher" role applies to public users.
Users can have several roles, and will have all the privileges of all their roles. Role membership is managed
by an administrative UI (part of the dashboard) and related API code. User information such as name, phone
number, and even password can also change. User ID values cannot be changed, and a user ID is never reused,
even after account deletion.
We expect to create additional roles as necessary for application functions.
Roles include a large number "is instution member" roles. These should be roles like any other, but we may
want to flag these role records to make them easy to manage and easy to display in the UI. Any user can have
zero or more roles that define their instutional affiliation. This primarily effects reporting and admin. In
the case of reports, membership in an institution constrains the reporting. When setting up a report, users
may only choose from institutions of which they are members. Some reports may auto-detect the user's
membership.
By and large when we refer to "accounts" we mean web accounts managed by the Manager/Web admin. The general
public can use the discovery interface without an account, but saving search history, and other
session related discovery tools requires an account. It is technically possible to have a single session
dashboard. Although that has not been mentioned as a requirement and is probably a low priority, it might be
almost trivial to implement.
Every account will be in the "Researcher" role which has the same privileges as the general public, but with a
TBD set of basic privileges including: search history, certain researcher reports.
| User type | Role | Description |
|----------------------------+---------------------+-----------------------------------------------------------------------|
| Sysadmin | Server admin | Maintain server, backups, etc. |
| Database Administrator | DBA | Schema maintenance, data dumps, etc. |
| Software engineer | Developer | Coding, testing, QA, release management, data loading, etc. |
| Manager | Web admin | Web accounts: create, manage, assign roles, run reports |
| Peer vetting | Vetting | Approve moderators, reviewers, content experts |
| Moderator | Moderator | Approve maintenance changes, posting those changes |
| Reviewer/editor | Maintenance | Maintainer privileges, interacts with moderators |
| Content expert | Maintenance | Domain expert, may have zero institutional roles |
| Documentary editor | Maintenance | Distinguished by? |
| Maintenance | Maintenance | Distinguished by? |
| Researcher | Researcher | Use the discovery interface and history dashboard |
| Archival description donor | Block upload | Bulk uploads of CPF or finding aids |
| Name authority manager | Name authority | Donates name authority data perhaps via bulk upload |
| Institutional admins | Institutional admin | Instutional role admin dashboard, institutional reports |
| Public | Researcher | No account, researcher role, no dashboard or single session dashboard |
Remember: institutional affiliation roles aren't in the table above. There will be many of those roles, and
users may have zero, one, or several institutional roles that define which insitutions that user is a member
of.
It is possible for an institutional admin to be a member of more than one institution. Institutional Admins
have abilities:
- view membership lists of their institution(s)
- add or remove their instutional role for users.
Roles which require one or more instutitutional roles (affiliation):
- Block upload
- Name authority
- Institutional admin
Roles which may have zero or more institutional roles:
- Web admin
- Vetting
- Moderator
- Maintenance (likely to have one or more)
- Researcher
There are several dashboard sections:
- Standard researcher history
- Standard user account management (password, email, etc.)
- Web admin account creation, deletion, role assignments
- Vetting admin (if we have vetting)
- Available reports.
# Coding Style
All code generated by the SNAC project will be written in one of the following languages.
* PHP 7 (preferred)
* PHP 5
* Java
* XSLT
## Coding Style Specifications
Source code must match the following style guidelines:
* 4-space tabs with literal spaces
* Maximum line-length of 100 characters
* Variables and Class names follow standard camel casing syntax, with descriptive names
* Class names start with upper-case letters
* Variable and field names start with lower-clase letters
* No underscores allowed in variable names
## Internal Documentation of Code
All code will be internally-documented using [Javadoc](http://www.oracle.com/technetwork/java/javase/documentation/index-137868.html) style documentation, which has been ported to PHP as [phpdoc](http://www.phpdoc.org/docs/latest/guides/docblocks.html) and XSLT as [XSLTdoc](http://www.pnp-software.com/XSLTdoc/). Tools to generate documentation from the code is also available for [Java](http://www.oracle.com/technetwork/java/javase/documentation/index-jsp-135444.html), [PHP](http://www.phpdoc.org/), and [XSLT](http://www.pnp-software.com/XSLTdoc/).
* All files, regardless of language, must have javadoc-style documentation with author attribution, definition of the file, and short-text of the code license, as defined below (in PHP):
```php
<?php
/**
* File Description Headline
*
* Paragraphs describing the file
*
* License:
* ....
*
* @author Robbie Hott
* @license http://opensource.org/licenses/BSD-3-Clause BSD 3-Clause
* @copyright 2015 the Rector and Visitors of the University of Virginia, and the Regents of the University of California
*/
?>
```
* All classes, fields, methods, and function definitions must include documentation, as shown below:
```php
<?php
/**
* Name Reconciliation Engine Main Class
*
* This class provides the meat of the reconciliation engine. To run the
* reconciliation engine, create an instance of this class and call the
* reconcile method.
*
* @author Robbie Hott
*/
class ReconciliationEngine {
/**
* Main reconciliation function
*
* This function does the reconciliation and returns the top identity from
* the engine. Other top identities and their corresponding score vectors
* may be obtained by other functions within this class.
*
* @param identity $identity The identity to be searched. This identity
* must be in the proper form
* @return identity The top identity by the reconciliation
* engine
*/
public function reconcile($identity) {
return $identity;
}
}
?>
```
## Licensing in Github/Gitlab
Each code repository must contain the full BSD 3-Clause license below. It must be saved in the document root as a text file titled `LICENSE`.
```
Copyright (c) 2015, the Rector and Visitors of the University of Virginia, and
the Regents of the University of California
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
3. Neither the name of the copyright holder nor the names of its contributors
may be used to endorse or promote products derived from this software without
specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
```
Copyright (c) 2015, the Rector and Visitors of the University of Virginia, and
the Regents of the University of California
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
3. Neither the name of the copyright holder nor the names of its contributors
may be used to endorse or promote products derived from this software without
specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
# Formal Specification Documents
These documents describe the specifications of the system. They include specific decisions for each component and the system as a whole, in order to meet the requirements listed in the [Requirements](/Requirements) section.
\ No newline at end of file
# SNAC Server Architecture
The system will be architected as a LAMP system, with the following components:
* Linux: CentOS 7
* Apache: Apache 2 web server
* PHP: PHP 7
* PostgreSQL: Postgres
Each component of the architecture will run on this platform. Any sub-component must either produce it's own http server on an available port, such as Elastic Search, or utilize the main Apache web server running a virtual host.
The following diagrams describe the architecture of internal components:
* ![Overall Server Architecture](http://gitlab.iath.virginia.edu/snac/Documentation/raw/master/Specifications/Originals/SNAC%20Server%20Architecture.svg)
# Third-Party Documentation
This directory contains documentation and links to off-the-shelf components used by the SNAC project.
- check into Apache httpd and http/2 as well as supporting Opportunistic encryption:
* [ArsTechnica: new firefox version says might as well to encrypting all web traffic](http://arstechnica.com/security/2015/04/new-firefox-version-says-might-as-well-to-encrypting-all-web-traffic/)
#### Brian's API docs need to be merged in or otherwise referred to:
[https://gist.github.com/tingletech/4a3fc5f59e5af3054286](https://www.google.com/url?q=https%3A%2F%2Fgist.github.com%2Ftingletech%2F4a3fc5f59e5af3054286&sa=D&sntz=1&usg=AFQjCNEJeJexryBtHbvLw-WtFYjxP4VwlQ)
#### Not sure where to fit these topics into the requirements. Some of them may not be part of technical requirements:
Discuss. What is "as it is configured now"? Consider implementing linked data standard for relationship links
instead of having to download an entire document of links (as it is configured now.)
Discuss. This seems to be the controlled vocabulary issue. Sort by common subject headings across all of SNAC - right now SNAC has
subject headings that have been applied locally without common practice
across the entire corpus.
We probably need to build our own holdings authority.
We need to write code to get accurate holdings info from WorldCat records. All the other repositories will
have be handled on a case-by-case basis. Sort by holdings location. Sort by identity's activity location. Sort
and visualize a person through time (show dates for events in a person or organization's lifetime). Sort and
visualize an agency or organization as it changes over time.
Continue to develop and refine context widget.
Sort collection links. Add weighting to understand which collections have more material directly related to
identity. (How is this best handled programmatically or as an input by contributors- maybe both?).
Increase exposure of SNAC to general public by leveraging partnerships. Suggested agreement with Wikipedia to
display Wikipedia content in SNAC biographical area and work with Wikipedia to allow for links to SNAC at the
bottom of all applicable identities. This would serve to escalate and drive traffic to SNAC.
#### Expanded Database Schema
The database schema has been rewritten to capture all the data in CPF files, as well as meet the various data requirements.
Each field within CPF may (will?) need provenance meta data. Likewise many fields in the database may need
data for provenance. This has not been done, and the developers need policy on provenance, as well as
examples. There seems to be little or no mention of provenance in Rachael's UI requirements.
The new schema has full versions of all records for all time. If not implemented, this is planned. The version
table records each table name, record id, user id who modified, and time datestamp. No changes were made to
existing tables, although existing tables may have gotten a field to distinguish old from current
records. The implementation may change.
Every record has a unique id. The watch system is a query run on some schedule (daily, hourly, ?) that checks
to see if a watched record has changed. CPF record has links to a “watch” table so users can watch each
record, and can watch for certain types of changes. Need UI for the watch system. Need an API for the watch
system.
Need a user table, group (role) table, probably a group permission table so that permissions are hard code
with groups. We also want to allow several permissions per group. Need UI for user, group, and
group-permission management.
We have created a generalized workflow system (as opposed to an ad-hoc linked set of reports). There is a work
flow state table which needs to be moved into the database.
Need fields to deal with delete/embargo. This may be best implemented via a trigger or perhaps a view. By
making what appear to be simple SELECTs through a view, the view can exclude deleted records. We must think
about how using a view (or trigger) will effect UPDATE and INSERT. Ideally the view is transparent. Is there
some clever way we can restrict access to the original table only via the view?
Need record lock on some types of records. This lock needs to be honored by several modules, so like “delete”,
lock might best be implemented via a view and we \*only\* access the table in question via the view.
If there are different levels of review for different elements in the record, then we need extra granularity
in the workflow or the edited record info to know the type of record edited apropos of workflow variations.
If there different reviewers for different parts of the record, then workflow data (and workflow
configuration) needs to be able to notify multiple people, and would have to get multiple reviewer approvals
before moving to the next phase of the workflow.
Institutional affiliation is probably common enough to want a field in the user table, as opposed to creating
a group for each institution. The group is perhaps more generalized and could behave identical (or almost
identical) to a field (with controlled vocabulary) in the user table.
Make sure we can write a query (report) to count numbers of records based type of edit, institution of the
editor, and number of holdings.
If we want to be able to quickly count some CPF element such as outgoing links from CPF to a given
institution, then we should put those CPF values into the SQL database, as meta data for the CPF record.
What is: How many referral links to EAC records that they created?
Be able to count record views, record downloads. Institutional dashboard reports need the ability to group-by
user, or even filter to a specific user.
Reporting needs to help managers verify performance metrics. This assumes that all changes have a
date/timestamp. Once workflow and process decisions are set, performance requirements for users such as
load/performance (how many updates and changes to records can be handled at once), search response time, edit
time (outside of review workflow), and update times need to be set.
Effort reporting to allow SNAC and participants to communicate to others the actual level of effort
involved. This sounds like a report with time span and numbers of records handled in various ways. SNAC might
use this when going from pilot into production so that everyone knows what effort will be required for X
number of records/actions (of whatever action type).
Time/activity reporting could allow us to assess viability, utility, and efficiency of maintenance system
processes.
Similar reports might be generated to evaluate the discovery interface. Something akin to how much time was
required to access a certain number of records. Rachael said: Assess viability of access funtionality-
performance time, available features, and ease of use.
We could try to report on the amount of training necessary before a new user was able to work independently in
each of various areas (content input, review, etc.)
# Unsorted Documents
This is a temporary-holding facility for documents as they are parsed and placed into the appropriate sections of the documentation.
Internal flow:
1. validate the inputs.
1. Somehow slice and dice the CGI params of the REST call into an abstracted request we can pass to the
internal API. I suppose that the external and internal APIs are very similar, but we almost certainly need
some level of symbolic reference aka abstraction. Each REST call has its requisite data. Some data is as
simple as a record id, and some will be fairly interesting json data structures.
1. The web app API does the tasks specified by the REST request and the work flow engine's directions.
1. Every http request must go through the work flow engine so that the work flow is validated and managed.
1. Every web app has a work flow, but people mostly just cobble that together with a bunch of implied
functionality using conditionals and side-effect-full function calls. In our code, the internal API is
100% work flow agnostic.
1. I can explain this in more detail, but it makes a huge improvement in the structure of the application.
1. Create the output data object if it wasn't created by the functions doing the work.
1. Pass the output data to a rendering function (or module) to be rendered into the appropriate output format:
html, text, xml, etc. and sent to stdout, or returned as an http file download. JSON probably doesn't need to
be rendered since JSON is "data" and not "presentation".
The work flow engine relies on functions that read application data and return booleans so that the
work flow engine can detect the application's relevant state. I guess that sounds confusing because the work
flow engine has state, and the application has state. Those two types of state are vastly different and only
related to each other in that the work flow engine can detect the application's state. The internal API of the
web app has no idea that the work flow engine even exists. And the work flow engine knows what work needs to
be done, but has no idea how it will be done. This is a very lovely separation of concerns.
# Overall Plan
### Table of Contents
* [Big questions](#big-questions)
* [Documents we need to create](#documents-we-need-to-create)
* [Governance and Policies, etc.](#governance-and-policies-etc)
* [Overview and order of work](#overview-and-order-of-work)
* [Non-component notes to be worked into requirements](#non-component-notes-to-be-worked-into-requirements)
* [System Design](#system-design)
* [Developed Components](#developed-components)
* [Off-the-shelf Components](#off-the-shelf-components)
* [Controlled vocabularies and tag system](#controlled-vocabularies-and-tag-system)
### Big questions
- (solved) how is gitlab backed up?
- Shayne backs up the whole gitlab VM.
- We need a complete understanding of our requirements for controlled vocabulary, ontology, and/or tagging as
well as how this relates to search facets. This also impacts our future ability to make assertions about the
data, and is somewhat related to semantic net. See [Tag system](#controlled-vocabularies-and-tag-system).
### Documents we need to create
- Operations and Procedure Manual
- Formal Requirements Document
- Formal Specification Document
- Research Agenda
- User Story Backlog
- Design Documents (UI/UX/Graphic Design)
- ideally someone writes a (possibly brief) style guide
- a set of .psd or other images is not a style guide
### Governance and Policies, etc.
- Data curation, preservation, graceful retirement
- Data expulsion vs. embargo
- Duplicates, backups, restore, related policy and technical issues
- Broad pieces that are missing or underdeveloped [Laura]
- Refresh relationship with OCLC [John, Daniel]
### Overview and order of work
1. List requirements of the overall application. (done)
2. Organize requirements by component, clean up, flesh out, and vet. (requires team meetings)
3. Create formal specifications for each component based on the official, clean, requirements document. Prototypes will help ensure the formal spec is written correctly.
4. Define a timeline for development and prototyping based on the formal specifications document.
5. Create tests for test-driven development based on the formal specification. This includes creating and mining ground-truth data.
6. Develop software based on formal specification that passes the given tests.
### Non-component notes to be worked into requirements
- CPF record edit, edit each field
- CPF record split, split data into separate cpf identities, deprecate old ARK, mint new ARKs
- CPF record merge, combine fields, deprecate old ARKs, mint new ARK
### System Design
#### Developed Components
- Data validation engine
- **API:** Custom JSON (needs formal spec)
- The data validation engine applies a written system of rules to the incoming data. The rules must be written in a human-readble form, such that non-technical individuals are able to write and understand the rules. A rule-writing guide must be supplied to give hints and help for writing rules properly. The engine will be pluggable and written as an MVC application. The model will read the user-written rules (stored in a postgres database or flat-file system, depending on the model) and apply them to any input given on the view. The initial view will be a JSON API, which accepts individual fields and returns the input with either validation suggestions or valid flags.
- rule based system abstracted out of the code
- rules are data
- change the rules, not the actual application code
- rules for broad classes of data type, or granular rules for individual fields
- probably used this to untaint data as well (remove things that are potential security problems)
- send all data through this API
- every rule includes a message describing what when wrong and suggesting fixes
- rules potentially editable by non-programmers
- rules are based on co-op data policies, which implies a data policy document, or the rules **can** be the
policy documentation.
- Identitiy Reconciliation (aka IR) (architect Robbie)
- **API:** Custom JSON (needs formal spec)
- needs docs wrangled
- workflow manager (architect Tom)
- **API:** Custom JSON? (needs formal spec)
- exists, needs tests, needs requirements
* **We need to stop now, write requirements, then apply those requirements to the existant system to ensure we meet the requirements**
- needs to be integrated into an index.php script that also checks authentication
- can the workflow also support the login.php authentication? (Yes).
- PostgreSQL Storage: schema definition (Robbie, Tom)
- **API:** SQL
- exists, needs tests, needs requirements
* **We need to stop now, write requirements, then apply those requirements moving forward to ensure we meet the requirements**
- should we re-architect tables become normal tables, the views go away, and versioned records are moved to shadow tables.
- add features for delete-via-mark (as opposed to actual delete)
- add features to support embargo
- *maybe, discuss* change vocabulary.sql from insert to copy. It would be smaller and faster, although in reality as soon as
it is in the database, the text file will never be touched again.
- discuss; Can/should we create a tag system to deal with ad-hoc requirements later in the project? [Tag system](#controlled-vocabularies-and-tag-system)
- CPF to SQL parser (Robbie)
- **API:** EAC-CPF XML input, JSON output? (needs formal spec)
- exists, needs tests, needs requirements
* **We need to stop now, write requirements, then apply those requirements moving forward to ensure we meet the requirements**
- NameEntity serialization tool, selectable pre-configured formats
- NameEntity string parser
- **API:** subroutine? JSON?
- Can we find a grammar-based parser for PHP? Should we use a standalone parser?
- Can we expose this as a JSON API such that it's given a name-string and returns an identity object of that identity's information? Possibly a score as well as to how well we thought we could parse it?
- Name parser (only a portion of the NameEntity string)
- **API:** subroutine? JSON?
- Can this use the same parser engine as the name string parser?
- Date parser
- **API:** subroutine? JSON?
- Can this use the same parser engine as the name string parser?
- **This should be distinct, or may be a subroutine of the nameEntry-string parser**
- Can we expose this as a JSON API such that given a set of dates, it returns a date object that lists out the individual dates? Then, it could be called from the name-string parser to parse out dates.
- Editing User interface
- **API:** HTML front-end, makes calls to internal JSON API
- Must have ajax-backed interaction for displaying and searching large lists, such as cpfRelations for an identity.
- We need to have UI edit/chooser widget for search and select of large numbers of options, such as the Joseph
Henry cpfRelations that contain some 22K entries. Also need to list all fields which might have large numbers
of values. In fact, part of the meta data for every field is "number of possible entries/reapeat values" or
whatever that's called. From a software architecture perspective, the answer is 0, 1, infinite.
- History Research Tool (redefined)
- **API:** HTML front-end, makes calls to internal JSON API
- Needs to be reworked to support the Postgres backend
#### Off-the-shelf Components
- gitlab for developer documentation, code version management, internal issue tracking and milestone keeping
- github public code repository
- test framework, need to choose one
- authentication
- session management, especially as applies to authentication tokens, cookies and something which prevents
XSS (cross-site scripting) attacks
- JavaScript UI component tools, JQuery; what others?
- Suggestions: bootstrap, angular JS, JQueryUI
- reports, probably Jasper
- PHP, Postgres, Linux, Apache httpd, etc.
- language modules ?
#### Controlled vocabularies and tag system
Tags are simply terms. When inplemented as fixed terms with persistent IDs and some modify/add policy, tags
become a flat (non-hierarchal) controlled vocabulary.
The difference being a weaker moderation of tags and more readiness to create new tags (types). The tag table
would consist of tag term and an ID value. Tag systems lack data type, and generally have no policy or less
restrictive policies about creating new tags
Below are some subject examples. It is unclear if these are each topics, or if "--" is used to join granular
topics into a topic list. Likewise it is unclear if this list relies on some explicit hierarchy.
```
American literature--19th century--Periodicals
American literature--20th century--Periodicals
Periodicals
Periodicals--19th century
World politics--Periodicals
World politics--Pictorial works
World politics--Societies, etc.
World politics--Study and teaching
```
**RH: I agree, this is super tricky. We need to distinguish between types of controlled vocab, so that we don't mix Occupation and Subject. A tagging system might be very nice for at least subjects.**
#### All reports
- what records have I edited
- how many records has my institution edited
- how many records has my institution contributed
- list of number of records contributed by institution
plan.md
--------
plan.md Big questions
plan.md Overview and order of work
plan.md Code we write
plan.md Controlled vocabularies and tag system
plan.md Code we use off the shelf
co-op_background.md
-----
Authors
Organization of documenatation
Introduction to SNAC
Evaluation of Existing Technical Architecture
Overview
Current State of the System
Processing Pipeline
Extraction
Match/Merge
Discovery/Dissemination
Prototype research tool
Gap analysis
Data maintenance
Pilot phase architecture
Current State Conclusion
introduction.md
--------
TAT Functional Requirements
Introduction to Planned Functionality
Software development, processes, and project management
QA and Related Tests for Test-driven Development
Documentation
Required new features
Web application overview
Web application output via template
Data background
What is "normal form" and what informs the database schema design?
Edit architecture requirements
Expanded CPF schema requirements
Expanded Database Schema
Merge and watch
Brian’s API docs need to be merged in or otherwise referred to:
Not sure where to fit these topics into the requirements. Some of them may not be part of technical requirements:
requirements.md
----
List of requirements
Requirements from Rachael's spreadsheet
List of Application Programmer Interfaces (APIs)
Work flow engine
Maintenance Functionality
Functionality for Discovery
User interface for Discovery
Functionality for Splitting
User interface for Splitting
Functionality for Merging
User interface for Merging
Functionality for Editing
User interface for Editing
Admin Client for Maintenance System
User Management
Web Application Administration
Reports
System Administration
Community Contributions
Ability to Open/Close the Site during Maintenance
Sandbox for Training, perhaps as a clone of the QA system?
ArchiveSpace Feature Planning via Brad
Staffing Model (Brian's draft suggestions)
#### Big questions
- (solved) how is gitlab backed up?
- Shayne backs up the whole gitlab VM.
- We need a complete understanding of our requirements for controlled vocabulary, ontology, and/or tagging as
well as how this relates to search facets. This also impacts our future ability to make assertions about the
data, and is somewhat related to semantic net. See [Tag system](#controlled-vocabularies-and-tag-system).
#### Documents we need to create
- Operations and Procedure Manual
- Research Agenda
- User Story Backlog
- Design Documents (UI/UX/Graphic Design)
- ideally someone writes a (possibly brief) style guide
- a set of .psd or other images is not a style guide
#### Governance and Policies, etc.
- Data curation, preservation, graceful retirement
- Data expulsion vs. embargo
- Duplicates, backups, restore, related policy and technical issues
- Broad pieces that are missing or underdeveloped [Laura]
- Refresh relationship with OCLC [John, Daniel]
#### Overview and order of work
1. create tech documents, filling in as much prose as possible
- currenly on-going
1. create prototype software to test tech requirements, iterate updating requirements and prototype
- Work flow engine is working and has both a command-line and web interface
- We have a SQL database schema
1. create tests for test driven development, and validate prototype
1. refactor or rewrite prototype to match requirements
1. create version 1 of software
#### Code we write
- Data validation API
- rule based system abstracted out of the code
- rules are data
- change the rules, not the actual application code
- rules for broad classes of data type, or granular rules for individual fields
- probably used this to untaint data as well (remove things that are potential security problems)
- send all data through this API
- every rule includes a message describing what when wrong and suggesting fixes
- rules potentially editable by non-programmers
- rules are based on co-op data policies, which implies a data policy document, or the rules **can** be the
policy documentation.
- Identitiy Reconciliation (aka IR) (architect Robbie)
- needs docs wrangled
- workflow manager (architect Tom)
- exists, needs tests, needs requirements
- needs to be integrated into an index.php script that also checks authentication
- can the workflow also support the login.php authentication? (Yes).
- SQL schema (Robbie, Tom)
- exists, needs tests, needs requirements
- should we re-architect tables become normal tables, the views go away, and versioned records are moved to shadow tables.
- add features for delete-via-mark (as opposed to actual delete)
- add features to support embargo
- *maybe, discuss* change vocabulary.sql from insert to copy. It would be smaller and faster, although in reality as soon as
it is in the database, the text file will never be touched again.
- discuss; Can/should we create a tag system to deal with ad-hoc requirements later in the project? [Tag system](#controlled-vocabularies-and-tag-system)
- CPF to SQL parser (Robbie)
- exists, needs tests, needs requirements
- Name serialization tool, selectable pre-configured formats
- Name string parser
- Can we find a grammar-based parser for PHP? Should we use a standalone parser?
- Date parser
- Can this use the same parser engine as the name string parser?
- CPF record edit, edit each field
- CPF record split, split data into separate cpf identities, deprecate old ARK, mint new ARKs
- CPF record merge, combine fields, deprecate old ARKs, mint new ARK
- coding style, class template (architect Robbie)
- We need to have UI edit/chooser widget for search and select of large numbers of options, such as the Joseph
Henry cpfRelations that contain some 22K entries. Also need to list all fields which might have large numbers
of values. In fact, part of the meta data for every field is "number of possible entries/reapeat values" or
whatever that's called. From a software architecture perspective, the answer is 0, 1, infinite.
#### Controlled vocabularies and tag system
Tags are simply terms. When inplemented as fixed terms with persistent IDs and some modify/add policy, tags
become a flat (non-hierarchal) controlled vocabulary.
The difference being a weaker moderation of tags and more readiness to create new tags (types). The tag table
would consist of tag term and an ID value. Tag systems lack data type, and generally have no policy or less
restrictive policies about creating new tags
Below are some subject examples. It is unclear if these are each topics, or if "--" is used to join granular
topics into a topic list. Likewise it is unclear if this list relies on some explicit hierarchy.
```
American literature--19th century--Periodicals
American literature--20th century--Periodicals
Periodicals
Periodicals--19th century
World politics--Periodicals
World politics--Pictorial works
World politics--Societies, etc.
World politics--Study and teaching
```
#### Code we use off the shelf
- gitlab for docs, code version management, issue tracking(?)
- github public code repository?
- test framework, need to choose one
- authentication
- session management, especially as applies to authentication tokens, cookies and something which prevents
XSS (cross-site scripting) attacks
- JavaScript UI component tools, JQuery; what others?
- reports, probably Jasper
- PHP, Postgres, Linux, Apache httpd, etc.
- language modules
These documents are organized in the following order:
[plan.md](plan.md) Big overview.
[outline.md](outline.md) An outline of sections in the documents
[co-op_background.md](co-op_background.md) Broad expectations for the co-op software.
[introduction.md](introduction.md) Requirements part one
[requirements.md](requirements.md) Requirements part two, includes tech requirements from Rachael's spreadsheets
![SNAC web app API data flow](images/image02.png)
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment