plan.md 8.95 KB
Newer Older
1
# Overall Plan
Tom Laudeman committed
2

3 4 5 6 7 8 9 10 11 12 13 14 15
### Table of Contents

* [Big questions](#big-questions)
* [Documents we need to create](#documents-we-need-to-create)
* [Governance and Policies, etc.](#governance-and-policies-etc)
* [Overview and order of work](#overview-and-order-of-work)
* [Non-component notes to be worked into requirements](#non-component-notes-to-be-worked-into-requirements)
* [System Design](#system-design)
  * [Developed Components](#developed-components)
  * [Off-the-shelf Components](#off-the-shelf-components)
  * [Controlled vocabularies and tag system](#controlled-vocabularies-and-tag-system)

### Big questions
Tom Laudeman committed
16

Tom Laudeman committed
17

twl8n committed
18 19 20
- (solved) how is gitlab backed up?

  - Shayne backs up the whole gitlab VM.
Tom Laudeman committed
21

Tom Laudeman committed
22 23 24
- We need a complete understanding of our requirements for controlled vocabulary, ontology, and/or tagging as
  well as how this relates to search facets. This also impacts our future ability to make assertions about the
  data, and is somewhat related to semantic net. See [Tag system](#controlled-vocabularies-and-tag-system).
25

26
### Documents we need to create
27 28 29

- Operations and Procedure Manual

30 31 32 33
- Formal Requirements Document

- Formal Specification Document

34 35 36 37 38
- Research Agenda

- User Story Backlog

- Design Documents (UI/UX/Graphic Design)
39

40
      - ideally someone writes a (possibly brief) style guide
41

42 43
      - a set of .psd or other images is not a style guide

44
### Governance and Policies, etc.
45 46 47 48 49 50 51 52 53 54 55

- Data curation, preservation, graceful retirement

- Data expulsion vs. embargo

- Duplicates, backups, restore, related policy and technical issues

- Broad pieces that are missing or underdeveloped [Laura]

- Refresh relationship with OCLC [John, Daniel]

twl8n committed
56

57
### Overview and order of work
twl8n committed
58

59
1. List requirements of the overall application. (done)
Tom Laudeman committed
60

61
2. Organize requirements by component, clean up, flesh out, and vet. (requires team meetings)
62

63
3. Create formal specifications for each component based on the official, clean, requirements document. Prototypes will help ensure the formal spec is written correctly.
64 65 66 67 68 69 70

4. Define a timeline for development and prototyping based on the formal specifications document.

5. Create tests for test-driven development based on the formal specification.  This includes creating and mining ground-truth data.

6. Develop software based on formal specification that passes the given tests.

71

72

73
### Non-component notes to be worked into requirements
74 75 76 77 78 79 80 81 82 83 84 85 86
- CPF record edit, edit each field

- CPF record split, split data into separate cpf identities, deprecate old ARK, mint new ARKs

- CPF record merge, combine fields, deprecate old ARKs, mint new ARK

### System Design

#### Developed Components

- Data validation engine
  - **API:** Custom JSON (needs formal spec)

87
  - The data validation engine applies a written system of rules to the incoming data.  The rules must be written in a human-readble form, such that non-technical individuals are able to write and understand the rules.  A rule-writing guide must be supplied to give hints and help for writing rules properly.  The engine will be pluggable and written as an MVC application.  The model will read the user-written rules (stored in a postgres database or flat-file system, depending on the model) and apply them to any input given on the view.  The initial view will be a JSON API, which accepts individual fields and returns the input with either validation suggestions or valid flags.
twl8n committed
88 89 90

  - rule based system abstracted out of the code
  - rules are data
91
  - change the rules, not the actual application code
twl8n committed
92 93 94 95 96 97 98 99
  - rules for broad classes of data type, or granular rules for individual fields
  - probably used this to untaint data as well (remove things that are potential security problems)
  - send all data through this API
  - every rule includes a message describing what when wrong and suggesting fixes
  - rules potentially editable by non-programmers
  - rules are based on co-op data policies, which implies a data policy document, or the rules **can** be the
    policy documentation.

Tom Laudeman committed
100
- Identitiy Reconciliation (aka IR) (architect Robbie)
101
  - **API:** Custom JSON (needs formal spec)
Tom Laudeman committed
102 103

 - needs docs wrangled
Tom Laudeman committed
104

Tom Laudeman committed
105
- workflow manager (architect Tom)
106
  - **API:** Custom JSON? (needs formal spec)
Tom Laudeman committed
107 108

  - exists, needs tests, needs requirements
109
    * **We need to stop now, write requirements, then apply those requirements to the existant system to ensure we meet the requirements**
110

twl8n committed
111
  - needs to be integrated into an index.php script that also checks authentication
112

twl8n committed
113
  - can the workflow also support the login.php authentication? (Yes).
114

115 116
- PostgreSQL Storage: schema definition (Robbie, Tom)
  - **API:** SQL
Tom Laudeman committed
117 118

  - exists, needs tests, needs requirements
119
    * **We need to stop now, write requirements, then apply those requirements moving forward to ensure we meet the requirements**
120

twl8n committed
121
  - should we re-architect tables become normal tables, the views go away, and versioned records are moved to shadow tables.
122

twl8n committed
123
  - add features for delete-via-mark (as opposed to actual delete)
124

twl8n committed
125
  - add features to support embargo
126

twl8n committed
127
  - *maybe, discuss* change vocabulary.sql from insert to copy. It would be smaller and faster, although in reality as soon as
twl8n committed
128
    it is in the database, the text file will never be touched again.
129

twl8n committed
130
- discuss; Can/should we create a tag system to deal with ad-hoc requirements later in the project? [Tag system](#controlled-vocabularies-and-tag-system)
Tom Laudeman committed
131 132

- CPF to SQL parser (Robbie)
133
  - **API:** EAC-CPF XML input, JSON output? (needs formal spec)
Tom Laudeman committed
134 135

  - exists, needs tests, needs requirements
136
    * **We need to stop now, write requirements, then apply those requirements moving forward to ensure we meet the requirements**
137

138 139 140
- NameEntity serialization tool, selectable pre-configured formats

- NameEntity string parser
twl8n committed
141

142
  - **API:** subroutine? JSON?
Tom Laudeman committed
143

twl8n committed
144
    - Can we find a grammar-based parser for PHP? Should we use a standalone parser?
twl8n committed
145

146
    - Can we expose this as a JSON API such that it's given a name-string and returns an identity object of that identity's information?  Possibly a score as well as to how well we thought we could parse it?
147
- Name parser (only a portion of the NameEntity string)
148 149
  - **API:** subroutine? JSON?
  - Can this use the same parser engine as the name string parser?
150

Tom Laudeman committed
151
- Date parser
152
  - **API:** subroutine? JSON?
Tom Laudeman committed
153

twl8n committed
154
  - Can this use the same parser engine as the name string parser?
155
  - **This should be distinct, or may be a subroutine of the nameEntry-string parser**
156
  - Can we expose this as a JSON API such that given a set of dates, it returns a date object that lists out the individual dates?  Then, it could be called from the name-string parser to parse out dates.
twl8n committed
157

158 159 160 161 162 163 164
- Editing User interface
  - **API:** HTML front-end, makes calls to internal JSON API
  - Must have ajax-backed interaction for displaying and searching large lists, such as cpfRelations for an identity.
    - We need to have UI edit/chooser widget for search and select of large numbers of options, such as the Joseph
      Henry cpfRelations that contain some 22K entries. Also need to list all fields which might have large numbers
      of values. In fact, part of the meta data for every field is "number of possible entries/reapeat values" or
      whatever that's called. From a software architecture perspective, the answer is 0, 1, infinite.
twl8n committed
165

166 167 168
- History Research Tool (redefined)
  - **API:** HTML front-end, makes calls to internal JSON API
  - Needs to be reworked to support the Postgres backend
twl8n committed
169

170
#### Off-the-shelf Components
twl8n committed
171

172 173 174 175 176 177 178 179 180
- gitlab for developer documentation, code version management, internal issue tracking and milestone keeping

- github public code repository

- test framework, need to choose one

- authentication
  - session management, especially as applies to authentication tokens, cookies and something which prevents
    XSS (cross-site scripting) attacks
181

182 183 184 185 186 187 188 189
- JavaScript UI component tools, JQuery; what others?
  - Suggestions: bootstrap, angular JS, JQueryUI

- reports, probably Jasper

- PHP, Postgres, Linux, Apache httpd, etc.

- language modules ?
Tom Laudeman committed
190

twl8n committed
191

Tom Laudeman committed
192

193
#### Controlled vocabularies and tag system
twl8n committed
194

Tom Laudeman committed
195 196
Tags are simply terms. When inplemented as fixed terms with persistent IDs and some modify/add policy, tags
become a flat (non-hierarchal) controlled vocabulary.
twl8n committed
197

Tom Laudeman committed
198 199 200
The difference being a weaker moderation of tags and more readiness to create new tags (types). The tag table
would consist of tag term and an ID value. Tag systems lack data type, and generally have no policy or less
restrictive policies about creating new tags
twl8n committed
201

Tom Laudeman committed
202
Below are some subject examples. It is unclear if these are each topics, or if "--" is used to join granular
203 204
topics into a topic list. Likewise it is unclear if this list relies on some explicit hierarchy.

twl8n committed
205
```
Tom Laudeman committed
206 207 208 209 210 211 212 213
American literature--19th century--Periodicals
American literature--20th century--Periodicals
Periodicals
Periodicals--19th century
World politics--Periodicals
World politics--Pictorial works
World politics--Societies, etc.
World politics--Study and teaching
twl8n committed
214 215
```

216
**RH: I agree, this is super tricky.  We need to distinguish between types of controlled vocab, so that we don't mix Occupation and Subject.  A tagging system might be very nice for at least subjects.**