Commit 7a2c70bc by Tom Laudeman

moving files from wiki to docs as files

parent 8e6e8192
We have had many problems due to inconsistent import data practices and formats. In addition to guidelines, we can create portable and online software to help people detect issues in their data. We already have many of these scripts in a state of maturity suitable for software developers. It is possible to raise that level of maturity to something we can offer to the general public.
Some of the issues encountered:
- data not encoded as utf8 files, or non-utf8 characters in the files. If there is a requirement to process non-utf8 files, those files will need to have the encoding explicitly set somewhere. It is impossible to always guess the encoding of a file.
- illegal character encodings
- accidental inclusion of substituted characters. Commonly smart quotes, en-dash, em-dash. Fixing these requires running the data through cleaning software that we may always need due to people writing text in Microsoft product, then copying/pasting. The cleanup is easy, but ideally the bad characters wouldn't be in the data in the first place.
- non well-formed XML.
- lack of persistent id values. Every file should have a unique id. Every online finding aid needs a unique, persistent URI that is also a persistent URL and which loads in a web browser, and can be retrieved via wget or curl.
- URLs should resolve to static HTML without using JavaScript to dynamically build pages.
- ideally, any data-centric URL should have variant formats for humans (HTML, PDF) and computers (XML, JSON, etc.)
- inclusion of URIs within data files. It creates issues to have to build a URI algorithmically based on an ID value found somewhere in a data file. The URI should have a standard (conventional) location in the data, and the URI should occur as a URL. Ideally, the ID value is also present simply as an ID. It should not be required to perform sub-string or regular expression operations to extract ID values.
- URLs that are missing should return 404, and no other value. It is problematic for automated validation of URLs when a missing page returns a 200 (or any non-404 value) along with a human readable HTML message about the status of the page.
# Elastic Search Notes and Installation
Installing Elastic Search (ES) for SNAC use requires a few stages, as outlined below. Elastic Search uses, by default, the REST API on port 9200 for all communications.
## Background
When you need full text search for a SQL database, Elastic Search is fast and feature rich. We use it with PostgreSQL on Linux, and (will) integrate ES with Apache httpd in a LAMP web application.
## Elastic Search
The download site for elastic search is [https://www.elastic.co/downloads/elasticsearch](https://www.elastic.co/downloads/elasticsearch). Once downloaded and unzipped, the server can be started by executing
```
bin/elasticsearch
```
The server can be tested by issuing the following command
```
curl -X GET http://localhost:9200/
```
## Elastic Search River Plugin
The River plugin allows Elastic Search to connect to a database using any standard JDBC connector. The full source code and documentation is available at [https://github.com/jprante/elasticsearch-river-jdbc](https://github.com/jprante/elasticsearch-river-jdbc). To install the River plugin, execute the following in the elastic search directory:
```
./bin/plugin --install jdbc --url \
http://xbib.org/repository/org/xbib/elasticsearch/plugin/
elasticsearch-river-jdbc/1.5.0.0/elasticsearch-river-jdbc-1.5.0.0.zip
```
Since we are using PostgreSQL, we need to install the appropriate JDBC connector. Postgres makes this available at [https://jdbc.postgresql.org/download.html](https://jdbc.postgresql.org/download.html). Download the jar file to `./plugins/jdbc/` in the elastic search directory.
Once the River plugin and PostgreSQL JDBC connector have been installed, restart elastic search.
## Instructions to Index Postgres
To link postgres using the river plugin, a great tutorial can be found at [http://studiofrenetic.com/blog/a-river-flowing-from-postgresql-to-elasticsearch/](http://studiofrenetic.com/blog/a-river-flowing-from-postgresql-to-elasticsearch/).
We can add an index to our database using the Postgres JDBC driver by issuing the following command:
```
curl -XPUT "localhost:9200/_river/index_type/_meta" -d ' {
"type" : "jdbc",
"jdbc" : {
"url" : "jdbc:postgresql://localhost:5432/eaccpf",
"user" : "snac",
"password" : "snacsnac",
"sql" : "full SQL statement to index",
"index" : "index_name",
"type" : "index_type",
"strategy" : "oneshot"
}
}'
```
The `index_type` is what will normally be considered the index name. The `index_name` is the super type. To access the full index later on, a user would query `localhost:9200/index_name/index_type`. For our uses, we have defined `index_name` as `snac` and each index to be the type associated with it. Therefore, searching all of the snac data can be handled by:
```
curl -X GET "localhost:9200/snac/_search?pretty&q=search term"
```
### SNAC Indices
For reference, the two indices already created were defined as follows:
```
curl -XPUT "localhost:9200/_river/original_name/_meta" -d ' {
"type" : "jdbc",
"jdbc" : {
"url" : "jdbc:postgresql://localhost:5432/eaccpf",
"user" : "snac",
"password" : "snacsnac",
"sql" : "select id,cpf_id,original from name;",
"index" : "snac",
"type" : "original_name",
"strategy" : "oneshot"
}
}'
curl -XPUT "localhost:9200/_river/vocabulary/_meta" -d ' {
"type" : "jdbc",
"jdbc" : {
"url" : "jdbc:postgresql://localhost:5432/eaccpf",
"user" : "snac",
"password" : "snacsnac",
"sql" : "select id,type,value from vocabulary;",
"index" : "snac",
"type" : "vocabulary",
"strategy" : "oneshot"
}
}'
```
# Other References
Tutorial: [http://okfnlabs.org/blog/2013/07/01/elasticsearch-query-tutorial.html](http://okfnlabs.org/blog/2013/07/01/elasticsearch-query-tutorial.html)
Elastic search is quite powerful, and there are a few nice search features we can take advantage of:
1. Wildcard and regex queries work: http://www.elastic.co/guide/en/elasticsearch/guide/current/_wildcard_and_regexp_queries.html
2. Prefix searching for fast autocomplete: http://www.elastic.co/guide/en/elasticsearch/guide/current/_query_time_search_as_you_type.html
**Note** the full listing of query types can be found [here](http://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-queries.html)
Indexing might help us, too, if the default inverted indices are not enough. The inverted index allows for regex, wildcard, and even edit-distance searching. However, other methods of indexing exist:
1. N-grams: http://www.elastic.co/guide/en/elasticsearch/guide/current/ngrams-compound-words.html
Also, it has a great scoring mechanism, as seen [here](http://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html) and [here](http://www.elastic.co/guide/en/elasticsearch/guide/current/practical-scoring-function.html)
\ No newline at end of file
## General Notes
* Lots of institutions are plugging into snac
Nara Website Platform (for information access for the pilot group and snac team)
* Want a cross between a website and a facebook page
* Jive platform (internally)
* They are wanting to expand it to an external collaboration effort (for the external research community)
* Blogs, other collaboration tools
* [Website](https://www.jivesoftware.com/)
* Other Suggestions
* [Ning](http://www.ning.com)
## SNAC Coop Launch
* Possible Dates
* Sept 28-30
* Sept 30-Oct 2
* Oct 14-16
* Oct 21-23
* Pilot Policies and Procedures
* Learn as we go, ask questions in meetings
* Really put our heads around the participants
* Best Practices Team: Daniel, Amanda, Jerry
## [ArchivesSpace](http://archivesspace.github.io/archivesspace/doc) Presentation
* Uses MySQL
* 4 Accesses
* Public UI (8080?)
* Staff UI (editing)
* API interface (8089)
* Solr (8090) search interface
* JSON Model
* 4 Types of Agents (They all inherit the Agent data model)
* Corporate Entities
* Persons
* Families
* Software
* Other Links
* [Technical Architecture](http://archivesspace.github.io/archivesspace/doc/file.ARCHITECTURE.html)
* [API Reference](http://archivesspace.github.io/api/)
* [Test Interface](http://test.archivesspace.org:8080) User: admin, Password: admin
* Other Notes
* Background Jobs creation: do bulk imports in the background
* Look at New Person Agent (agents/agent_person/new) for their creation UI. It's not very user friendly.
* BiogHists
* They have a wrap-and-tag editor. (Type '<' to get the options for tags -- inline options are limited)
* Allow subnotes of type chronList, etc.
* Sounds like a front-end for what we're planning to do.
## Match/Merge Presentation (Ray, Yiming)
* Postgres, Cheshire, Python 2.x
* SQLAlchemy ORM, libxml (xml parser), setuptools packaging
* Using SQLAlchemy, can switch SQL backends easily (ish)
* Matching
* ULAN is not being used.
* VIAF loaded into Cheshire
* Record groups that ties all the matching records together
* They are not merged at this point. If they match, they are linked together by pointers within Postgres
* There is also a maybe same as table that shows possible matches
* EAC is input, loaded into Postgres, then 3 stage process
* Connect exactly matching records (if available)
* Matching an exact normalized name (identically)
* Note: What happens if there is a match that both don't have dates?
* Use Cheshire to find possible match through VIAF (if the exact match above failed)
* Merge
* When loading an EAC record, it just stores the filename (not the entire data)
* Matching in detail
* Names are the most important attribute
* Finding candidates using names
* Exact name with normalization (just in postgres? or also VIAF/cheshire?)
* Name components as keyworkds (Cheshire only)
* Ngram (overlapping 3-char sequences) matches (Cheshire only)
* Validate returned candidate records
* Either accept candidate as a match, maybe a match or reject candidate
* Generating Candidates
* Look up in cheshire
* Look up in postgres (exact name only)
* Try to parse Western names
* name order, suffixes, epithets, middle names
* British library records were the most egregious for this (Lance, Eveline Anne, Nee La Belinaye; Wife of T Lance, the younger,...)
* Using Python Library: nameparser (note: I don't think it would work well with name identity strings)
* Validation of candidates
* Edit distance + string length
* Smith vs Smythe
* Zabolotskii, Nikolai Alekseevich vs Zabolotskii, Nikolai Aleksevich
* For Western names: name part to name part comparison, initial to name comparison
* Different strengths of match for each component
* Mistakes (some differences) in Matches of last name are likely false positives and are ignored (according to Yiming)
* Matches where the last name is a good match, but there are mistakes/differences in first names, then they are flagged as maybe matches (ex: initials vs full names)
* Dates of activity
* Within some tolerance depending on the era
* Flourished vs birth/death
* State of the database
* 6 million input
* ~3 million unique groups
* Spirit Index
* Ghost-written records
* Franklin, Benjamin, 1706-1790 (Spirit)
* Special exclusion
* Merging
* Record Group is marker for records to point to if they think they are the same entity
* Record Groups can be invalidated (valid flag)
* As a way to break apart groups (there is a tombstone record left, so that the improperly merged record could still be generated)
* keep a linked list of record groups that have been invalidated
* ... exact matches and authority
* Create output
* Record assembly
* In case of source changes, no reprocessing needed
* Output script detects if record changed by an update
* Lots of legacy code from v1
* Cancelled records (split, merged, record prior history)
* Creating tombstones if these things happen
* Final merged records are only just in time created (dumped out at the end; no merged record stored in the database, but always created from the sources)
* Combining
* They say they are not combining biogHists (but this is happening: see Richard Nixon)
* Brian suggested that if the humans edit the combined biogHist, then that's what is shown, but the others are kept around hidden in a tab to view the originals
* There is a focus on the XML documents (EAC-CPF) as the canonical record; not a current database that can handle multiple versions of various pieces of the records.
* Discussion
* Two dimensions to editing a record (EAC-CPF XML), according to Brian
* EAC-CPFs that are JIT merged; how do you edit that?
* Put the edits upstream and rerun the merge
* Ray is still talking about looking at the XML files and merging based on diffs, for example
* Any new batches that will be coming in, they would just get added to the list (and it would run through the same merge process, maybe flagged as human or something)
* So, batch uploads would be processed just like a human editing the record
* Brian's proposal
* Refactor the match and merge processes
* Match is useful in a lot more contexts, like archivesspace hitting against
* Merge process should use the same API to update the database that the front-end uses (back-end and front-end use the same api)
* Daniel likes this and agrees
* Ray didn't use a database with all the information because they wanted to keep all the pieces (pre-merge) separate
* We need to address multiple edits (or re-edits)
* An editor does a lot of work, then a batch process (or another editor) overwrites or changes that work, that original editor might get offended
* What is the policy that a person/process can edit the record
* Right "now," that policy is that they can log in.
* **The proposal doesn't allow for automatic merging**
* Identity Reconciliation will just flag possible matches, but will not try to automatically merge records.
* A person would have to see that there are X incoming records that match, and they needed to be included (by hand merge)
......@@ -6,4 +6,64 @@ The currently-being-revised TAT requirements are found here in the "tat_requirem
Note for TAT functional requirements: need to have UI widget for search of very long fields, such as the Joseph Henry cpfRelations
that contain some 22K entries. Also need to list all fields which migh have large numbers of values. In fact, part of the meta data for
every field is "number of possible entries/reapeat values" or whatever that's called.
\ No newline at end of file
every field is "number of possible entries/reapeat values" or whatever that's called.This wiki serves as the documentation of the SNAC technical team as relates to Postgres and storage of data. Currently, we are documenting:
* the schema and reasons behind the schema,
* methods for handling versioning of eac-cpf documents, and
* elastic search for postgres.
* Need a data constraint and business logic API for validating data in the UI. This layer checks user inputs against some set of rules and when there is an issue it informs the user of problems and suggests solutions. Data should be saved regardless. We could allow "bad" data to go into the database (assumes text fields) and flag for later cleaning. That's ugly. Probably better to save inputs in some agnostic repo which could be json, frozen data structures, or name-value pairs, or even portable source format. The problem is that most often data validation is hard coded into UI JavaScript or host code. Validation really should be configurable separate from the display of data, workflow automation, and data storage. While our database needs to do certain rudimentary sanity checks, the database can't be expected to send messages all the way back up to the UI layer. Nor should the database be burdened with validation rules which are certain to be mutable. Ideally, the validation rules would work in the same state machine framework as the workflow automation API, and might even share some code.
* Need a mechanism to lock records. Even mark-as-deleted records won't necessarily solve all our problems, so we might want records that are "live", but not editable for whatever reason. The ability to lock records and having the lock integrated across all the data-aware APIs is a good idea.
* On the topic of data, we will have user/group/other and r/w permissions, and all data-aware APIs should be designed with that in mind.
* "Literate programming" Using MCV and the state machine workflow automation probably meets (and exceeds) Knuth's ideas of Literate programming. Look at his idea and make sure we aren't missing any key concepts.
* QA and testing needs to be several layers, one of which is simple documentation. We should have code that examines data for various qualities, which when done in a comprehensive manner will test the data for all properties described in the requirements. As bugs are discovered and features added, this data testing code would expand. Code should be tested on several levels as well, and the tests plus comments in the tests constitute our full understanding of both data and code.
* Entities (names) have ID values, ARKs and various kinds of persistent ids. Some subsystem needs to know how to gather various ID values, and how to generate URIs and URLs from those id values. All discovered ids need to be attached to the cpf identity in a table related_id. We will need to track the authority that issued the id, so we need an authority table. Perhaps it is best (as previously discussed) to create a CPF record for each authority, and use the CPF persistent ID as the authority identifier in the related_id table.
```
create table related_id (
ri_id auto primary key,
id_value text,
uri text,
url text,
authority_id int -- fk to cpf.id?
);
```
* Allow config of CPF output formats via web interface. For example, in the CPF generator, we can offer some format and config options such as name formats in <part> and/or <relationEntry>
- include 4 digit fromDate-toDate for person
- include dates for corporateBody
- use "fl." for active dates
- use "active" for active dates
- use explicit "b." and "d."
- only use "b." or "d." for single 4 digit dates
- enclose date in parentheses
- add comma between name and dates (applies only if there is a date)
and so on. In theory that could be done for all kinds of CPF variations. We should have a single checkbox for "use most portable CPF formats" although I suspect the best data exchange format is not XML, but SQlite, SQL INSERT statements, or json.
* Does our schema track who edited a record? If not, we should add user_id to the version table.
* We should review the schema and create rules that human beings follow for primary key names, and foreign key names. Two options are table_id and table_id_fk. There may be a better option
* Option 1 is nice for people doing "join on", but I find "join on" syntax confusing, especially when combined with a where clause
* Option 2 seems more natural, especially when there is a where clause. Having different field names means that field names are more likely to be unique, thus not requiring explicit table.field syntax. Option 2a will be used most of the time. Option 2b shows a three table join where table.field syntax is required.
```
-- 1
select * from foo as a, bar as b where a.table_id=b.table_id;
-- 2a
select * from foo as a, bar as b where table_id=table_id_fk;
-- or 2b
select * from foo as a, bar as b, baz as c where a.table_id=b.table_id_fk and a.table_id=c.table_id_fk;
```
* Identity Reconciliation has a planned "why" feature to report on why a match matches. It would be nice to also report the negative: why didn't this match X? I'm not even sure that is possible, but something to think about.
# CPF SQL schema
## Version implementation
1. each table contains version info.
1. never update
2. always insert and insert into version_history
3. ideally, we don't need to remember to update existing "current" record (trigger?)
2. each table contains `cpf_id` foreign key
3. Current record is where `version=max(version)` or `current<>0`
1. use trigger as necessary to avoid lots of update on insert
2. use a view as necessary
3. maybe simply use where clause and brute force the first implementation
4. table version_history has all version info for all users for all tables
1. version_history has its own sequence because it is a special table
```PLpgSQL
create table version_history (
id int primary key default nextval('version_history_seq'),
user int, -- fk to user.id
date timestamp, -- now()
)
create table foo_internal (
id int primary key default nextval('unique_id_seq'),,
cpf_id int, -- fk to cpf.id
data text,
version int -- fk to version_history.id, sequence is unique foreign key
current int -- or bool, depending on implementation this field is optional
);
-- Optional view to simplify queries by not requiring every where clause to use version=max(version)
create view foo as select * from foo_internal where current;
create view foo as select * from foo_internal where valid = true;
-- a view foo_current that is the most current record
create view foo_current as select * from foo_internal where version=max(version);
select * from foo_current where id=1234;
```
Having renamed all the tables to table_internal and created one or more views, any use of the table_internal name tends to make the code un-portable. We need to make sure the views perform as well as the longer, explicit queries.
```PLpgSQL
-- Yikes, using foo_internal means this query breaks if the internal tables have a schema or name
-- change.
select * from foo_internal where id=1234 and version=max(version);
-- Normally, we would use the foo view, not table foo_internal
select * cpf_current as cpf,foo_current as foo where cpf.id=foo.cpf_id;
```
*Caveat* The actual implementation needed to change the setup of `foo_internal`, since in some cases multiple entries could be for the same `cpf_id` (ie name entries, date entries, sources, documents, etc). So, there are two options:
* `(id, version)` primary key in each table, where we get the latest version for each id and join on `cpf_id` for the latest entries for that cpf, or
* `(cpf_id, version)` foreign key, where a join must also match the current version of the cpf record to get the latest entries for that cpf (note independence on this table's `id` field.
In commit 533fc082fb3c68f7c2bf8edbbace2571c8f963bc, I've chosen the first version of this scheme.
## Splitting of merged records
1. split may result in N new records
2. each field (especially biogHist) might be split, but only via select/cut/copy/paste. Editing of content is not supported.
1. Edit during split is a later (if ever) feature.
3. In a 2 way split, where originals are A and B,
1. we need to show the user the original record A (original fields) and the most current version of each field
2. ditto original B and current version of each field
3. each field one checkbox: keep
4. Some web UI is created for splitting, including field subsplit. The UI details by prototyping.
1. splitting a single field involves select, choose, and a resulting highlight.
2. We allow multi-select, multi-choose, and accumulate select/choose sections.
5. Nth phase shows original N plus the current, probably with gray-highlight for previous records. All fields/text are selectable since content is often in several records.
6. In the database, we create a new cpf record for each split record, and invalidate the merged record.
7. what happens in table merge_history when a merge is split?
## Schema changes to support merge and split
1. add `valid` field to table cpf
1. change valid to 0/false when the cpf record is merged
2. add table merge_history, all result in multiple records being created
1. Merge: from=old (singleton) and to=new (merged)
2. Split from merge: from=old (merged) and to=new (singleton)
3. Split (original singleton): from=old and to=new (multiple singletons)
3. When splitting merge, need an additional record to link new split with its parent (pre-merge) record. Only 1 generation back.
1. Does this lineage record need a special field in table split_merge_history? (No.)
```PLpgSQL
create table split_merge_history (
from_id int, -- fk cpf.id
to_id int, -- fk cpf.id
date timestamp,
);
```
- check into Apache httpd and http/2 as well as supporting Opportunistic encryption:
* [ArsTechnica: new firefox version says might as well to encrypting all web traffic](http://arstechnica.com/security/2015/04/new-firefox-version-says-might-as-well-to-encrypting-all-web-traffic/)
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment