Commit 7a2c70bc by Tom Laudeman

moving files from wiki to docs as files

parent 8e6e8192
We have had many problems due to inconsistent import data practices and formats. In addition to guidelines, we can create portable and online software to help people detect issues in their data. We already have many of these scripts in a state of maturity suitable for software developers. It is possible to raise that level of maturity to something we can offer to the general public.
Some of the issues encountered:
- data not encoded as utf8 files, or non-utf8 characters in the files. If there is a requirement to process non-utf8 files, those files will need to have the encoding explicitly set somewhere. It is impossible to always guess the encoding of a file.
- illegal character encodings
- accidental inclusion of substituted characters. Commonly smart quotes, en-dash, em-dash. Fixing these requires running the data through cleaning software that we may always need due to people writing text in Microsoft product, then copying/pasting. The cleanup is easy, but ideally the bad characters wouldn't be in the data in the first place.
- non well-formed XML.
- lack of persistent id values. Every file should have a unique id. Every online finding aid needs a unique, persistent URI that is also a persistent URL and which loads in a web browser, and can be retrieved via wget or curl.
- URLs should resolve to static HTML without using JavaScript to dynamically build pages.
- ideally, any data-centric URL should have variant formats for humans (HTML, PDF) and computers (XML, JSON, etc.)
- inclusion of URIs within data files. It creates issues to have to build a URI algorithmically based on an ID value found somewhere in a data file. The URI should have a standard (conventional) location in the data, and the URI should occur as a URL. Ideally, the ID value is also present simply as an ID. It should not be required to perform sub-string or regular expression operations to extract ID values.
- URLs that are missing should return 404, and no other value. It is problematic for automated validation of URLs when a missing page returns a 200 (or any non-404 value) along with a human readable HTML message about the status of the page.
# Elastic Search Notes and Installation
Installing Elastic Search (ES) for SNAC use requires a few stages, as outlined below. Elastic Search uses, by default, the REST API on port 9200 for all communications.
## Background
When you need full text search for a SQL database, Elastic Search is fast and feature rich. We use it with PostgreSQL on Linux, and (will) integrate ES with Apache httpd in a LAMP web application.
## Elastic Search
The download site for elastic search is [https://www.elastic.co/downloads/elasticsearch](https://www.elastic.co/downloads/elasticsearch). Once downloaded and unzipped, the server can be started by executing
```
bin/elasticsearch
```
The server can be tested by issuing the following command
```
curl -X GET http://localhost:9200/
```
## Elastic Search River Plugin
The River plugin allows Elastic Search to connect to a database using any standard JDBC connector. The full source code and documentation is available at [https://github.com/jprante/elasticsearch-river-jdbc](https://github.com/jprante/elasticsearch-river-jdbc). To install the River plugin, execute the following in the elastic search directory:
```
./bin/plugin --install jdbc --url \
http://xbib.org/repository/org/xbib/elasticsearch/plugin/
elasticsearch-river-jdbc/1.5.0.0/elasticsearch-river-jdbc-1.5.0.0.zip
```
Since we are using PostgreSQL, we need to install the appropriate JDBC connector. Postgres makes this available at [https://jdbc.postgresql.org/download.html](https://jdbc.postgresql.org/download.html). Download the jar file to `./plugins/jdbc/` in the elastic search directory.
Once the River plugin and PostgreSQL JDBC connector have been installed, restart elastic search.
## Instructions to Index Postgres
To link postgres using the river plugin, a great tutorial can be found at [http://studiofrenetic.com/blog/a-river-flowing-from-postgresql-to-elasticsearch/](http://studiofrenetic.com/blog/a-river-flowing-from-postgresql-to-elasticsearch/).
We can add an index to our database using the Postgres JDBC driver by issuing the following command:
```
curl -XPUT "localhost:9200/_river/index_type/_meta" -d ' {
"type" : "jdbc",
"jdbc" : {
"url" : "jdbc:postgresql://localhost:5432/eaccpf",
"user" : "snac",
"password" : "snacsnac",
"sql" : "full SQL statement to index",
"index" : "index_name",
"type" : "index_type",
"strategy" : "oneshot"
}
}'
```
The `index_type` is what will normally be considered the index name. The `index_name` is the super type. To access the full index later on, a user would query `localhost:9200/index_name/index_type`. For our uses, we have defined `index_name` as `snac` and each index to be the type associated with it. Therefore, searching all of the snac data can be handled by:
```
curl -X GET "localhost:9200/snac/_search?pretty&q=search term"
```
### SNAC Indices
For reference, the two indices already created were defined as follows:
```
curl -XPUT "localhost:9200/_river/original_name/_meta" -d ' {
"type" : "jdbc",
"jdbc" : {
"url" : "jdbc:postgresql://localhost:5432/eaccpf",
"user" : "snac",
"password" : "snacsnac",
"sql" : "select id,cpf_id,original from name;",
"index" : "snac",
"type" : "original_name",
"strategy" : "oneshot"
}
}'
curl -XPUT "localhost:9200/_river/vocabulary/_meta" -d ' {
"type" : "jdbc",
"jdbc" : {
"url" : "jdbc:postgresql://localhost:5432/eaccpf",
"user" : "snac",
"password" : "snacsnac",
"sql" : "select id,type,value from vocabulary;",
"index" : "snac",
"type" : "vocabulary",
"strategy" : "oneshot"
}
}'
```
# Other References
Tutorial: [http://okfnlabs.org/blog/2013/07/01/elasticsearch-query-tutorial.html](http://okfnlabs.org/blog/2013/07/01/elasticsearch-query-tutorial.html)
Elastic search is quite powerful, and there are a few nice search features we can take advantage of:
1. Wildcard and regex queries work: http://www.elastic.co/guide/en/elasticsearch/guide/current/_wildcard_and_regexp_queries.html
2. Prefix searching for fast autocomplete: http://www.elastic.co/guide/en/elasticsearch/guide/current/_query_time_search_as_you_type.html
**Note** the full listing of query types can be found [here](http://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-queries.html)
Indexing might help us, too, if the default inverted indices are not enough. The inverted index allows for regex, wildcard, and even edit-distance searching. However, other methods of indexing exist:
1. N-grams: http://www.elastic.co/guide/en/elasticsearch/guide/current/ngrams-compound-words.html
Also, it has a great scoring mechanism, as seen [here](http://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html) and [here](http://www.elastic.co/guide/en/elasticsearch/guide/current/practical-scoring-function.html)
\ No newline at end of file
## General Notes
* Lots of institutions are plugging into snac
Nara Website Platform (for information access for the pilot group and snac team)
* Want a cross between a website and a facebook page
* Jive platform (internally)
* They are wanting to expand it to an external collaboration effort (for the external research community)
* Blogs, other collaboration tools
* [Website](https://www.jivesoftware.com/)
* Other Suggestions
* [Ning](http://www.ning.com)
## SNAC Coop Launch
* Possible Dates
* Sept 28-30
* Sept 30-Oct 2
* Oct 14-16
* Oct 21-23
* Pilot Policies and Procedures
* Learn as we go, ask questions in meetings
* Really put our heads around the participants
* Best Practices Team: Daniel, Amanda, Jerry
## [ArchivesSpace](http://archivesspace.github.io/archivesspace/doc) Presentation
* Uses MySQL
* 4 Accesses
* Public UI (8080?)
* Staff UI (editing)
* API interface (8089)
* Solr (8090) search interface
* JSON Model
* 4 Types of Agents (They all inherit the Agent data model)
* Corporate Entities
* Persons
* Families
* Software
* Other Links
* [Technical Architecture](http://archivesspace.github.io/archivesspace/doc/file.ARCHITECTURE.html)
* [API Reference](http://archivesspace.github.io/api/)
* [Test Interface](http://test.archivesspace.org:8080) User: admin, Password: admin
* Other Notes
* Background Jobs creation: do bulk imports in the background
* Look at New Person Agent (agents/agent_person/new) for their creation UI. It's not very user friendly.
* BiogHists
* They have a wrap-and-tag editor. (Type '<' to get the options for tags -- inline options are limited)
* Allow subnotes of type chronList, etc.
* Sounds like a front-end for what we're planning to do.
## Match/Merge Presentation (Ray, Yiming)
* Postgres, Cheshire, Python 2.x
* SQLAlchemy ORM, libxml (xml parser), setuptools packaging
* Using SQLAlchemy, can switch SQL backends easily (ish)
* Matching
* ULAN is not being used.
* VIAF loaded into Cheshire
* Record groups that ties all the matching records together
* They are not merged at this point. If they match, they are linked together by pointers within Postgres
* There is also a maybe same as table that shows possible matches
* EAC is input, loaded into Postgres, then 3 stage process
* Connect exactly matching records (if available)
* Matching an exact normalized name (identically)
* Note: What happens if there is a match that both don't have dates?
* Use Cheshire to find possible match through VIAF (if the exact match above failed)
* Merge
* When loading an EAC record, it just stores the filename (not the entire data)
* Matching in detail
* Names are the most important attribute
* Finding candidates using names
* Exact name with normalization (just in postgres? or also VIAF/cheshire?)
* Name components as keyworkds (Cheshire only)
* Ngram (overlapping 3-char sequences) matches (Cheshire only)
* Validate returned candidate records
* Either accept candidate as a match, maybe a match or reject candidate
* Generating Candidates
* Look up in cheshire
* Look up in postgres (exact name only)
* Try to parse Western names
* name order, suffixes, epithets, middle names
* British library records were the most egregious for this (Lance, Eveline Anne, Nee La Belinaye; Wife of T Lance, the younger,...)
* Using Python Library: nameparser (note: I don't think it would work well with name identity strings)
* Validation of candidates
* Edit distance + string length
* Smith vs Smythe
* Zabolotskii, Nikolai Alekseevich vs Zabolotskii, Nikolai Aleksevich
* For Western names: name part to name part comparison, initial to name comparison
* Different strengths of match for each component
* Mistakes (some differences) in Matches of last name are likely false positives and are ignored (according to Yiming)
* Matches where the last name is a good match, but there are mistakes/differences in first names, then they are flagged as maybe matches (ex: initials vs full names)
* Dates of activity
* Within some tolerance depending on the era
* Flourished vs birth/death
* State of the database
* 6 million input
* ~3 million unique groups
* Spirit Index
* Ghost-written records
* Franklin, Benjamin, 1706-1790 (Spirit)
* Special exclusion
* Merging
* Record Group is marker for records to point to if they think they are the same entity
* Record Groups can be invalidated (valid flag)
* As a way to break apart groups (there is a tombstone record left, so that the improperly merged record could still be generated)
* keep a linked list of record groups that have been invalidated
* ... exact matches and authority
* Create output
* Record assembly
* In case of source changes, no reprocessing needed
* Output script detects if record changed by an update
* Lots of legacy code from v1
* Cancelled records (split, merged, record prior history)
* Creating tombstones if these things happen
* Final merged records are only just in time created (dumped out at the end; no merged record stored in the database, but always created from the sources)
* Combining
* They say they are not combining biogHists (but this is happening: see Richard Nixon)
* Brian suggested that if the humans edit the combined biogHist, then that's what is shown, but the others are kept around hidden in a tab to view the originals
* There is a focus on the XML documents (EAC-CPF) as the canonical record; not a current database that can handle multiple versions of various pieces of the records.
* Discussion
* Two dimensions to editing a record (EAC-CPF XML), according to Brian
* EAC-CPFs that are JIT merged; how do you edit that?
* Put the edits upstream and rerun the merge
* Ray is still talking about looking at the XML files and merging based on diffs, for example
* Any new batches that will be coming in, they would just get added to the list (and it would run through the same merge process, maybe flagged as human or something)
* So, batch uploads would be processed just like a human editing the record
* Brian's proposal
* Refactor the match and merge processes
* Match is useful in a lot more contexts, like archivesspace hitting against
* Merge process should use the same API to update the database that the front-end uses (back-end and front-end use the same api)
* Daniel likes this and agrees
* Ray didn't use a database with all the information because they wanted to keep all the pieces (pre-merge) separate
* We need to address multiple edits (or re-edits)
* An editor does a lot of work, then a batch process (or another editor) overwrites or changes that work, that original editor might get offended
* What is the policy that a person/process can edit the record
* Right "now," that policy is that they can log in.
* **The proposal doesn't allow for automatic merging**
* Identity Reconciliation will just flag possible matches, but will not try to automatically merge records.
* A person would have to see that there are X incoming records that match, and they needed to be included (by hand merge)
A comparison of Perl 5 and Python 3.
Tom's fundamental conclusion (still being debated) is that Perl5 wins based on stability, size of CPAN, and
useful everyday programming features. Python loses due to application-breaking changes in even minor version
updates **(needs citation, since research shows only the major version has caused breaking changes)**, has fewer modules, and lacks language features (syntactic sugar especially) to support common idioms.
Perl6 is a wildcard in this comparison. If Perl6 ships in 2015 as scheduled (when exactly? December?) it
appears that it will exceed Python in terms of language features, and it will have the CPAN advantage as
well. It also retains the good features of the original Perl 5.
## Python 3
Python pros:
- many powerful language features
- arguably more modern than perl 5
- good library of modules
- py-lint lint checking helps analyze common issues with code
- works with Apache httpd native (via shebang)
- used by smart programmers, hipster approved
- out-of-the box utf8 support
- native tuple data type (need example, if Perl has this, probably requires a module)
- printf() is trivially defined (related to string interpolation and .format() discussion below)
```
#!/usr/bin/python
import os
import re
import sys
def printf(format, *args):
sys.stdout.write(format % args)
if __name__ == '__main__':
var = 1234
hash = {'foo' : 5678}
print "var:" , var, '\n'
print "hash:", hash['foo'], "\n"
printf("var: %4.4s hash: %10.1s\n", var, hash['foo'])
> ./printf.py
var: 1234
hash: 5678
var: 1234 hash: 5
```
Python cons:
- no increment via ++ use += instead. http://stackoverflow.com/questions/2632677/python-integer-incrementing-with
- no syntax checker, but there is py-lint
- lack of syntax for some common idioms. Assignment success test in while() is not valid syntax:
`while ($foo = funct()) <code block>` **Are you sure?**
** Tom: what is wrong with the syntax below? Checking the implicit boolean value of an assignment is common in some
languages, I think. Perl does it and I thought PHP, and others? **
```
#!/usr/bin/python
import os
import re
def read_state_data(file, cols = []):
print 'file: ', file
fp = open(file, "r")
while (temp=fp.readline()):
print 't: ', temp
if __name__ == '__main__':
print 'Running'
read_state_data('states_v2.dat', ['order', 'edge', 'choice', 'test', 'func', 'next'] )
> ./while_demo.py
File "./while_demo.py", line 18
while (temp = fp.readline()):
^
SyntaxError: invalid syntax
```
- see printf() in the pros section above
- lack of string interpolation **Python has this, [doc](https://docs.python.org/3/library/string.html#format-string-syntax)**
**Tom: strings are not interpolated. Some alternative must be used. See the Perl (one-liner) interpolation below.**
```
#!/usr/bin/python
import os
import re
if __name__ == '__main__':
var = 1234
hash = {'foo' : 5678}
print "var:" , var, '\n'
print "hash:", hash['foo'], "\n"
print "non-interpolated var: var hash: hash['foo']\n"
> ./interpolation_demo.py
var: 1234
hash: 5678
non-interpolated var: var hash: hash['foo']
```
**Tom: Format functions are not string interpolation. Perl does this: "value of var: $var\nvalue of hash: $hash{foo}\nDone.\n"**
```
> perl -e ' $var=1234; $hash{foo}=5678; print "value of var: $var\nvalue of hash: $hash{foo}\nDone.\n";'
value of var: 1234
value of hash: 5678
Done.
```
**Tom: Python apparently can't do-what-i-mean (dwim) and type cast ints to strings in order to print them? That's
ugly. Surely there is a Pythonic idiom to get around this.**
```
#!/usr/bin/python
import os
import re
if __name__ == '__main__':
var = 1234
hash = [5678]
print "var:" + var + '\n'
print "hash:" + hash['foo'] + "\n"
print "interpolated var: var hash: hash['foo']\n"
> ./interpolation_demo.py
Traceback (most recent call last):
File "./interpolation_demo.py", line 7, in <module>
print "var:" + var + '\n'
TypeError: cannot concatenate 'str' and 'int' objects
```
The answer is to use comma to concatenate, or format(): http://stackoverflow.com/questions/11559062/concatenating-string-and-integer-in-python
**Tom: This is a bug, not a feature. As the OP notes "Even Java does not need explicit casting to String to do
this sort of concatenation. " I wouldn't call it a show stopper, but it is a extra typing to do something that
happens all the time. Similar and perhaps more standard are printf() and sprintf() Must check if Python has
printf/sprintf.**
- lack of regular expression binding operator (=~) means that Python is limited to a functional use of regular
expressions which is awkward compared to Perl [Python's regex](https://docs.python.org/3/howto/regex.html)
**Tom: Can Python do a regex substitution, consuming the string and matching one or more groups? I can't find group() as related to sub(), but only to match(). Substitution while consuming all or part of the original string is a common idiom, as seen below in a Perl example. I've been reading Python docs for 30 minutes and can't figure out how to do grouping with sub().**
```
#!/usr/bin/perl
$var = 'abcd';
while($var =~ s/^(.)//g)
{
print "trimmed off: $1\n";
}
> ./regex.pl
trimmed off: a
trimmed off: b
trimmed off: c
trimmed off: d
```
Or as a oneliner:
```
> perl -e '$var='abcd'; while($var =~ s/^(.)//g) { print "trimmed off: $1\n"; }'
trimmed off: a
trimmed off: b
trimmed off: c
trimmed off: d
```
- lack of syntactic sugar for dictionaries in that the key must be quoted. **Robbie: Python handles dicts/arrays very nicely**
**Tom: Perl is happy with $hash{key} and $hash{'key'} and only requires the quote when the key literal is ambiguous. All the non-Perl languages I know always require quotes around literal keys. Typing the perl sigil ($, @, %) is extra typing, but many programmers do more typing than that in variable nameing: str_var, int_var, etc. just to keep track of variable types which are obscure without the sigil.**
- whitespace is significant; yes, Pythonistas consider it a feature, but:
- anywhere except Python source code, significant whitespace leads to bugs (Rhetorically: why doesn't it lead to bugs in Python?)
- the thousandth time you have to hit the tab key multiple times to get a new line properly indented you
may start to wish for curly brackets. **Tom: must try Emacs indent-region which may partially overcome this.**
- whitespace not allowed, or simply not Pythonic style? From py-lint: **I believe this is py-lint enforcing a particular coding style, as this is valid python**
```
C: 44, 0: No space allowed before bracket
read_state_data('states_v2.dat', ['order', 'edge', 'choice', 'test', 'func', 'next'] )
```
- smaller number of modules than in CPAN
- functions have to be defined in the file above their call (or presumably pre-declared) which is awkward and
like ancient history of C. **Do you count included in the file, but defined elsewhere, such as PHP, C++, and Java?**
- Python oneliners are somewhat more verbose than Perl.
- RHEL doesn't have Python 3 in the standard package (yum) repository.
http://stackoverflow.com/questions/8087184/installing-python3-on-rhel
```
sudo yum install http://dl.iuscommunity.org/pub/ius/stable/CentOS/6/x86_64/ius-release-1.0-11.ius.centos6.noarch.rpm
sudo yum search python3
# List all into a file instead of searching, allows less and grep later:
sudo yum list all > yum_list.txt
sudo yum install python33
```
Misconceptions of Python
- Python 3 broke compatibility with Python 2, for obvious reasons (there were serious issues in Python 2). They have been rectified in Python 3, and new minor versions include new features that do not break old ones.
**Tom: Ok. So v2 broke with minor versions, which explains why there are 3 or 4 sub-versions of v2 by default with MacOS.**
## Perl 5
Perl pros:
- syntactic sugar and DWIM make everyday tasks pleasant, code is expressive but not verbose.
```
# regex binding operator
$var = "The quick brown";
if ($var =~ m/quick/)
{
print "found quick\n";
}
# literal hash keys optional quoting
my %hash;
$hash{literal} = 1;
$hash{bar} = 2;
$hash{'literal with space needs quoting'} = 3;
# variable casting implicit (and an example of string interpolation)
# $var is a number, but when tested against a regex, Perl simply casts to a string because that is clearly
# what the programmer wanted.
$var = 1234;
if ($var =~ m/(23)/)
{
print "matched string \"$1\" in number: $var\n";
}
```
- huge library (149,563 modules) in CPAN
- string interpolation, which is lacking in all (nearly?) all other languages. This feature is a direct result
of Perl's use of sigils on variable names (and optionally on functions).
- the best regex syntax
- foreach loop iterator change in place
- stable, updates to perl won't break applications, active development fixes bugs and even adds the occasional new feature
- runs under Apache httpd native, via shebang
- available Apache mod_perl for improved performance with Apache httpd (not that we will ever need this).
- good CGI support, and CGI scripts also run unchanged at the command line, and don't break pipes or other
shell features. This is very useful for debugging CGI.
- excellent database modules in DBI and DBD
- lexical scoping (and dynamic scoping, apparently), allows local declaractions for side-effect free code
- perl -cw syntax checking, as well as Perl::Critic, B::Lint, and several others (once again CPAN saves the
day with a huge number of modules)
- Moose is a postmodern object system for Perl 5 that takes the tedium out of writing object-oriented Perl. It
borrows all the best features from Perl 6, CLOS (Lisp), Smalltalk, Java, BETA, OCaml, Ruby and more, while
still keeping true to its Perl 5 roots. Moose is 100% production ready and in heavy use in a number of
systems and growing every day. Get Moose from CPAN. Also see: http://moose.iinteractive.com/en/
Perl cons:
- arguably a clumsy OOP implementation (overcome by Moose?)
- lots of bad things programmers can do, but should avoid
- hipsters view perl 5 as something only old folks do; a dinosaur
- subroutine parameter passing only via a flat list is awkward (of course the list can include references, and due to
Perl dwim, a hash is a list which allows for named parameters). Pass by reference avoids the list flattening issues.
- subroutine returns are also a flat list, the same issues/constraints/features apply as with subroutine parameters.
- references can be awkward, and often the simplist solution is to look up a common idiom.
- um...
......@@ -6,4 +6,64 @@ The currently-being-revised TAT requirements are found here in the "tat_requirem
Note for TAT functional requirements: need to have UI widget for search of very long fields, such as the Joseph Henry cpfRelations
that contain some 22K entries. Also need to list all fields which migh have large numbers of values. In fact, part of the meta data for
every field is "number of possible entries/reapeat values" or whatever that's called.
\ No newline at end of file
every field is "number of possible entries/reapeat values" or whatever that's called.This wiki serves as the documentation of the SNAC technical team as relates to Postgres and storage of data. Currently, we are documenting:
* the schema and reasons behind the schema,
* methods for handling versioning of eac-cpf documents, and
* elastic search for postgres.
* Need a data constraint and business logic API for validating data in the UI. This layer checks user inputs against some set of rules and when there is an issue it informs the user of problems and suggests solutions. Data should be saved regardless. We could allow "bad" data to go into the database (assumes text fields) and flag for later cleaning. That's ugly. Probably better to save inputs in some agnostic repo which could be json, frozen data structures, or name-value pairs, or even portable source format. The problem is that most often data validation is hard coded into UI JavaScript or host code. Validation really should be configurable separate from the display of data, workflow automation, and data storage. While our database needs to do certain rudimentary sanity checks, the database can't be expected to send messages all the way back up to the UI layer. Nor should the database be burdened with validation rules which are certain to be mutable. Ideally, the validation rules would work in the same state machine framework as the workflow automation API, and might even share some code.
* Need a mechanism to lock records. Even mark-as-deleted records won't necessarily solve all our problems, so we might want records that are "live", but not editable for whatever reason. The ability to lock records and having the lock integrated across all the data-aware APIs is a good idea.
* On the topic of data, we will have user/group/other and r/w permissions, and all data-aware APIs should be designed with that in mind.
* "Literate programming" Using MCV and the state machine workflow automation probably meets (and exceeds) Knuth's ideas of Literate programming. Look at his idea and make sure we aren't missing any key concepts.
* QA and testing needs to be several layers, one of which is simple documentation. We should have code that examines data for various qualities, which when done in a comprehensive manner will test the data for all properties described in the requirements. As bugs are discovered and features added, this data testing code would expand. Code should be tested on several levels as well, and the tests plus comments in the tests constitute our full understanding of both data and code.
* Entities (names) have ID values, ARKs and various kinds of persistent ids. Some subsystem needs to know how to gather various ID values, and how to generate URIs and URLs from those id values. All discovered ids need to be attached to the cpf identity in a table related_id. We will need to track the authority that issued the id, so we need an authority table. Perhaps it is best (as previously discussed) to create a CPF record for each authority, and use the CPF persistent ID as the authority identifier in the related_id table.
```
create table related_id (
ri_id auto primary key,
id_value text,
uri text,
url text,
authority_id int -- fk to cpf.id?
);
```
* Allow config of CPF output formats via web interface. For example, in the CPF generator, we can offer some format and config options such as name formats in <part> and/or <relationEntry>
- include 4 digit fromDate-toDate for person
- include dates for corporateBody
- use "fl." for active dates
- use "active" for active dates
- use explicit "b." and "d."
- only use "b." or "d." for single 4 digit dates
- enclose date in parentheses
- add comma between name and dates (applies only if there is a date)
and so on. In theory that could be done for all kinds of CPF variations. We should have a single checkbox for "use most portable CPF formats" although I suspect the best data exchange format is not XML, but SQlite, SQL INSERT statements, or json.
* Does our schema track who edited a record? If not, we should add user_id to the version table.
* We should review the schema and create rules that human beings follow for primary key names, and foreign key names. Two options are table_id and table_id_fk. There may be a better option
* Option 1 is nice for people doing "join on", but I find "join on" syntax confusing, especially when combined with a where clause
* Option 2 seems more natural, especially when there is a where clause. Having different field names means that field names are more likely to be unique, thus not requiring explicit table.field syntax. Option 2a will be used most of the time. Option 2b shows a three table join where table.field syntax is required.
```
-- 1
select * from foo as a, bar as b where a.table_id=b.table_id;
-- 2a
select * from foo as a, bar as b where table_id=table_id_fk;
-- or 2b
select * from foo as a, bar as b, baz as c where a.table_id=b.table_id_fk and a.table_id=c.table_id_fk;
```
* Identity Reconciliation has a planned "why" feature to report on why a match matches. It would be nice to also report the negative: why didn't this match X? I'm not even sure that is possible, but something to think about.
SNAC2 Algorithms to Match
---------------------------
The process to match persons (`match_persons`)
1. For each unprocessed record:
1. Try to find an exact string match of the name in already matched records and Cheshire's VIAF index (`match_person_exact`).
2. If no record group and no VIAF ID found, search for name in VIAF index using ngrams (`match_person_ngram_viaf`).
* If match quality is null, then create new record group not matching VIAF (no VIAF record found).
* If match quality is above 0 and less than threshhold, then create new group flagging as maybe VIAF match (VIAF record found, but might not be correct).
* If match quality is above 0 and above threshhold, then create new group with this VIAF ID (close enough VIAF record found).
* If match quality is -1, create a new group for this record (NO VIAF ID candidate).
3. If no record group, but VIAF ID found, create a new record group linking to this VIAF ID.
4. If a record group with VIAF ID was found, then match into this record group.
The process to match Corporations (`match_corporate`)
1. For each unprocessed record:
1. Try to find an exact string match in VIAF index (`match_exact`).
* If no record group and no VIAF ID found, try Keyword viaf name search (`match_corp_keyword_viaf`).
* Try keyword match (`viaf_match_keyword`). If match quality is greater than 0, then look up group in Postgres by VIAF ID from the keyword match and return that group. Else, returns nothing and match quality = -1.
* If match quality is above accept threshold, then use the record group returned by the keyword search.
* If match quality is above fuzzy match threshold, create new group for this record, noting this VIAF ID as a maybe match.
* If no record group but a VIAF ID found, create a new group for this record.
2. Add this record to the group found/created.
The process to match Families (`match_families`)
1. For each unprocessed record:
1. Look up the record group by searching postgres for the normalized name of the family.
2. If the record has no group found, create one.
SNAC2 Algorithms to Merge
--------------------------
Merging Records
1. For each record group (matched set of records): *I believe this includes the MaybeSame records, which are "maybe" links to VIAF identities. The "maybe" VIAF IDs are stored in a separate Postgres table from the record groups.*
* Create the merged record with the group name (the first element's normalized name).
* Store the merged record (in Postges).
* Get canonical ID.
* Either ARK temporary, ARK permanent (`-r` parameter), or no id (`-n` parameter).
* Store the merged record (in Postgres).
Assembling Records
1. For each merged record (in Postgres):
* Create the combined record as follows.
* Combine each CPF record's name, type, sources, exist dates, occupations, local descriptions, functions, resource relations, and biog hists.
* For each CPF record's relations, if a merged record containing the other side of the relation is found and it has a canonical ID, add the `xlink:href` link and remove the `descriptiveNote` elements from the relation. The merged relations are kept.
* Write all merged data into a string in EAC-CPF XML form as the combined record.
* If the merged record has an ARK id and the combined record can be parsed by Python's XML parser.
* Grab the canonical ARK id, stripping `ark:/` and replace `/` with `-`.
* Write the combined record to file at `merged_directory/ARKID.xml` in utf-8.
* Note the merged record in Postres as processed.
* If the merged record has no ARK id or can't be parsed by Python's XML parser, an exception is thrown and assembling stops.
Reassign IDs from File
1. Read all canonical IDs from the file given after `-f` argument into an array.
2. For each merged record with no canonical id assigned, assign a canonical id from the array and write back to Postgres.
Uses of Cheshire
====================
### Summary
Cheshire, via CheshirePy, is used in only a few places in the snac2 match and postprocess code. Specifcially, in the following ways:
1. Exact string searching: a normalized name is queried into Cheshire for VIAF records with exact string matches in the mainHeadings->data->text tags. Note, only the first 1 result is used. This is a fast indexing.
2. Ngram searching: a normalized name is queried into Cheshire for VIAF records with ngrams matches in the mainHeadings->data->text tags. Note, only the first 10 results are used. *Based on ngrams Cheshire queries for Geonames, there appears to be an ordering issue on results. The top 10 might not have the intended match.*
3. Keyword searching: a normalized name is queried into Cheshire for VIAF records with keyword matchings in the mainHeadings->data->text tags. Note, only the first 10 results are used.
4. Exact VIAF ID searching: requesting a VIAF record with a given ID (in the viafID tag).
5. Keyword ID Number searching: querying an id number for VIAF records with that id number in the sources->source tags.
*Note: the normalized names mentioned above come from the first name entry in the EAC record, and are cleaned with the following code (when being inserted into the Postgres database). This normalization returns lower case with no punctuation and with accents replaced with non-accented characters. (We need to update the comments in code for `str_remove_punc`, and should not need to replace the accents because of unicode character matching).*
```
return compress_spaces(strip_accents(str_remove_punc(str_clean_and_lowercase(s), replace_with=" ")))
```
### Matching (match.py)
1. Exact string match VIAF search of a record into a group: (`match_exact`)
* A candidate group is found by calling `group.get_by_name(record.name_norm)`, where the latter is the normalized name of the record to match into the current groups.
* This searches the groups in Postgres and looks for a group with this normalized name.
* If a group is found AND group has a VIAF record associated, check the dates and match (if equal).
* Else, look up the `record.name_norm` as an exact match into **Cheshire**'s VIAF exact name index (exact string matches only).
* Found in VIAF's mainHeadings->data->text tags.
* If record dates match to viaf record dates (within a 10-year margin for birth/death dates), then this VIAF entry matches.
* If a record group has this VIAF ID, then match in to that group.
* If not, then create a new group.
* If dates don't match, return no match found.
* If VIAF doesn't have dates, return no match found.
* If no VIAF entry, return no match found.
* <span style="color:blue">*We could probably clean up the if/else logic here: the variable `match` is set to `true`, and never changed. When something fails to match, the literal `false` is returned. Also, there is a return statement at the end of the viaf_records section that is only reachable by passing the first two inner if statements.*</span>
2. Exact string match VIAF search of a person record to a group: (`match_person_exact`, line 231)
* If we're not forcing a rematch,
* Look up `record.norm_name` in PersonGroup models, to see if it exists in the current set of matches.
* If found, check for the dates for each record in the group.
* Find largest score between person record dates and each record in the group's dates. These scores do not have to come from the same record. (1pt active dates match, 3pts birth/death exact match, 2pts birth/death date within 10 years, 0pts otherwise. *This only checks whether the years match*)
* If the normalized name is longer than 10 chars and either of the scores in date is greater than the fail threshold, then this is a match (return this match).
* Look up `record.name_norm` as an exact match into **Cheshire**'s VIAF exact name index (exact string matches only). Search includes whether or not looking for spirit (if `(spirit)` was in the normalized record string).
* Found in VIAF's mainHeadings->data->text tags.
* If record dates match to viaf record dates (within a 10-year margin for birth/death dates), then this VIAF entry matches.
* If a record group has this VIAF ID, then match in to that group.
* If not, then create a new group.
* If dates don't match, return no match found.
* If VIAF doesn't have dates, return no match found.
* If no VIAF entry, return no match found.
3. Find VIAF person by ngrams: (`match_person_ngram_viaf`, line 283)
* Calls through to `viaf_match_ngram`, line 385 to get match and match quality.
* Get only top 10 results from **Cheshire**'s ngram search.
* Found in VIAF's mainHeadings->data->text tags.
* For each of the 10 top results:
* If existence dates don't match (failure), return no match found.
* If existence dates match exactly (requires only that birth and death years match exactly), return this candidate VIAF record as match.
* If dates aren't exact but don't fail (partial/fuzzy match: active dates match exactly or birth and death years are within 10-year margin).
* If (person) compute:
* Use name parser (python [nameparser](https://pypi.python.org/pypi/nameparser) package) to separate string into parts (match and query).
* If there is only one name each, check for exact string match and accept if so.
* If there are multiple parts and the last names and first names exact string match, then
* If middle names exist and they exact string match, accept.
* If middle names exist and don't match, reject.
* If no middle names exist, accept.
* If one of the two middle names is an initial, if first characters match, accept at 0.99 (not 1.0) as fuzzy match.
* Else, compute the Jaro Winkler distance of "First Last" and return the threshold value if the JW distance is above the threshold (or JW distance if below).
* If there are multiple parts, but only first initials exist instead of full first name, and the last names, first initials, middle names (no initials) all match, then accept.
* Otherwise, do Jaro Winkler distance between the strings, return 0.00001 below the accept threshold if the JW distance is above the threshold (or JW distance if below).
* If (corporation) compute Jaro Winkler Distance and accept/return if above threshold.
* If (other) compute simple relative length and accept/return if above threshold.
* If match quality is greater than 0:
* If record group found with that VIAF ID, then match into that group.
* If not, then create a new group.
* If match quality less than or equal to 0, return no match found.
4. Find VIAF record by keywords: (`viaf_match_keyword`, line 514)
* Clean up normalized name for record of which to find match.
* Query **Cheshire**'s VIAF name index for normalized name.
* Found in VIAF's mainHeadings->data->text tags.
* Returns 10 candidates by default.
* For each candidate, compute [Jaro-Winkler Distance](http://en.wikipedia.org/wiki/Jaro–Winkler_distance) between candidate and normalized name. If distance is above fuzzy match threshold, return that candidate. (*This stops at the first match above the threshold*).
### Postprocessing (postprocess.py)
1. Look up VIAF records by VIAF ID, which happens in two places.
* To reload VIAF record for a record group (matched group), query **Cheshire**'s `viafid` index for the group's `viaf_id`.
* If found, it repaces the record stored with the group with the new version from Cheshire's result.
* If not found, the ID numbers of each source from the existing viaf_record stored with the group are queried in **Cheshire**'s `idnumber` index to find matching VIAF records.
* Depending on the results found, this record group may be merged into another group that has a matching VIAF record found in this manner.
# CPF SQL schema
## Version implementation
1. each table contains version info.
1. never update
2. always insert and insert into version_history
3. ideally, we don't need to remember to update existing "current" record (trigger?)
2. each table contains `cpf_id` foreign key
3. Current record is where `version=max(version)` or `current<>0`
1. use trigger as necessary to avoid lots of update on insert
2. use a view as necessary
3. maybe simply use where clause and brute force the first implementation
4. table version_history has all version info for all users for all tables
1. version_history has its own sequence because it is a special table
```PLpgSQL
create table version_history (
id int primary key default nextval('version_history_seq'),
user int, -- fk to user.id
date timestamp, -- now()
)
create table foo_internal (
id int primary key default nextval('unique_id_seq'),,
cpf_id int, -- fk to cpf.id
data text,
version int -- fk to version_history.id, sequence is unique foreign key
current int -- or bool, depending on implementation this field is optional
);
-- Optional view to simplify queries by not requiring every where clause to use version=max(version)
create view foo as select * from foo_internal where current;
create view foo as select * from foo_internal where valid = true;
-- a view foo_current that is the most current record
create view foo_current as select * from foo_internal where version=max(version);
select * from foo_current where id=1234;
```
Having renamed all the tables to table_internal and created one or more views, any use of the table_internal name tends to make the code un-portable. We need to make sure the views perform as well as the longer, explicit queries.
```PLpgSQL
-- Yikes, using foo_internal means this query breaks if the internal tables have a schema or name
-- change.
select * from foo_internal where id=1234 and version=max(version);
-- Normally, we would use the foo view, not table foo_internal
select * cpf_current as cpf,foo_current as foo where cpf.id=foo.cpf_id;
```
*Caveat* The actual implementation needed to change the setup of `foo_internal`, since in some cases multiple entries could be for the same `cpf_id` (ie name entries, date entries, sources, documents, etc). So, there are two options:
* `(id, version)` primary key in each table, where we get the latest version for each id and join on `cpf_id` for the latest entries for that cpf, or
* `(cpf_id, version)` foreign key, where a join must also match the current version of the cpf record to get the latest entries for that cpf (note independence on this table's `id` field.
In commit 533fc082fb3c68f7c2bf8edbbace2571c8f963bc, I've chosen the first version of this scheme.
## Splitting of merged records
1. split may result in N new records
2. each field (especially biogHist) might be split, but only via select/cut/copy/paste. Editing of content is not supported.
1. Edit during split is a later (if ever) feature.
3. In a 2 way split, where originals are A and B,
1. we need to show the user the original record A (original fields) and the most current version of each field
2. ditto original B and current version of each field
3. each field one checkbox: keep
4. Some web UI is created for splitting, including field subsplit. The UI details by prototyping.
1. splitting a single field involves select, choose, and a resulting highlight.
2. We allow multi-select, multi-choose, and accumulate select/choose sections.
5. Nth phase shows original N plus the current, probably with gray-highlight for previous records. All fields/text are selectable since content is often in several records.
6. In the database, we create a new cpf record for each split record, and invalidate the merged record.
7. what happens in table merge_history when a merge is split?
## Schema changes to support merge and split
1. add `valid` field to table cpf
1. change valid to 0/false when the cpf record is merged
2. add table merge_history, all result in multiple records being created
1. Merge: from=old (singleton) and to=new (merged)
2. Split from merge: from=old (merged) and to=new (singleton)
3. Split (original singleton): from=old and to=new (multiple singletons)
3. When splitting merge, need an additional record to link new split with its parent (pre-merge) record. Only 1 generation back.
1. Does this lineage record need a special field in table split_merge_history? (No.)
```PLpgSQL
create table split_merge_history (
from_id int, -- fk cpf.id
to_id int, -- fk cpf.id
date timestamp,
);
```
- check into Apache httpd and http/2 as well as supporting Opportunistic encryption:
* [ArsTechnica: new firefox version says might as well to encrypting all web traffic](http://arstechnica.com/security/2015/04/new-firefox-version-says-might-as-well-to-encrypting-all-web-traffic/)
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment