* Either accept candidate as a match, maybe a match or reject candidate
* Generating Candidates
* Look up in cheshire
* Look up in postgres (exact name only)
* Try to parse Western names
* name order, suffixes, epithets, middle names
* British library records were the most egregious for this (Lance, Eveline Anne, Nee La Belinaye; Wife of T Lance, the younger,...)
* Using Python Library: nameparser (note: I don't think it would work well with name identity strings)
* Validation of candidates
* Edit distance + string length
* Smith vs Smythe
* Zabolotskii, Nikolai Alekseevich vs Zabolotskii, Nikolai Aleksevich
* For Western names: name part to name part comparison, initial to name comparison
* Different strengths of match for each component
* Mistakes (some differences) in Matches of last name are likely false positives and are ignored (according to Yiming)
* Matches where the last name is a good match, but there are mistakes/differences in first names, then they are flagged as maybe matches (ex: initials vs full names)
* Dates of activity
* Within some tolerance depending on the era
* Flourished vs birth/death
* State of the database
* 6 million input
* ~3 million unique groups
* Spirit Index
* Ghost-written records
* Franklin, Benjamin, 1706-1790 (Spirit)
* Special exclusion
* Merging
* Record Group is marker for records to point to if they think they are the same entity
* Record Groups can be invalidated (valid flag)
* As a way to break apart groups (there is a tombstone record left, so that the improperly merged record could still be generated)
* keep a linked list of record groups that have been invalidated
* ... exact matches and authority
* Create output
* Record assembly
* In case of source changes, no reprocessing needed
* Output script detects if record changed by an update
* Lots of legacy code from v1
* Cancelled records (split, merged, record prior history)
* Creating tombstones if these things happen
* Final merged records are only just in time created (dumped out at the end; no merged record stored in the database, but always created from the sources)
* Combining
* They say they are not combining biogHists (but this is happening: see Richard Nixon)
* Brian suggested that if the humans edit the combined biogHist, then that's what is shown, but the others are kept around hidden in a tab to view the originals
* There is a focus on the XML documents (EAC-CPF) as the canonical record; not a current database that can handle multiple versions of various pieces of the records.
* Discussion
* Two dimensions to editing a record (EAC-CPF XML), according to Brian
* EAC-CPFs that are JIT merged; how do you edit that?
* Put the edits upstream and rerun the merge
* Ray is still talking about looking at the XML files and merging based on diffs, for example
* Any new batches that will be coming in, they would just get added to the list (and it would run through the same merge process, maybe flagged as human or something)
* So, batch uploads would be processed just like a human editing the record
* Brian's proposal
* Refactor the match and merge processes
* Match is useful in a lot more contexts, like archivesspace hitting against
* Merge process should use the same API to update the database that the front-end uses (back-end and front-end use the same api)
* Daniel likes this and agrees
* Ray didn't use a database with all the information because they wanted to keep all the pieces (pre-merge) separate
* We need to address multiple edits (or re-edits)
* An editor does a lot of work, then a batch process (or another editor) overwrites or changes that work, that original editor might get offended
* What is the policy that a person/process can edit the record
* Right "now," that policy is that they can log in.
***The proposal doesn't allow for automatic merging**
* Identity Reconciliation will just flag possible matches, but will not try to automatically merge records.
* A person would have to see that there are X incoming records that match, and they needed to be included (by hand merge)