Commit 6aae8d37 by Tom Laudeman

Merging tom into master, getting IR as well

parents 728ec78d 8d4dc47a
...@@ -2,3 +2,21 @@ ...@@ -2,3 +2,21 @@
These documents describe the functionality desired of the system. These should be high-level requirements, geared toward the policy side, of the form "The system should do X." These documents describe the functionality desired of the system. These should be high-level requirements, geared toward the policy side, of the form "The system should do X."
## Table of Contents (evolving)
* [Overview of data](Requirements/Data Overview.md)
* [Serialized CPF output](Requirements/EAC-CPF Output.md)
* [List of and Types of documents generated](Requirements/Generated Documents.md)
* [Identity Reconciliation engine](Requirements/Identity Reconciliation.md)
* [Broad overview of storage implementation](Requirements/Internal Data Storage.md)
* [Licenses for code and documentation](Requirements/Licensing.md)
* [Prose from the Mellon Proposal](Requirements/Mellon Proposal.md)
* [What the name parser needs to do](Requirements/Name Parser.md)
* [Overview of necessary new software sub-systems](Requirements/New Features.md)
* [Guidelines for development](Requirements/Software Development Process.md)
* [Guidlines for user documentation](Requirements/User Documentation.md)
* [Broad notes about user interface philosophy](Requirements/User Interface.md)
* [Authentication, Authorization, roles](Requirements/User Management.md)
* [Application behavior ecapsulated in a workflow manager](Requirements/Workflow Engine.md)
# Identity Reconciliation Engine
![Engine Diagram](http://gitlab.iath.virginia.edu/snac/Documentation/raw/ir/Specifications/Originals/IR_Engine.svg)
The identity reconciliation engine will operate on SNAC Identity Constellations, as defined in [this specification](/Specifications/Server API/Constellation.md) and depicted in [this figure](/Specifications/Originals/IC_Overview.pdf). The engine will take a partial constellation as input (called the _query constellation_) and find appropriate candidate matches from the SNAC database (called _candidate constellations_).
The engine is architected as an independent multi-stage process, with a coalescing weighting function to produce the final results. Specifically, the engine will consist of multiple independent stages, which may run in parallel, to produce independent lists of resulting candidate constellations, together with a numeric score for the stage, from the query constellation. The coalescer will then combine each set of resulting candidate constellations into one list of candidates by combining the scores for each constellation across the executed stages.
Within the sorted list of resulting candidate constellations, each will report its score among the stages. The coalescer's weighting function may be modified, either by hand or programmatically, to enhance the resulting set of candidate constellations. A human must verify and accept that the top candidate(s) is the same as the query constellation before any merging may take place. **Therefore, this engine does not and can not guarantee 100% accuracy in its matching.**
## Implementation Notes
* The identity reconciliation engine may be used as a search engine within SNAC by limiting the stages performed and providing a query constellation with only the parts interested in finding. For example, to search for a name, a query constellation with that name may be passed to the reconciliation engine. The process would be the same to search for an entity that is associated with a given place, occupation, or exist dates.
* Based on the highly parallelizable nature of the independent stages followed by the coalescer, the identity reconciliation engine maybe implemented (or enhanced) using the [Map-Reduce Framework](http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf) on a large cluster.
# Stage Designs
The following stages are proposed for the Identity Reconciliation Engine:
* **Candidate Constellation producing stages** *These stages produce lists of candidate constellations.*
* Elastic Search Name Entry (Heading) : *Search the entire name entry from the query constellation against all constructed name entry strings from SNAC identity constellations. Return the top N constellations based on Elastic Search's search algorithm with Elastic Search native scores as the stage score for each candidate constellation.*
* Elastic Search Name (Name-only) : *Search the name portion of the name entry from the query constellation against all constructed name-only strings from SNAC identity constellations. Return the top N constellations based on Elastic Search's search algorithm with Elastic Search native scores as the stage score for each candidate constellation.*
* Elastic Search Surname : *Search the surname component from the query constellation against all surnames from SNAC identity constellations. Return the top N constellations based on Elastic Search's search algorithm with Elastic Search native scores as the stage score for each candidate constellation.*
* Elastic Search Forenames : *Search the forenames component from the query constellation against all forenames from SNAC identity constellations. Return the top N constellations based on Elastic Search's search algorithm with Elastic Search native scores as the stage score for each candidate constellation.*
* Exist Dates : *Search the exist dates from the query constellation against all identity constellations in SNAC. Return all constellations that contain the exact exist dates as the query constellation.*
* Fuzzy Exist Dates : *Search the exist dates from the query constellation against all identity constellations in SNAC. Return all constellations that contain exist dates within a range of X from those in the query. Stage score is defined as the distance from the query's dates, lower is better.*
* Occupation : *Search the list of occupations from the query constellation against all identity constellations in SNAC. Return all constellations that match all occupations in the query constellation. (The candidates must have the entire list of occupations from the query as a subset of their occupation list.)*
* Place : *Search the list of places from the query constellation against all identity constellations in SNAC. Return all constellations that match all places in the query constellation. (The candidates' list of places must be a superset of the query constellation's places.)*
* **Candidate Constellation list modifying stages** *These stages take lists of candidate constellations and modify or replace the scores for the input results.*
* Name Entry Length : *Compute the difference in length between the candidate constellation's constructed name entry string and the query constellation's constructed name entry string. Replace the original stage score with the log of the difference. Lower scores are better.*
* SNAC Degree Sort : *Replace the original stage score with the number of constellation relations (constellation out-degree) in the candidate constellation. Constellations that are more connected with other SNAC identity constellations will get better scores.*
* SNAC Resource Count Sort : *Replace the original stage score with the number of resource relations (resource out-degree) in the candidate constellation. Constellations with higher resource relations will get better scores.*
* **Multi-Stage** *This stage allows the execution engine to run multiple stages in sequence, feeding the results of one stage as the input to the next. It results in one final list of candidate constellations from all independent stages it ran.*
,jrhott,Caspian,25.01.2016 17:22,file:///home/jrhott/.config/libreoffice/4;
\ No newline at end of file
### Database schema and SQL queries
### How versioning works.
```
create type icstatus as enum ('published', 'needs review', 'rejected', 'being edited', 'bulk ingest');
create table version_history (
id int default nextval('version_history_id_seq'),
main_id int default nextval('id_seq'), -- main constellation id, when inserting a new identity, allow this to default
is_locked boolean default false, -- boolean, true is locked by version_history.user_id
user_id int, -- fk to appuser.id
role_id int, -- fk to role.id, defaults to users primary role, but can be any role the user has
timestamp timestamp default now(), -- now()
status icstatus, -- enum icstatus note: an enum is a data type like int or text
is_current boolean, -- most current published, optional field to enhance performance
note text, -- checkin message
primary key (id, main_id)
);
```
The version_history table is the central table to a CPF constellation (aka record). The
version_history.main_id is the constellation id. By convention, all SNAC tables have a field id, which is the
record id. Field version_history.id is known by the alias "version" in all locations outside table
version_history. All first-order data tables have fields id,version, and main_id. It may help to understand
some of the following specification by knowing that the unique record key for nearly all tables is
(id,version). For all tables except nrd, the constellation id is main_id. Table nrd is 1:1 data fields, and is
special, therefore nrd.id is both record id and constellation id.
There are some basic traits of versioning
1) No record is ever deleted. Old versions of every record are left in the database.
2) The current published version is noted in table version_history, so we can select the published version of
every record from every table.
3) An update to part of a constellation might update a single table. That update will have a new version
number. This saves data copying and redundancy.
4) Due to (3), actual SQL to select a given record is based on selecting that record's version <= the current
version in table version_history.
The following SQL illustrates the some relationships between version_history and a typical table such as
"name" for a simplified scenario of a single version for a given fictional constellation id 1234.
First we want the maximum which is most recent version number for constellation id 7.
```
select max(id) as version, main_id
from
version_history
where
main_id=7
group by id,main_id;
```
That query returns a single record and tells us the "current version" for constellation 7. Imagine there have
been many changes to this constellation, each change only effecting some parts. Every time the database was
updated, the version number incremented. We can go back to any version, but when editing we always use the
most recent version of each record in each table. That is: <= max(version).
In the case of published, we use a similar query, but constrain by is_current or by status='published'. The
exact implementation has not been settled, and there are plusses and minuses. Nonetheless, this query
illustrates the concept for selecting the most recent version number of the most recently published
constellation id 7:
```
select max(id) as version, main_id
from
version_history
where
main_id=7 and
is_current and
group by id,main_id;
```
Note that max(id) is aliased as 'version'. To all the world outside the version_history.id is a version
number. In order to get the max() of one column relative to another, we must "group by", and since we need
both id and main_id, we group by both columns.
Remember that we're calling SQL from php, and php caches values from queries. This means that we can call
smaller SQL queries, capture returned values, then pass those values to other relatively small queries. The
non-php option would be larger stored procedures. It turns out that php+sql results in better performance, and
the more granular queries are individually easier to read. In some of the following queries you will see php
placeholders $1, $2, $3, etc. These are subsituted safely from php to avoid SQL injection issues.
Selecting records from data tables requires a subselect. It looks a bit daunting, but is very fast, and is
simple enough that we can treat it as a convention or idiom.
```
select
aa.is_deleted,aa.id,aa.version, aa.main_id, aa.original, aa.preference_score
from name as aa,
(select id,max(version) as version from name where version<=$1 and main_id=$2 group by id) as bb
where
aa.id = bb.id and
aa.version = bb.version and
not aa.is_deleted
order by preference_score,id
```
We join our table (name) to a subquery on the same table. Note that we alias the original table as aa, and the
subquery as bb. Conceptually we're doing this:
```
select aa.* from aa,bb where ...
```
This is a relational join that works just like any join between any two tables. It just so happens that table
bb is a subquery.
The subquery is aliased as bb:
```
(select id,max(version) as version from name where version<=$1 and main_id=$2 group by id) as bb
```
The sole purpose of the subquery is to get an record id and matching max() version. Joining these back to the
original table constrains returns records to only each most recent individual name, <= the supplied
version. For normal edits, we look up the current version via the first query above.
The where clause of the larger select should now start to look simpler, now that we understand the subquery as
simply another related table. With in-line comments the where clause is:
```
where
aa.id = bb.id and -- join and constrain id
aa.version = bb.version and -- join and constrain version
not aa.is_deleted -- only select non-deleted records
order by preference_score,id -- order the results
```
While there's some subtle behavior here, the where clause is a typical two table join.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment