Commit a20a2182 by Tom Laudeman

merging branch tom into master

parents 12aaedb3 088b8233
#### Name and alternate name
There is no consensus on the canonical name. There probably can be no single, always preferred name. What is
preferred depends on context, and will vary for different purposes. In the data we can capture a reasonable
amount of context, but only the users know what is preferred. Computationally, we should treat all name as
alternates. We can offer names in one (or more) of several agreed-upon formats, leaving the choice up to the
user.
Name and alternate has no effect on identity matching because the match is done on all alternates, and uses
all available data from the identity constellation.
#### Name components
Give the variety of components in names, it is not possible to create a canonical, single set of components
for many names and their alternates. Even the concept of "preferred" name is debatable.
It is (mostly) possible to parse out the components for each name and each alternate name. Thus we have as many
sets of name components as we have names for a single identity constellation.
Suggest: we not become dogmatic about component labels. We should avoid "family name" vs surname even though
family name is perhaps more culturally relevant. Ditto givenname vs forename. Goal: be culturally agnostic
in the name_component vocabulary, and capture cultural practice in some other place/table, such as table
name_format.
If we want language specific name components then we need to add field "language" to name_components. Due
to components being extracted from possibly several name strings, we probably will not join table name to
table name_component. It would be a many-to-one join, and given that name component relates to cpf, and not
to one or more name strings, a join between table name and table name component is not logical.
Do we need information about how each name component was derived? If so we are probably better off saving a
log of the name parser than trying to use the database as part of the historical record of name
parsing. One aspect of database-centric component derivation would be a join table to handle the
many-to-many relation between table name and table name component. Complete info about derivation also
requires the name parsing version number and any configuration at the time the parsing was done.
Field nc_label (table name_component) must come from a controlled vocabulary in order for dynamic
formatting to work.
We should not allow our technology to be defined by the "minimal existing implementation" of name. We will
cripple SNAC if we only meet the minimal definition of name. Additionally, SNAC has needs beyond that of
our archives and authority stake holders.
#### Minimal list of components labels
The least flexible system (of the 3 or 4 we have reviewed) for name components is probably MARC. Even in MARC
the $c is repeatable, allowing for large number of (unlabeled) components. This system is probably too
restrictive, although it allows us to capture middle name when possible. However, lack of flexible labels
makes MARC a weak standard for names.
surname, forename, additions, numeration, expansion
#### Larger list of components
This list comes from a combination of MARC, Unimarc, ISNI, ArchiveSpace, and British Library. We might be wise
to include others as well (VIAF, BnF, AnF, Archives Hub UK).
surname, middle, forename, prefix, suffix, epithet, title, pretitle, numeration, additional
Unfortunately, current name format guidelines are ambiguous. The most common problem is that middle name could
be a second forename or the second of several additional name components.
There is general agreement on "name" and "non-name" parts, although no guides explicitly talk about this. Name
parts are surname, forename, and middle name. Many other non-name parts are often found in names. Both the
name and non-name parts have ambiguous rules within most systems, and the various systems and cultures have
incomplete agreement.
The database is improved by labeling the compoents where possible, but our algorithms and user
interface can (fairly) easily remaing agnostic about compoent labels while processing anddisplaying the
compoents as well as formatting the components into names.
In the past, failure to create names by (re-) formatting components has led to inconsistent names. While is it
not possible to 100% parse or format names, it is also true that humans who did the data entry were not 100%
accurate in their formats. The computer can be more consistent, and probably nearly as accurate as the human
editors (especially where the editors cannot agree or where they have ambiguous rules). While it is
technically feasible to format names from components, it is also feasible to keep and (carefully) display the
names as originally entered.
#### Overview
ISNI: prefix (NR), surname (NR), forename (additional forenames) (R-ish), middle name (second and subsequent
forenames) (R-ish), suffix (NR)
Unimarc: $a Surname (NR), $b Given name remainder (NR), $c Additions (R), $d Roman numerals (NR), $g Expansion (NR)
MARC: $a name (NR), $b numeration (NR), $c titles and other words (R), $q fuller form (NR)
#### Detailed fields by authority
(R) are repeatable fields. (NR) are non-repeating fields.
#### ISNI
From: ISNI fields of tab delimited format for data submission, A. MacEwan et al.
http://dx.doi.org/10.1080/01639374.2012.730601
http://www.tandfonline.com/doi/abs/10.1080/01639374.2012.730601?journalCode=wccq20
ISNI components:
prefix, surname, forename and additional forenames, middle name second and subsequent forenames, suffix
- prefix: e.g. Sir
- surname: all parts of surname in the form commonly used; for alt form use alt name
- forename: one or more forenames or initials
- middle: second and subsequent forenames
- suffix: e.g. Esq
- ISNI's input format supports a single alternate name in indirect format
#### Unimarc
http://www.ifla.org/files/assets/uca/unimarc_updates/BIBLIOGRAPHIC/u-b_700_update.pdf
Unimarc components:
$a Surname (NR), $b Given name remainder (NR), $c Additions (R), $d Roman numerals (NR), $g Expansion (NR)
LoC MARC 1xx
http://www.loc.gov/marc/authority/ad100.html
MARC and Unimarc are surname oriented: "that part of the name by which the name is entered in ordered
lists", although they allow the main entry to be a forename (distinguished by indicators attribute). All
non-surname parts of the given name are thrown together into forename, both have a numeration field, both
allow many additional components. MARC $c acknowledges that there are many types of other words associated
with names.
MARC components:
$a name (NR), $b numeration (NR), $c titles and other words (R), $q fuller form (NR)
(First indicator determines direct or indirect format of $a.)
```
100 - MAIN ENTRY--PERSONAL NAME (NR)
Indicators
First - Type of personal name entry element
0 - Forename
1 - Surname
3 - Family name
$a - Personal name (NR)
$b - Numeration (NR) [roman numerals]
$c - Titles and other words associated with a name (R) [Jr., King of Sweden, Meister, pseud., Sir, (Anglo-Norman poet), Esq., II]
$d - Dates associated with a name (NR)
$q - Fuller form of name (NR)
```
#### ArchiveSpace
http://sandbox.archivesspace.org/
ArchiveSpace components:
prefix, title, "primary part of name (required)", rest of name, suffix, fuller form.
#### Name and alternate name
There is no consensus on the canonical name. There probably can be no single, always preferred name. What is
preferred depends on context, and will vary for different purposes. In the data we can capture a reasonable
amount of context, but only the users know what is preferred. Computationally, we should treat all name as
alternates. We can offer names in one (or more) of several agreed-upon formats, leaving the choice up to the
user.
Name and alternate has no effect on identity matching because the match is done on all alternates, and uses
all available data from the identity constellation.
#### Name components
Give the variety of components in names, it is not possible to create a canonical, single set of components
for many names and their alternates. Even the concept of "preferred" name is debatable.
It is (mostly) possible to parse out the components for each name and each alternate name. Thus we have as many
sets of name components as we have names for a single identity constellation.
Suggest: we not become dogmatic about component labels. We should avoid "family name" vs surname even though
family name is perhaps more culturally relevant. Ditto givenname vs forename. Goal: be culturally agnostic
in the name_component vocabulary, and capture cultural practice in some other place/table, such as table
name_format.
If we want language specific name components then we need to add field "language" to name_components. Due
to components being extracted from possibly several name strings, we probably will not join table name to
table name_component. It would be a many-to-one join, and given that name component relates to cpf, and not
to one or more name strings, a join between table name and table name component is not logical.
Do we need information about how each name component was derived? If so we are probably better off saving a
log of the name parser than trying to use the database as part of the historical record of name
parsing. One aspect of database-centric component derivation would be a join table to handle the
many-to-many relation between table name and table name component. Complete info about derivation also
requires the name parsing version number and any configuration at the time the parsing was done.
Field nc_label (table name_component) must come from a controlled vocabulary in order for dynamic
formatting to work.
We should not allow our technology to be defined by the "minimal existing implementation" of name. We will
cripple SNAC if we only meet the minimal definition of name. Additionally, SNAC has needs beyond that of
our archives and authority stake holders.
#### Minimal list of components labels
The least flexible system (of the 3 or 4 we have reviewed) for name components is probably MARC. Even in MARC
the $c is repeatable, allowing for large number of (unlabeled) components. This system is probably too
restrictive, although it allows us to capture middle name when possible. However, lack of flexible labels
makes MARC a weak standard for names.
surname, forename, additions, numeration, expansion
#### Larger list of components
This list comes from a combination of MARC, Unimarc, ISNI, ArchiveSpace, and British Library. We might be wise
to include others as well (VIAF, BnF, AnF, Archives Hub UK).
surname, middle, forename, prefix, suffix, epithet, title, pretitle, numeration, additional
Unfortunately, current name format guidelines are ambiguous. The most common problem is that middle name could
be a second forename or the second of several additional name components.
There is general agreement on "name" and "non-name" parts, although no guides explicitly talk about this. Name
parts are surname, forename, and middle name. Many other non-name parts are often found in names. Both the
name and non-name parts have ambiguous rules within most systems, and the various systems and cultures have
incomplete agreement.
The database is improved by labeling the compoents where possible, but our algorithms and user
interface can (fairly) easily remaing agnostic about compoent labels while processing anddisplaying the
compoents as well as formatting the components into names.
In the past, failure to create names by (re-) formatting components has led to inconsistent names. While is it
not possible to 100% parse or format names, it is also true that humans who did the data entry were not 100%
accurate in their formats. The computer can be more consistent, and probably nearly as accurate as the human
editors (especially where the editors cannot agree or where they have ambiguous rules). While it is
technically feasible to format names from components, it is also feasible to keep and (carefully) display the
names as originally entered.
#### Overview
ISNI: prefix (NR), surname (NR), forename (additional forenames) (R-ish), middle name (second and subsequent
forenames) (R-ish), suffix (NR)
Unimarc: $a Surname (NR), $b Given name remainder (NR), $c Additions (R), $d Roman numerals (NR), $g Expansion (NR)
MARC: $a name (NR), $b numeration (NR), $c titles and other words (R), $q fuller form (NR)
#### Detailed fields by authority
(R) are repeatable fields. (NR) are non-repeating fields.
#### ISNI
From: ISNI fields of tab delimited format for data submission, A. MacEwan et al.
http://dx.doi.org/10.1080/01639374.2012.730601
http://www.tandfonline.com/doi/abs/10.1080/01639374.2012.730601?journalCode=wccq20
ISNI components:
prefix, surname, forename and additional forenames, middle name second and subsequent forenames, suffix
- prefix: e.g. Sir
- surname: all parts of surname in the form commonly used; for alt form use alt name
- forename: one or more forenames or initials
- middle: second and subsequent forenames
- suffix: e.g. Esq
- ISNI's input format supports a single alternate name in indirect format
#### Unimarc
http://www.ifla.org/files/assets/uca/unimarc_updates/BIBLIOGRAPHIC/u-b_700_update.pdf
Unimarc components:
$a Surname (NR), $b Given name remainder (NR), $c Additions (R), $d Roman numerals (NR), $g Expansion (NR)
LoC MARC 1xx
http://www.loc.gov/marc/authority/ad100.html
MARC and Unimarc are surname oriented: "that part of the name by which the name is entered in ordered
lists", although they allow the main entry to be a forename (distinguished by indicators attribute). All
non-surname parts of the given name are thrown together into forename, both have a numeration field, both
allow many additional components. MARC $c acknowledges that there are many types of other words associated
with names.
MARC components:
$a name (NR), $b numeration (NR), $c titles and other words (R), $q fuller form (NR)
(First indicator determines direct or indirect format of $a.)
```
100 - MAIN ENTRY--PERSONAL NAME (NR)
Indicators
First - Type of personal name entry element
0 - Forename
1 - Surname
3 - Family name
$a - Personal name (NR)
$b - Numeration (NR) [roman numerals]
$c - Titles and other words associated with a name (R) [Jr., King of Sweden, Meister, pseud., Sir, (Anglo-Norman poet), Esq., II]
$d - Dates associated with a name (NR)
$q - Fuller form of name (NR)
```
#### ArchiveSpace
http://sandbox.archivesspace.org/
ArchiveSpace components:
prefix, title, "primary part of name (required)", rest of name, suffix, fuller form.
#### Introduction
This work flow engine is a lightweight request routing tool. In our application it encapsulates our business
processes at a high level. Architecturally, it lives inside web middle-ware. Its function in the middle-ware
is to handle calling the proper high level functions. We have two workflow engines, because we separate web
UI based workflow from fundamental business (policy) issues, rather than conflating the two problems.
Small web applications that will always be small (a maximum of 5 web pages) often use "page controllers" where
each page handles its own logic, and connections between pages are implicit in the links. Larger sites use a
"front controller" which is a single point of control.
The the workflow engine handles the application decision making logic in the front controller. Business
process decisions are handled in the server-side front controller, and we have separate workflow limited
browser and UI. Web http requests go to the browser controller, where they are normalized for the server
controller. REST calls are also normalized and sent to the same sever controller. Thus interactions with the
server internals always follow consistent business and policy workflow.
It is important to remember that nearly all aspects of the current application design involve lightweight
solutions to typical problems. Rather than a comprehensive framework, we have chosen to use a select set of
off the shelf software modules to construct a framework suitable to our needs.
#### Requirements
The workflow engine encapsulates only decision making. It assumes other code deeper in the application will do
the real work. The decisions are written down in a 4 column state table. Workflow is testable by stepping
through the state table manually. Workflow is also testable via computational methods that will validate that
the states will reach an exit, and that all states are reachable.
The 4 columns are: starting state, boolean transition test, transition function to run, next state. There are
3 pseudo-functions: jump, return, wait. The jump will push the current state onto an internal stack and jump
to a new state. The return pops the stack and returns to that state where it immediately transitions to the
next state. The wait might be called exit since it causes workflow to stop.
Workflow always begins with a default starting state. From a starting node, the boolean transition test is
run. If true, the transition will occur. If false, the next state of the same name will run boolean transition
test. If a transition function exists, it will be run (eval'd). The workflow transitions to the next state,
and the process repeats until the wait function.
To accomodate multiple boolean transition tests, there can be multiple rows with the same starting state
name. These are tested in the order they occur in the state table. If none of the transition tests are true,
the machine halts with an error. This possibility is revealed during testing. By convention, no transition
test is true, thus any starting state may (and probably should) have a default catch-all. In keeping with
business rules this answers the workflow question "What happens at this step if everything goes wrong?"
#### Implementation as thought problem
Implementation can be handled several ways which may help you think (extrapolate) how the
system works.
In the first mode, the workflow engine state table's functions are eval'd as literal function calls. For every
function that the state table calls for a given state transition, the function must exist in the system. A
string "unlock_record()" when eval'd will run the function unlock_record. The workflow engine doesn't know
what exactly goes on inside that function, but it does "know" that it will unlock the current record.
Creation of the workflow involves a shared understanding between the programmer writing the workflow, and the
programmer creating the system code.
digraph States {
# dot -Tsvg constellation_linked.gv -O
# Will create constellation_linked.gv.svg
label = "\n\nIdentity Constellation\nTwo linked identities";
labelloc="t";
fontsize=20;
inputscale=0;
# sep=1;
# splines=true;
overlap=false;
node [pos="4,5!"]; "root1";
node [pos="1,3!"]; ne1;
node [pos="3,3!"]; an1;
node [pos="5,3!"]; ed1;
node [pos="7,3!"]; cr1;
node [pos="10,3!"]; occ1;
node [pos="9,4!"]; rr1;
node [pos="3.3,1.5!"]; "root2";
node [pos="1,2!"]; ne2;
node [pos="3,0!"]; an2;
node [pos="1.5,1!"]; an3;
node [pos="5,0!"]; ed2;
node [pos="7.4,2!"]; cr2;
node [pos="9,0!"]; occ2;
node [pos="8,-1!"]; occ22;
node [pos="10,1.5!"]; rr2;
"ne1","ne2" [label="alt name"];
"an1", "an2", "an3" [label="alt name"];
"ed1", "ed2" [label="exist dates"];
"occ1", "occ2", "occ22" [label="occupation/function"];
"cr1", "cr2" [label="cpf relation"];
"rr1", "rr2" [label="resource relation"];
"root1" [label="identity-A"];
"root2" [label="identity-B"];
root1 -> ne1;
root1 -> an1;
root1 -> ed1;
root1 -> occ1;
root1 -> cr1;
root1-> rr1;
cr1 -> cr2 ;
cr2 -> cr1 ;
root2-> rr2;
cr2 -> root2 [dir="back"];
root2 -> occ2;
root2 -> occ22;
root2 -> ed2;
root2 -> an2;
root2 -> an3;
root2 -> ne2;
}
digraph States {
// neato -n2 -Tsvg identity_constellation.gv -O
//
// Absolute positioning appears to only work with neato, and only if all nodes are pinned,
// but not always. neato -n2 units are points, and inputscale appears to be ignored
// sep=0.2 splines=polyline overlap=false allows the pos values to be followed,
// while getting the lines to go around nodes.
label = "\n\nIdentity Constellation";
labelloc="t";
fontsize=20;
// inputscale=75;
sep=0.08;
splines=polyline;
overlap=false;
"an1" [label="alt name"];
"an2", "an3" [label="alt name"];
"ed1" [label="exist dates"];
"occ1", "occ2" [label="occupation\nor function"];
"cr1", "cr2" [label="cpf relation"];
"rr1" [label="resource relation"];
root1 [pos="350,400!" label="identity root"];
place [pos="200,450!" label="related place"];
an1 [pos="100,320!" ];
an2 [pos="100,250!" ];
an3 [pos="100,240!" ];
ed1 [pos="200,200!"];
biog [pos="160,400!" label="biog hist"] ;
cr1 [pos="500,300!"];
cr2 [pos="600,200!"];
et [pos="300,100!" label="entity type"];
occ1 [pos="350,250!"];
occ2 [pos="450,200!"];
rr1 [pos="550,350!"];
src [pos="550,400!" label="source"];
usedate [pos="105,315]", label="use dates"];
an1 -> usedate;
root1 -> et;
root1 -> src;
root1 -> place;
root1 -> an1;
root1 -> an2;
root1 -> an3;
root1 -> ed1;
root1 -> occ1;
root1 -> occ2;
root1 -> cr1;
root1-> rr1;
root1 -> biog;
cr1 -> cr2 ;
cr2 -> cr1 ;
}
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"
"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<!-- Generated by graphviz version 2.34.0 (20140101.1016)
-->
<!-- Title: States Pages: 1 -->
<svg width="637pt" height="472pt"
viewBox="0.00 0.00 637.14 472.00" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
<g id="graph0" class="graph" transform="scale(1 1) rotate(0) translate(4 468)">
<title>States</title>
<polygon fill="white" stroke="white" points="-4,4 -4,-468 633.14,-468 633.14,4 -4,4"/>
<text text-anchor="middle" x="314.57" y="-396" font-family="Times,serif" font-size="20.00">Identity Constellation</text>
<!-- an1 -->
<g id="node1" class="node"><title>an1</title>
<ellipse fill="none" stroke="black" cx="41.5963" cy="-269.413" rx="41.6928" ry="18"/>
<text text-anchor="middle" x="41.5963" y="-265.713" font-family="Times,serif" font-size="14.00">alt name</text>
</g>
<!-- usedate -->
<g id="node15" class="node"><title>usedate</title>
<ellipse fill="none" stroke="black" cx="120.194" cy="-225.48" rx="43.5923" ry="18"/>
<text text-anchor="middle" x="120.194" y="-221.78" font-family="Times,serif" font-size="14.00">use dates</text>
</g>
<!-- an1&#45;&gt;usedate -->
<g id="edge1" class="edge"><title>an1&#45;&gt;usedate</title>
<path fill="none" stroke="black" d="M67.3322,-255.027C72.9733,-251.874 79.0394,-248.484 84.9796,-245.163"/>
<polygon fill="black" stroke="black" points="87.1102,-247.982 94.1315,-240.048 83.6949,-241.872 87.1102,-247.982"/>
</g>
<!-- an2 -->
<g id="node2" class="node"><title>an2</title>
<ellipse fill="none" stroke="black" cx="70.388" cy="-182.051" rx="41.6928" ry="18"/>
<text text-anchor="middle" x="70.388" y="-178.351" font-family="Times,serif" font-size="14.00">alt name</text>
</g>
<!-- an3 -->
<g id="node3" class="node"><title>an3</title>
<ellipse fill="none" stroke="black" cx="52.8941" cy="-82.5315" rx="41.6928" ry="18"/>
<text text-anchor="middle" x="52.8941" y="-78.8315" font-family="Times,serif" font-size="14.00">alt name</text>
</g>
<!-- ed1 -->
<g id="node4" class="node"><title>ed1</title>
<ellipse fill="none" stroke="black" cx="176.494" cy="-118" rx="48.9926" ry="18"/>
<text text-anchor="middle" x="176.494" y="-114.3" font-family="Times,serif" font-size="14.00">exist dates</text>
</g>
<!-- occ1 -->
<g id="node5" class="node"><title>occ1</title>
<ellipse fill="none" stroke="black" cx="326.494" cy="-168" rx="55.3091" ry="26.7407"/>
<text text-anchor="middle" x="326.494" y="-171.8" font-family="Times,serif" font-size="14.00">occupation</text>
<text text-anchor="middle" x="326.494" y="-156.8" font-family="Times,serif" font-size="14.00">or function</text>
</g>
<!-- occ2 -->
<g id="node6" class="node"><title>occ2</title>
<ellipse fill="none" stroke="black" cx="426.494" cy="-118" rx="55.3091" ry="26.7407"/>
<text text-anchor="middle" x="426.494" y="-121.8" font-family="Times,serif" font-size="14.00">occupation</text>
<text text-anchor="middle" x="426.494" y="-106.8" font-family="Times,serif" font-size="14.00">or function</text>
</g>
<!-- cr1 -->
<g id="node7" class="node"><title>cr1</title>
<ellipse fill="none" stroke="black" cx="476.494" cy="-218" rx="52.7911" ry="18"/>
<text text-anchor="middle" x="476.494" y="-214.3" font-family="Times,serif" font-size="14.00">cpf relation</text>
</g>
<!-- cr2 -->
<g id="node8" class="node"><title>cr2</title>
<ellipse fill="none" stroke="black" cx="576.494" cy="-118" rx="52.7911" ry="18"/>
<text text-anchor="middle" x="576.494" y="-114.3" font-family="Times,serif" font-size="14.00">cpf relation</text>
</g>
<!-- cr1&#45;&gt;cr2 -->
<g id="edge14" class="edge"><title>cr1&#45;&gt;cr2</title>
<path fill="none" stroke="black" d="M493.913,-200.581C509.973,-184.521 533.976,-160.518 551.972,-142.523"/>
<polygon fill="black" stroke="black" points="554.679,-144.765 559.275,-135.219 549.729,-139.815 554.679,-144.765"/>
</g>
<!-- cr2&#45;&gt;cr1 -->
<g id="edge15" class="edge"><title>cr2&#45;&gt;cr1</title>
<path fill="none" stroke="black" d="M559.076,-135.419C543.016,-151.479 519.012,-175.482 501.017,-193.477"/>
<polygon fill="black" stroke="black" points="498.31,-191.235 493.713,-200.781 503.259,-196.185 498.31,-191.235"/>
</g>
<!-- rr1 -->
<g id="node9" class="node"><title>rr1</title>
<ellipse fill="none" stroke="black" cx="526.494" cy="-268" rx="71.4873" ry="18"/>
<text text-anchor="middle" x="526.494" y="-264.3" font-family="Times,serif" font-size="14.00">resource relation</text>
</g>
<!-- root1 -->
<g id="node10" class="node"><title>root1</title>
<ellipse fill="none" stroke="black" cx="326.494" cy="-318" rx="55.4913" ry="18"/>
<text text-anchor="middle" x="326.494" y="-314.3" font-family="Times,serif" font-size="14.00">identity root</text>
</g>
<!-- root1&#45;&gt;an1 -->
<g id="edge5" class="edge"><title>root1&#45;&gt;an1</title>
<path fill="none" stroke="black" d="M277.519,-309.648C225.166,-300.719 142.657,-286.648 90.379,-277.732"/>
<polygon fill="black" stroke="black" points="90.7905,-274.252 80.3444,-276.021 89.6137,-281.152 90.7905,-274.252"/>
</g>
<!-- root1&#45;&gt;an2 -->
<g id="edge6" class="edge"><title>root1&#45;&gt;an2</title>
<path fill="none" stroke="black" d="M300.146,-301.895C251.614,-272.23 153.009,-211.959 153.009,-211.959 153.009,-211.959 132.706,-204.61 112.23,-197.197"/>
<polygon fill="black" stroke="black" points="113.162,-193.813 102.568,-193.7 110.78,-200.395 113.162,-193.813"/>
</g>
<!-- root1&#45;&gt;an3 -->
<g id="edge7" class="edge"><title>root1&#45;&gt;an3</title>
<path fill="none" stroke="black" d="M306.818,-301.066C258.568,-259.541 134.333,-152.62 79.5364,-105.461"/>
<polygon fill="black" stroke="black" points="81.5287,-102.558 71.6661,-98.6872 76.9625,-107.863 81.5287,-102.558"/>
</g>
<!-- root1&#45;&gt;ed1 -->
<g id="edge8" class="edge"><title>root1&#45;&gt;ed1</title>
<path fill="none" stroke="black" d="M313.183,-300.251C286.541,-264.729 226.578,-184.778 195.689,-143.593"/>
<polygon fill="black" stroke="black" points="198.362,-141.323 189.562,-135.423 192.762,-145.523 198.362,-141.323"/>
</g>
<!-- root1&#45;&gt;occ1 -->
<g id="edge9" class="edge"><title>root1&#45;&gt;occ1</title>
<path fill="none" stroke="black" d="M326.494,-299.906C326.494,-276.556 326.494,-235.353 326.494,-205.194"/>
<polygon fill="black" stroke="black" points="329.994,-205.034 326.494,-195.034 322.994,-205.034 329.994,-205.034"/>
</g>
<!-- root1&#45;&gt;occ2 -->
<g id="edge10" class="edge"><title>root1&#45;&gt;occ2</title>
<path fill="none" stroke="black" d="M335.545,-299.899C352.073,-266.843 387.408,-196.172 408.844,-153.301"/>
<polygon fill="black" stroke="black" points="411.983,-154.848 413.325,-144.338 405.722,-151.718 411.983,-154.848"/>
</g>
<!-- root1&#45;&gt;cr1 -->
<g id="edge11" class="edge"><title>root1&#45;&gt;cr1</title>
<path fill="none" stroke="black" d="M350.928,-301.711C376.098,-284.931 415.489,-258.67 443.433,-240.041"/>
<polygon fill="black" stroke="black" points="445.692,-242.742 452.071,-234.282 441.809,-236.917 445.692,-242.742"/>
</g>
<!-- root1&#45;&gt;rr1 -->
<g id="edge12" class="edge"><title>root1&#45;&gt;rr1</title>
<path fill="none" stroke="black" d="M370.385,-307.027C398.4,-300.024 435.101,-290.848 465.878,-283.154"/>
<polygon fill="black" stroke="black" points="466.791,-286.534 475.644,-280.713 465.094,-279.743 466.791,-286.534"/>
</g>
<!-- place -->
<g id="node11" class="node"><title>place</title>
<ellipse fill="none" stroke="black" cx="176.494" cy="-368" rx="57.3905" ry="18"/>
<text text-anchor="middle" x="176.494" y="-364.3" font-family="Times,serif" font-size="14.00">related place</text>
</g>
<!-- root1&#45;&gt;place -->
<g id="edge4" class="edge"><title>root1&#45;&gt;place</title>
<path fill="none" stroke="black" d="M287.866,-330.876C268.843,-337.217 245.702,-344.931 225.453,-351.681"/>
<polygon fill="black" stroke="black" points="224.155,-348.424 215.775,-354.906 226.369,-355.065 224.155,-348.424"/>
</g>
<!-- biog -->
<g id="node12" class="node"><title>biog</title>
<ellipse fill="none" stroke="black" cx="136.494" cy="-318" rx="42.4939" ry="18"/>
<text text-anchor="middle" x="136.494" y="-314.3" font-family="Times,serif" font-size="14.00">biog hist</text>
</g>
<!-- root1&#45;&gt;biog -->
<g id="edge13" class="edge"><title>root1&#45;&gt;biog</title>
<path fill="none" stroke="black" d="M271.028,-318C245.227,-318 214.692,-318 189.24,-318"/>
<polygon fill="black" stroke="black" points="189.034,-314.5 179.034,-318 189.034,-321.5 189.034,-314.5"/>
</g>
<!-- et -->
<g id="node13" class="node"><title>et</title>
<ellipse fill="none" stroke="black" cx="276.494" cy="-18" rx="49.2915" ry="18"/>
<text text-anchor="middle" x="276.494" y="-14.3" font-family="Times,serif" font-size="14.00">entity type</text>
</g>
<!-- root1&#45;&gt;et -->
<g id="edge2" class="edge"><title>root1&#45;&gt;et</title>
<path fill="none" stroke="black" d="M319.416,-299.896C303.808,-259.976 267.81,-167.909 267.81,-167.909 267.81,-167.909 272.337,-89.7605 274.854,-46.3123"/>
<polygon fill="black" stroke="black" points="278.363,-46.2682 275.447,-36.0825 271.374,-45.8633 278.363,-46.2682"/>
</g>
<!-- src -->
<g id="node14" class="node"><title>src</title>
<ellipse fill="none" stroke="black" cx="526.494" cy="-318" rx="34.394" ry="18"/>
<text text-anchor="middle" x="526.494" y="-314.3" font-family="Times,serif" font-size="14.00">source</text>
</g>
<!-- root1&#45;&gt;src -->
<g id="edge3" class="edge"><title>root1&#45;&gt;src</title>
<path fill="none" stroke="black" d="M381.94,-318C413.48,-318 452.454,-318 481.982,-318"/>
<polygon fill="black" stroke="black" points="482.005,-321.5 492.005,-318 482.005,-314.5 482.005,-321.5"/>
</g>
</g>
</svg>
digraph States {
// neato -n2 -Tsvg identity_constellation_repeats.gv -O
//
// Absolute positioning appears to only work with neato, and only if all nodes are pinned,
// but not always. neato -n2 units are points, and inputscale appears to be ignored
// sep=0.2 splines=polyline overlap=false allows the pos values to be followed,
// while getting the lines to go around nodes.
label = "\n\nIdentity Constellation\n(R) repeatable fields";
labelloc="t";
fontsize=20;
// inputscale=75;
sep=0.05;
// nodesep is a synonym for sep?
// nodesep=0.1;
splines=polyline;
overlap=false;
"an1" [label="name/alt(R)"];
"ed1" [label="exist dates"];
"occ1" [label="occupation\nor function(R)"];
"cr1" [label="identity relation(R)"];
"rr1" [label="resource relation(R)"];
root1 [pos="470,400!" label="identity root"];
place [pos="320,450!" label="related place(R)"];
an1 [pos="270,350!" ];
pref [pos="120,410!" label="preferred"];
usedate [pos="120,350!", label="use dates"];
name_components [pos="140,300!", label="components"];
language [pos="150,250!", label="language"];
script [pos="180,200!", label="script"];
authorized_form [pos="210,140!", label="authorized\nform"];
an1 -> language;
an1 -> script;
an1 ->authorized_form;
an1 -> pref;
name_components -> surname;
name_components -> forename;
name_components -> numeration;
name_components -> prefix;
name_components -> suffix;
surname [pos="0,350!", label="surname(R)"];
forename [pos="0,300!", label="forename(R)"];
numeration [pos="0,250!", label="numeration"];
prefix [pos="0,200!", label="prefix(R)"];
suffix [pos="0,150!", label="suffix(R)"];
ed1 [pos="330,270!"];
biog [pos="280,400!" label="biog hist"] ;
cr1 [pos="730,310!"];
et [pos="340,100!" label="entity type"];
occ1 [pos="550,250!"];
subject [pos="460,180!" label="topical subject(R)"];
rr1 [pos="720,200!"];
src [pos="670,400!" label="source(R)"];
citation [pos="690,450!" label="citation(R)"];
root1 -> subject;
root1 -> citation;
root1 -> et;
root1 -> src;
root1 -> place;
root1 -> an1;
root1 -> ed1;
root1 -> occ1;
root1 -> cr1;
root1-> rr1;
root1 -> biog;
an1 -> usedate;
an1 -> name_components;
}
##### Constellation diagrams
The diagrams show conceptual table names, which mirror the SQL database schema. An identity
has one record in each table unless noted as repeatable.
constellation_linked.gv.svg is an over view showing how two related records are linked via identity
relation. Each record is independent. The only link between records is the identity relation which had one
record for each side of the relationship. For reasons of clarity, the two identity constellations are not
shown in full detail.
identity_constellation_repeats.gv.svg is a detailed view of a single identity constellation with repeatable
recorda noted.
identity_constellation.gv.svg is an earlier, simplified view of the data.
SNAC Data outline
This is a broad view of various kinds of data in the SNAC web application. At the core, SNAC data was
historically EAC-CPF. Working with the CPF data causes two things to happen. First, CPF data itself becomes
more of a constellation than discrete fields. Second, using and manipulating the data requires many types of
meta data.
SNAC has always had aspects of controlled vocabulary and authority work. Both of those are being formalized
and both add data to the SNAC application.
Most of the data resides in the SQL database. Nearly every item below corresponds to a SQL table. The database
also has additional tables serving various linking and record keeping functions. At this time, we have several
non-SQL data stores: XTF, Neo4j, Elastic search index
- EAC-CPF constellation (broadly disambiguated from "identity", "entity", "EAC-CPF", "record", etc).
- Canonical data in SQL tables
- XML output generated as necessary
- Meta data
- Version system
- data current public version
- data current edit version (for records being edited)
- data old public versions
- data old edit versions
- Merge history
- Links to outside resources
- Archives
- Finding aids
- Multilingual strings
- Web UI labels
- Controlled vocabulary strings, including labels and definitions
- Controlled vocabularies
- Multilingual strings
- Have category and hierarchy
- All vocabularies share a base data structure
- Use varies by policy; does this imply a vocabulary workflow?
- Name format system
- Multiple known formats
- Canonical SNAC format?
- Context sensitive to language, script, and user?
- Workflows
- Web UI workflow
- Workflow specific to web domain, pages, buttons, output type, etc.
- Server workflow
- archivist edit
- split/merge
- identity reconciliation suggested merge
- manual merge
- policy based workflows
- technical workflows
- Web admins
- Create and assign institution roles (more powerful than institution admins)
- Institutions
- Institution admins
- Users
- SNAC CPF entries for institutions
- At least one role per institution
- Users
- Dashboard tabs
- Historical Research Tool per-user search history
- Maintenance tool per-user workflow task status
- Notifications, all users
- Account info
- name, email, user id, password
- roles, as many as necessary
- Web session, possibly multiple sessions per user
- REST API session, similar or identical to web sessions
- Removing a role from a user revokes the associated privilege
- Roles
- Created and maintained by admins with role privileges
- Single privilege per role, must be coordinated with workflows and application functions
- At least one role exists per institution
- At least one role per user (HRT user)
- Potentially, roles for ad-hoc groups (sub-institution, department, professional orgs, etc.)
- Need explicit, on-going policy guidance
- Reports
- Read the database
- Availability based on roles
- XTF full text index
- Neo4j graph database
- Elastic search full text index
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment