expand name component prose, edit the various component lists for clarity

ecb16a8d · Tom Laudeman · d56092fa · ecb16a8d
Commit ecb16a8d authored Oct 21, 2015 by Tom Laudeman
Hide whitespace changes
Inline Side-by-side

Showing with 51 additions and 12 deletions

name_compoents_alternates.md Discussion/name_compoents_alternates.md +51 -12

No files found.
--- a/Discussion/name_compoents_alternates.md
+++ b/Discussion/name_compoents_alternates.md
+#### Name and alternate name
+There is no consensus on the canonical name. There probably can be no single, always preferred name. What is
+preferred depends on context, and will vary for different purposes. In the data we can capture a reasonable
+amount of context, but only the users know what is preferred. Computationally, we should treat all name as
+alternates. We can offer names in one (or more) of several agreed-upon formats, leaving the choice up to the
+user.
+Name and alternate has no effect on identity matching because the match is done on all alternates, and uses
+all available data from the identity constellation. 
 #### Name components
-There is only one set of name components for a given cpf identity, thus name_component is related to table
+Give the variety of components in names, it is not possible to create a canonical, single set of components
-cpf, not table name. To derive the components, we can parse the preferred name. Or using a more complex
+for many names and their alternates. Even the concept of "preferred" name is debatable.
-algo, we can parse all the names in a given language and build a consensus set, or a canonical set of
-components. Initially I thought that a consensus set precludes having a component "japanese family name",
+It is (mostly) possible to parse out the components for each name and each alternate name. Thus we have as many
-but now I can't see why. There is no reason for a canonical set of components not to include components
+sets of name components as we have names for a single identity constellation. 
-used only in a subset of preferred name forms.
+Suggest: we not become dogmatic about component labels. We should avoid "family name" vs surname even though
+family name is perhaps more culturally relevant. Ditto givenname vs forename. Goal: be culturally agnostic
+in the name_component vocabulary, and capture cultural practice in some other place/table, such as table
+name_format. 
 If we want language specific name components then we need to add field "language" to name_components. Due
 to components being extracted from possibly several name strings, we probably will not join table name to
@@ -18,11 +34,6 @@ parsing. One aspect of database-centric component derivation would be a join tab
 many-to-many relation between table name and table name component. Complete info about derivation also
 requires the name parsing version number and any configuration at the time the parsing was done.
-Suggest: we not become dogmatic about component labels. We should avoid "family name" vs surname even though
-family name is perhaps more culturally relevant. Ditto givenname vs forename. Goal: be culturally agnostic
-in the name_component vocabulary, and capture cultural practice in some other place/table, such as table
-name_format.
 Field nc_label (table name_component) must come from a controlled vocabulary in order for dynamic
 formatting to work.
@@ -32,15 +43,43 @@ our archives and authority stake holders.
 #### Minimal list of components labels
+The least flexible system (of the 3 or 4 we have reviewed) for name components is probably MARC. Even in MARC
+the $c is repeatable, allowing for large number of (unlabeled) components. This system is probably too
+restrictive, although it allows us to capture middle name when possible. However, lack of flexible labels
+makes MARC a weak standard for names.
 surname, forename, additions, numeration, expansion
 #### Larger list of components
+This list comes from a combination of MARC, Unimarc, ISNI, ArchiveSpace, and British Library. We might be wise
+to include others as well (VIAF, BnF, AnF, Archives Hub UK).
 surname, middle, forename, prefix, suffix, epithet, title, pretitle, numeration, additional
+Unfortunately, current name format guidelines are ambiguous. The most common problem is that middle name could
+be a second forename or the second of several additional name components. 
+There is general agreement on "name" and "non-name" parts, although no guides explicitly talk about this. Name
+parts are surname, forename, and middle name. Many other non-name parts are often found in names. Both the
+name and non-name parts have ambiguous rules within most systems, and the various systems and cultures have
+incomplete agreement. 
+The database is improved by labeling the compoents where possible, but our algorithms and user
+interface can (fairly) easily remaing agnostic about compoent labels while processing anddisplaying the
+compoents as well as formatting the components into names.
+In the past, failure to create names by (re-) formatting components has led to inconsistent names. While is it
+not possible to 100% parse or format names, it is also true that humans who did the data entry were not 100%
+accurate in their formats. The computer can be more consistent, and probably nearly as accurate as the human
+editors (especially where the editors cannot agree or where they have ambiguous rules). While it is
+technically feasible to format names from components, it is also feasible to keep and (carefully) display the
+names as originally entered.
 #### Overview
-ISNI: prefix, surname, forename (additional forenames), middle name (second and subsequent forenames), suffix
+ISNI: prefix (NR), surname (NR), forename (additional forenames) (R-ish), middle name (second and subsequent
+forenames) (R-ish), suffix (NR)
 Unimarc: $a Surname (NR), $b Given name remainder (NR), $c Additions (R), $d Roman numerals (NR), $g Expansion (NR)