Vocabulary-properties-and-ontologies.md 7.88 KB
Newer Older
twl8n committed
1 2 3 4

### Introduction


twl8n committed
5
For our purposes the simplest controlled vocabulary is a flat (non-hierarchal) list of terms.
twl8n committed
6

twl8n committed
7
An ontology is a heirarchy of controlled vocabulary terms that explicitly encodes both relatedness and category.
twl8n committed
8

twl8n committed
9 10
Using the example of subject (aka "topical subject"), either technology allows us to make assertions about the
data and relations between identities. 
twl8n committed
11

Tom Laudeman committed
12 13 14
- Both technologies can be used simultaneously to describe identities, however, doing double data entry would
  be irksome.

twl8n committed
15
- Ontolgies allow explict assertions, and stronger assertions
twl8n committed
16

Tom Laudeman committed
17 18
- Both technologies are weak if a subject is missing, that is, the identity was not marked up (to use
  XML-speak; in fact we are using a database and creating relational links)
twl8n committed
19

twl8n committed
20
- More extensive is better, but the extent of ontology or vocabulary is limited by resources
twl8n committed
21

twl8n committed
22
- Ontologies are difficult to create and maintain
twl8n committed
23

twl8n committed
24
- Flat vocabularies are less difficult to create and maintain (perhaps much less difficult)
twl8n committed
25

Tom Laudeman committed
26 27
- Both technologies have an implicit definition for each term; however, two different humans (editors,
  scholars) may understand somewhat different implied defintions
twl8n committed
28

twl8n committed
29 30
- A single explicit definition can be added to each term (although I haven't seen this; it may only exist in
  fields outside the archival world)
twl8n committed
31

twl8n committed
32
- Explicit definition (see below) vocabularies are difficult to create and maintain
twl8n committed
33

twl8n committed
34
- Flat vocabulary can be done with a single database table
twl8n committed
35

twl8n committed
36
- An ontology requires at least 2 database tables, perhaps 3
twl8n committed
37

Tom Laudeman committed
38
- Policies need to be developed for create, update, and delete of terms in either technology
twl8n committed
39

twl8n committed
40 41 42
- Policy complexity is greater for an ontology

- Using computers, an identity may have multiple subjects of either ontology, or flat vocabulary
twl8n committed
43

Tom Laudeman committed
44 45 46 47 48
- Building either technology can (and almost certainly) will be an on-going process. We don't have to start
  with a fully mature vocabulary. That said, records edited early in the life of the data will be somewhat
  less-well-marked-up than records marked up later.

It might also be sensible to design the terms with multilingual vocabulary terms. By multilingual I mean:
twl8n committed
49 50
multiple terms for each unique ID where each term is specific to a specific language, and all terms with the
same ID share a definition. 
twl8n committed
51

twl8n committed
52 53 54 55 56 57 58
Each term has a unique id, and a definition (implied or explicit). This is a simple dictionary. Explicit
definition would improve the vocabulary, but takes more work.

The flat vocabulary has a single "type" for each term, where type examples are: topical subject, gender,
function, etc. In an ontology, the "type" is handled by the ontology structure, which is explicit, but
discovering the type requires tree-traversal.

Tom Laudeman committed
59
### Term domain
twl8n committed
60 61 62 63 64 65


Intellectually each property has a definition. Technically, dding an explicit definition only required an
additional field. Explicit definitions are (almost?) trivial technically, but are challenging
intellectually. The Wikipedia clarifies this issue since ambiguous terms lead to a disambiguation
page. Wikipedia "definition" is the article.
twl8n committed
66 67


Tom Laudeman committed
68
### Proper entities are not terms
twl8n committed
69

Tom Laudeman committed
70
There is no term "detroit", although there is a CPF entity for "Detroit, MI USA", complete with a field
twl8n committed
71 72 73 74
for the corresponding geonames ID. It is technically possible to conflate CPF entities in the user interface
to enable the construction of a topical subject "detroit", although that intellectually sub-optimal. The data
should be as well-constructed as possible. A search for subject + place is not a search for subject +
subject(placename).
twl8n committed
75

Tom Laudeman committed
76
Consider what happens if (and I'm opposed to this) all CPF entities were imported into the term
twl8n committed
77
table. That would be denormalization, data duplication, and would only end in tears.
twl8n committed
78 79 80 81


### Use Markov models instead of an ontology

twl8n committed
82
Ontologies are difficult to create, and there is disagreement about them, both in structure and content. There
Tom Laudeman committed
83
are several to choose from, the the terms they use are somewhat incomplete and confusing. Linking (aka
twl8n committed
84 85 86 87 88 89 90
markup of) each aspect of an identity record's properties to the ontology is an onerous task, and fraught with
several types of errors. Linking is often a judgement call.

A technology exists that is easy to implement, powerful, and tractable in real life and works well with flat
vocabularies.

We can create a Markov matrices of the terms. Multiplying Markov matrices causes them to converge which
Tom Laudeman committed
91
reveals term relatedness as exists in the data. The effect is quite powerful and (almost?) obviates the
twl8n committed
92 93 94 95 96
need for a hand-created ontology. Missing relations (known to exist, but not discovered by the Markov
convergence because no records actually contain the desired relation) are easily rectified by either of two
methods. The first would be to add the correct relations to existing records. The second works by creating
non-public special records containing related terms and making the special records available to the Markov
modeling process. The whole Markov solution is only 2 or 3 pages of code, so we can write it and evaluate the
twl8n committed
97 98 99 100 101 102 103 104 105 106 107
effectiveness.

See: Everything is miscellaneous.

https://en.wikipedia.org/wiki/Everything_Is_Miscellaneous

http://www.youtube.com/watch?v=WHeta_YZ0oE

http://www.youtube.com/watch?v=x3wOhXsjPYM


Tom Laudeman committed
108
### Ontology uses terms, but is a separate problem
twl8n committed
109

twl8n committed
110 111
The alternative to the Markov relation discovery is an ontology that relates terms both in relatedness,
and as a hierarchy from broad to narrow. There are existing ontologies with varying levels of detail.
twl8n committed
112

twl8n committed
113 114 115 116
When describing two very different things, some flat terms would be the same. For example the topical
subject of a publisher, and of a painting (a work of art). The publisher creates books of automobiles,
especially cars which have been artistically painted. The work of art is a painting of an automobile. In this
example, both publisher and painting have two subjects, and both are identical.
twl8n committed
117 118 119 120 121

```
Publisher subject:Automobiles subject:Painting (fine art)
Painting subject:Automobiles subject:Painting (fine art)
```
twl8n committed
122

twl8n committed
123
The underlying terms are the same in both. However, the ontological relationship is quite different
Tom Laudeman committed
124
because one is a corporate body, and the other is an art object. It is not the domain of a flat term to
twl8n committed
125 126 127 128 129
know how it is applied to a database record. Also, the larger context of what is being described changes how
the description is perceived. In any case, the use of a flat vocabulary is sufficient for search and discovery, and
Markov matrices can discover relatednesss between records. Hierarchy becomes another type of relatedness. In
the Markov world, both the publisher and the painting have a linkage to "Engineering" because there are
identities in the database with both "Automobiles" and "Engineering" as subjects.
twl8n committed
130

twl8n committed
131 132 133
The example above is limited to terms as topical subject. It seems reasonable to add fields in order to apply
additional terms (beyond "topical subject") "typeOf" or "isA", while still using the same (original, large)
list of terms. Types of "publisher (corporateBody)" and "painting (object)" seem obvious. Applying both
Tom Laudeman committed
134
term and type pairs will explicitly categorize any database record, even without using an
twl8n committed
135 136 137
ontology. However, it is unclear how this somewhat loosely coupled description will impact being able to
reason about database records. This also requires adding fields to the CPF database schema, which carries
serious baggage.
twl8n committed
138 139


Tom Laudeman committed
140
### Ontology and term interact to create search facets
twl8n committed
141

Tom Laudeman committed
142 143 144 145 146
In general, a search for a parent term should include all child terms as specified by the ontology. A
multilingual example would be searching for the Spanish term "ropa" (clothes) will include "cinturon" (belt)
which has the English term "belt (clothing)". This works well as long as the ontology is complete. Note that
being a controlled vocablulary, the Spanish "ropa" has the dsame ID as English "clothes", and the search is
performed based on ID number, not text string.
twl8n committed
147 148 149

Interestingly, we might be able to apply Markov matrices to identities marked up via ontology, with the same
sort of relatedness building that occurs with a flat vocabulary list.