From ab62c169fa7e5971ae5dd40f8714c20eafe931ff Mon Sep 17 00:00:00 2001 From: Tom Laudeman <twl8n@shannon.Village.Virginia.EDU> Date: Fri, 7 Aug 2015 09:15:04 -0400 Subject: [PATCH] copied from Documentation --- Vocabulary-properties-and-ontologies.md | 165 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 165 insertions(+) create mode 100644 Vocabulary-properties-and-ontologies.md diff --git a/Vocabulary-properties-and-ontologies.md b/Vocabulary-properties-and-ontologies.md new file mode 100644 index 0000000..b143177 --- /dev/null +++ b/Vocabulary-properties-and-ontologies.md @@ -0,0 +1,165 @@ + +### Introduction + + +There are two related yet distinct issues: controlled vocabularies and ontologies. (Or Markov matrices instead +of the ontologies; described below.) The vocabularies are a list of properties and can easily be accommodated +in a single SQL table. Policies need to be developed for create, update, and delete of property entries. It +might also be sensible to design the properties with multilingual labels (and definitions). By multilingual I +mean: multiple labels where each label is specific to a specific language, but the labels all have the same +meaning. All properties for all vocabularies probably have identical data structure, thus there is only a +single extensive list of properties. Each property has a unique id, and needs a definition. By and large, the +property table is simply a dictionary-like structure with labels and definitions. + +Properties do vary by type, where type examples are: topical subject, gender, function, etc. + +### Property domain + + +By design, each property has an explicit definition. The Wikipedia clarifies this issue since some (ambiguous) +terms only result in a disambiguation page. + +The properties "automotive" and "automobile" are closely related, and in common use they are interchangeable +in many situations. "automotive" + "seat belt" and "automobile" + "seat belt" differ in subtle ways. The +difference is subtle enough to guarantee that some database identity records will be coded one way, and some +another way. If there are any errors where one or the other term was used, then due to amibiguity, any +reasoning about these closely related terms is likely be erroneous. + +Wikipedia has both "automotive" and "automobile", but "lap belt" redirects to "seat belt", which seems +sensible. The term "seat belt" could be defined as "webbing designed to retain an occupant or object in a +seat". There is no single property "automotive--seat belts", but the separate ability to create subject lists +allows for "automotive" + "seat belt", as well as "roller coaster" + "seat belt" and "aircraft" + "seat belt". + +Curatorial question: Is a seat belt simply a complex term of "belt" and "seat". I think not. The word "belt" +is (in English) applied to clothing, machinery belts, webbing, and others. Arguably, "seat belt" is simply +Enlish-centric, where the word "belt" has multiple different meanings some of which could just as well be rope +or chain. Perhaps the guiding curatorial policy is "avoid confusion" while attempting to be broad and language +agnostic. Properties which might naturally have multiple meanings should be avoided. Thus, we avoid the +property "belt", which isn't so much broadly inclusive as it describes mutually exclusive things. The word +"automotive" is broad and inclusive. Thus properties for "belt (mechanical)", "seat belt", "belt (clothing)" +make sense. The definitions would be explicit, and adding "(clothing)" merely disambiguates the label. + +In Spanish, a "fan belt" is "correa del ventilador", not "cinturon del ventilador". The Spanish word +"cinturon" is a clothing belt. This difference between languages clarifies what are properties, and what words +are simply usages specific to English (or any language). My translations may be a bit clumsy, so I hope +the conclusion is clear: + +``` +cinturon == belt (clothing) +correa de transmisiĆ³n == belt (mechanical) +``` + +Even Spanish has issues with belt. The word "correa" by itself is "strap". + +It is reasonable to have a property "airplane" as well as "aircraft"? It is (again demonstrated by the +Wikipedia), but at the same time there is some burden on archivists to choose "jet (aircraft)" + "seat belt" +when speaking of a Boeing 747, and "aircraft" + "glider (aircraft)" + "seat belt" when refering to a glider, +although I'm not convinced that aircraft + glider adds useful facets of information. It doesn't hurt to add +the extra "aircraft". + +When using an ontoloty, the ontology needs to clarify sub-classes where "jet (aircraft)" and "glider" are more +specific examples of "aircraft". In fact, using a broad category is superfluous (but not really harmful) when +the narrow property is supplied. + +It would be just as useful to search/analaysis to have a separate topical subject "glider (aircraft)" as +compared to subject being a list of 1 to n elements. In fact, from a computational point of view, there is +probably no difference between subject as a list of properties and a list of singleton subjects. These are +equivalent: + +``` +subject: Aircraft +subject: Seat belts +subject: glider (aircraft) +``` +``` +subject: Aircraft + Seat belts + glider (aircraft) +``` + +In fact, the first example appears to be more clear and easier to analyze. If this is true, then we could +apply a universal rule: There are no complex properties, although multiple properties are allowed (and +encouraged). + + +Properties are not themselves complex. "automotive" is broad, and if the added specificity of "parts" is +desired then a second topic "parts" (Parts: components of a larger entity) needs to be used. Component lists +are in the domain of ontologies, not properties. There is no single property "automotive--parts", or +"automotive--paintings", and this needs to be enforced by database design, user interface and policy. + +### Proper entities are not properties + +The ontology linkage handles issues such as "automobiles" + "detroit". There is no property "detroit", +although there is a CPF entity for "Detroit, MI USA". It is possible to conflate CPF entities in the user +interface to enable the construction of a topical subject "automobiles" + "detroit", although that is a bad +idea. The data should be as well-constructed as possible. When searching for subject and place, it is +reasonable to either parse the place name, or have the user explicitly choose a place name. + +Consider what happens if (and I'm opposed to this) all CPF entities were imported into the property +table. That would be denormalization, data duplication, and would only end in tears. + + +### Use Markov models instead of an ontology + +Ontologies are difficult to create, and there is little agreement about them, both in structure and +content. There are several to choose from, the the properties they use are (speaking frankly) a huge +mess. Linking each aspect of an entity record's properties to the ontology is an onerous task, and fraught +with error largely because the linking is often a judgement call. + +A technology exists that is easy to implement, powerful, and tractable in real life. + +We can create a Markov matrices of the property terms. Multiplying Markov matrices causes them to converge +which reveals property relatedness as exists in the data. The effect is quite powerful and obviates the need +for a hand-created ontology. Missing relations (known to exist, but not discovered by the Markov convergence +because no records actually contain the desired relation) are easily rectified by either of two methods. The +first would be to add the correct relations to existing records. The second works by creating non-public +special records containing related terms and making the special records available to the Markov modeling +process. The whole Markov solution is only 2 or 3 pages of code, so we can write it and evaluate the +effectiveness. + +See: Everything is miscellaneous. + +https://en.wikipedia.org/wiki/Everything_Is_Miscellaneous + +http://www.youtube.com/watch?v=WHeta_YZ0oE + +http://www.youtube.com/watch?v=x3wOhXsjPYM + + +### Ontology uses property, but is a separate problem + +The alternative to the Markov relation discovery is an ontology that relates properties both in relatedness, +and as a hierarchy from broad to narrow. As far as I can tell, such a network does not yet exist, and it would +be time consuming to create since it has to be done by hand, by humans. There are existing ontologies, but to +use them we would be forced to use their property lists, and (sadly) the property lists I've seen are both +incomplete and poorly constructed. + +When describing two very different things, some properties would be the same. For example the topical subject +of a publisher, and of a work of art. The publisher creates books of automobiles, especially cars which have +been artistically painted. The work of art is a painting of an automobile. + +``` +Publisher subject:Automobiles subject:Painting (fine art) +Painting subject:Automobiles subject:Painting (fine art) +``` + +The underlying properties are the same in both branches of the ontology, but the ontological relationship is +quite different because one is a corporate body, and the other is an object. It is not the domain of a +property to know how it is applied to a database record. Also, the larger context of what is being described +changes how the description is perceived. In any case, the use of properties is sufficient for search and +discovery. Data beyond property is necessary to make assertions about the records. + +The example above is limited to properties as topical subject. It seems reasonable to apply additional +properties "typeOf", still using the same (original, large) list of properties. Types of "publisher (corporateBody)" and +"painting (object)" seem obvious. Applying both property and type pairs will explicitly categorize any +database record, even without using an ontology. However, it is unclear how this somewhat loosely coupled +description will impact being able to reason about database records. + +It also seems resonable to constrain some properties to be used only as certain types. A gender property is +nonsense in the context of a topical subject. On the other hand, "painting (fine art)" could be both a subject +and a typeOf. The conservative approach is to limit each property to a single type. + +### Ontology and property interact to create search facets + +A search for "aircraft" should turn up "glider (aircraft)" even if the record in question lacks "aircraft" as +a specific topical subject. In general, a search for a parent property should include all child properties as +specified by the ontology. Searching for the Spanish term "ropa" will include "cinturon" which has the English +label "belt (clothing)". -- libgit2 0.27.1