[om-list] Knowledge Processing

Mark Butler butlerm at middle.net
Mon Feb 5 01:44:11 EST 2001


Hello everybody,

  Sorry about the delay in documenting the data model we talked about in our
last meeting.  I will work on it this week.

I am also thinking about writing a program that has potential general
applicability to what we are doing.  The program would process GEDCOM (The
genealogical data communications format invented by the LDS Church) and would
perform one of two operations: merging two data files or subtracting one data
file from the other.

GEDCOM is a hierarchical object representation format similar to XML, except
easier to parse.  There is a standard schema used by nearly every genealogy
program in the world to import and export information.

The tricky part about merging / subtracting such information is that there are
no common unique identifiers for historical individuals (e.g. a social
security number applicable to all people who have ever lived).  In fact, we
are barely getting to the stage in PAF 5 where every new individual is
assigned a globally unique serial number so that two files of the same origin
can be merged back together.

Such global unique identifiers, are a great idea for knowledge represenation,
particularly if you plan to have a distributed database or more than one user
who share data with each other. 

Unfortunately, in the real world, particularly the world of genealogical and
historical research, there are no reliable surrogate keys for identifying that
two records actually refer to the same real person or object.  

Indeed, one of the main weaknesses of the classical object model is that there
is usually an unspoken assumption that the in-computer object has a 1:1
relationship with a corresponding real world object, when everyone knows that
is often not the case.

A C++ or Java "object" is not an object at all.  Rather it is usually a
*record* that describes an object, which is a completely different thing.  As
sson as a load an object from memory and start making changes, I now have at
least three versions of the object: 1. The real object 2. The object in the
database 3. The object in memory.

The dominant performance problem for most intensive multi-user database
applications is that as soon as a collection of objects is loaded into memory
it becomes stale unless a transaction is held open where each object in the
collection is locked, which is impractical for a collection of thousands of
records.  

This could be solved by a standard distributed lock & change notification
protocol (i.e. the database sends you real time updates whenever any of the
records you have copies of are changed by other users), but there isn't one
yet.

But back to the main topic, in genealogy you often have records that are only
partially complete, have date inaccuracies, have spelling errors, contain
mistakes, and sometimes worse.  The only good way to merge two records
together is to calculate the likelihood that they are identical based on sound
statistical inference, and then only merge records that meet a very high
standard for probable error.

In genealogy, any automated merger that merges records with a one percent
error rate is hopelessly flawed.  As you can imagine, many people probably
really care whether someone is really their sixth great grand parent or not. 
If the error rate is one in a million, the the probability of six chained
identifications all being correct drops to only 94%.  By the time you get to
the twenty fourth generation, the probability the relationship is accurate
drops to 78%, which is pathetic.

I think a better goal for an automated merger is a 10^-12 (one in a trillion
error rate).  But you may ask, if your data is fuzzy, inaccurate, and error
prone to begin with, how is an error rate much over 10^-4 even possible? I
believe the answer is found in the transitive assignment of probabilities via
graph traversal of the relationships to other records you are merging.

For example, if instead of matching individual records per se, we matched
based on the combination of an individual record, his father's and his
mother's record. Assuming the information is available, and each individual
record matched to a probability of 1 - (10^-4), the probability that all three
would match together to another set of three with the same relationships drops
all the way to 1 - (10^-12), assuming we trust the relationship information.

If our data is even spottier, and in the presence of random errors in the
relationship data, we can extend this principle by traversing to grandparents,
children, grandchildren and so on and calculating adjusted probabilities based
on the probability that two records would match adjusted for the probability
that the relationships themselves are incorrect.

Now of course, on the attribute level, we need algorithms to calculate the
probability that two names are the same except for mis-spellings, that
identical names of various types actually refer to the same person or place,
that two dates are mis-recordings of the same date, and so on.

I plan on writing a program to do this, but I believe all the algorithms
developed will have general applicability to the field of knowledge
representation.  You could use a program like this to merge two knowledge
models based on a collection of domain specific identification plug-in.

The one I just described could be a biographical plug-in to determine whether
two records refer to the same person.  You could also have a geographical
plug-in, a historical event plug-in, a book/journal plug-in, and so forth. 
The most useful would be an English plug-in that had a dictionary of words,
definitions, and synonyms with contextual identification probabilities to
allow the it to choose definitions by context.  In many instances it is
trivial to extract context just from preceding and following words: "garbage
in garbage out" vs. "the garbage man" or "four score" vs. "the score is", or
"super bowl" vs. "cereal bowl" vs. "bowl a seven". 

I think making a giant distributed database is a great idea, but I believe
that we need a way to populate it with data from other databases and merge
that data together while keeping track of how reliable it is and how that
reliability is affected by the merging process.

<begin aside> The classical application for technology like this, although
much more fuzzy, is document topic matching.  If you go to
http://www.google.com and search for keywords, they have links called "Similar
pages" that bring up other pages that match the statistical signature of the
first page.

This works very well if the similar pages share keywords that aren't very
common in combination with each other.  However, if your context depends on
the ordering of the words or their grammatical relationship to each other,
then you are mostly out of luck.  For example, finding a web page specifically
about "Windows console mode FTP utilities" isn't very easy using current
search engines, because the all the words are much too common, even in
combination on the same web page.  FAQ pages have hit rates that are much too
high because the search engines do not care how close on the page the words
are, let alone whether they actually related to each other.  And of course, as
far as I know, none of them do pluralization or verb tense changing, let alone
synonym transformation, either. <end aside>

In any case, there is a real immediate need for an high quality GEDCOM merging
tool, so I am going to write one.  Perhaps we can later merge the algorithms
into a more general tool.  A specialized tool is easier to start because the
data format is already well defined and in a relatively narrow domain, it
doesn't need a graphical user interface, and most importantly, I can do it all
in memory instead of having to use a database.

- Mark




More information about the om-list mailing list