[om-list] Knowledge Processing

Thu Feb 8 15:36:22 EST 2001

Mark

    I am very interested in this project of yours, and am sorry it has taken
so long for me to respond.

    I will be very interested in being a part of this, if I am able -- if
only to watch what you're doing, and perhaps see if you take any of my very
vague suggestions and see what comes of them.  If not, I'll be interested in
seeing you generalise it sometime in the future.

    I have two suggestions to this process of "identification": geometric,
and semi-symbolic.  I intend to try them both in my "Informatica" (which is
a name I will probably have to change), and probably in our OM as well.

    By geometric, I'm referring to the same "distances in many-dimensional
state space" which we talked about when Luke and I were at your house not
long ago.

    By the way, Luke, I have modified my original ideas about there being
three types of "queries", in two ways.

    (1) The third type of "query", called "distance from
prototype/archetype" can and probably should be generalised to something
more like "distance from centroid/centre-of-mass" in the same mathetical
state/classification space,  (i.e. a functional space where the domain is a
many-dimensional state or list of features, and where the range is
classificational values for many classes).  But, if we model psychology well
enough, the centre of mass should turn out to be very close to the
conceptual "prototype" anyway.

    (2) There is a fourth type of query.  Just as the third query is a fuzzy
version of the first type (the first type being the idealistic-inductive or
extensional-definition-based query) the fourth type is a fuzzy version of
the second type (the second type being the idealistic-deductive or
intensional-definition-based query).

    More about all this later.

    My next suggestion is "semi-symbolic", (semi-symbolic/semi-geometric, is
how I think of this), or, in other words, graph theory.  Mark, I strongly
suggest we/you look into what, in graph theory, are called "isomorphisms"
and "homeomorphisms".  They are very similar to a part of this process of
identification.  Two reasons: (1) if there are algorithms for determining
homeomorphisms in weighted, directed graphs, then our job is easier.  (2) If
not, then our job will involve more word, but it will be work that will
advance the field of graph theory, and should be published.

    The way I see it, the process of homeomorphism-finding should be
specialised to apply to a graph whose edges (internodes) are not just
weighted and directed, but are of differing sub-types, corresponding to
different dimensions in space; and every relation or attribute in the
database (such as your GEDCOM files) would be encoded as nodes and
internodes.

    About probabilities, I am very impressed with "Dempster-Shafer Theory of
Evidence" which we talked about in my AI class last semester.  I think we
could use that.  Did you say you had the same AI book?  I didn't like that
book much.  And, it doesn't go into enough detail on anything, including
Dempster-Shafer.  So, ... I'll look for a better source.

Ciao for now
tomp

----- Original Message -----
From: "Mark Butler" <butlerm at middle.net>
To: "One Model List" <om-list at onemodel.org>
Sent: Sunday, February 04, 2001 11:44 PM
Subject: [om-list] Knowledge Processing

Hello everybody,

  Sorry about the delay in documenting the data model we talked about in our
last meeting.  I will work on it this week.

I am also thinking about writing a program that has potential general
applicability to what we are doing.  The program would process GEDCOM (The
genealogical data communications format invented by the LDS Church) and
would
perform one of two operations: merging two data files or subtracting one
data
file from the other.

GEDCOM is a hierarchical object representation format similar to XML, except
easier to parse.  There is a standard schema used by nearly every genealogy
program in the world to import and export information.

The tricky part about merging / subtracting such information is that there
are
no common unique identifiers for historical individuals (e.g. a social
security number applicable to all people who have ever lived).  In fact, we
are barely getting to the stage in PAF 5 where every new individual is
assigned a globally unique serial number so that two files of the same
origin
can be merged back together.

Such global unique identifiers, are a great idea for knowledge
represenation,
particularly if you plan to have a distributed database or more than one
user
who share data with each other.

Unfortunately, in the real world, particularly the world of genealogical and
historical research, there are no reliable surrogate keys for identifying
that
two records actually refer to the same real person or object.

Indeed, one of the main weaknesses of the classical object model is that
there
is usually an unspoken assumption that the in-computer object has a 1:1
relationship with a corresponding real world object, when everyone knows
that
is often not the case.

A C++ or Java "object" is not an object at all.  Rather it is usually a
*record* that describes an object, which is a completely different thing.
As
sson as a load an object from memory and start making changes, I now have at
least three versions of the object: 1. The real object 2. The object in the
database 3. The object in memory.

The dominant performance problem for most intensive multi-user database
applications is that as soon as a collection of objects is loaded into
memory
it becomes stale unless a transaction is held open where each object in the
collection is locked, which is impractical for a collection of thousands of
records.

This could be solved by a standard distributed lock & change notification
protocol (i.e. the database sends you real time updates whenever any of the
records you have copies of are changed by other users), but there isn't one
yet.

But back to the main topic, in genealogy you often have records that are
only
partially complete, have date inaccuracies, have spelling errors, contain
mistakes, and sometimes worse.  The only good way to merge two records
together is to calculate the likelihood that they are identical based on
sound
statistical inference, and then only merge records that meet a very high
standard for probable error.

In genealogy, any automated merger that merges records with a one percent
error rate is hopelessly flawed.  As you can imagine, many people probably
really care whether someone is really their sixth great grand parent or not.
If the error rate is one in a million, the the probability of six chained
identifications all being correct drops to only 94%.  By the time you get to
the twenty fourth generation, the probability the relationship is accurate
drops to 78%, which is pathetic.

I think a better goal for an automated merger is a 10^-12 (one in a trillion
error rate).  But you may ask, if your data is fuzzy, inaccurate, and error
prone to begin with, how is an error rate much over 10^-4 even possible? I
believe the answer is found in the transitive assignment of probabilities
via
graph traversal of the relationships to other records you are merging.

For example, if instead of matching individual records per se, we matched
based on the combination of an individual record, his father's and his
mother's record. Assuming the information is available, and each individual
record matched to a probability of 1 - (10^-4), the probability that all
three
would match together to another set of three with the same relationships
drops
all the way to 1 - (10^-12), assuming we trust the relationship information.

If our data is even spottier, and in the presence of random errors in the
relationship data, we can extend this principle by traversing to
grandparents,
children, grandchildren and so on and calculating adjusted probabilities
based
on the probability that two records would match adjusted for the probability
that the relationships themselves are incorrect.

Now of course, on the attribute level, we need algorithms to calculate the
probability that two names are the same except for mis-spellings, that
identical names of various types actually refer to the same person or place,
that two dates are mis-recordings of the same date, and so on.

I plan on writing a program to do this, but I believe all the algorithms
developed will have general applicability to the field of knowledge
representation.  You could use a program like this to merge two knowledge
models based on a collection of domain specific identification plug-in.

The one I just described could be a biographical plug-in to determine
whether
two records refer to the same person.  You could also have a geographical
plug-in, a historical event plug-in, a book/journal plug-in, and so forth.
The most useful would be an English plug-in that had a dictionary of words,
definitions, and synonyms with contextual identification probabilities to
allow the it to choose definitions by context.  In many instances it is
trivial to extract context just from preceding and following words: "garbage
in garbage out" vs. "the garbage man" or "four score" vs. "the score is", or
"super bowl" vs. "cereal bowl" vs. "bowl a seven".

I think making a giant distributed database is a great idea, but I believe
that we need a way to populate it with data from other databases and merge
that data together while keeping track of how reliable it is and how that
reliability is affected by the merging process.

<begin aside> The classical application for technology like this, although
much more fuzzy, is document topic matching.  If you go to
http://www.google.com and search for keywords, they have links called
"Similar
pages" that bring up other pages that match the statistical signature of the
first page.

This works very well if the similar pages share keywords that aren't very
common in combination with each other.  However, if your context depends on
the ordering of the words or their grammatical relationship to each other,
then you are mostly out of luck.  For example, finding a web page
specifically
about "Windows console mode FTP utilities" isn't very easy using current
search engines, because the all the words are much too common, even in
combination on the same web page.  FAQ pages have hit rates that are much
too
high because the search engines do not care how close on the page the words
are, let alone whether they actually related to each other.  And of course,
as
far as I know, none of them do pluralization or verb tense changing, let
alone
synonym transformation, either. <end aside>

In any case, there is a real immediate need for an high quality GEDCOM
merging
tool, so I am going to write one.  Perhaps we can later merge the algorithms
into a more general tool.  A specialized tool is easier to start because the
data format is already well defined and in a relatively narrow domain, it
doesn't need a graphical user interface, and most importantly, I can do it
all
in memory instead of having to use a database.

- Mark

_______________________________________________
om-list mailing list
om-list at onemodel.org
http://www.pairlist.net/mailman/listinfo/om-list