[om-list] Graph Traversal Specifications

Wed Dec 3 22:44:36 EST 2003

Lately I have been working on a mini graph traversal language that 
specifies a set of related objects given a starting object and a 
traversal specification given in terms of the language.  I need this to 
do simple things like access direct attributes, but it turns out to be 
easily extended to more sophisticated relations.

My current design is based on items and connections between pairs of 
items.  Items are used to represent classes, objects, and standard 
relationships between objects and classes. Connections are used to 
connect object items to relation items, tagging each object as a 
subject, object, context, etc, much like the parts of a sentence. 

The system is largely typeless, in that every item / object has a common 
format.  Attributes are accessed through algorithmic traversals from the 
attributed item.  More radically, object attributes are not explicitly 
represented in the system at all.  Rather, class attributes are 
represented as functions (graph traversal specifications) that return 
zero or more related objects given a source object.

A typical attribute function might be children(x), a function that given 
x returns the set of the children of x.  There is a lot of freedom in 
how one might represent family relationships in relational form, and the 
graph traversal specification must match the chosen representation.

Given a beget(father,mother,child) relation, to find the children of x, 
I have to enumerate the 'beget' predicate relationships where x is one 
of the first two arguments, and then enumerate the referents of the 
third argument.  The current design can easily represent shared / 
compressed relationships like beget(dad,mom,{child1,child2,....childN}) 
although that form must be split if each sub-relationship has distinct 
attributes, such as the birth date for each child.

In any case, children(x) can be easily represented as a string of graph 
traversal and node test instructions, for example:

     start  x       ; start on node x
     move  -arg1    ; move to relationship(s) where x is first argument
     test   beget   ; test to see if we are on a 'beget' relationship
     move   arg3    ; move to referent of argument 3
     return         ; current node is child of x

The first thing to recognize is that for this to work, any failed test 
or traversal must be aborted, for example if x never participated in any 
relationships as argument 1, or none of those were 'beget'-type 
relationships.  The second thing to recognize is that each move 
instruction can result in an arbitrary number of graph traversals, each 
of which must be handled independently.  It is as if each move 
instruction forks a sub-process to handle each matching graph edge, and 
aggregates the results.

Of course I can't create new processes or threads to handle this type of 
parallelism, so I just save state on some sort of stack instead.  
Currently every move instruction gets converted into a related item 
index range scan on the set of connections to related items.  The 
indexes are currently by (arg_selector,item_class) so a move,test pair 
converts into a single range scan. Then for each matching item of the 
first range scan (a 'beget' relationship item in this case), a second 
related item index range scan is run, returning items that are connected 
as argument 3 - the children.

In relational databases, this type of query execution plan is known as 
'nested loops' and is common eay to execute a join of two well indexed 
tables.  The big difference is that relational databases join using 
global high cardinality indexes, whereas here every object has its own 
low cardinality index of connections to related objects.  (Of course 
relational databases could be implemented this way in principle, but no 
one ever has as far as I know - current technology is heavily entrenched 
- what happens in practice is the object oriented database people 
develop relational (i.e. SQL) interfaces to non-relational data, using 
'relational' in the conventional strongly-typed set of rows sense, of 
course).

Now for my purposes, I need this query to execute in microseconds, not 
milliseconds (as from a client server database), so I am loading 
everything into memory.  The advantage of strongly typed languages 
(C++,Java) is that single valued attribute access occurs at a fixed 
offset from from the item header in a matter of nanoseconds.  
Multi-valued access is much slower, of course.  By using the same system 
for single and multivalued attributes, there is a significant 
performance penalty (~20x) similar to that of systems like CLOS 
(probably worse).  The advantage is flexibility, which is ideal for my 
applications and for general purpose knowledge modeling systems.

One of my problems is that there is an increasing mismatch between C++ 
and my high level application code.  The low level stuff (query engine 
and the like) works very well in C++, but the high level stuff 
practically needs its own language.  I like C# style syntax, but the big 
difference is the requirement to create classes at runtime and execute 
code written in terms of those classes, all on top of the indexed 
item-connection meta framework.  I am hoping to write a interpreter 
roughly similar to what you might see in Python (first pass conversion 
to AST or byte code, interpret AST or byte code directly).    This is 
all high level code, so the 10x interpretation penalty won't be a problem.

 - Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://six.pairlist.net/pipermail/om-list/attachments/20031203/84759e85/attachment.html