[om-list] Graph Traversal Specifications
Mark Butler
butlerm at middle.net
Wed Dec 3 22:44:36 EST 2003
Lately I have been working on a mini graph traversal language that
specifies a set of related objects given a starting object and a
traversal specification given in terms of the language. I need this to
do simple things like access direct attributes, but it turns out to be
easily extended to more sophisticated relations.
My current design is based on items and connections between pairs of
items. Items are used to represent classes, objects, and standard
relationships between objects and classes. Connections are used to
connect object items to relation items, tagging each object as a
subject, object, context, etc, much like the parts of a sentence.
The system is largely typeless, in that every item / object has a common
format. Attributes are accessed through algorithmic traversals from the
attributed item. More radically, object attributes are not explicitly
represented in the system at all. Rather, class attributes are
represented as functions (graph traversal specifications) that return
zero or more related objects given a source object.
A typical attribute function might be children(x), a function that given
x returns the set of the children of x. There is a lot of freedom in
how one might represent family relationships in relational form, and the
graph traversal specification must match the chosen representation.
Given a beget(father,mother,child) relation, to find the children of x,
I have to enumerate the 'beget' predicate relationships where x is one
of the first two arguments, and then enumerate the referents of the
third argument. The current design can easily represent shared /
compressed relationships like beget(dad,mom,{child1,child2,....childN})
although that form must be split if each sub-relationship has distinct
attributes, such as the birth date for each child.
In any case, children(x) can be easily represented as a string of graph
traversal and node test instructions, for example:
start x ; start on node x
move -arg1 ; move to relationship(s) where x is first argument
test beget ; test to see if we are on a 'beget' relationship
move arg3 ; move to referent of argument 3
return ; current node is child of x
The first thing to recognize is that for this to work, any failed test
or traversal must be aborted, for example if x never participated in any
relationships as argument 1, or none of those were 'beget'-type
relationships. The second thing to recognize is that each move
instruction can result in an arbitrary number of graph traversals, each
of which must be handled independently. It is as if each move
instruction forks a sub-process to handle each matching graph edge, and
aggregates the results.
Of course I can't create new processes or threads to handle this type of
parallelism, so I just save state on some sort of stack instead.
Currently every move instruction gets converted into a related item
index range scan on the set of connections to related items. The
indexes are currently by (arg_selector,item_class) so a move,test pair
converts into a single range scan. Then for each matching item of the
first range scan (a 'beget' relationship item in this case), a second
related item index range scan is run, returning items that are connected
as argument 3 - the children.
In relational databases, this type of query execution plan is known as
'nested loops' and is common eay to execute a join of two well indexed
tables. The big difference is that relational databases join using
global high cardinality indexes, whereas here every object has its own
low cardinality index of connections to related objects. (Of course
relational databases could be implemented this way in principle, but no
one ever has as far as I know - current technology is heavily entrenched
- what happens in practice is the object oriented database people
develop relational (i.e. SQL) interfaces to non-relational data, using
'relational' in the conventional strongly-typed set of rows sense, of
course).
Now for my purposes, I need this query to execute in microseconds, not
milliseconds (as from a client server database), so I am loading
everything into memory. The advantage of strongly typed languages
(C++,Java) is that single valued attribute access occurs at a fixed
offset from from the item header in a matter of nanoseconds.
Multi-valued access is much slower, of course. By using the same system
for single and multivalued attributes, there is a significant
performance penalty (~20x) similar to that of systems like CLOS
(probably worse). The advantage is flexibility, which is ideal for my
applications and for general purpose knowledge modeling systems.
One of my problems is that there is an increasing mismatch between C++
and my high level application code. The low level stuff (query engine
and the like) works very well in C++, but the high level stuff
practically needs its own language. I like C# style syntax, but the big
difference is the requirement to create classes at runtime and execute
code written in terms of those classes, all on top of the indexed
item-connection meta framework. I am hoping to write a interpreter
roughly similar to what you might see in Python (first pass conversion
to AST or byte code, interpret AST or byte code directly). This is
all high level code, so the 10x interpretation penalty won't be a problem.
- Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://six.pairlist.net/pipermail/om-list/attachments/20031203/84759e85/attachment.html
More information about the om-list
mailing list