[om-list] Re: Fast subset enumeration in object-relation models

Sat Oct 25 21:11:57 EDT 2003

Hello everybody,

In order to solve the subset enumeration performance in object-relation 
model problem, I have decided to implement an in memory arbitrary degree 
tree structure for the purpose of indexing the attributes and relations 
of each object.  My tree structure resembles a B-tree, except  instead 
of multi-key pages arranged in a tree structure, it has doubly linked 
lists arranged in a tree structure, where every list may contain either 
branches (sub-lists) or leaves (end items). 

B-trees are designed for disk usage, this structure for RAM usage. 
Traditional B-trees have branching factor limited by the disk page size, 
but are often as high as 100, which is ideal when you have a 10 ms 
access time for each page.   I can choose any branching factor 
convenient - there is a non-linear space / time tradeoff, however.

 Initial indications are that a branching factor of eight or so is 
ideal.  With a branching factor of eight, there is only a 15% overhead 
for interior nodes, but one of 1000 relations can be accessed in less 
than eight steps, and one of eight million items, in less than 40 steps, 
provided of course that you have an exact key or narrow key range to 
begin with.   Hash tables are even more efficient provided you know in 
advance how many items will be indexed, but are out of the question due 
to the inability to do key range queries, which are critical in most 
general purpose applications.

I might also mention that I have decided to store both ends of 
relationships in the same reference index for each referenced item.  It 
saves memory and solves a major problem of not knowing which index to 
search when relationships can be represented in both forward and 
backward senses, e.g. child(x,y) and parent(y,x) respectively.  I plan 
to find specific well known relationships quickly by assigning them 
multi-part relation tags, e.g. (SI_ATTRIBUTE,ATTRIBUTE_CHILD) and 
grouping entries together in the index.

This is vaguely similar to the Common LISP Object System (CLOS) method 
of attribute access, except I believe CLOS used lists instead of trees 
for attribute enumeration.  Any such meta-system of course is going to 
be much slower than hard coded C or Java structures, but such structures 
are not remotely as flexible. For family history at least, flexibility 
is king.

  - Mark

I wrote:

> Fast subset enumeration in object-relation models
> ...
> Although there are other aspects of knowledge modelling that may have 
> an impact on choice of data structures, fully polymorphic object 
> relation models - those that use a class hierarchly with only one or 
> two roots (Object and/or Relation, respectively) have a serious 
> performance issue.   Traditionally if you want fast access to an 
> attribute, you store it inline.  If you move attributes out into 
> separate objects, you typically have to traverse a list for every 
> access.  This is a bit of a problem, one of the reasons why LISP 
> hasn't taken over the world yet, for example.
>
> For many purposes, the list traversal penalty is acceptable as long as 
> the list rarely exceeds ~100 items.  When a list gets into the 
> thousands of entries, more sophisticated data structures need to be 
> adopted. A very common case is where a list is can old items that are 
> members of a class or any of its sub-classes, and you want to quickly 
> enumerate only the ones belonging to a particular sub-class.
>
> A simple way to represent a set is to use a doubly linked list, one 
> entry for each set member.  This is insufficient to allow fast access 
> by sub-class, so one alternative might by to store a set as 
> partitioned set of sets, one for each sub class.  Then when we want to 
> scan the members of the set that are of a certain sub-class, we could 
> scan the top level list looking for the subset that contains that sub 
> class, and then scan that subset.  Properly speaking, you would need a 
> forest of subsets rather than a simple lists, but either way massive 
> (10-1000x) performance improvements can be obtained by reducing a list 
> search from O(n) to O(k) where n is the list size, and k is the number 
> of elements of the target sub class.
>
> A typical example is dealing with an 'object' that represents a 
> universal, or a class.  A class typically has only one or two base 
> classes, a half dozen attributes, but perhaps millions of instances.  
> Any system that needs fast access to base classes and class attributes 
> is going to fail miserably if the instance relations have to be 
> scanned  to find the more fundamental  relationships.  I believe what 
> I have proposed will solve the problem nicely and have started an 
> implementation in C++ that I am going to use in my family history 
> system.  The trick of course is to encapsulate it all nicely, so that 
> the ugly details are not apparent in higher level code.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://six.pairlist.net/pipermail/om-list/attachments/20031025/ceee0d62/attachment.html