[om-list] Re: Fast subset enumeration in object-relation models
Mark Butler
butlerm at middle.net
Sat Oct 25 21:11:57 EDT 2003
Hello everybody,
In order to solve the subset enumeration performance in object-relation
model problem, I have decided to implement an in memory arbitrary degree
tree structure for the purpose of indexing the attributes and relations
of each object. My tree structure resembles a B-tree, except instead
of multi-key pages arranged in a tree structure, it has doubly linked
lists arranged in a tree structure, where every list may contain either
branches (sub-lists) or leaves (end items).
B-trees are designed for disk usage, this structure for RAM usage.
Traditional B-trees have branching factor limited by the disk page size,
but are often as high as 100, which is ideal when you have a 10 ms
access time for each page. I can choose any branching factor
convenient - there is a non-linear space / time tradeoff, however.
Initial indications are that a branching factor of eight or so is
ideal. With a branching factor of eight, there is only a 15% overhead
for interior nodes, but one of 1000 relations can be accessed in less
than eight steps, and one of eight million items, in less than 40 steps,
provided of course that you have an exact key or narrow key range to
begin with. Hash tables are even more efficient provided you know in
advance how many items will be indexed, but are out of the question due
to the inability to do key range queries, which are critical in most
general purpose applications.
I might also mention that I have decided to store both ends of
relationships in the same reference index for each referenced item. It
saves memory and solves a major problem of not knowing which index to
search when relationships can be represented in both forward and
backward senses, e.g. child(x,y) and parent(y,x) respectively. I plan
to find specific well known relationships quickly by assigning them
multi-part relation tags, e.g. (SI_ATTRIBUTE,ATTRIBUTE_CHILD) and
grouping entries together in the index.
This is vaguely similar to the Common LISP Object System (CLOS) method
of attribute access, except I believe CLOS used lists instead of trees
for attribute enumeration. Any such meta-system of course is going to
be much slower than hard coded C or Java structures, but such structures
are not remotely as flexible. For family history at least, flexibility
is king.
- Mark
I wrote:
> Fast subset enumeration in object-relation models
> ...
> Although there are other aspects of knowledge modelling that may have
> an impact on choice of data structures, fully polymorphic object
> relation models - those that use a class hierarchly with only one or
> two roots (Object and/or Relation, respectively) have a serious
> performance issue. Traditionally if you want fast access to an
> attribute, you store it inline. If you move attributes out into
> separate objects, you typically have to traverse a list for every
> access. This is a bit of a problem, one of the reasons why LISP
> hasn't taken over the world yet, for example.
>
> For many purposes, the list traversal penalty is acceptable as long as
> the list rarely exceeds ~100 items. When a list gets into the
> thousands of entries, more sophisticated data structures need to be
> adopted. A very common case is where a list is can old items that are
> members of a class or any of its sub-classes, and you want to quickly
> enumerate only the ones belonging to a particular sub-class.
>
> A simple way to represent a set is to use a doubly linked list, one
> entry for each set member. This is insufficient to allow fast access
> by sub-class, so one alternative might by to store a set as
> partitioned set of sets, one for each sub class. Then when we want to
> scan the members of the set that are of a certain sub-class, we could
> scan the top level list looking for the subset that contains that sub
> class, and then scan that subset. Properly speaking, you would need a
> forest of subsets rather than a simple lists, but either way massive
> (10-1000x) performance improvements can be obtained by reducing a list
> search from O(n) to O(k) where n is the list size, and k is the number
> of elements of the target sub class.
>
> A typical example is dealing with an 'object' that represents a
> universal, or a class. A class typically has only one or two base
> classes, a half dozen attributes, but perhaps millions of instances.
> Any system that needs fast access to base classes and class attributes
> is going to fail miserably if the instance relations have to be
> scanned to find the more fundamental relationships. I believe what
> I have proposed will solve the problem nicely and have started an
> implementation in C++ that I am going to use in my family history
> system. The trick of course is to encapsulate it all nicely, so that
> the ugly details are not apparent in higher level code.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://six.pairlist.net/pipermail/om-list/attachments/20031025/ceee0d62/attachment.html
More information about the om-list
mailing list