[om-list] Databases

Fri May 25 20:17:48 EDT 2001

Hello everybody,

I have a new project to build a unified genealogical repository for one of my
clients.  The current repository is a cluster of small PAF databases with
inter-database links.  IT is getting out of control because we have too many
duplicate individuals in different files.

As before, I will need to use an advanced fuzzy matching technique to merge
all the duplicate individuals together while maintaining a high level of data
quality.

What is new is that I need a new non-PAF database to permanently store the
resulting clean data in.  For reasons I have documented before, relational
databases are not a candidate. 

One interesting thing is that we will still use PAF to do data entry, so work
will have to be done using a check in / check out system as follows:

1. Check out a set of related records, place in PAF file
2. User edits checked out data set using PAF
3. Analyze changes, resolve conflicts, check into repository

Yoy may be aware that the CVS source code control system works in a manner
very similar to this.  The main advantage of this type of protocol is that it
allows disconnected operation, which makes it scale well to a large number of
remote users.

As you might imagine, I think it would be very helpful for the One Model
system to be able to operate this way as well.  I have thought about tweaking
FramerD to do this, but after glancing at their source code, I don't think it
would be particularly difficult to write something similar I can actually
maintain.

THE REAL VALUE OF SOFTWARE

One lesson I wish that more open source people would learn, is that the real
value of any sufficiently complex piece of software is largely tied up in how
easily it is to comprehend and update, which means that C++ or Java are to be
strongly preferred over C.

A classic example is the case of two web browsers: Mozilla and Konqueror.  The
Mozilla project has taken three years to do in C what the KDE Konqueror
developers did much better in C++ and Qt in six months.

After much perusal of the Postgres source code and much discussion with the
Postgres developers, I tend to think that one of these days the whole thing is
going to have to be redone in C++ to become a worthy competitor to the likes
of Oracle.  Ten years ought to do it.

If you have a staff of a hundred programmers, you can probably use plain old
C, but I think it is a rare case that a very complex open source project can
maintain sanity with a large number of developers without using highly
structured development techniques encouraged by the use of object oriented
languages.

The best part about well written code is that it is self documenting. 
Unfortunately, most open source software projects produce very little internal
technical documentation, so it is always a matter of how readable the source
is, and object oriented languages tend to produce much more readable source.

The reason why LISP and Scheme will never become general purpose programming
languages is that the source code for any sufficiently complex function is
about as easy to read as assembly language.  I can see them being used as a
middle layer to a higher level language, though.  

You could for example, translate Java to LISP and then use a LISP runtime
engine instead of the JVM - the whole problem with the JVM, of course, is that
it is much too Java specific.  The main problem with Java is that it is too
low level of a language for business applications - A good business language
should meld seamlessly with a database technology.  Every standard language
still lags far behind most database 4GLs in this respect.

In C++ or Java, everyone still has to:

1. Prepare the statement
2. Check for an error
3. Bind parameters
4. Bind result columns
5. Execute / Fetch
6. Check for an error

That is pretty pathetic when you really want nice code like:

persistent Car * x;

x = Cars[license_nbr == "123 ABC" && state == "UT"];

if(x)
 {
  x->renewal_date = sysdate + 365;
  commit;
 }

- Mark