[om-list] My Free Time Project -- Re: second letter

Sat Aug 17 15:07:55 EDT 2002

Mark

    Sounds good.

    I'm starting to experiment with internet programming.  The first step
was easy -- too easy, probably.  I searched the internet and found a
function in VS.NET (that incidentally has no documentation in VS.NET: I
can't find it or the containing class or namespace mentioned in the help
files of Visual Studio).  Using the example I found on the internet, it took
me about five minutes to make a "web browser" of the simplest sort that
displays whatever web page corresponds to the URL you type in.  I think it
uses an IE DLL, so all the hard parts are taken care of by Microsoft.  But
it makes one feel like he's accomplished *something* to see a live web page
displayed in his own little window.

    The problem is, I can't get any information from this function, such as
whether the web page being requested timed-out or anything.

    For fun, I wrote a 4-level nested loop in which I can cycle through all
the IP numbers in the universe and display each web page one at a time, with
a three-second wait before going on to the next one.  (I'm assuming that I
understand how IP numbers work: I incrementing each of the four numerals in
"xxx.xxx.xxx.xxx" from 0 to 255.)  This probably sounds pretty silly, but I
figure I have to start with what I know, and experiment, and learn.

    It's apparent that most of the numbers out there don't actually point to
anything.  So the next step is to find better functions in Visual C# .NET
that tell me more about the IP numbers I am requesting, so I can
programmatically test each of these IP numbers, and make record of the ones
that work, and maybe make a list of the real web pages so the next time I
want to cycle through all the web pages in the universe I can do so much
more concisely and quickly.

    The point of all this is to just learn more about how the internet is
put together, and eventually to go on to semantically searching and
categorising some web pages.  I bet there are internet resources that will
give me a more concise listing of all the web pages in the internet.  Do you
know?  I'm sure I'll want to make use of DNSs at some point.  Any
suggestions are welcome.

    I also looked briefly into regular expressions, since the CS people who
are interested in linguistics and web information extraction seem to love
them.  But I think I want to write my own.  I just looked at PHP's regular
expression functions, but I'm guessing that most languages use something
similar: a little bit ugly.  I want something more flexible, more uniform,
simple, scalable.  I'm thinking of defining my own in the following way,
which should allow me to use them for everything from simple, short
character string matching all the way up to abstract sentence structure,
grammatical
rules, in arbitrary (natural and artificial) languages, etc.:

    Recursive Definition:

    Basis Case:  An ASP ("Abstract Sequence Pattern" or "Abstract String
Pattern") is a character or null.

    Recursive Case:  An ASP is the combination of two or more simpler ASPs
using one or more of the predefined ASP operators.

    So far, I can think of only three operators (functions) to perform on
ASPs to create larger ASPs.  They are the "or" the "unordered_and" and the
"ordered_and".

    Example of using "or":  To make an ASP wild-card that represents any
single alphabetic character, I would take the "or" or union of all
characters.  $char =
{a or b or c or d or e ,.... X or Y or Z}.  (My notation will probably
change to make expressions more simple.)  The you could use this ASP to find
any three character word: find( $char$char$char ).  I guess I should add a
quantifier operator: find( (3)$char ).

    Example of using "unordered and":  This will be useful after we've built
up some abstract ASPs, such as these two: $char and $word.  I could define a
set (an ASP) using "unordered and" that represented all of the of the
single-character words, such as "I" and "a".  (This is just an example of
using "unordered and"; there will probably be better ways of defining
single-character words.)  So $OneLetterWord = ($char uand $word).

    Example of using "ordered and".  This will be the most interesting
operator.  It's the one that carries the idea of order or sequence.  Most
simply, you can define a word as the ordered sequence of its constituent
letters: $Butterfly = (B oand(1) u oand(1) t oand(1) t oand(1) e oand(1) r
oand(1) f oand(1) l oand(1) y).  Notation for this operator will probably
change, but you get the idea.  I will probably substitute something much
simpler for this particular expression, such as $Butterfly = "Butterfly".
But there could be much power in the general use of this operator.  If you
generalise the number parameter, and the units parameter (not shown), then
you can specify any arbitrary distance, such as "two words" or "two to four
instances of the noun ASP" or "from negative five to positive three
instances of the noun-phrase ASP".

    Does this not sound useful?  Maybe there are already other language
definition languages (metalanguages) out there.  But this is just too fun to
not try out myself before I find them.  And I think it will be much more
flexible and useful than regular expressions.

    Then I will hopefully write some machine learning algorithms to discover
and define
useful sets of ASPs for various things like ... whatever ... language
learning, grammar learning, etc.

    Big dreams.  We'll see.

    This ASP definition language may end up being my first version of MTL.

    I'm also learning XML and XSL and will hopefully make use of them in
making my first web-page- and/or bibliography- database and categorised
display system.

    More big dreams: I want to make a hierarchical display format, like I
sort of told you about before, where a given node can have more than one
parent, which I think will make browsing through the data faster and easier
for people.  It will be easier for the person seeking a node (a web page or
book title) because they don't have to go down the exact branch of the
category tree that you first thought of when you defined your categories.
Instead, every leafnode in the tree will have a fuzzy classification value
for every category in the tree.  And every sub-category within a branch-path
will also have fuzzy classifications in terms of multiple parent categories.
Then when someone searches down a branch of categories, each time they
select a new branch in the left pane of the window, the flat list (table) in
the right pane will be resorted by relevance defined in terms of the
combination of (1) each leafnode's (book title's, or web page's)
classifications, and (2) that class's membership in it's parent
classifications, all the way up to whatever class/category the user clicked
on.

    In this way I think I can output a beautiful list of all leafnodes (such
as book titles) sorted in such a way that makes sense, given a person's
chosen category path, without leaving anything out on account of there being
similar (though not mutually exclusive) categories found in other
branch-paths in the tree.

    Suggestion request:  I'm thinking I will probably have to use php or
something in order to display a tree graph in the left pane of a web page
that expands and contracts and updates the other table view in the right
pane.  Any suggestions?

    My first term at BYU is over, by the way, which is why I have time to do
this stuff.

    Also, I just applied to the MS in CS program.  My fingers are crossed.

    ciao,
tomp

~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Víðar sum quem nihil obstat.
www.Ontolog.Com
~~~~~~~~~~~~~~~~~~~~~~~~~~~~