kasykflow - Flow of information when using Kasyk
This document describes the flow of information when using Kasyk. It could be considered as a very short "Kasyk for Dummies", a book that might get written someday.
The Kasyk search engine uses the concept of a "document": a document contains properties and text. Properties of a document are items of (generally) non-textual information that are associated with the document (such as a filename or a last modified date). At least one of the properties of a document should uniquely identify the document. This identifying property serves two purposes:
Properties can also be used to limit searching in your information to a specific set of documents. And they allow you to easily obtain other information relevant to a document in the Kasyk hitlist XML.
To create a search engine for your information, it is necessary to know how that information can be extracted programmatically. Or even better, automatically (for instance with a program that will dump the information into a textfile, or even better, save it as XML). Many standard document and database systems provide such a functionality, so this sounds a lot more difficult than it generally is.
But if there is no such functionality available, then it is necessary to know the way the information is stored, so that a program can be made (by you or someone that can help you) that will allow you to dump the information into a textfile or save it as XML.
Once your information is stored in a format that an external program can easily access, it is necessary to decide what constitutes a "document". In general, this is very easy: a document can for instance be a file, a single web page, a single record from a database or one email message in a mail box. Sometimes, especially when interfacing with relational database systems, a document may need to be put together from different records (for instance when translating from numbers to associated strings).
During this process, decisions need to be made on the following issues:
Once it is decided which properties are to be associated with a document, the types of the properties need to be defined. Lazily, you can make them all of type "string", but because of the performance of the searching operations, it is important to define the type of each property as optimally as possible.
Decide whether the property is always a whole number (for instance, when it is an ID out of a relational database, or when the document was published), or a string (name of a file, or a URL, or the name of the author), or a floating point value (for instance, the relative importance of the document) or a flag (true/false) value (whether or not it is public, whether or not it is spam, whether or not it is a faq, whether or not it is a press release). Specify this type in the type attribute of the <property> container of the Kasyk configuration XML.
If a specific property can only have a fixed number of different values, for instance a status property that can only have 5 "states" (inactive, active, bronze, silver, gold), then the performance will be significantly increased if the value keyed is specified with the value attribute of the <property> container of the Kasyk configuration XML.
In this example, the strings corresponding to the different states are internally converted to numbers:
inactive => 1 active => 2 bronze => 3 silver => 4 gold => 5
And these numbers can be compared to other numbers orders of magnitude faster than the strings could be compared to other strings.
Decide what should happen when a specific document does not have a value associated for a particular property. If you would the property then to be empty (for a definition of empty for the type of the property), then you don't need to do anything. If however, you would like to have a specific "default" value inserted in that case, you can specify the default attribute in the appropriate <property> container.
Taking the example from the previous paragraph. If you would like a document to have a status of "active" when there is no specific value for this property with this document, then you should specify the value active in the default attribute of the <property> container for that property.
Once the properties of a document are specified, the text of the document should be considered. Usually a document has a name or a title. Words that occur inside such an area of the text, should usually be considered more important than words that occur in the rest of the text. Words that occur in paragraph headers, should usually also be considered more important.
In the Kasyk configuration XML one or more <texttype> containers can be specified in which you define which areas of text can be expected in the text of documents, which weight the words inside those area of text should have (by default).
Now that all of this is figured out, it is possible to create the Kasyk configuration XML that is right, especially the <creation> container. Of course there are many other aspects of the Kasyk configuration XML that haven't been touched yet, but these can be adapted later in the Kasyk configuration XML. Once the configuration with a proper <creation> container is determined, the next step can be taken.
<kasyk:config xmlns:kasyk="http://www.kasyk.org/1.0"> <creation> <property name="filename" type="string" value="unique"/> <property name="changed" type="number" hitlist="no"/> <texttype name="title" weight="2.5" hitlist="yes"/> </creation> </kasyk:config>
This configuration is used in the test-suite a lot. It records the filename as the uniquely identifying attribuet of a document. It records the modification time of the document, but doesn't list in the the hitlist. It handles the title of a document specially, by making the words in a title 2.5 times as important as words that occur in the rest of the text. And it returns the title as a property in the hitlist as well.
Now that the initial configuration of your Kasyk index has been figured out, a decision needs to be made where to "store" your Kasyk index on your file system. In practice, 4 to 5 times the amount of space of your original information, needs to be reserved for the Kasyk index (well, compared to the Kasyk document sequence XML, but that will be explained later on). The Kasyk initializer (kasyknew) needs to be able to create a directory and store the Kasyk configuration XML in it. It's as simple as that.
Once the Kasyk index is created, refer to that directory (either absolutely or relatively) to refer to the Kasyk index. They've become synonyms.
$ kasyknew /kasyk/testing testconfig <ENTER> $
Create a Kasyk index in the "/kasyk" directory called "testing" with the Kasyk initializer (kasyknew) executable. Use the contents of the file "testconfig" as the Kasyk configuration XML. As there are no further messages displayed, creation of the Kasyk index was successful.
Now that a Kasyk index has been created, the Kasyk document sequence XML needs to be created that corresponds to the information that needs to be indexed. Users of the programming language Perl are in luck: many transformations of commonly used data-formats, such as email, mailboxes, Word files, PDF files and relational databases, have been tackled already using that programming language. In that case it is just a matter of getting the right modules from CPAN.
Users of other programming languages and those who have some very special needs, need (to ask someone to) write a script (or program) that will do the conversion to the Kasyk document sequence XML. But maybe there is someone out there in the world who has tackled this already! Time to subscribe yourself to the Kasyk Users mailing list (users-subscribe@kasyk.org) and see whether one of the other subscribers has tackled this problem before. But of course you should always search the mailing list archive first (http://www.kasyk.org/search/users.html).
<kasyk:docseq xmlns:kasyk="http://www.kasyk.org/1.0">
<document>
<properties>
<filename>foo</filename>
<changed>1027889748</changed>
</properties>
<text>
<title>This is Foo</title>
This is the text of Foo!
</text>
</document>
<document>
<properties>
<filename>bar</filename>
<changed>1027889824</changed>
</properties>
<text>
<title>This is Bar</title>
This is the text of Bar!
</text>
</document>
</kasyk:docseq>
Of course, this is a very simple example, but it should get the idea across. Such a document sequence is usually stored in a file (which however, is not necessary).
Now that a Kasyk index has been configured and a matching document sequence has been created, the "indexing" can start. Indexing is the process in which the documents are taken apart by Kasyk and stored in a special format inside the index directory. This indexing process needs only to be done once for each document. Or more specifically, once each document is updated.
$ kasykindex /kasyk/testing sample $
Index the Kasyk document sequence XML in the file "sample" into the Kasyk index "/kasyk/testing". Please note that if the document sequence is very small, no message will be shown at all. Larger document sequences will cause Kasyk to report status every now and then until indexing has been completed.
$ myconversionprogram | kasykindex /kasyk/testing $
Index the standard output of "myconversionprogram" into the Kasyk index "/kasyk/testing". Please note that in this case the Kasyk document sequence XML only exists while the indexing is taking place. It therefore does not take up any additional diskspace.
The Kasyk query XML of a query can be very simple, or it can get very complex if complex Kasyk constraint expressions are used.
<kasyk:query>foo</kasyk:query>
This searches for the word "foo". As the type of search is not specified, it is determined by the setting of the type attribute of the <searching> container in the Kasyk configuration XML. If that is not set either, then the search will become "fuzzy" if "fuzzy" searches are possible, else it will become "exact". The type of search that was performed can be determined by the type attribute in the Kasyk hitlist XML.
Please check out the Kasyk query XML later to find out all of the possibilities when specifying a query.
The simplest way of obtaining a Kasyk hitlist XML for a specific Kasyk query XML, is by calling the Kasyk searcher (kasyk) executable.
$ kasyk /kasyk/testing <ENTER>
<kasyk:query>foo</kasyk:query> <ENTER>
<kasyk:hitlist xmlns:kasyk="http://www.kasyk.org/1.0">
<header type="fuzzy" hits="1" first="1" last="1"
updated="1049397396" pass1="1" documents="2">
<note class="Info" id="NoNamespace">
Namespace specification missing, assuming 'http://www.kasyk.org/1.0'
</note>
</header>
<hit ordinal="1">
<preview> <b>Foo</b> This is the text of <b>Foo!</b></preview>
<properties>
<filename>foo</filename>
<title>This is Foo</title>
</properties>
</hit>
</kasyk:hitlist>
$
A very simple example in which the Kasyk query XML is entered interactively on the command line. Usually it doesn't happen this way, but for the sake of the demonstration, we've done it this way here.
In this query a search for "foo" is done in a fuzzy manner (since there is no type of search specified and there is no default type of search specified in the Kasyk configuration XML either). Note that there is only one document found, and that there were 2 documents in the Kasyk index in total (the documents attribute in the <header> container).
Because no namespace was specified in the query, a warning is added to the Kasyk hitlist XML to indicate that "http://www.kasyk.org/1.0" has been assumed. In any situation other than testing, it's always wise to specify the namespace, as it will ensure that the version of the Kasyk searcher (kasyk) will understand the version of the Kasyk query XML.
In this example whitespace has been added to the Kasyk hitlist XML for clarity.
Now it is verified that there is a working Kasyk index and that it is possible to get Kasyk hitlist XML from it, it is time to consider performance. Generally, if a Kasyk index only needs to process a Kasyk query XML once an hour, then using Kasyk searcher (kasyk) for that lone query poses no performance problems. If a Kasyk index needs to process a Kasyk query XML once a minute or more, then it is generally advisable to start a Kasyk server (kasykd) for it. Anything inbetween these two extremes, could work without problems with either solution.
In order to run a Kasyk server (kasykd), decide on the "location" where the Kasyk server (kasykd) should be running. Currently, a location is an IP-number and a port number, seperated by a colon (:). If the Kasyk server (kasykd) is only to be accessible from the same computer, the special IP-number 127.0.0.1 (also known as "localhost") can be used. If the Kasyk server (kasykd) needs to be accessible from other hosts, then another IP-number needs to be specified (one that is of course "owned" by the computer on which the Kasyk server (kasykd) is to run).
Also decide on a port number on which the Kasyk server (kasykd) will be running. There is no recommended port number for Kasyk servers (yet), so pick a port number of which you are sure it is not going to interfere with anything else. Please note that for any port numbers < 1024, it is necessary to have root privileges on the host running Kasyk server (kasykd). And if that Kasyk server is not to be accessible from outside your internal network, one needs to make sure that any connections on that port number are restricted by the configuration of the firewall(s).
Put the IP-number and port number together in a string, seperated by a colon (:). If the IP-number and the colon are omitted, the IP-number 127.0.0.1 (localhost) will be assumed. This will be the location on which the Kasyk server (kasykd) will be running.
Specify the location in the location attribute of the <searching> container of the Kasyk configuration XML or with the --location option when starting Kasyk server (kasykd).
$ kasykd /kasyk/testing <ENTER> $
No message means success.
$ kasykd --location="3333" /kasyk/testing <ENTER> $
Again, no message means success. In this case the server will start running on the IP-number "127.0.0.1" and use port "3333".
$ kasykd --server /kasyk/testing <ENTER>
<kasyk:server xmlns:kasyk="http://www.kasyk.org/1.0"
location="127.0.0.1:3333"
pid="24742"/>
$
If correct Kasyk server info XML is output, then the Kasyk server (kasykd) is running correctly. Please note the the XML has had whitespace added for clarity.
There are basically four situations where it is handy to use the Kasyk caching query server (kasykcqd).
With a search service that allows for a "window" on the result of a query (by first offering hits 1 through 10, then 11 through 20, etc.), redoing the query each time needlessly spends resources because in this case the entire result will always calculated again, even if there are only 10 hits of it presented to the user.
By using the Kasyk caching query server (kasykcqd) as a front end for a search engine, a Kasyk query XML will only be done once and its resulting Kasyk hitlist XML will bekept in RAM. Each time the window is moved (for instance if the user requests the "next" 10 hits), the hits will be obtained from the cache in RAM rather than by redoing the query. Needless to say, this is a lot faster than redoing the query. And uses much less resources.
In the case of a Kasyk index that needs to be accessible as a search service, but is not accessed often enough to warrant running a seperate Kasyk server (kasykd) for (particularly from a standpoint of RAM usage), but the speed of a Kasyk server <is> necessary, then it is advisable to use a so-called "ad hoc" Kasyk server. Instead of specifying the location of a local Kasyk server, the local Kasyk index (directory) itself needs to be specified.
When a query comes in with the Kasyk caching query server (kasykcqd), a single Kasyk searcher (kasyk) will be started by the Kasyk caching query server (kasykcqd). Communication between the Kasyk caching query server (kasykcqd) and the Kasyk searcher (kasyk) is done by bi-directional pipes rather than sockets. This means that unlike the Kasyk server (kasykd), there is no asynchronous processing of multiple queries (which is generally not a problem in such a situation). But there is the advantage of Kasyk searcher (kasyk) not having to re-read the necessary contents of the Kasyk index each time a query is made. After a certain amount of time of no queries coming in for that Kasyk index, the pipes are closed and the Kasyk searcher (kasyk) for that index will stop and free all the resources it was using.
As with the Kasyk server (kasykd), specify a "location" on which the Kasyk caching query server (kasykcqd) will be running. Create a Kasyk caching query server configuration XML that specifies which Kasyk providers will be accessed by which index names. After that, starting the Kasyk caching query server (kasykcqd) is pretty easy.
$ kasykcqd config.xml <ENTER> $
No message means success.
$ kasykcqd --location="3334" config.xml <ENTER> $
Again, no message means success. In this case the server will start running on the IP-number "127.0.0.1" and use port "3334".
$ kasykcqd --server config.xml <ENTER>
<kasyk:server xmlns:kasyk="http://www.kasyk.org/1.0"
location="127.0.0.1:3334"
pid="24745"/>
$
If correct Kasyk server info XML is output, then the Kasyk caching query server (kasykcqd) is running correctly. Please note the the XML has had whitespace added for clarity.
The final step is of course integrating Kasyk into your application. Many programming languages / environments nowadays support XML in various ways. You may be comfortable with one of them already. Using the the programming language / environment you're most comfortable with, is always the best way to go. Unless that way somehow limits you in what you are trying to achieve.
If you don't (want to) know (more) about XML, and you are more or less proficient in Perl, please use the Kasyk.pm Perl modules from CPAN. With them it is possible to use Kasyk without having to really know anything about XML in general, or the Kasyk XML in particular.
Kasyk home, Kasyk introduction, Kasyk history, future of Kasyk.
See http://www.kasyk.nl/flow.html for the most up-to-date version of this information.
Copyright © 2003 Dijkmat BV
This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Kasyk Information: Kasyk version 1.0.0, generated on Tue Nov 25 12:09:47 2003.