kasykfuture - The future of Kasyk
This document describes future developments of the Kasyk search engine software itself. But you may be interested more in the vision behind Kasyk.
The current way queries can be specified, still leaves room for improvement.
Currently it is impossible to search within the documents referred to in a Kasyk hitlist XML. It can be done externally in different ways (the in() function in Kasyk constraint expressions was developed particularly for a similar purpose), but a more generic way should be made available.
One way this is envisaged is by using a specialized "bitmap" feature if you will. The Kasyk hitlist XML should output a specially constructed sequence of characters describing the time the index was last "updated" and the "docnum" values of the documents in the hitlist, which we will refer to as the hitlist "bitmap" (for lack of a better word). This "bitmap" should then later be usable as part of a constraint.
The reason the "updated" value is part of this "bitmap", is that through incremental indexing, documents can have their "docnum" value changed. And this inadvertently cause documents to be excluded from the query. An incremental index on the Kasyk index will change the "updated" value of the Kasyk index, and thereby invalidate any hitlist bitmaps that were issued previously.
All of the basic elements for this feature are basically available, except one: the ability to use the "docnum" of a document as part of a constraint.
Kasyk currently supports synonyms only very indirectly. The <any> container in Kasyk query XML could be considered as a way of specifying synonyms. And the way words with accented characters are handled (as specified with the <exact> and <fuzzy> container in the Kasyk configuration XML) are internally already handled as synonyms.
It should become possible in Kasyk query XML to specify synonymous words and specify the weight of each word relative to the "main" word.
Currently Kasyk supports fuzzy searching by matching patterns (actually three character sequences called "trigrams") rather than words. Although this works very well in most cases, especially when enhanced by exact word support, there are borderline cases where words that seem similar pattern matching wise, do not seem to be similar to the natural language reader.
Other search engine software in the world, have often resorted to the "stemming" of words to simulate fuzzy searching. Although this has the advantage of being a lot less resource intensive, it has the disadvantage of being natural language centric (whereas the pattern matching approach is completely natural language independent: it doesn't even know the concept of a "word").
It would seem that a "best of both worlds" situation could be created with Kasyk. Since Kasyk keeps a dictionary of words that have been encountered in the documents that were indexed, it can use that dictionary to create yet another Kasyk index (a sub-index if you will) in which each "document" consists of a single word. That sub-index would then be used to "fuzzy" search all of the words in the Kasyk query XML and generate a list of synonymous words from that, and then use "exact" searching on those words.
Of course, there is always room for improvement. A number of issues that are currently open are technical in nature, and some are managerial in nature. And some are both.
Kasyk currently offers an incremental indexing feature, which however needs to have access to a local Kasyk index directory. Indexing can only be performed by one process at a time and has a significant overhead for starting up an indexing operation. Apart from the Kasyk indexer (kasykindex) executable, a Kasyk caching indexing server (kasykcid) needs to be developed, which will act as an inbetween much like the Kasyk caching query server (kasykcqd) acts for Kasyk searcher (kasyk).
This Kasyk caching indexing server (kasykcid) will allow multiple external connections feeding it Kasyk document sequence XML. At regular intervals (or when the received data exceeds a threshold) a Kasyk indexer (kasykindex) process is started and the data received is indexed. This can be especially useful when used in logging situations of any kind. Of course, access control for the Kasyk caching indexing server (kasykcid) is even more important than it is for Kasyk caching query server (kasykcqd).
See the file TODO in the distribution for a more complete and up-to-date list of things that are considered worthwhile to do still in Kasyk.
If you have any ideas about the further development of Kasyk, please do not hesitate to send them to ideas@kasyk.org.
Kasyk home, vision behind Kasyk, Kasyk introduction, Kasyk flow of information, Kasyk history.
See http://www.kasyk.nl/future.html for the most up-to-date version of this information.
Copyright © 2003 Dijkmat BV
This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Kasyk Information: Kasyk version 1.0.0, generated on Tue Nov 25 12:09:47 2003.