NAME

kasykfuture - The future of Kasyk


DESCRIPTION

This document describes future developments of the Kasyk search engine software itself. But you may be interested more in the vision behind Kasyk.


Improvements to the basic Kasyk functionality

The current way queries can be specified, still leaves room for improvement.

Searching within the result of a query

Currently it is impossible to search within the documents referred to in a Kasyk hitlist XML. It can be done externally in different ways (the in() function in Kasyk constraint expressions was developed particularly for a similar purpose), but a more generic way should be made available.

One way this is envisaged is by using a specialized "bitmap" feature if you will. The Kasyk hitlist XML should output a specially constructed sequence of characters describing the time the index was last "updated" and the "docnum" values of the documents in the hitlist, which we will refer to as the hitlist "bitmap" (for lack of a better word). This "bitmap" should then later be usable as part of a constraint.

The reason the "updated" value is part of this "bitmap", is that through incremental indexing, documents can have their "docnum" value changed. And this inadvertently cause documents to be excluded from the query. An incremental index on the Kasyk index will change the "updated" value of the Kasyk index, and thereby invalidate any hitlist bitmaps that were issued previously.

All of the basic elements for this feature are basically available, except one: the ability to use the "docnum" of a document as part of a constraint.

Specifying synonyms of words

Kasyk currently supports synonyms only very indirectly. The <any> container in Kasyk query XML could be considered as a way of specifying synonyms. And the way words with accented characters are handled (as specified with the <exact> and <fuzzy> container in the Kasyk configuration XML) are internally already handled as synonyms.

It should become possible in Kasyk query XML to specify synonymous words and specify the weight of each word relative to the "main" word.

Alternate ways of fuzzy searching

Currently Kasyk supports fuzzy searching by matching patterns (actually three character sequences called "trigrams") rather than words. Although this works very well in most cases, especially when enhanced by exact word support, there are borderline cases where words that seem similar pattern matching wise, do not seem to be similar to the natural language reader.

Other search engine software in the world, have often resorted to the "stemming" of words to simulate fuzzy searching. Although this has the advantage of being a lot less resource intensive, it has the disadvantage of being natural language centric (whereas the pattern matching approach is completely natural language independent: it doesn't even know the concept of a "word").

It would seem that a "best of both worlds" situation could be created with Kasyk. Since Kasyk keeps a dictionary of words that have been encountered in the documents that were indexed, it can use that dictionary to create yet another Kasyk index (a sub-index if you will) in which each "document" consists of a single word. That sub-index would then be used to "fuzzy" search all of the words in the Kasyk query XML and generate a list of synonymous words from that, and then use "exact" searching on those words.


Technical and managerial issues

Of course, there is always room for improvement. A number of issues that are currently open are technical in nature, and some are managerial in nature. And some are both.

More platforms need to be able to run Kasyk
Currently, Kasyk is only tested on Linux based platforms. Kasyk's predecessor has been running on HP-UX and Solaris platforms, but this has not yet been confirmed for Kasyk. And emerging platforms, such as Mac OS X need to be supported as well. Some work has been done to have Kasyk's predecessor run under Windows, but that area definitely needs to have a thorough investigation for Kasyk.
Access control
Currently, any host that can create a socket to a Kasyk server, can run queries on that server. This leaves access control to local or remote firewall configuration. It might be necessary to create some sort of access control mechanisms in Kasyk itself.
Shared libraries
Kasyk currently does not use shared libraries for its own library. This has been a more or less conscious choice to make it easier to have multiple versions of Kasyk co-exist on the same host. But maybe having shared libraries do have more to offer in resource sharing that would offset the possible confusion when different versions of Kasyk are being used simultaneously.
Continuous incrementally indexing (from external sources)

Kasyk currently offers an incremental indexing feature, which however needs to have access to a local Kasyk index directory. Indexing can only be performed by one process at a time and has a significant overhead for starting up an indexing operation. Apart from the Kasyk indexer (kasykindex) executable, a Kasyk caching indexing server (kasykcid) needs to be developed, which will act as an inbetween much like the Kasyk caching query server (kasykcqd) acts for Kasyk searcher (kasyk).

This Kasyk caching indexing server (kasykcid) will allow multiple external connections feeding it Kasyk document sequence XML. At regular intervals (or when the received data exceeds a threshold) a Kasyk indexer (kasykindex) process is started and the data received is indexed. This can be especially useful when used in logging situations of any kind. Of course, access control for the Kasyk caching indexing server (kasykcid) is even more important than it is for Kasyk caching query server (kasykcqd).


And more...

See the file TODO in the distribution for a more complete and up-to-date list of things that are considered worthwhile to do still in Kasyk.

If you have any ideas about the further development of Kasyk, please do not hesitate to send them to ideas@kasyk.org.


SEE ALSO

Kasyk home, vision behind Kasyk, Kasyk introduction, Kasyk flow of information, Kasyk history.

See http://www.kasyk.nl/future.html for the most up-to-date version of this information.


COPYRIGHT

Copyright © 2003 Dijkmat BV

This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Kasyk Information: Kasyk version 1.0.0, generated on Tue Nov 25 12:09:47 2003.