Kasyk Source Internals Glossary
This document attempts to provide a glossary of the terms that are used in the source of the Kasyk search engine software. It is by no means complete yet, so additions / corrections / further clarifications are always welcome.
Some of the descriptions here might actually be meaningful for people other than the Kasyk development team.
An alphabetical list of terms used with the Kasyk search engine software.
The term "ad hoc" is usually used in conjunction with server. An "ad hoc" Kasyk server is a copy of kasyk that is started by kasykcqd whenever a query comes in that needs to be handled by a local index directory for which there is no kasykd running locally.
It provides the "best of both worlds" with regards to resource usage, at the expense of a little more delay the first time a query comes in for that "ad hoc" server.
An attribute is a piece of extra information that is associated with an XML Gcontainer and is identified by a name and a string between quotes. In the container <kasyk:hitlist hits="42">, "hits" is the name of the attribute and the value of the attribute is "42".
The barn is a set of files that contain a copy of all document XML that was received during indexing. It is used for preview extraction as well as retrieving information for ranking purposes.
An as yet not implemented version of kasykindex in which it is possible to deliver document sequences for indexing on a (remote) socket rather than as a locally available file.
A version of the Kasyk server kasykd that caches the result of a query so that the same query does not need to use resources for searching again. This is particularly important if paging through a hitlist at e.g. 10 hits at a time. The caching query server is also known as kasykcqd.
A chunk is a partial document list (a list of document ids associated with a texttype and a word or a trigram. Chunks are cached in memory as long as possible and flushed out to disk when they're full or when room needs to be made in the cache for other chunks of document lists of more frequently occurring words or trigrams.
See configuration file.
The configuration file is a file that should contain a <kasyk:config> XML container. It is needed when an index is created during indexing.
See Kasyk configuration XML for the complete definition.
A constraint in the context of the Kasyk search engine software, is a condition that is applied to all documents in a Kasyk index when searching. Only when the evaluation of the condition results in a true value for a particular document, will the contents of that document be considered for further investigation and if there is sufficient score for the document, be included in the final hitlist.
Conditions (currently) can only be applied to a property. The following property types and values are considered to be true when used in a unary context:
property type considered true if ======================================== flag set (1) number not 0 float not 0 string not empty
Please note that number, float and string types of properties can also be used in binary contexts, e.g. "number == 1" and "datestamp < 20030101".
The basic element> of XML. These are six examples of containers:
<container/> <container></container> <container>text</container>
<container attribute="value"/> <container attribute="value"></container> <container attribute="value">text</container>
Containers can contain text and other containers. Extra information can be added to containers by so called attributes.
Section in the configuration file XML that is checked when an index is created. See also indexing and searching.
A document is said to be marked "deleted" if it should not be checked for any query at all anymore. This can occur by (incrementally) indexing a document that has a property that has a unique value, and the value of that property in that document already occurs in another (already indexed) document in the index. This later document is then considered to be an update of the older document. Since "physical" deletion out of the rather complicated datastructures of an index is basically not possible, a special flag (actually the first in the flags bitmap associated with the older document) is set to indicate that the document has been deleted.
Future versions of Kasyk may actually zero out the copy of the document out of the barn, but removal of words and trigrams from the document lists will never be done: for that you should use kasykopt.
Distributed searching is the process in which more than one Kasyk server is sent a query and the resulting hitlists are combined into a single hitlist, with the appropriate ranking performed.
Being able to do proper distributed searching is high on the wishlist of Kasyk.
Abbreviation of document level only.
Deprecated way of describing a document number.
See document level only.
See document number.
See document list.
See document sequence.
A document is simply the logical unit of a piece of information indexed by the Kasyk search engine software. It does not need to be an actual file, it can just as well be a record in a database. This all depends on the filter programm that was used to create the document XML.
A document is presented to the indexer as a <document> XML container that is part of a document sequence. An example of a document is:
<document> <properties> <myid>42</myid> <!-- uniquely identifies this document --> <example>1</example> <!-- flag indicating this document is an example --> </properties> <text> Hello world! </text> </document>
A <document> container is supposed to contain an <properties> container which describes the properties associated with this document, of which one is usually known to be an property marked unique.
The XML of a document generally doesn't exist in the correct form to be ready for inclusion in a document sequence. Usually an external filter program creates the document XML out of other sources of information.
Deprecated way of describing a document number.
An old way of specifying exact searching.
A list of document ids in ascending order, spread out over one or more chunks. Associated with a texttype and either a word> or a trigram.
Document lists are spread out on disk in semi-random order (depending on many factors), causing a noticable slowdown in document list decoding.
One of the functions of the kasykopt program is to make sure that the chunks of a document list are located in consecutive blocks on disk. This process is also referred to as list straightening.
Document lists are stored in a highly compressed format that takes advantage of the fact that the list of document ids is always in ascending order. The advantage being that in some cases less than 1 bit on average is needed to store the document number of a document (which is otherwise stored in at least a 4-byte integer).
The document number of a document is the ordinal number that is assigned to the document during indexing. The first document in the first document sequence that is presented to an indexer, gets number 1. The second number 2, etc. etc.
Please note that the document number of a document, as perceived by the user, can change when incremental indexing is being performed. If a document contains (at least) one property that is marked unique, then only one document can exist in the index with a particular value of that property. If a second document sequence is delivered to the indexer and that document sequence contains a document with a "unique" property set to a value that already occurs in the index, then the original document is marked as "deleted>". The new version of the same document is stored in the index as if it was a new document, with a new (higher) document id. Thus it will appear to the (external) user that the document number has changed.
Therefore document number values are only guaranteed to be valid until the next (incremental) indexing.
The actual document number of a document is only available as part of the information returned in a hitlist and can as yet not be used as part of a constraint.
A document sequence is basically a <kasyk:docseq> XML container that contains at least one document. It is presented to the indexer which then adds it to a Kasyk index.
An example of a document sequence with 3 documents would be:
<kasyk:docseq> <kasyk:document> <!-- contents of document 1 --> </kasyk:document> <kasyk:document> <!-- contents of document 2 --> </kasyk:document> <kasyk:document> <!-- contents of document 3 --> </kasyk:document> </kasyk:docseq>
See the Kasyk document sequence XML for the complete definition.
One way of describing an XML container.
Exact searching only searches for whole words from the search query in the Kasyk index. Misspelled words will not be found this way, unless some external processes provide stemming or similar actions to provide a list of likely misspellings of a word. See fuzzy searching for an alternate way to allow for finding misspelled words.
With exact searching, locations of words inside a document are recorded, allowing for giving greater importance to words that are closer to each other in a document.
The phrase "exact assist" applies to fuzzy searching only. It increases the score of a document when a word is actually known in the index that matches the sequence of trigrams found.
The "expat" library (as available from http://www.libexpat.org ) is currently used by Kasyk to handle all issues with regards to reading XML. Although an exquisite and very efficient library, it is considered to move to the "libxml2" library (as available from the GNOME project at http://www.gnome.org ) because that library allows for easy interfacing with the "libxslt" library, which in turn would allow for server-side transformations of hitlists.
Moving to "libxml2" would also allow for more character encodings when indexing, removing the need for an extra encoding conversion if the input documents were not already in UTF-8 or ISO-LATIN-1.
A "filter" is an external program that processes externally available information (e.g. a mailbox, set of HTML-files, a number of records out of a relational database) and creates a document sequence out of this information. By indexing this document sequence, a Kasyk index is created or updated. It is then possible to perform searching operations on that index, which effectively allows you to search your original externally available information.
An example of a filter would convert this email message:
From: liz@dijkmat.nl To: you@somewhere.com Subject: Hello world Date: February 2nd, 2003
This is a test of email indexing.
to the following /document:
<kasyk:document> <properties> <date>20030202</date> <from>liz@dijkmat.nl</from> </properties> <text> <to>you@somewhere.com</to> <subject>Hello world</subject> This is a test of email indexing. </text> </kasyk:document>
Several customizable filters are available written in Perl (check out the Kasyk.pm and associated modules on CPAN).
Indication in a query of the ordinal number of the first hit to be returned in the hitlist.
See also last to indicate the ordinal number of the last hit to be returned in the hitlist.
Fuzzy searching only searches for trigrams from the search query in the Kasyk index. It is completely oblivious to the concept of a word, it only recognizes patterns (sequences of characters), of which whitespace (or more accurately a word boundary) is one.
Fuzzy searching allows for automatically finding misspelled words without the need to provide a pre-calculated list of expected misspellings (stemming) but does this at the expense of a very much higher need for resources.
A hit is an XML container that part of a hitlist, which in turn is the result of a searching operation. Each hit in a hitlist refers to a single document in the index.
An example of a hit is:
<hit docnum="42" score="value" percent="value"> <properties> <date>20030202</date> <from>liz@dijkmat.nl</from> </properties> <preview>is a <b>test</b> of email</preview> </hit>
Please note that the <preview> container shows a part of the text that was considered to be significant and reason for the ranking of the document in the hitlist.
See the Kasyk hitlist XML for the complete definition.
A hitlist is the result of the searching process. It basically consists of a <kasyk:hitlist> XML container with 0 or more hits. And example of a hitlist would be:
<kasyk:hitlist> <header> <!-- errors and warnings come here, if any --> </header> <hit docnum="42"> <!-- information about the first hit --> </hit> <hit docnum="24"> <!-- information about the second hit --> </kasyk:hitlist>
See the Kasyk hitlist XML for the complete definition.
The maximum number of hits that actually could be returned in a hitlist. Actual number of hits may be limited by first and last XML attributes.
Incremental indexing adds one or more document sequences to an already existing index. This can be used if new information needs to be added to the index, or if existing information needs to be updated in the index.
Within the context of the Kasyk search engine software, an index is the logical entity in which a compiled, fast searchable version of one or more document sequences is stored. The process is usually referred to as indexing. Once the document sequences have been indexed, is it possible to perform searching operations on the index.
The result of the indexing process, is a number of files which are stored in the index directory.
The index directory is a directory on a file system in which all of the files that make up an index are located. As such, the index directory is the unique identifier for an index.
See index directory.
At the moment there is only one indexing program: kasykindex. This means that only document sequences that are locally available to the indexer, can be processed by the indexer.
It is envisioned that in the future it should be possible to create a caching indexing server that would accept document sequences on a socket and add these to an index immediately, or after an amount of time has elapsed, or when a certain number of documents or bytes have been queued for indexing.
Indexing is the process in which document sequences are processed and a Kasyk index is created. A configuration file, which defines the way the index is structured, is needed when a new Kasyk index is created (with kasyknew). Otherwise specifying the index directory is enough to indicate the index to be indexing for.
If fuzzy searching capabilities are required, then trigrams are extracted from the document sequence. If exact searching is (also) required, then words are (also) extracted from the document sequence.
In all cases, properties are also extracted and stored with the index. The properties of a document are (optionally) returned with the hit in a hitlist and can in many cases also be used to as a constraint when performing searching operations.
There is also an indexing container in the configuration file XML that is checked whenever indexing occurs.
The searching program that allows searching locally available Kasyk indexes. It is run for each query and is therefore inefficient in terms of CPU usage. It is however efficient in terms of memory usage, as it will not take any memory when it is not in use (unlike the server version of this program: Gkasykd).
The kasyk program expects query GXML on stdin (or from a file) and outputs the resulting hitlist XML on stdout. It is intended as the work horse for ad hoc servers and as a debugging tool for developers.
The acronym KASYK stands for "Knowhow About Searching Your Knowledge". It is inspired by Elizabeth Mattijsen's mothers (maiden) name "Ingeborg Rosa Kasik".
The caching query server program. Once started, it keeps running and listens on an address:port combination for incoming connections. Once connected, it expects query XML on its input socket. Once the query is received, it checks whether the same query is already stored in its cache. If it is not in the cache, it then performs a searching operation on the indicated Kasyk provider and stores the resulting hitlist in its cache (possibly removing the results of older queries if the cache has been filled up). Then the hitlist is returned from the cache on the socket (possibly limiting the hitlist to the actual hits requested).
The searching program that allows searching locally available Kasyk indexes by external programs through a socket. It is started once and keeps running until it is told to stop. It is therefore very efficient in terms of CPU usage. It however inefficient in terms of memory usage, as it will use memory even when it is not being used (unlike the "per query" version of this program: kasyk).
The kasykd program expects query XML on its socket and outputs the resulting hitlist XML on the same socket.
The indexing program that takes one or more files with document sequences and creates or updates an index with that.
The Kasyk index optimizer. Deletes documents from the index that have been replaced by other documents during incremental indexing. Also performs list straightening on the document lists to increase performance of document list decoding.
Within the context of the Kasyk search engine software, an property is said to be "keyed" when multiple occurences of the same value are stored only once and assigned a number, which is then associated with a document.
Currently keyed properties can be used in constraints.
Other types of storage of property values are unique and notkeyed.
Indication in a query of the ordinal number of the last hit to be returned in the hitlist.
See also first to indicate the ordinal number of the first hit to be returned in the hitlist.
The list straightening process ensures that the chunks of a document list are located in consecutive blocks on disk. This significantly speeds up decoding the document list of a word or trigram, especially for larger indexes where most of the document lists are spread out over multiple chunks.
This process is performed by the Kasyk Index Optimizer kasykopt.
Either a "server:port" or a local index directory specification of a provider.
The maximum number of hits that could be returned as the result of a query.
The maximum number of hits that can be considered for inclusion in a hitlist before the final ranking. Can never be less than the maxhits value.
Usually the maxpass1hits value in a query is about 5 times the maxhits value. It is als possible to indicate the value "unlimited" to allow considering all possible documents for inclusion in the final hitlist.
See pass1hits in the hitlist to find out how many documents were actually initially considered for inclusion.
Short for "maximum (number of) outstanding queries". Used in the Kasyk caching query server. See also noq.
Short for "(current) number (of) outstanding queries". Used in the Kasyk caching query server. See also moq.
Within the context of the Kasyk search engine software, a property is said to be "notkeyed" when all occurences of values are stored seperately each time, without checking for values that may occur multiple times.
Currently "notkeyed" properties can not be used in constraints. The usefullness of this type of property is typically for information that is handy to have as part of a hit, such as a Subject: of an email message.
Other types of storage of property values are unique and keyed.
The actual number of documents that were considered for inclusion in a hitlist. Can never be more than what was specified with maxpass1hits in a query.
Part of the hit XML container. It contains a representation of the part of the document that is most likely what was searched for with the query.
The process of extracting a preview for a document using the information stored in the barn.
See preview extraction.
A property, in the context of the Kasyk search engine software, is a piece of extra information that is associated with a document. In that context it is specified as an XML container in the <properties> container of a document. A property can be a flag (true/false), a numeric value (0..2G-1), a floating-point value (range and precision determined by the environment for which the software was compiled) and a string (an arbitrary length sequence of Unicode characters that do not contain a null-byte).
See document for an example of the use of Kasyk properties.
A provider, in the context of the Kasyk search engine software, is the specification of a Kasyk server or a local Kasyk index directory. It should be considered the entity that will provide the hitlist associated with a particular query.
Providers can currently only be specified in the configuration of the caching query server, either referencing a kasykd or an ad hoc server.
The "qip" stands for Quantized Index Position. Rather than remembering the position of a word or trigram in a document sequence accurate to the actual byte position, resolution is reduced by dividing the document sequence into "blocks" of X bytes (where X is 128 by default) and using the ordinal number of the block as the position of the word or trigram.
By using qips, it is possible to index document sequences of up to 256 Gbyte before rollover occurs in the 4-byte integers that keep qip positions.
Although this concept was used as a space saving measure, it turned out that the complexity introduced by qips does not warrant the space saving. Future effort will be directed at replacing the qip concept by absolute positions in the document sequence (which is something different than reducing the qip size to 1 byte). This will result either in reducing the maximum size of a document sequence to 2Gb per index (which may not really be a problem when /"distributed searching" becomes available), or by using 8-byte integers for keeping absolute positions (which starts to make more and more sense with the arrival of affordable native 64-bit processors in the coming years). Using 8-byte integers will basically make the document sequence size limitless (2GB times 2GB) by today's (2003) standards.
Within the context of the Kasyk search engine software, a query is an XML container that indicates what needs to be searched for in an index.
An example of a simple query would be:
<kasyk:query>test</kasyk:query>
which would do a fuzzy search for the word "test". See the Kasyk query XML for the complete definition.
Within the context of the Kasyk search engine software, ranking is a two step process that is used to order the hits in a hitlist.
The first step in ranking occurs when the initial hitlist is made. A basic score for a document that matches the constraint is calculated depending on the query and the contents of the document. An initial position is assigned with this score. Any documents at or below that position are moved one position down. If the position of the lowest document now exceeds the maxpass1hits value, it is disregarded for further inclusion.
Once all documents have been inspected, the resulting list is sorted according to any other constraints that may be required by the query. In the end, only the first maxhits hits will actually be returned in the hitlist.
During this final part of the ranking process, information is currently retrieved from the barn. In the future, when qips have been eradicated from Kasyk, it should be possible to perform ranking without having to go to the barn.
A resource, in the context of the Kasyk search engine software, is the specification of one or more /indexes handled by a /"caching query server".
The process of presenting a query XML to a Kasyk search engine (Gkasyk, kasykd or kasykcqd) and receiving the resulting hitlist and doing something with that (e.g. rendering it to HTML using an XSLT transformation engine).
There is also a <searching> container in the configuration file XML that is checked whenever searching occurs.
Within the context of the Kasyk search engine software, a server is either kasykd or kasykcqd running on a dedicated address:port or a kasyk ad hoc server communicating through pipes. It allows you to do searching operations only.
In the future, when the caching indexing server becomes a reality, this concept might also include indexing operations.
Stemming is a process in which "derived words" of a word are calculated using various methods. Stemming allows you to use the fast exact searching on a limited set of synonym / misspelled words, as opposed to the slower fuzzy searching which is based on trigrams.
A very simple way of stemming in the English language, is appending "s", "ing" and "ed". The word "wait" would then become:
(wait,waits,waiting,waited)
Of course this only works for so many words in the English language, and is completely useless for most other natural languages.
Stemming is currently not implemented in Kasyk itself and must therefore currently be done by an external process. Please also note that it is currently not possible to specify synonyms for words in a query as such. This should however be added in the future, possibly by using the fuzzy capabilities of Kasyk.
A Kasyk server that is handling requests for a Kasyk caching query server. Can be part of a collection of subservers handling requests to the same index.
A texttype could be considered a piece of extra information that is associated with a word or trigram. Currently, Kasyk allows for a maximum of 32 different texttypes, each of which can be assigned a different weight when searching.
The document XML indicates what texttype needs to be associated with a word or trigram. Take for example:
<kasyk:docseq>
<document>
<properties>
<id>42</id>
</properties>
<text>
<title>Hello World</title>
This is a "Hello World" example.
</text>
</document>
<document>
<properties>
<id>43</id>
</properties>
<text>
<title>Another World</title>
This is yet another "Hello World" example.
</text>
</document>
</kasyk:docseq>
This would be indexed in words as (in alphabetical order):
Word Texttype Document list ================================================ a (default) 1 another title 2 example (default) 1,2 hello (default) 1,2 hello title 1 is (default) 1,2 this (default) 1,2 world (default) 1,2 world title 1,2 yet (default) 2
Please note that each texttype causes a seperate document list to be created, which is very resource intensive. Alternate ways of being able to handle this will need to be investigated.
Also, please note that for the simplicity of the example, document ids were used instead of qips.
A "trigram" is a sequence of three characters (and since UTF-8 is used as the encoding, not necessarily 3 bytes).
If a Kasyk index is created with "fuzzy" searching capabilities, then during indexing each word will be broken into sequences of three characters. Take for instance the word "azimuth". When exact searching would be used, then only one entry would be stored: the word "azimuth".
With fuzzy indexing, the following entries would be stored for "azimuth":
" az" "azi" "zim" "imu" "mut" "uth" "th "
As you can see, the capability of fuzzy searching forces a multiple of entries to be stored in the Kasyk index. In this case, instead of 1 entry, 7 entries would be stored. When a fuzzy search is performed, the same process is used on the search query string and each of the trigrams is searched seperately.
Please note that currently a space is stored as part of the trigram (" az" and "th ", but this should be more accurately considered as a word boundary character, as the space may actually not appear in the original document (e.g. if the word "azimuth-controller" would occur in the document).
A way of presenting all of the characters / ideographs of all of the languages in the world (and then some) in a unified manner. Kasyk internally uses the "smallest" encoding called UTF-8 for representing characters and ideographs in an index.
Unicode is basically the successor to ASCII. More information is available from the Unicode Consortium (http://www.unicode.org ).
Within the context of the Kasyk search engine software, a property is said to be "unique" when a value of an property can only occur once and only once in an index.
Currently unique and keyed properties can be used in constraints.
Other types of storage of property values are keyed and notkeyed.
The variable length encoding of /Unicode that is used internally by the Kasyk search engine software. In UTF-8, characters can either be 1, 2, 3 or 4 bytes (and in some extraordinary cases, even more). The eXtensible Markup Language (XML>) depends heavily on Unicode.
A Kasyk index that appears through a caching query server to the outside world as a real index, but is in fact a combination of a real index and a constraint.
A weight can be assigned to a texttype in a query. With this it is possible to make certain texttypes (e.g. title) more important than other texttypes (e.g. default), causing a higher ranking of hits that have the requested word or trigram as part of the document. It can also be used to exclude certain texttypes from searching by giving them a weight of 0.
It is also possible to assign a default weight to each texttype, as conceptually the weight of a texttype is more a configuration file matter of the index. Which of course is overridable in the query.
A word is a string of characters that is surrounded by a word boundary at each end. It is the entity that is used for exact searching as opposed to the trigram which is used for fuzzy searching.
The virtual of real character that denotes the beginning and the end of a word. The following word boundaries are currently recognized by the Kasyk search engine software:
beginning of text end of text punctation (;-'",.: etc.) whitespace (space, tab, vertical tab, newline, carriage return)
Although it is possible to provide your own mappings to what signifies e.g. a punctionation or an alphanumeric ("word") character, (see the as yet undocumented "utf8data" container in the configuration file) it is as yet impossible to have punctuation be acceptable as a "word" without it becoming part of other words if there is no whitespace inbetween. This problem should be solved before it can be made possible to sensibly search source code.
The eXtensible Markup Language. A way of presenting information in a flexible and uniform manner. Uses the Unicode character encodings for representing characters and ideographs. For more information, see the website of the World Wide Web Consortium at http://www.w3.org/XML/ .
The eXtensible Stylesheet Language: Transformations. A way of transforming XML into other XML or other types of ordered information, such as HTML and plain text. For more information, see the website of the World Wide Web Consortium at http://www.w3.org/TR/xslt .
All of the .c and .h files in the "src" directory.
See http://www.kasyk.nl/internal/kasykglossary.html for the most up-to-date version of this information.
Copyright © 2003 Dijkmat BV
This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Kasyk Internal Information: Kasyk version 1.0.0, generated on Tue Nov 25 12:09:47 2003.