kasykconfigxml - description of the Kasyk configuration XML (<kasyk:config>)
The Kasyk configuration XML specifies the way a Kasyk index is created, how documents are indexed into that index, and how queries to that index are handled. It is usually stored in a file, but can also be generated "on the fly" while being passed to Kasyk indexer (kasykindex).
A simplified overview of the configuration XML looks like this:
<kasyk:config xmlns:kasyk="http://www.kasyk.org/1.0"> <messagelog .../> <creation> <!-- containers specific to index creation --> </creation> <indexing ...> <!-- containers specific to document indexing --> </indexing> <searching ...> <!-- containers specific to hitlist generation --> </searching> </kasyk:config>
None of the containers in the configuration XML is supposed to contain anything but whitespace: anything else will cause a warning to be issued to the message log.
The outer <kasyk:config> container indicates that this is an XML specification of the configuration of a Kasyk index. The namespace specification is necessary to verify that the version of the XML being processed is compatible with the version of Kasyk. It does not contain any further attributes.
The <messagelog> container specifies the name of the file in which messages will be logged. If it is not specified, then either the KASYK_MESSAGELOG environment variable must be set with the filename, or the --messagelog option must be specified, to cause messages to be saved to a file.
The <messagelog> container must have one attribute, name, which specifies the name of the file to which messages will be logged. If the filename is relative, ie. does not contain any slashes, the file will be opened in the index directory. Prefix the filename with "./" if you want the logfile to be opened in your "own" current directory.
The <creation> container (and the containers within it) specify how the Kasyk index will be set up the first time a Kasyk document sequence XML is indexed.
The <creation> container itself does not contain any attributes. The <creation> container can contain the following sub-containers:
If the <exact> container is specified, it indicates that the Kasyk index will support "exact" searches. Only words that exactly match words in a particular Kasyk query XML, are considered relevant.
If the <fuzzy> container is specified, it indicates that the Kasyk index will support "fuzzy" searches. Fuzzy searches consider words to be relevant if the words in the Kasyk query XML are similar to words found in the Kasyk index.
It is possible to have support for both exact and fuzzy searches in the same Kasyk index. In fact, if both are present, the accuracy and quality of "fuzzy" searches is significantly improved by being able to match against "exact" words in the Kasyk index.
The <exact> and <fuzzy> containers can both have one (optional) attribute: accent.
If a word with accents is encountered, both the original version (with the access preserved) is stored in the Kasyk index, as well as the normalized version (without any accents).
For example, if the word "voilà" is indexed, it is stored in the index as both "voilà" and "voila". During searching, if the word "voila" (without accent) is searched for, both "voila" and "voilà" will be found. If the word "voilà" (accented) is searched for in an exact fashion, only occurrences of "voilà" (with the accent) will be found. A fuzzy search for "voilà" will treat "voila" to be "further away" and therefore less significant in that case.
In this way a search using non-accented text will find both non-accented and accented forms of the words. A search using explicitly accented text will find only the accented forms of the words that were indexed.
It should be noted, however, that in the text preview provided as part of the Kasyk hitlist XML, accented characters are displayed in their original form. The lines discussed above relate only the searching of the text as a consequence of how it was indexed, not of how the preview is generated..
By default, a setting of "both" is assumed for the accent attribute for both exact and fuzzy indexes. It is recommended that if the Kasyk index is created allowing for both exact and fuzzy searching, that the setting of the accent is the same for both the <exact> and <fuzzy> container.
If both the <exact> and <fuzzy> container are absent, it is assumed that both have been specified with the value "both" for the accent attribute.
A <property> container specifies the characteristics of a particular per-document property. Up to 256 different properties can be specified that may occur with a document as part of a Kasyk document sequence XML.
These per-document properties are reported as part of the search results for a query (the Kasyk hitlist XML) unless the hitlist attribute is set to "no" (or "0").
Properties can also be used as part of a "constraint" in the Kasyk query XML, an expression built up out of property names and values that restrict the set of allowable documents returned as the result of a query.
The type of the property is specified with the type attribute. The following types are currently available:
If no type attribute is specified, the property will be assumed to be of type "flag". If the property is of type flag, no other attributes apart from possibly the default attribute, need to be specified.
It should be noted that the values of properties can not be searched in the way the text of a document can be. The values of these properties can only be used as part of a constraint in the Kasyk query XML, restricting the set of valid documents to return in Kasyk hitlist XML. Or it can be returned as part of the Kasyk hitlist XML, giving information about the document found.
The value attribute indicates how the values of the property will be stored, whether if will serve as a unique identification for the document and whether the values will be indexed for easier constraint matching and lower storage resource needs.
The word "unique" with the value attribute implies that only one document in the Kasyk index can have any particular value for this property. If, during indexing, a second document is presented in the Kasyk document sequence XML with a value for this property that is already associated with another document in the index, then the existing document is automatically marked as "deleted". This feature allows for updating existing documents in a Kasyk index when indexing incrementally.
A property that is marked "unique" is automatically also marked as "keyed". In general there is at least one and at most one property marked "unique". Usually the property marked "unique" corresponds to a filename, a URL or an ID from a relational database system.
In the current release of Kasyk, the word "keyed" means that all possible values encountered for this property are keyed to a list of possible property values. This makes particular sense when there is a limited list of possible values for the property. A side-effect of using "keyed" is that values are "shared" amongst documents. All documents having the same string value all point to the same piece of text, reducing memory requirements if the strings are generally long and duplicated.
In standard relational database terms, you could consider this property to be "indexed". But since indexing means something entirely different within the context of Kasyk, that word is not used in the configuration XML to indicate this behaviour.
In the current implementation of Kasyk, only "string" type properties can be marked as "keyed".
If no value attribute is specified, the property will be assumed to have "notkeyed" value(s).
The multiplicity attribute indicates whether a document in the Kasyk document sequence XML can have only one, or more than one values associated with that property in the document.
A particular document can have at most only one value associated with this property if "1" is specified. If no value for the property is specified, the value specified with the default attribute will be assumed. If there is no default value specified, then false, 0, 0.0 or "" (the empty string) will be assumed depending (on the type of the property).
A "filename" property, which indicates the original filename of a document, is an example of a property with multiplicity 1.
A particular document can have more than one value associated with this property if "*" is specified. If no value for the property is specified, it is just that: there are no values for that property (not even a default value). A "keyword" or "author" property is a good example of a property that can have more than one value associated with a document..
In the Kasyk document sequence XML stream being indexed, the multiple values of such a property are defined by repeating the <pname>...</pname> specification, eg:
<document>
<properties>
<pname>value1</pname>
<pname>value2</pname>
<pname>value3</pname>
</properties>
...
</document>
It should be noted that properties that have a multiplcity of "*" can not be used in constraints in the current version of Kasyk. They can however be returned as part of the Kasyk hitlist XML.
In a situation where you can have multiple values to a property and you wish to be able to constrain searches on this property, you have basically two ways of getting around this limitation.
If the number of different values of this property is not too large (usually upto 30 different values or so), it may make sense to define a flag type property for each of the different values. For instance:
<property name="category" type="string" multiplicity="*"/>
with as possible values:
<document> <properties> <category>foo</category> <category>bar</category> <category>baz</category> </properties> <text><!-- text here --></text> </document>
would then become in the configxml:
<property name="cat_foo"/> <property name="cat_bar"/> <property name="cat_baz"/>
and in the document:
<document> <properties> <cat_foo>1</cat_foo> <cat_bar>1</cat_bar> <cat_baz>1</cat_baz> </properties> <text><!-- text here --></text> </document>
and this would allow a constraint such as:
(cat_foo | cat_bar) & !cat_baz
which would select all documents that have flag property "cat_foo" or "cat_bar" set, and which do not have the "cat_baz" property set.
For a higher number of different possible values of the property, it may make sense to used a single, delimited string, for the property. Using the same example as above, this would use this configuration:
<property name="categories" type="string"/>
i.e. a single, unkeyed "string" type property. However, the value of this property would reflect different possible settings. It is based on using a character that will never part of any of the multiple values. And then concatenate the values applicable for a particular document with that special character inbetween, and the same character at the beginning and at the end. For example, if the special character is ":", then the above example document would become:
<document> <properties> <categories>:foo:bar:baz:</categories> </properties> <text><!-- text here --></text> </document>
and the associated constraint would then be (using the like operator):
(categories like ":foo:" | categories like ":bar:") & !categories like ":baz:"
which would select all documents that have the conceptual category "foo" or "bar" set, and which do not have the conceptual "baz" category set.
Please note that the above constraint can be further optimized if you can be sure of an order to the conceptual categories, e.g. an alphabetical order such as in this example:
<document> <properties> <categories>:bar::baz::foo:</categories> </properties> <text><!-- text here --></text> </document>
Please note that you will need the delimiting character before as well as after each conceptual category. But then the constraint can be simplified to:
categories like ":foo:*:baz:" & !categories like ":baz:"
resulting in significant better performance.
From a performance point of view, the approach of using "flag" properties is better. But since there is a limit to the number of different "flag" properties Kasyk can handle in an index, this may not be feasible. And it has the disadvantage that the configuration needs to be changed if a new flag property needs to be added (which would need a full re-index).
The approach of using delimited strings has the advantage that it is much more flexible: new values can be added to the index without any changes to the configuration. It has the disadvantage of requiring significantly more resources in disk-space as well as in CPU during searches, which may be a problem for very large indexes (more than 500Mbyte document sequence XML). However, these can be circumvented by applying even more tricks, such as using single letters for the conceptual categories (in which case you wouldn't need the delimiting characters anymore).
If no multiplicity attribute is specified, it will be assumed that the property will have at most "1" (one) value, which allows the property to be used in constraints.
The default attribute indicates the default value that should be assumed for the property if it is not specified in a document. No default value (which results in either false, 0 or the null string, depending on the property type) will be assumed if it is not specified.
For a flag-type property, the only valid settings for the other attributes are 'value="notkeyed" multiplicity="1"'. Other variations for a flag type property will return an error.
The <texttype> container specifies the name and other properties of an allowable "text type" inside the <text> container of a document in a Kasyk document sequence XML. Such a nested container can have text in it that is indexed in a special manner, depending on the setting of the attributes.
So, for instance:
<texttype name="title" weight="2.5" hitlist="yes"/>
allows the use of the following in the text of a document being indexed:
<text> <title>A short history of KASYK</title> Although KASYK is now released as version 1.0.0, it is a completely proven system. Approximately 10 man years have already been put... </text>
Any word occuring in the <title> container will, by default, be considered 2.5 times as important as words that occur in the rest of the text. See the Kasyk query XML for other ways in which the <texttype> containers can be used. And because of the setting of the hitlist attribute, the text will also be returnedi as a property with the same name in the Kasyk hitlist XML of the query.
Please note that in the current implementation of Kasyk, only 31 different texttypes are available in a single Kasyk index. This limit may be raised in the future.
Please also note that a value of "yes" or "1" for the hitlist attribute, effectively defines a "notkeyed string *" property with the same name. In that case, you can not define a property with same name.
The <indexing> container (and the containers within it) specify the various actions that need to be performed when certain events occur during indexing of a Kasyk document sequence XML.
The <indexing> container itself can have two (optional) attributes:
The optional "save" attribute indicates whether a copy of the Kasyk document sequence XML should be kept inside the Kasyk index. The default is "yes". If you are short on diskspace, you can specify "no" or "0" to not have the document sequence XML saved.
Keeping copies of the document sequence is mainly intended for backup purposes as well as for allowing developers to easily re-create sequences of events for debugging purposes.
A memory cache is used during indexing to increase performance. The cache attribute (roughly) specifies the amount of memory to allocate for that cache. The "cachesize" is a number which can be suffixed with K, M or G indicating kilobytes, megabytes or gigabytes. If only a number is used for "cachesize", it is taken to be the size in bytes.
By default a value of "10M" (10 Megabyte) is used, but normally a much larger value is specified, dependant somewhat on how large the index is and how much physical memory you want to be available to the indexing process.
During indexing of the <properties> container of a document in a Kasyk document sequence XML, it is possible to find containers in there that are unknown. The <unknownproperty> container specifies what to do in those situations.
If no <unknownproperty> container is specified, the "log" setting of the action attribute is assumed.
During indexing of the <properties> container of a document in a Kasyk document sequence XML, it is possible to find containers that are nested inside other property containers. The <nestedproperty> container specifies what to do in those situations.
If no <nestedproperty> container is specified, the "log" setting of the action attribute is assumed.
During indexing of the <text> container of a document in a Kasyk document sequence XML, it is possible to find containers that are unknown. The <unknowntexttype> container specifies what to do in those situations.
The text attribute specifies what should be done with the text of the container of which the name is unknown. It is only applicable if the action attribute is not "stop".
If the value of the text attribute is "ignore", then the text of the container will be ignored. If the value of the text attribute is "default", then the text of the container will be added as if it is part of the "outer" <text> container in the document.
If no <unknowntexttype> container is specified, the "log" setting of the action attribute, and the "default" setting of the text attribute is assumed.
During indexing of the <text> container of a document in a Kasyk document sequence XML, it is possible to find containers that are too deeply nested. The <nestedtexttype> container specifies what to do in those situations.
The text attribute specifies what should be done with the text of the container that is nested too deeply. It is only applicable if the action attribute is not "stop".
If the value of the text attribute is "ignore", then the text of the container will be ignored. If the value of the text attribute is "inherit", then the text of the container will be added as if it is part of the "parent" text container in the document (ignoring the too deeply nested container specification).
If no <nestedtexttype> container is specified, the "log" setting of the action attribute, and the "inherit" setting of the text attribute is assumed.
The <searching> container (and the containers within it) specify the various actions that need to be performed when certain events occur during searching the Kasyk index (the process of transforming a Kasyk query XML into a Kasyk hitlist XML).
The <searching> container itself can have five (optional) attributes:
The type attribute specifies the type of search that should be performed if no type attribute has been specified in the Kasyk query XML. It can be either specified as "fuzzy" or "exact".
Please note that if the configuration XML only has an <exact> container in the <creation> container, then all searches will become "exact". And that if there is only a <fuzzy> container in the <creation> container, all searches will become "fuzzy". Specification of this attribute therefore only makes sense if either both <exact> and <fuzzy> containers are present or absent from the <creation> container (in which case the Kasyk index if capable of both doing "exact" as well as "fuzzy" searches).
The name attribute sets the name with which the Kasyk index is to be identified (to match the <index> container in the Kasyk query XML). If no name is specified, then the index is assumed to be unnamed and only queries without an <index> container will be allowed to be processed.
If a name is specified, then it depends on the setting of the default attribute whether queries without an <index> container will be allowed to be processed. If a name is specified, then it should match the name specified in the <index> container of the Kasyk query XML.
The default attribute specifies whether it is allowed to put queries to this Kasyk index that to not have an <index> container specified in the Kasyk query XML. If the value "yes" or "1" is specified, queries do not need to have an <index> container to be processed.
If the value is "no" or "0", then the action depends on the setting of the name attribute. If there is no name specified for the index, then queries without an <index> container will always fail to be processed. In this case the process creating the query must know the name of the index to allow queries to be processed.
If there is a name specified for the index, then the name specified in the <index> container of the Kasyk query XML must match the name of the index to allow queries to be processed.
The default value for the default attribute is "yes" if there is no name specified for the index and "no" if there is a name specified for the index.
The location attribute is important for Kasyk server Kasyk server (kasykd) only. It specifies the default location on which the server will be listening.
The value specified here can either be a combination of "hostname:portnumber", or if there is no colon (":") found, only a portnumber (assuming "localhost" for the hostname in that case).
Any setting here can be overridden with the --location option of Kasyk server (kasykd). If there is no default location specified, the --location option must be specified when starting Kasyk server (kasykd).
A memory cache is used during searching to increase performance. The cache attribute (roughly) specifies the amount of memory to allocate for that cache. The "cachesize" is a number which can be suffixed with K, M or G indicating kilobytes, megabytes or gigabytes. If only a number is used for "cachesize", it is taken to be the size in bytes.
By default a value of "10M" (10 Megabyte) is used, but normally a much larger value is specified, dependant somewhat on how large the index is and how much physical memory you want to be available to the searching process. Values of "100M" are not uncommon.
If preview text is being returned in Kasyk hitlist XML, words that are present in the preview text and that are relevant to the original search terms, are highlighted in the preview. By default, Kasyk will put such terms in a <b> container, causing a word to be highlight thus: "<b>word</b>".
The <highlight> container allows you to specify a different "name" for the container to be used, or even to use different strings for "before" and "after" the word to be highlighted in the preview (which can be handy if the Kasyk hitlist XML is later converted that is not XML).
Please note that you can either just specify the name attribute or either or both of the before and after attributes.
The <clients> container specifies settings related to client connections of the Kasyk server Kasyk server (kasykd).
The number of connections allowed is printed to the message logfile as the first message after startup of the Kasyk server Kasyk server (kasykd).
The <threads> container specifies settings related to a multi-threading Kasyk server Kasyk server (kasykd). In all other situation, the contents of this container is ignored.
To determine if the Kasyk server Kasyk server (kasykd) has been compiled multi-threaded, use the --version option. The number of threads allowed is also printed to the message logfile by the Kasyk server Kasyk server (kasykd) as the first message after startup.
The <querylog> container specifies the name of the file to which queries will be logged. Queries are either logged as Kasyk successful query XML containers if the XML was valid, or as Kasyk error XML if the XML of the query was invalid.
The <querylog> container must have one attribute, name, which specifies the name of the file to which queries will be logged. If the filename is relative, ie. does not contain any slashes, the file will be opened in the index directory. Prefix the filename with "./" if you want the logfile to be opened in your "own" current directory.
These are some examples of Kasyk configuration XML.
The following configuration is used in much of the test-suite.
<kasyk:config xmlns:kasyk="http://www.kasyk.org/1.0"> <creation> <property name="filename" type="string" value="unique"/> <property name="changed" type="number" hitlist="no"/> <texttype name="title" weight="2.5" hitlist="yes"/> </creation> </kasyk:config>
It uses a unique property "filename" (so that documents are keyed to the name of the file from which they are generated). It stores the modification time of the files in a property named "changed", which is not returned in the Kasyk hitlist XML. The title of each document is 2.5 times of important as the rest of the text in a document, and is also returned as a property in the Kasyk hitlist XML.
A more or less unaltered configuration of a client of Dijkmat BV.
<kasyk:config xmlns:kasyk="http://www.kasyk.org/1.0"> <logfile name="index.log"/> <creation> <property name="id" type="number" value="unique"/> <property name="type" type="string" value="keyed"/> <property name="region" type="string" value="keyed" hitlist="no"/> <property name="language" type="string" value="keyed" hitlist="no"/> <property name="created" type="string"/> <property name="published" type="string"/> <property name="updated" type="string"/> <property name="pr" hitlist="no"/> <property name="editorial" hitlist="no"/> <property name="postponed" hitlist="no"/> <property name="cancelled" hitlist="no"/> <property name="category" type="number" multiplicity="*"/> <texttype name="name" weight="5" hitlist="yes"/> <texttype name="slogan" weight="2.5"/> <texttype name="summary" weight="1.5"/> <texttype name="author" weight="2.5" hitlist="yes"/> <texttype name="keywords" weight="10"/> </creation> <indexing> <unknowntexttype action="log"/> <nestedtexttype action="log" text="ignore"/> <unknownproperty action="stop"/> <nestedproperty action="stop"/> </indexing> <searching cache="100M" location="3333"> <querylog name="query.log"/> </searching> </kasyk:config>
This is an attempt at a Document Type Definition for the Kasyk configuration XML.
<!DOCTYPE kasyk:config [
<!ELEMENT kasyk:config (messagelog?, creation?, indexing?, searching?)>
<!ELEMENT messagelog EMPTY>
<!ATTLIST messagelog name CDATA #REQUIRED>
<!ELEMENT creation (exact?, fuzzy?, texttype*, property*)>
<!ELEMENT exact EMPTY>
<!ATTLIST exact accent (preserve|normalize|both) #IMPLIED>
<!ELEMENT fuzzy EMPTY>
<!ATTLIST fuzzy accent (preserve|normalize|both) #IMPLIED>
<!ELEMENT texttype EMPTY>
<!ATTLIST texttype name CDATA #REQUIRED,
weight CDATA #IMPLIED
hitlist (yes|no|0|1) #IMPLIED>
<!ELEMENT property EMPTY>
<!ATTLIST property name CDATA #REQUIRED
type (flag|number|float|string) #IMPLIED
value (unique|keyed|notkeyed) #IMPLIED
multiplicity (1|"*") #IMPLIED,
hitlist (yes|no|0|1) #IMPLIED>,
default CDATA #IMPLIED
<!ELEMENT indexing (unknownproperty?,
nestedproperty?,
unknowntexttype?,
nestedtexttype?)>
<!ATTLIST indexing cache CDATA #IMPLIED,
save (yes|no|0|1) #IMPLIED>
<!ELEMENT unknownproperty EMPTY>
<!ATTLIST unknownproperty action (log|dontlog|stop) #IMPLIED>
<!ELEMENT nestedproperty EMPTY>
<!ATTLIST nestedproperty action (log|dontlog|stop) #IMPLIED>
<!ELEMENT unknowntexttype EMPTY>
<!ATTLIST unknowntexttype action (log|dontlog|stop) #IMPLIED,
text (ignore|default) #IMPLIED>
<!ELEMENT nestedtexttype EMPTY>
<!ATTLIST nestedtexttype action (log|dontlog|stop) #IMPLIED.
text (ignore|inherit) #IMPLIED>
<!ELEMENT searching (highlight?, clients?, threads?, querylog?)>
<!ATTLIST searching type (exact|fuzzy) #IMPLIED,
cache CDATA #IMPLIED,
name CDATA #IMPLIED,
default (yes|no|0|1) #IMPLIED,
location CDATA #IMPLIED>
<!ELEMENT highlight EMPTY>
<!ATTLIST highlight name CDATA #IMPLIED,
on CDATA #IMPLIED,
off CDATA #IMPLIED>
<!ELEMENT clients EMPTY>
<!ATTLIST clients connected CDATA #IMPLIED,
pending CDATA #IMPLIED>
<!ELEMENT threads EMPTY>
<!ATTLIST threads worker CDATA #IMPLIED, core CDATA #IMPLIED>
<!ELEMENT querylog EMPTY>
<!ATTLIST querylog name CDATA #REQUIRED>
]>
Kasyk home, Kasyk document sequence XML, Kasyk query XML, Kasyk hitlist XML, Kasyk successful query XML, Kasyk error XML, Kasyk initializer (kasyknew), Kasyk indexer (kasykindex), Kasyk searcher (kasyk), Kasyk server (kasykd), Kasyk caching query server (kasykcqd), Kasyk configuration handler (kasykconfig).
See http://www.kasyk.nl/xml/kasykconfigxml.html for the most up-to-date version of this information.
Copyright © 2003 Dijkmat BV
This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Kasyk XML Information: Kasyk version 1.0.0, XML version http://www.kasyk.org/1.0, generated on Tue Nov 25 12:09:47 2003.