kasykqueryxml - description of the Kasyk query XML (<kasyk:query>)
The Kasyk query XML specifies the parameters with which a search action should be performed on a Kasyk index, either directly with Kasyk searcher (kasyk) or Kasyk server (kasykd), or indirectly with Kasyk caching query server (kasykcqd). The result of query is always in the form of Kasyk hitlist XML.
A very simple query XML looks like this:
<kasyk:query>free format +query !specification</kasyk:query>
A more complex query XML with a lot of bells and whistles looks like this:
<kasyk:query xmlns:kasyk="http://www.kasyk.org/1.0" type="exact" maxhits="200" maxpass1hits="unlimited" first="21" last="30" showinternal="all" > <index>life</index> <constraint>meaning == 42</constraint> free form query </kasyk:query>
The outer <kasyk:query> container indicates that this is an XML specification of a query used for searching the contents of a Kasyk index. The namespace specification is optional: if specified it is used to verify that the version of the XML being processed is compatible with the version of Kasyk. If it is not specified, "http://www.kasyk.org/1.0" will assumed and a note of class "Info" will be added to the Kasyk hitlist XML.
The text of the outer <kasyk:query> container, contains a free form list of the words that should be searched for. Please note that all the word prefixes apply to exact searches only: they are usually ignored when a fuzzy search is applied.
The current syntax may be augmented in the future.
A word that has the prefix "!" (exclamation mark) indicates a word that <must not> occur in a document before the document will be considered for inclusion in the initial result set (when using exact searching).
Earlier versions of Kasyk used the prefix "-" to mark words that should not occur in a document. It was found however that many people use hyphens in open search queries. Using the "-" prefix changed the query in a way that many people didn't expect, so instead the "!" was chosen (which is much less likely to occur at the beginning of a word in a natural language query).
When a word is prefixed with "!" in a fuzzy search, that word will be ignored to determine the trigrams of the words being searched for.
The <kasyk:query> container itself contains a large number of optional attributes. They are all listed here.
The id attribute is optional. The identifier that is specified, will be placed in the Kasyk hitlist XML that is returned by either Kasyk searcher (kasyk), Kasyk server (kasykd) or Kasyk caching query server (kasykcqd). This allows you to match the query with the resulting hitlist in situations where queries and hitlists are exchanged asynchronously (which can occur when using Kasyk server (kasykd) or Kasyk caching query server (kasykcqd)).
The id attribute also serves as an implicit flag: it indicates that the client would like to keep the connection alive with the Kasyk provider. There is no guarantee however that the Kasyk provider will actually keep the connection open: server configuration of the Kasyk server may cause the connection to be broken immediately or after a period of inactivity.
No identifier is assumed if the id attribute is not specified. In that case the Kasyk provider will always break the connection after the Kasyk hitlist XML has been served.
The type attribute is optional. It indicates what type of search you want to perform. Two types of queries are currently defined:
If the type attribute is not specified, the type of search that is specified with the type attribute in the <searching> container of the Kasyk configuration XML will be assumed. If that is not specified either, a "fuzzy" search will be assumed.
The maxhits attribute is optional. It specifies the maximum number of hits that will be part of the final search result (before being further limited by the first and last attribute). The actual number of hits that could have been returned, is specified by the hits attribute of the Kasyk hitlist XML.
If the maxhits attribute is not specified, the same value as (implicitely) specified with the maxpass1hits attribute will be assumed. It doesn't make sense to specify any value larger than the value specified with the maxpass1hits attribute.
The string "maxpass1hits" can be used to indicate the number of documents specified by the maxpass1hits attribute (which is effectively the same as omitting the maxhits attribute).
The maxpass1hits attribute is optional. It specifies the maximum number of hits that will be part of the result set of the first pass through all of the documents of the Kasyk index. It is mainly intended as a way to limit the resources a search query may use, particularly in very large Kasyk indexes. The actual number of documents that were part of the first pass, is specified by the pass1hits attribute of the Kasyk hitlist XML.
If it is not specified, 1000 will be assumed. The string "unlimited" can be used to indicate the number of documents in the Kasyk index being queried. This feature should however be used with caution, especially with larger Kasyk indexes, as it directly relates to the amount of memory required to process the query.
In the event that the value specified with the maxpass1hits attribute was not large enough to keep all documents that initially match in the initial result set, a note of class "Info" is added to the Kasyk hitlist XML to indicate that either the query is not specific enough, or the value for the maxpass1hits attribute should be made larger.
The first and last attributes are optional. They indicate the ordinal number of the first and last hit you want returned in the Kasyk hitlist XML. They in fact provide a "window" on the hitlist (which is very handy if you want to page through a search result with 10 hits displayed at a time).
If no first attribute is specified, the number "1" will be assumed. If no last attribute is specified, the (implicit) value of the maxhits attribute will be assumed.
The first and last attribute are also returned in the Kasyk hitlist XML, mostly for convenience (allowing determination of "next" and "previous" page). The value of the last attribute may have been adapted if it would become larger than the hits attribute in the Kasyk hitlist XML.
The fuzzylevel attribute is optional and only makes sense if the type attribute has been (implicitely) set to "fuzzy". It indicates the measure of fuzziness that will be allowed while searching the Kasyk index. The value "0" means least fuzzy (although still not "exact"). The value "3" means the most fuzzy possible.
The value "1" will be assumed if the fuzzylevel attribute is not specified.
The highlight attribute is optional and only makes sense if the type attribute has been (implicitely) set to "fuzzy". It indicates the number of characters in a word that should match before the whole will be highlighted in the <preview> container of the Kasyk hitlist XML.
The value "3" will be assumed if the highlight attribute is not specified.
The showpreview attribute is optional. It specifies whether the <preview> container should be placed in the <hit> container of each document in the Kasyk hitlist XML. The values "yes" and "1" indicate that the <preview> container should be added, the values "no" and "0" indicate that the <preview> container should not be added.
Previews are added to the Kasyk hitlist XML by default if the showpreview attribute is not specified.
The showproperties attribute is optional. It specifies whether the <properties> container (which contains the properties of a document) should be placed in the <hit> container of each document in the Kasyk hitlist XML. The values "yes" and "1" indicate that the <properties> container should be added, the values "no" and "0" indicate that the <properties> container should not be added.
Properties are added to the Kasyk hitlist XML by default if the showproperties attribute is not specified.
The showinternal attribute is optional. It allows specification of a number internal aspects of the Kasyk search engine to be exposed in the Kasyk hitlist XML. The following values can currently be specified:
If the value specified in the showinternal attribute contains the string "docnum", then the internal document number of the document will be added as the docnum attribute to each hit in the Kasyk hitlist XML.
The document number has currently no meaning outside of the Kasyk search engine internals. It is mainly for debugging purposes for the developers.
If the value specified in the showinternal attribute contains the string "score", then the internal score of the document will be added as the score attribute to each hit in the Kasyk hitlist XML.
The score has currently no meaning outside of the Kasyk search engine internals. It is mainly for debugging purposes for the developers.
If the value specified in the showinternal attribute contains the string "percentage", then the internal percentage score of the document will be added as the percentage attribute to each hit in the Kasyk hitlist XML.
The percentage has currently no meaning outside of the Kasyk search engine internals. It is mainly for debugging purposes for the developers.
If the value specified in the showinternal attribute contains the string "timing", then the timing information of the query will be added as the timing attribute to the <header> container in the Kasyk hitlist XML.
The timing information has currently no meaning outside of the Kasyk search engine internals. It is mainly for debugging purposes for the developers.
If the value specified in the showinternal attribute contains the string "provider", then the information of the provider of the Kasyk hitlist XML will be added as the <provider> container to the <header> container in the Kasyk hitlist XML.
The provider information has currently no meaning outside of the Kasyk search engine internals. It is mainly for debugging purposes for the developers.
The updated attribute is optional. It is intended to ensure that when moving the "window" on the Kasyk hitlist XML (by changing the first and last attribute on an otherwise identical query), you will be looking at indeed the same Kasyk hitlist XML. And if the Kasyk index on which the query has been performed is updated while moving the "window", that you get a note to indicate that the Kasyk hitlist XML can have been changed.
When the updated attribute is specified, it should contain the same value as the updated attribute in the <header> container that was returned with the previous Kasyk hitlist XML. If the Kasyk provider finds that the index has been updated (by comparing the updated attribute in the query with its internal "updated" variable) and seeing that you are not looking at the first page of the Kasyk hitlist XML (determined by the fact that the first attribute has a value larger than "1"), it will add a <note> container of class "Info" to the <header> container of the Kasyk hitlist XML indicating that the Kasyk index was updated and that it may be wise to start inspecting the Kasyk hitlist XML from the beginning again.
The text inside the <index> container specifies which Kasyk index should be queried with this provider. This is most useful when used in conjunction with Kasyk caching query server (kasykcqd), which may serve more than one Kasyk index, each of which is identified by a name.
If the <index> container is absent or empty, the "default" index will be assumed. Depending on the configuration of Kasyk searcher (kasyk), Kasyk server (kasykd) or Kasyk caching query server (kasykcqd), the "default" index may be served or not.
A <note> container of class "Query" will be added to the <header> container of the Kasyk hitlist XML if the (implicitely) requested index is not being served.
The text of the <constraint> container is considered to be an expression that restricts the set of documents that will become part of the initial result set. The expression describes various properties and the values they must have for a document to be considered for the initial result set of a Kasyk hitlist XML. It does not operate on the text of a document.
A property of a document in a Kasyk index is a named quantity having a flag (boolean), numeric or string typed value. The set of allowable properties in any particular Kasyk index is specified by the <property> containers in the Kasyk configuration XML of that Kasyk index.
If the <constraint> container is absent or empty, all documents if the Kasyk index will considered for inclusion in the initial result set.
See Kasyk constraint expressions for a complete description of <constraint> expressions.
Each <texttype> container allows you to change the relative weight of a type of text in this query. The name attribute specifies the name of the texttype to which the different weight should be applied. The weight attribute specifies the floating point value representing the weight of words in that text type (to override any setting for this texttype in the Kasyk configuration XML).
If no <texttype> containers are specified, the weights of the texttypes will be as specified in the Kasyk configuration XML.
The following special texttype names may be specified:
The value specified with the weight attribute is a weight expressed as a floating point value, "1" being the "normal" or "default" value. This weight indicates how the importance of words and patterns found in a texttype relates to words and patterns found in other texttypes. A weight of "0", for example, indicates that the nominated texttype is not to be considered for inclusion in the initial result set. A weight of "2.5" indicates that words and patterns found in text of that type are to be thought of as worth two and a half times as much as words found in other text types that have (the default) values of "1".
The sequence of <texttype> containers is applied in the order they are presented in the query. By default (if no <texttype> elements are present) all text types are searched.
When the first <texttype> container is encountered, it restricts the search to search only that nominated texttype. Following <texttype> elements add to the set of allowable texttypes.
Some examples:
<texttype name="title"/>
All other text is excluded from the search. Use the default weight of "1".
<texttype name="title" weight="2"/>
<texttype name="*"/>
Search all texttypes, with words found in the <title> container in the <text> container of a document, having twice the scoring weight of words found in other texttypes.
<texttype name="title"/>
<texttype name=""/>
Search only title text and default (untyped) text. Ignore all other texttypes.
<texttype name="*"/>
<texttype name="" weight="0"/>
Search all text types except the default (untyped) texttype.
The directory test/query contains a number of subdirectories which each contain files in which each file is a single query. Most of the queries are legal, some of them aren't. They're used to test the functioning of the Kasyk search engine.
This is an attempt at a Document Type Definition for the Kasyk query XML.
<!DOCTYPE kasyk:query [
<!ELEMENT kasyk:query (index?, texttype*, constraint?)>
<!ATTLIST kasyk:query id CDATA #IMPLIED,
type (fuzzy|exact) #IMPLIED,
maxhits CDATA #IMPLIED,
maxpass1hits CDATA #IMPLIED,
first CDATA #IMPLIED,
last CDATA #IMPLIED,
fuzzylevel CDATA #IMPLIED,
highlight CDATA #IMPLIED,
showpreview (yes|no|0|1) #IMPLIED,
showproperties (yes|no|0|1) #IMPLIED,
showinternal CDATA #IMPLIED,
updated CDATA #IMPLIED>
<!ELEMENT index (#PCDATA)>
<!ELEMENT texttype EMPTY>
<!ATTLIST texttype name CDATA #REQUIRED,
weight CDATA #IMPLIED>
<!ELEMENT constraint (#PCDATA)>
]>
Kasyk home, Kasyk configuration XML, Kasyk document sequence XML, Kasyk hitlist XML, Kasyk initializer (kasyknew), Kasyk indexer (kasykindex), Kasyk searcher (kasyk), Kasyk server (kasykd), Kasyk caching query server (kasykcqd), Kasyk configuration handler (kasykconfig).
See http://www.kasyk.nl/xml/kasykqueryxml.html for the most up-to-date version of this information.
Copyright © 2003 Dijkmat BV
This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Kasyk XML Information: Kasyk version 1.0.0, XML version http://www.kasyk.org/1.0, generated on Tue Nov 25 12:09:47 2003.