NAME
Odeum - the inverted API of QDBMSYNOPSIS
#include <depot.h>DESCRIPTION
Odeum is the API which handles an inverted index. An inverted index is a data structure to retrieve a list of some documents that include one of words which were extracted from a population of documents. It is easy to realize a full-text search system with an inverted index. Odeum provides an abstract data structure which consists of words and attributes of a document. It is used when an application stores a document into a database and when an application retrieves some documents from a database. Odeum does not provide methods to extract the text from the original data of a document. It should be implemented by applications. Although Odeum provides utilities to extract words from a text, it is oriented to such languages whose words are separated with space characters as English. If an application handles such languages which need morphological analysis or N-gram analysis as Japanese, or if an application perform more such rarefied analysis of natural languages as stemming, its own analyzing method can be adopted. Result of search is expressed as an array contains elements which are structures composed of the ID number of documents and its score. In order to search with two or more words, Odeum provides utilities of set operations. Odeum is implemented, based on Curia, Cabin, and Villa. Odeum creates a database with a directory name. Some databases of Curia and Villa are placed in the specified directory. For example, `casket/docs', `casket/index', and `casket/rdocs' are created in the case that a database directory named as `casket'. `docs' is a database directory of Curia. The key of each record is the ID number of a document, and the value is such attributes as URI. `index' is a database directory of Curia. The key of each record is the normalized form of a word, and the value is an array whose element is a pair of the ID number of a document including the word and its score. `rdocs' is a database file of Villa. The key of each record is the URI of a document, and the value is its ID number. In order to use Odeum, you should include `depot.h', `cabin.h', `odeum.h' and `stdlib.h' in the source files. Usually, the following description will be near the beginning of a source file.
#include <depot.h>
#include <cabin.h>
#include <odeum.h>
#include <stdlib.h>
A pointer to `ODEUM' is used as a database handle. A database handle is opened
with the function `odopen' and closed with `odclose'. You should not refer
directly to any member of the handle. If a fatal error occurs in a database,
any access method via the handle except `odclose' will not work and return
error status. Although a process is allowed to use multiple database handles
at the same time, handles of the same database file should not be used.
A pointer to `ODDOC' is used as a document handle. A document handle is opened
with the function `oddocopen' and closed with `oddocclose'. You should not
refer directly to any member of the handle. A document consists of attributes
and words. Each word is expressed as a pair of a normalized form and a
appearance form.
Odeum also assign the external variable `dpecode' with the error code. The
function `dperrmsg' is used in order to get the message of the error code.
Structures of `ODPAIR' type is used in order to handle results of search.
- typedef struct { int id; int score; } ODPAIR;
- `id' specifies the ID number of a document. `score' specifies the score calculated from the number of searching words in the document.
- ODEUM *odopen(const char *name, int omode);
- `name' specifies the name of a database directory. `omode' specifies the connection mode: `OD_OWRITER' as a writer, `OD_OREADER' as a reader. If the mode is `OD_OWRITER', the following may be added by bitwise or: `OD_OCREAT', which means it creates a new database if not exist, `OD_OTRUNC', which means it creates a new database regardless if one exists. Both of `OD_OREADER' and `OD_OWRITER' can be added to by bitwise or: `OD_ONOLCK', which means it opens a database directory without file locking, or `OD_OLCKNB', which means locking is performed without blocking. The return value is the database handle or `NULL' if it is not successful. While connecting as a writer, an exclusive lock is invoked to the database directory. While connecting as a reader, a shared lock is invoked to the database directory. The thread blocks until the lock is achieved. If `OD_ONOLCK' is used, the application is responsible for exclusion control.
- int odclose(ODEUM *odeum);
- `odeum' specifies a database handle. If successful, the return value is true, else, it is false. Because the region of a closed handle is released, it becomes impossible to use the handle. Updating a database is assured to be written when the handle is closed. If a writer opens a database but does not close it appropriately, the database will be broken.
- int odput(ODEUM *odeum, const ODDOC *doc, int wmax, int over);
- `odeum' specifies a database handle connected as a writer. `doc' specifies a document handle. `wmax' specifies the max number of words to be stored in the document database. If it is negative, the number is unlimited. `over' specifies whether the data of the duplicated document is overwritten or not. If it is false and the URI of the document is duplicated, the function returns as an error. If successful, the return value is true, else, it is false.
- int odout(ODEUM *odeum, const char *uri);
- `odeum' specifies a database handle connected as a writer. `uri' specifies the string of the URI of a document. If successful, the return value is true, else, it is false. False is returned when no document corresponds to the specified URI.
- int odoutbyid(ODEUM *odeum, int id);
- `odeum' specifies a database handle connected as a writer. `id' specifies the ID number of a document. If successful, the return value is true, else, it is false. False is returned when no document corresponds to the specified ID number.
- ODDOC *odget(ODEUM *odeum, const char *uri);
- `odeum' specifies a database handle. `uri' specifies the string of the URI of a document. If successful, the return value is the handle of the corresponding document, else, it is `NULL'. `NULL' is returned when no document corresponds to the specified URI. Because the handle of the return value is opened with the function `oddocopen', it should be closed with the function `oddocclose'.
- ODDOC *odgetbyid(ODEUM *odeum, int id);
- `odeum' specifies a database handle. `id' specifies the ID number of a document. If successful, the return value is the handle of the corresponding document, else, it is `NULL'. `NULL' is returned when no document corresponds to the specified ID number. Because the handle of the return value is opened with the function `oddocopen', it should be closed with the function `oddocclose'.
- int odgetidbyuri(ODEUM *odeum, const char *uri);
- `odeum' specifies a database handle. `uri' specifies the string the URI of a document. If successful, the return value is the ID number of the document, else, it is -1. -1 is returned when no document corresponds to the specified URI.
- int odcheck(ODEUM *odeum, int id);
- `odeum' specifies a database handle. `id' specifies the ID number of a document. The return value is true if the document exists, else, it is false.
- ODPAIR *odsearch(ODEUM *odeum, const char *word, int max, int *np);
- `odeum' specifies a database handle. `word' specifies a searching word. `max' specifies the max number of documents to be retrieve. `np' specifies the pointer to a variable to which the number of the elements of the return value is assigned. If successful, the return value is the pointer to an array, else, it is `NULL'. Each element of the array is a pair of the ID number and the score of a document, and sorted in descending order of their scores. Even if no document corresponds to the specified word, it is not error but returns an dummy array. Because the region of the return value is allocated with the `malloc' call, it should be released with the `free' call if it is no longer in use. Note that each element of the array of the return value can be data of a deleted document.
- int odsearchdnum(ODEUM *odeum, const char *word);
- `odeum' specifies a database handle. `word' specifies a searching word. If successful, the return value is the number of documents including the word, else, it is -1. Because this function does not read the entity of the inverted index, it is faster than `odsearch'.
- int oditerinit(ODEUM *odeum);
- `odeum' specifies a database handle. If successful, the return value is true, else, it is false. The iterator is used in order to access every document stored in a database.
- ODDOC *oditernext(ODEUM *odeum);
- `odeum' specifies a database handle. If successful, the return value is the handle of the next document, else, it is `NULL'. `NULL' is returned when no document is to be get out of the iterator. It is possible to access every document by iteration of calling this function. However, it is not assured if updating the database is occurred while the iteration. Besides, the order of this traversal access method is arbitrary, so it is not assured that the order of string matches the one of the traversal access. Because the handle of the return value is opened with the function `oddocopen', it should be closed with the function `oddocclose'.
- int odsync(ODEUM *odeum);
- `odeum' specifies a database handle connected as a writer. If successful, the return value is true, else, it is false. This function is useful when another process uses the connected database directory.
- int odoptimize(ODEUM *odeum);
- `odeum' specifies a database handle connected as a writer. If successful, the return value is true, else, it is false. Elements of the deleted documents in the inverted index are purged.
- char *odname(ODEUM *odeum);
- `odeum' specifies a database handle. If successful, the return value is the pointer to the region of the name of the database, else, it is `NULL'. Because the region of the return value is allocated with the `malloc' call, it should be released with the `free' call if it is no longer in use.
- double odfsiz(ODEUM *odeum);
- `odeum' specifies a database handle. If successful, the return value is the total size of the database files, else, it is -1.0.
- int odbnum(ODEUM *odeum);
- `odeum' specifies a database handle. If successful, the return value is the total number of the elements of the bucket arrays, else, it is -1.
- int odbusenum(ODEUM *odeum);
- `odeum' specifies a database handle. If successful, the return value is the total number of the used elements of the bucket arrays, else, it is -1.
- int oddnum(ODEUM *odeum);
- `odeum' specifies a database handle. If successful, the return value is the number of the documents stored in the database, else, it is -1.
- int odwnum(ODEUM *odeum);
- `odeum' specifies a database handle. If successful, the return value is the number of the words stored in the database, else, it is -1. Because of the I/O buffer, the return value may be less than the hard number.
- int odwritable(ODEUM *odeum);
- `odeum' specifies a database handle. The return value is true if the handle is a writer, false if not.
- int odfatalerror(ODEUM *odeum);
- `odeum' specifies a database handle. The return value is true if the database has a fatal error, false if not.
- int odinode(ODEUM *odeum);
- `odeum' specifies a database handle. The return value is the inode number of the database directory.
- time_t odmtime(ODEUM *odeum);
- `odeum' specifies a database handle. The return value is the last modified time of the database.
- int odmerge(const char *name, const CBLIST *elemnames);
- `name' specifies the name of a database directory to create. `elemnames' specifies a list of names of element databases. If successful, the return value is true, else, it is false. If two or more documents which have the same URL come in, the first one is adopted and the others are ignored.
- int odremove(const char *name);
- `name' specifies the name of a database directory. If successful, the return value is true, else, it is false. A database directory can contain databases of other APIs of QDBM, they are also removed by this function.
- ODDOC *oddocopen(const char *uri);
- `uri' specifies the URI of a document. The return value is a document handle. The ID number of a new document is not defined. It is defined when the document is stored in a database.
- void oddocclose(ODDOC *doc);
- `doc' specifies a document handle. Because the region of a closed handle is released, it becomes impossible to use the handle.
- void oddocaddattr(ODDOC *doc, const char *name, const char *value);
- `doc' specifies a document handle. `name' specifies the string of the name of an attribute. `value' specifies the string of the value of the attribute.
- void oddocaddword(ODDOC *doc, const char *normal, const char *asis);
- `doc' specifies a document handle. `normal' specifies the string of the normalized form of a word. Normalized forms are treated as keys of the inverted index. If the normalized form of a word is an empty string, the word is not reflected in the inverted index. `asis' specifies the string of the appearance form of the word. Appearance forms are used after the document is retrieved by an application.
- int oddocid(const ODDOC *doc);
- `doc' specifies a document handle. The return value is the ID number of a document.
- const char *oddocuri(const ODDOC *doc);
- `doc' specifies a document handle. The return value is the string of the URI of a document.
- const char *oddocgetattr(const ODDOC *doc, const char *name);
- `doc' specifies a document handle. `name' specifies the string of the name of an attribute. The return value is the string of the value of the attribute, or `NULL' if no attribute corresponds.
- const CBLIST *oddocnwords(const ODDOC *doc);
- `doc' specifies a document handle. The return value is the list handle contains words in normalized form.
- const CBLIST *oddocawords(const ODDOC *doc);
- `doc' specifies a document handle. The return value is the list handle contains words in appearance form.
- CBMAP *oddocscores(const ODDOC *doc, int max, ODEUM *odeum);
- `doc' specifies a document handle. `max' specifies the max number of keywords to get. `odeum' specifies a database handle with which the IDF for weighting is calculate. If it is `NULL', it is not used. The return value is the map handle contains keywords and their scores. Scores are expressed as decimal strings. Because the handle of the return value is opened with the function `cbmapopen', it should be closed with the function `cbmapclose' if it is no longer in use.
- CBLIST *odbreaktext(const char *text);
- `text' specifies the string of a text. The return value is the list handle contains words in appearance form. Words are separated with space characters and such delimiters as period, comma and so on. Because the handle of the return value is opened with the function `cblistopen', it should be closed with the function `cblistclose' if it is no longer in use.
- char *odnormalizeword(const char *asis);
- `asis' specifies the string of the appearance form of a word. The return value is is the string of the normalized form of the word. Alphabets of the ASCII code are unified into lower cases. Words composed of only delimiters are treated as empty strings. Because the region of the return value is allocated with the `malloc' call, it should be released with the `free' call if it is no longer in use.
- ODPAIR *odpairsand(ODPAIR *apairs, int anum, ODPAIR *bpairs, int bnum, int *np);
- `apairs' specifies the pointer to the former document array. `anum' specifies the number of the elements of the former document array. `bpairs' specifies the pointer to the latter document array. `bnum' specifies the number of the elements of the latter document array. `np' specifies the pointer to a variable to which the number of the elements of the return value is assigned. The return value is the pointer to a new document array whose elements commonly belong to the specified two sets. Elements of the array are sorted in descending order of their scores. Because the region of the return value is allocated with the `malloc' call, it should be released with the `free' call if it is no longer in use.
- ODPAIR *odpairsor(ODPAIR *apairs, int anum, ODPAIR *bpairs, int bnum, int *np);
- `apairs' specifies the pointer to the former document array. `anum' specifies the number of the elements of the former document array. `bpairs' specifies the pointer to the latter document array. `bnum' specifies the number of the elements of the latter document array. `np' specifies the pointer to a variable to which the number of the elements of the return value is assigned. The return value is the pointer to a new document array whose elements belong to both or either of the specified two sets. Elements of the array are sorted in descending order of their scores. Because the region of the return value is allocated with the `malloc' call, it should be released with the `free' call if it is no longer in use.
- ODPAIR *odpairsnotand(ODPAIR *apairs, int anum, ODPAIR *bpairs, int bnum, int *np);
- `apairs' specifies the pointer to the former document array. `anum' specifies the number of the elements of the former document array. `bpairs' specifies the pointer to the latter document array of the sum of elements. `bnum' specifies the number of the elements of the latter document array. `np' specifies the pointer to a variable to which the number of the elements of the return value is assigned. The return value is the pointer to a new document array whose elements belong to the former set but not to the latter set. Elements of the array are sorted in descending order of their scores. Because the region of the return value is allocated with the `malloc' call, it should be released with the `free' call if it is no longer in use.
- void odpairssort(ODPAIR *pairs, int pnum);
- `pairs' specifies the pointer to a document array. `pnum' specifies the number of the elements of the document array.
- double odlogarithm(double x);
- `x' specifies a number. The return value is the natural logarithm of the number. If the number is equal to or less than 1.0, the return value is 0.0. This function is useful when an application calculates the IDF of search results.
- double odvectorcosine(const int *avec, const int *bvec, int vnum);
- `avec' specifies the pointer to one array of numbers. `bvec' specifies the pointer to the other array of numbers. `vnum' specifies the number of elements of each array. The return value is the cosine of the angle of two vectors. This function is useful when an application calculates similarity of documents.
- void odsettuning(int ibnum, int idnum, int cbnum, int csiz);
- `ibnum' specifies the number of buckets for inverted indexes. `idnum' specifies the division number of inverted index. `cbnum' specifies the number of buckets for dirty buffers. `csiz' specifies the maximum bytes to use memory for dirty buffers. The default setting is equivalent to `odsettuning(32749, 7, 262139, 8388608)'. This function should be called before opening a handle.
- void odanalyzetext(ODEUM *odeum, const char *text, CBLIST *awords, CBLIST *nwords);
- `odeum' specifies a database handle. `text' specifies the string of a text. `awords' specifies a list handle into which appearance form is store. `nwords' specifies a list handle into which normalized form is store. If it is `NULL', it is ignored. Words are separated with space characters and such delimiters as period, comma and so on.
- void odsetcharclass(ODEUM *odeum, const char *spacechars, const char *delimchars, const char *gluechars);
- `odeum' specifies a database handle. `spacechars' spacifies a string contains space characters. `delimchars' spacifies a string contains delimiter characters. `gluechars' spacifies a string contains glue characters.
- ODPAIR *odquery(ODEUM *odeum, const char *query, int *np, CBLIST *errors);
- `odeum' specifies a database handle. 'query' specifies the text of the query. `np' specifies the pointer to a variable to which the number of the elements of the return value is assigned. `errors' specifies a list handle into which error messages are stored. If it is `NULL', it is ignored. If successful, the return value is the pointer to an array, else, it is `NULL'. Each element of the array is a pair of the ID number and the score of a document, and sorted in descending order of their scores. Even if no document corresponds to the specified condition, it is not error but returns an dummy array. Because the region of the return value is allocated with the `malloc' call, it should be released with the `free' call if it is no longer in use. Note that each element of the array of the return value can be data of a deleted document.
expr ::= subexpr ( op subexpr )*
subexpr ::= WORD
subexpr ::= LPAREN expr RPAREN
Operators are "&" (AND), "|" (OR), and "!"
(NOTAND). You can use parenthesis to group sub-expressions together in order
to change order of operations. The given query is broken up using the function
`odanalyzetext', so if you want to specify different text breaking rules, then
make sure that you at least set "&", "|",
"!", "(", and ")" to be delimiter characters.
Consecutive words are treated as having an implicit "&" operator
between them, so "zed shaw" is actually "zed & shaw".
The encoding of the query text should be the same with the encoding of target
documents. Moreover, each of space characters, delimiter characters, and glue
characters should be single byte.
SEE ALSO
qdbm(3), depot(3), curia(3), relic(3), hovel(3), cabin(3), villa(3), ndbm(3), gdbm(3)2004-04-22 | Man Page |