glimpseindex - index whole file systems to be searched by glimpse
Glimpse (which stands for GLobal IMPlicit SEarch) is a popular UNIX
indexing and query system that allows you to search through a large set of
files very quickly. Glimpseindex is the indexing program for glimpse. Glimpse
supports most of
agrep's options (
agrep is our powerful version
of
grep) including approximate matching (e.g., finding misspelled
words), Boolean queries, and even some limited forms of regular expressions.
It is used in the same way, except that you don't have to specify file names.
So, if you are looking for a
needle anywhere in your file system, all
you have to do is say
glimpse needle and all lines containing
needle will appear preceded by the file name. See man glimpse for
details on how to use glimpse.
Glimpseindex provides three indexing options: a tiny index (2-3% of the total
size of all files), a small index (7-8%) and a medium-size index (20-30%).
Search times are normally better with larger indexes (although unless files
are quite large, the small index is just about as good as the medium one). To
index all your files, you say
glimpseindex ~ for tiny index (where ~
stands for the home directory),
glimpseindex -o ~ for small index, and
glimpseindex -b ~ for medium.
Please submit bug reports or comments at
http://webglimpse.net/bugzilla/ Mail
[email protected] with SUBSCRIBE WGUSERS in the message body to be
added to the webglimpse mailing list, where glimpse discussion is also
directed. HTML version of these manual pages can be found in
http://webglimpse.net/docs/glimpseindexhelp.html Also, see the glimpse home
pages in
http://webglimpse.net/glimpse/
glimpseindex [
-abEfFiInostT -w number -dD
filename(s) -H directory -M number
-S number ]
directory_name[s]
Glimpseindex builds an index of all text files in all the directories
specified and all their subdirectories (recursively). It is also possible to
build several separate indexes (possibly even overlapping). The simplest way
to index your files is to say
glimpseindex -o ~
The index consists of several files (described in detail below), all with the
prefix
.glimpse_ stored in the user's home directory (unless otherwise
specified with the -H option). Files with one of the following suffixes are
not indexed: ".o", ".gz", ".Z", ".z",
".hqx", ".zip", ".tar". (Unless the -z option is
used, see below.) In addition, glimpseindex attempts to determine whether a
file is a text file and does not index files that it thinks are not text
files. Numbers are not indexed unless the -n option is used. It is possible to
prevent specified files from being indexed by adding their names to the
.glimpse_exclude file (described below). The -o option builds a larger index
than without it (typically about 7-8% vs. 2-3% without -o) allowing for a
faster search (1-5 times faster). The -b builds an even larger index and
allows an even faster search some of the time (-b is helpful mostly when large
files are present). There is an incremental indexing option
-f, which
updates an existing index by determining which files have been created or
modified since the index was built and adding them to the index (see -f).
Glimpseindex is reasonably fast, taking about 20 minutes to index 15,000 files
of about 200MB (on an Dec Alpha 233) and 2-4 minutes to update an existing
index. (Your mileage may vary.) It is also possible to increment the index by
adding a specific file (the -a option).
Once an index is built, searching for
pattern is as easy as saying
glimpse pattern
(See man glimpse for all glimpse's options and features.)
Glimpse does not automatically index files. You have to tell it to do it. This
can be done manually, but a better way is to set it to run every night. It is
probably a good idea to run glimpseindex manually for the first time to be
sure it works properly. The following is a simple script to run glimpseindex
every night. We assume that this script is stored in a file called
glimpse.script:
glimpseindex -o -t -w 5000 ~ >& .glimpse_out
at -m 0300 glimpse.script
(It might be interesting to collect all the outputs of glimpse by changing
>& to >>& so that the file .glimpse_out maintains a history.
In this case the file must be created before the first time >>& is
used. If you use ksh, replace '>&' with '2>&1'.)
Glimpseindex stores the names of all the files that it indexed in the file
.glimpse_filenames. Each file is listed by its full path name as obtained at
the time the files were indexed. For example, /usr1/udi/file1. Glimpse uses
this full name when it performs the search, so the name must match the current
name. This may become a problem when the indexing and the search are done from
different machines (e.g., through NFS), which may cause the path names to be
different. For example, /tmp_mnt/R/xxx/xxx/usr1/udi/file1. (The same is true
for several other .glimpse files. See below.)
Glimpseindex does not follow symbolic links unless they are explicitly included
in the .glimpse_include file (described below).
Glimpseindex makes an effort to identify non-text files such as binary files,
compressed files, uuencoded files, postscript files, binhex files, etc. These
files are automatically not indexed. In addition, all files whose names end
with `.o', `.gz', `.Z', `.z', `.hqx', `.zip', or `.tar' will not be indexed
(unless they are specifically included in .glimpse_include - see below).
The options for glimpseindex are as follows:
- -a
- adds the given file[s] and/or directories to an existing
index. Any given directory will be traversed recursively and all files
will be indexed (unless they appear in .glimpse_exclude; see below). Using
this option is generally much faster than indexing everything from
scratch, although in rare cases the index may not be as good. If for some
reason the index is full (which can happen unless -o or -b are used)
glimpseindex -a will produce an error message and will exit without
changing the original index.
- -b
- builds a medium-size index (20-30% of the size of all
files), allowing faster search. This option forces glimpseindex to store
an exact (byte level) pointer to each occurrence of each word (except for
some very common words belonging to the stop list).
- -B
- uses a hash table that is 4 times bigger (256k entries
instead of 64K) to speed up indexing. The memory usage will increase
typically by about 2 MB. This option is only for indexing speed; it does
not affect the final index.
- -d filename(s)
- deletes the given file(s) from the index.
- -D filename(s)
- deletes the given file(s) from the list of file names, but
not from the index. This is much faster than -d, and the file(s) will not
be found by glimpse. However, the index itself will not become
smaller.
- -E
- does not run a check on file types. Glimpse normally
attempts to exclude non-text files, but this attempt is not always
perfect. With -E, glimpseindex indexes all files, except those that are
specifically excluded in .glimpse_exclude and those whose file names end
with one of the excluded suffixes.
- -f
- incremental indexing. glimpseindex scans all files
and adds to the index only those files that were created or modified after
the current index was built. If there is no current index or if this
procedure fails, glimpseindex automatically reverts to the default
mode (which is to index everything from scratch). This option may create
an inefficient index for several reasons, one of which is that deleted
files are not really deleted from the index. Unless changes are small,
mostly additions, and -o is used, we suggest to use the default mode as
much as possible.
- -F
- Glimpseindex receives the list of files to index from
standard input.
- -H directory
- Put or update the index and all other .glimpse files
(listed below) in "directory". The default is the home
directory. When glimpse is run, the -H option must be used to direct
glimpse to this directory, because glimpse assumes that the index is in
the home directory (see also the -H option in glimpse).
- -i
- Make .glimpse_include (SEE GLIMPSEINDEX FILES) take
precedence over .glimpse_exclude, so that, for example, one can exclude
everything (by putting *) and then explicitly include files.
- -I
- Instead of indexing, only show (print to standard out) the
list of files that would be indexed. It is useful for filtering purposes.
("glimpseindex -I dir | glimpseindex -F" is the same as
"glimpseindex dir".)
- -M x
- Tells glimpseindex to use x MB of memory for temporary
tables. The more memory you allow the faster glimpseindex will run. The
default is x=2. The value of x must be a positive integer. Glimpseindex
will need more memory than x for other things, and glimpseindex may
perform some 'forks', so you'll have to experiment if you want to use this
option. WARNING: If x is too large you may run out of swap space.
- -n
- Index numbers as well as text. The default is not to index
numbers. This is useful when searching for dates or other identifying
numbers, but it may make the index very large if there are lots of
numbers. In general, glimpseindex strips away any non-alphabetic
character. For example, the string abc123 will be indexed as abc if the -n
option is not used and as abc123 if it is used. Glimpse provides warnings
(in .glimpse_messages) for all files in which more than half the words
that were added to the index from that file had digits in them (this is an
attempt to identify data files that should probably not be indexed). One
can use the .glimpse_exclude file to exclude data files or any other
files. (See GLIMPSEINDEX FILES.)
- -o
- Build a small index rather than tiny (meaning 7-9% of the
sizes of all files - your mileage may vary) allowing faster search. This
option forces glimpseindex to allocate one block per file (a block usually
contains many files). A detailed explanation of how blocks affect glimpse
can be found in the glimpse article. (See also LIMITATIONS.)
- -R
- Recompute .glimpse_filenames_index from .glimpse_filenames.
The file .glimpse_filenames_index speeds up processing. Glimpseindex
usually computes it automatically. However, if for some reason one wants
to change the path names of the files listed in .glimpse_filenames, then
running glimpseindex -R recomputes .glimpse_filenames_index. This is
useful if the index is computed on one machine, but is used on another
(with the same hierarchy). The names of the files listed in
.glimpse_filenames are used in runtime, so changing them can be done at
any time in any way (as long as just the names not the content is
changed). This is not really an option in the regular sense; rather, it is
a program by itself, and it is meant as a post-processing step. (Available
only from version 3.6.)
- -s
- supports structured queries. This option was added to
support the Harvest project and it is applicable mostly in that context.
See STRUCTURED QUERIES below for more information and also
http://harvest.sourceforge.net/ for more information about the Harvest
project.
- -S k
- The number k determines the size of the stop-list.
The stop-list consists of words that are too common and are not indexed
(e.g., 'the' or 'and'). Instead of having a fixed stop-list, glimpseindex
figures out the words that are too common for every index separately. The
rules are different for the different indexing options. The tiny index
contains all words (the savings from a stop-list are too small to bother).
The small index (-o), the number k is a percentage threshold. A word will
be in the stop list if it appears in at least k% of all files. The default
value is 80%. (If there are less than 256 files, then the stop-list is not
maintained.) The medium index (-b) counts all occurrences of all words,
and a word is added to the stop-list if it appears at least k times per
MByte. The default value is 500. A query that includes a stop list word is
of course less efficient. (See also LIMITATIONS below.)
- -t
- (A new option in version 3.5.) The order in which files are
indexed is determined by scanning the directories, which is mostly
arbitrary. With the -t option, combined with either -o and -b, the indexed
files are stored in reversed order of modification age (younger files
first). Results of queries are then automatically returned in this order.
Furthermore, glimpse can filter results by age; for example, asking to
look at only files that are at most 5 days old.
- -T
- builds the turbo file. Starting at version 3.0, this is the
default, so using this option has no effect.
- -w k
- Glimpseindex does a reasonable, but not a perfect, job of
determining which files should not be indexed. Sometimes a large text file
should not be indexed; for example, a dictionary may match most queries.
The -w option stores in a file called .glimpse_messages (in the same
directory as the index) the list of all files that contribute at least
k new words to the index. The user can look at this list of files
and decide which should or should not be indexed. The file
.glimpse_exclude contains files that will not be indexed (see more below).
We recommend to set k to about 1000. This is not an exact measure.
For example, if the same file appears twice, then the second copy will not
contribute any new words to the dictionary (but if you exclude the first
copy and index again, the second copy will contribute).
- -X
- (starting at version 4.0B1) Extract titles from HTML pages
and add the titles to the index (in .glimpse_filenames). (This feature was
added to improve the performance of WebGlimpse.) Works only on files whose
names end with .html, .htm, .shtml, and .shtm. (see
glimpse.h/EXTRACT_INFO_SUFFIX to add to these suffixes.) The routine to
extract titles is called extract_info, in index/filetype.c. This feature
can be modified in various ways to extract info from many filetypes. The
titles are appended to the corresponding filenames with a space separator.
Glimpseindex assumes that filenames don't have spaces in them.
- -z
- Allow customizable filtering, using the file
.glimpse_filters to perform the programs listed there for each match. The
best example is compress/decompress. If .glimpse_filters include the line
*.Z uncompress <
(separated by tabs) then before indexing any file that matches the pattern
"*.Z" (same syntax as the one for .glimpse_exclude) the command
listed is executed first (assuming input is from stdin, which is why
uncompress needs <) and its output (assuming it goes to stdout) is
indexed. The file itself is not changed (i.e., it stays compressed). Then
if glimpse -z is used, the same program is used on these files on the fly.
Any program can be used (we run 'exec'). For example, one can filter out
parts of files that should not be indexed. Glimpseindex tries to apply all
filters in .glimpse_filters in the order they are given. For example, if
you want to uncompress a file and then extract some part of it, put the
compression command (the example above) first and then another line that
specifies the extraction. Note that this can slow down the search because
the filters need to be run before files are searched.
All files used by glimpse are located at the directory(ies) where the index(es)
is (are) stored and have .glimpse_ as a prefix. The first two files
(.glimpse_exclude and .glimpse_include) are optionally supplied by the user.
The other files are built and read by glimpse.
- .glimpse_exclude
- contains a list of files that glimpseindex is explicitly
told to ignore. In general, the syntax of .glimpse_exclude/include is the
same as that of agrep (or any other grep). The lines in the
.glimpse_exclude file are matched to the file names, and if they match,
the files are excluded. Notice that agrep matches to parts of the string!
e.g., agrep /ftp/pub will match /home/ftp/pub and /ftp/pub/whatever. So,
if you want to exclude /ftp/pub/core, you just list it, as is, in the
.glimpse_exclude file. If you put "/home/ftp/pub/cdrom" in
.glimpse_exclude, every file name that matches that string will be
excluded, meaning all files below it. You can use ^ to indicate the
beginning of a file name, and $ to indicate the end of one, and you can
use * and ? in the usual way. For example /ftp/*html will exclude
/ftp/pub/foo.html, but will also exclude /home/ftp/pub/html/whatever; if
you want to exclude files that start with /ftp and end with html use
^/ftp*html$ Notice that putting a * at the beginning or at the end is
redundant (in fact, in this case glimpseindex will remove the * when it
does the indexing). No other meta characters are allowed in
.glimpse_exclude (e.g., don't use .* or # or |). Lines with * or ? must
have no more than 30 characters. Notice that, although the index itself
will not be indexed, the list of file names (.glimpse_filenames) will be
indexed unless it is explicitly listed in .glimpse_exclude.
- .glimpse_filters
- See the description above for the -z option.
- .glimpse_include
- contains a list of files that glimpseindex is explicitly
told to include in the index even though they may look like
non-text files. Symbolic links are followed by glimpseindex only if they
are specifically included here. The syntax is the same as the one for
.glimpse_exclude (see there). If a file is in both .glimpse_exclude and
.glimpse_include it will be excluded unless -i is used.
- .glimpse_filenames
- contains the list of all indexed file names, one per line.
This is an ASCII file that can also be used with agrep to search for a
file name leading to a fast find command. For example,
glimpse 'count#\.c$' ~/.glimpse_filenames
will output the names of all (indexed) .c files that have 'count' in their
name (including anywhere on the path from the index). Setting the
following alias in the .login file may be useful:
alias findfile 'glimpse -h !:1 ~/.glimpse_filenames'
- .glimpse_index
- contains the index. The index consists of lines, each
starting with a word followed by a list of block numbers (unless the -o or
-b options are used, in which case each word is followed by an offset into
the file .glimpse_partitions where all pointers are kept). The block/file
numbers are stored in binary form, so this is not an ASCII file.
- .glimpse_messages
- contains the output of the -w option (see above).
- .glimpse_partitions
- contains the partition of the indexed space into blocks
and, when the index is built with the -o or -b options, some part of the
index. This file is used internally by glimpse and it is a non-ASCII
file.
- .glimpse_statistics
- contains some statistics about the makeup of the index.
Useful for some advanced applications and customization of glimpse.
Glimpse can search for Boolean combinations of "attribute=value" terms
by using the Harvest SOIF parser library (in glimpse/libtemplate). To search
this way, the index must be made by using the -s option of glimpseindex (this
can be used in conjunction with other glimpseindex options). For glimpse and
glimpseindex to recognize "structured" files, they must be in SOIF
format. In this format, each value is prefixed by an attribute-name with the
size of the value (in bytes) present in "{}" after the name of the
attribute. For example, The following lines are part of an SOIF file:
type{17}: Directory-Listing
md5{32}: 3858c73d68616df0ed58a44d306b12ba
Any string can serve as an attribute name. Glimpse
"pattern;type=Directory-Listing" will search for "pattern"
only in files whose type is "Directory-Listing". The file itself is
considered to be one "object" and its name/url appears as the first
attribute with an "@" prefix; e.g., @FILE {
http://xxx... } The
scope of Boolean operations changes from records (lines) to whole files when
structured queries are used in glimpse (since individual query terms can look
at different attributes and they may not be "covered" by the
record/line). Note that glimpse can only search for patterns in the value
parts of the SOIF file: there are some attributes (like the TTL, MD5, etc.)
that are interpreted by Harvest's internal routines. See RFC 2655 for more
detailed information of the SOIF format.
- 1.
- U. Manber and S. Wu, "GLIMPSE: A Tool to Search
Through Entire File Systems," Usenix Winter 1994 Technical
Conference (best paper award), San Francisco (January 1994), pp.
23-32. Also, Technical Report #TR 93-34, Dept. of Computer Science,
University of Arizona, October 1993 (a postscript file is available by
anonymous ftp at ftp://webglimpse.net/pub/glimpse/TR93-34.ps).
- 2.
- S. Wu and U. Manber, "Fast Text Searching Allowing
Errors," Communications of the ACM 35 (October 1992),
pp. 83-91.
agrep(1),
ed(1),
ex(1),
glimpse(1),
glimpseserver(1),
grep(1V),
sh(1),
csh(1).
The index of glimpse is word based. A pattern that contains more than one word
cannot be found in the index. The way glimpse overcomes this weakness is by
splitting any multi-word pattern into its set of words and looking for all of
them in the index. For example,
glimpse 'linear programming' will first
consult the index to find all files containing both
linear and
programming, and then apply agrep to find the combined pattern. This is
usually an effective solution, but it can be slow for cases where both words
are very common, but their combination is not.
The index of glimpse stores all patterns in lower case. When glimpse searches
the index it first converts all patterns to lower case, finds the appropriate
files, and then searches the actual files using the original patterns. So, for
example,
glimpse ABCXYZ will first find all files containing abcxyz in
any combination of lower and upper cases, and then searches these files
directly, so only the right cases will be found. One problem with this
approach is discovering misspellings that are caused by wrong cases. For
example,
glimpse -B abcXYZ will first search the index for the best
match to abcxyz (because the pattern is converted to lower case); it will find
that there are matches with no errors, and will go to those files to search
them directly, this time with the original upper cases. If the closest match
is, say AbcXYZ, glimpse may miss it, because it doesn't expect an error.
Another problem is speed. If you search for "ATT", it will look at
the index for "att". Unless you use -w to match the whole word,
glimpse may have to search all files containing, for example,
"Seattle" which has "att" in it.
There is no size limit for simple patterns and simple patterns with Boolean AND
or OR. More complicated patterns are currently limited to approximately 30
characters. Lines are limited to 1024 characters. Records are limited to 48K,
and may be truncated if they are larger than that. The limit of record length
can be changed by modifying the parameter Max_record in agrep.h.
Each line in .glimpse_exclude or .glimpse_include that contains a * or a ? must
not exceed 30 characters length.
Glimpseindex does not index words of size > 64.
A medium-size index (-b) may lead to actually slower query times if the files
are all very small.
Under -b, it may be impossible to make the stop list empty. Glimpseindex is
using the "sort" routine, and all occurrences of a word appear at
some point on one line. Sort is limiting the size of lines it can handle (the
value depends on the platform; ours is 16KB). If the lines are too big, the
word is added to the stop list.
Please submit bug reports or comments at
http://webglimpse.net/bugzilla/
(Only in version 3.6 and above.)
exit status 0: terminated normally;
exit status 1: glimpseindex errors (e.g., bad option combos, no files were
indexed, etc.)
exit status 2: system errors (e.g., write failed, sort failed, malloc failed).
Udi Manber and Burra Gopal, Department of Computer Science, University of
Arizona, and Sun Wu, the National Chung-Cheng University, Taiwan. Now
maintained by Golda Velez at Internet WorkShop (Email:
[email protected])