glimpse - search quickly through entire file systems
Glimpse (which stands for GLobal IMPlicit SEarch) is a very popular UNIX
indexing and query system that allows you to search through a large set of
files very quickly. Glimpse supports most of
agrep's options (
agrep is our powerful version of
grep) including approximate
matching (e.g., finding misspelled words), Boolean queries, and even some
limited forms of regular expressions. It is used in the same way, except that
you don't have to specify file names. So, if you are looking for a
needle anywhere in your file system, all you have to do is say
glimpse needle and all lines containing
needle will appear
preceded by the file name.
To use glimpse you first need to index your files with glimpseindex. For
example,
glimpseindex -o ~ will index everything at or below your home
directory. See man glimpseindex for more details.
Glimpse is also available for web sites, as a set of tools called
WebGlimpse. (The old glimpseHTTP is no longer supported and is not
recommended.) See
http://webglimpse.net/ for more information.
Glimpse includes all of agrep and can be used instead of agrep by giving a file
name(s) at the end of the command. This will cause glimpse to ignore the index
and run agrep as usual. For example,
glimpse -1 pattern file is the
same as
agrep -1 pattern file. Agrep is distributed as a
self-contained package within glimpse, and can be used separately. We added a
new option to agrep: -r searches recursively the directory and everything
below it (see agrep options below); it is used only when glimpse reverts to
agrep.
Mail
[email protected] with SUBSCRIBE wgusers in the body to be added to
the Webglimpse users mailing list. This is now the location where glimpse
questions are also discussed. Bugs can be reported at
http://webglimpse.net/bugzilla/ HTML version of these manual pages can be
found in
http://webglimpse.net/docs/glimpsehelp.html Also, see the glimpse
home pages in
http://webglimpse.net/glimpse
glimpse - [almost all letters]
pattern
We start with simple ways to use glimpse and describe all the options in detail
later on. Once an index is built, using glimpseindex, searching for
pattern is as easy as saying
glimpse pattern
The output of glimpse is similar to that of
agrep (or any other grep).
The pattern can be any agrep legal pattern including a regular expression or a
Boolean query (e.g., searching for Tucson AND Arizona is done by
glimpse
'Tucson;Arizona').
The speed of glimpse depends mainly on the number and sizes of the files that
contain a match and only to a second degree on the total size of all indexed
files. If the pattern is reasonably uncommon, then all matches will be
reported in a few seconds even if the indexed files total 500MB or more. Some
information on how glimpse works and a reference to a detailed article are
given below.
Most of agrep (and other grep's) options are supported, including approximate
matching. For example,
glimpse -1 'Tuson;Arezona'
will output all lines containing both patterns allowing one spelling error in
any of the patterns (either insertion, deletion, or substitution), which in
this case is definitely needed.
glimpse -w -i 'parent'
specifies case insensitive (-i) and match on complete words (-w). So 'Parent'
and 'PARENT' will match, 'parent/child' will match, but 'parenthesis' or
'parents' will not match. (Starting at version 3.0, glimpse can be much faster
when these two options are specified, especially for very large indexes. You
may want to set an alias especially for "glimpse -w -i".)
The -F option provides a pattern that must match the file name. For example,
glimpse -F '\.c$' needle
will find the pattern
needle in all files whose name ends with .c.
(Glimpse will first check its index to determine which files may contain the
pattern and then run agrep on the file names to further limit the search.) The
-F option
should not be put at the end after the main pattern (e.g.,
"glimpse needle -F hay" is incorrect).
- -#
-
# is an integer between 1 and 8 specifying the
maximum number of errors permitted in finding the approximate matches (the
default is zero). Generally, each insertion, deletion, or substitution
counts as one error. It is possible to adjust the relative cost of
insertions, deletions and substitutions (see -I -D and -S options). Since
the index stores only lower case characters, errors of substituting upper
case with lower case may be missed (see LIMITATIONS). Allowing errors in
the match requires more time and can slow down the match by a factor of
2-4. Be very careful when specifying more than one error, as the number of
matches tend to grow very quickly.
- -a
- prints attribute names. This option applies only to Harvest
SOIF structured data (used with glimpseindex -s). (See
http://harvest.sourceforge.net/ for more information about the Harvest
project.)
- -A
- used for glimpse internals.
- -b
- prints the byte offset (from the beginning of the file) of
the end of each match. The first character in a file has offset 0.
- -B
- Best match mode. (Warning: -B sometimes misses matches. It
is safer to specify the number of errors explicitly.) When -B is specified
and no exact matches are found, glimpse will continue to search until the
closest matches (i.e., the ones with minimum number of errors) are found,
at which point the following message will be shown: "the best match
contains x errors, there are y matches, output them? (y/n)" This
message refers to the number of matches found in the index. There may be
many more matches in the actual text (or there may be none if -F is used
to filter files). When the -#, -c, or -l options are specified, the -B
option is ignored. In general, -B may be slower than -#, but not by very
much. Since the index stores only lower case characters, errors of
substituting upper case with lower case may be missed (see
LIMITATIONS).
- -c
- Display only the count of matching records. Only files with
count > 0 are displayed.
- -C
- tells glimpse to send its queries to
glimpseserver.
- -d 'delim'
- Define delim to be the separator between two
records. The default value is '$', namely a record is by default a line.
delim can be a string of size at most 8 (with possible use of ^ and
$), but not a regular expression. Text between two delim's, before
the first delim, and after the last delim is considered as
one record. For example, -d '$$' defines paragraphs as records and -d
'^From ' defines mail messages as records. glimpse matches
each record separately. This option does not currently work with
regular expressions. The -d option is especially useful for Boolean
AND queries, because the patterns need not appear in the same line but in
the same record. For example, glimpse -F mail -d '^From '
'glimpse;arizona;announcement' will output all mail messages (in their
entirety) that have the 3 patterns anywhere in the message (or the
header), assuming that files with 'mail' in their name contain mail
messages. If you want the scope of the record to be the whole file, use
the -W option. Glimpse warning: Use this option with care. If the
delimiter is set to match mail messages, for example, and glimpse finds
the pattern in a regular file, it may not find the delimiter and will
therefore output the whole file. (The -t option - see below - can be used
to put the delim at the end of the record.) Performance
Note: Agrep (and glimpse) resorts to more complex search when the -d
option is used. The search is slower and unfortunately no more than 32
characters can be used in the pattern.
- -Dk
- Set the cost of a deletion to k (k is a
positive integer). This option does not currently work with regular
expressions.
-
-e pattern
- Same as a simple pattern argument, but useful when
the pattern begins with a `-'.
- -E
- prints the lines in the index (as they appear in the index)
which match the pattern. Used mostly for debugging and maintenance of the
index. This is not an option that a user needs to know about.
- -f file_name
- this option has a different meaning for agrep than for
glimpse: In glimpse, only the files whose names are listed in
file_name are matched. (The file names have to appear as in
.glimpse_filenames.) In agrep, the file_name contains the list of the
patterns that are searched. (Starting at version 3.6, this option for
glimpse is much faster for large files.)
- -F file_pattern
- limits the search to those files whose name (including the
whole path) matches file_pattern. This option can be used in a
variety of applications to provide limited search even for one large
index. If file_pattern matches a directory, then all files with
this directory on their path will be considered. To limit the search to
actual file names, use $ at the end of the pattern. file_pattern
can be a regular expression and even a Boolean pattern. This option is
implemented by running agrep file_pattern on the list of file names
obtained from the index. Therefore, searching the index itself takes the
same amount of time, but limiting the second phase of the search to only a
few files can speed up the search significantly. For example,
glimpse -F 'src#\.c$' needle
will search for needle in all .c files with src somewhere along the path.
The -F file_pattern must appear before the search pattern (e.g.,
glimpse needle -F '\.c$' will not work). It is possible to use some of
agrep's options when matching file names. In this case all options as well
as the file_pattern should be in quotes. (-B and -v do not work very well
as part of a file_pattern.) For example,
glimpse -F '-1 \.html' pattern
will allow one spelling error when matching .html to the file names (so
".htm" and ".shtml" will match as well).
glimpse -F '-v \.c$' counter
will search for 'counter' in all files except for .c files.
- -g
- prints the file number (its position in the
.glimpse_filenames file) rather than its name.
- -G
- Output the (whole) files that contain a match.
- -h
- Do not display filenames.
- -H directory_name
- searches for the index and the other .glimpse files in
directory_name. The default is the home directory. This option is
useful, for example, if several different indexes are maintained for
different archives (e.g., one for mail messages, one for source code, one
for articles).
- -i
- Case-insensitive search — e.g., "A" and
"a" are considered equivalent. Glimpse's index stores all
patterns in lower case (see LIMITATIONS below). Performance Note:
When -i is used together with the -w option, the search may become much
faster. It is recommended to have -i and -w as defaults, for example,
through an alias. We use the following alias in our .cshrc file
alias glwi 'glimpse -w -i'
- -Ik
- Set the cost of an insertion to k (k is a
positive integer). This option does not currently work with regular
expressions.
- -j
- If the index was constructed with the -t option, then -j
will output the files last modification dates in addition to everything
else. There are no major performance penalties for this option.
- -J host_name
- used in conjunction with glimpseserver (-C) to connect to
one particular server.
- -k
- No symbol in the pattern is treated as a meta character.
For example, glimpse -k 'a(b|c)*d' will find the occurrences of a(b|c)*d
whereas glimpse 'a(b|c)*d' will find substrings that match the regular
expression 'a(b|c)*d'. (The only exception is ^ at the beginning of the
pattern and $ at the end of the pattern, which are still interpreted in
the usual way. Use \^ or \$ if you need them verbatim.)
- -K port_number
- used in conjunction with glimpseserver (-C) to connect to
one particular server at the specified TCP port number.
- -l
- Output only the files names that contain a match. This
option differs from the -N option in that the files themselves are
searched, but the matching lines are not shown.
- -L x | x:y | x:y:z
- if one number is given, it is a limit on the total number
of matches. Glimpse outputs only the first x matches. If -l is used (i.e.,
only file names are sought), then the limit is on the number of files;
otherwise, the limit is on the number of records. If two numbers are given
(x:y), then y is an added limit on the total number of files. If three
numbers are given (x:y:z), then z is an added limit on the number of
matches per file. If any of the x, y, or z is set to 0, it means to ignore
it (in other words 0 = infinity in this case); for example, -L 0:10 will
output all matches to the first 10 files that contain a match. This option
is particularly useful for servers that needs to limit the amount of
output provided to clients.
- -m
- used for glimpse internals.
- -M
- used for glimpse internals.
- -n
- Each matching record (line) is prefixed by its record
(line) number in the file. Performance Note: To compute the
record/line number, agrep needs to search for all record delimiters (or
line breaks), which can slow down the search.
- -N
- searches only the index (so the search is faster). If -o or
-b are used then the result is the number of files that have a potential
match plus a prompt to ask if you want to see the file names. (If -y is
used, then there is no prompt and the names of the files will be shown.)
This could be a way to get the matching file names without even having
access to the files themselves. However, because only the index is
searched, some potential matches may not be real matches. In other words,
with -N you will not miss any file but you may get extra files. For
example, since the index stores everything in lower case, a case-sensitive
query may match a file that has only a case-insensitive match. Boolean
queries may match a file that has all the keywords but not in the same
line (indexing with -b allows glimpse to figure out whether the keywords
are close, but it cannot figure out from the index whether they are
exactly on the same line or in the same record without looking at the
file). If the index was not build with -o or -b, then this option outputs
the number of blocks matching the pattern. This is useful as an
indication of how long the search will take. All files are partitioned
into usually 200-250 blocks. The file .glimpse_statistics contains
the total number of blocks (or glimpse -N a will give a pretty good
estimate; only blocks with no occurrences of 'a' will be missed).
- -o
- the opposite of -t: the delimiter is not output at the
tail, but at the beginning of the matched record.
- -O
- the file names are not printed before every matched record;
instead, each filename is printed just once, and all the matched records
within it are printed after it.
- -p
- (from version 4.0B1 only) Supports reading compressed set
of filenames. The -p option allows you to utilize compressed
`neighborhoods' (sets of filenames) to limit your search, without
uncompressing them. Added mostly for WebGlimpse. The usage is:
"-p filename:X:Y:Z" where "filename" is the file with
compressed neighborhoods, X is an offset into that file (usually 0, must
be a multiple of sizeof(int)), Y is the length glimpse must access from
that file (if 0, then whole file; must be a multiple of sizeof(int)), and
Z must be 2 (it indicates that "filename" has the sparse-set
representation of compressed neighborhoods: the other values are for
internal use only). Note that any colon ":" in filename must be
escaped using a backslash .
- -P
- used for glimpse internals.
- -q
- prints the offsets of the beginning and end of each matched
record. The difference between -q and -b is that -b prints the offsets of
the actual matched string, while -q prints the offsets of the whole record
where the match occurred. The output format is @x{y}, where x is the
beginning offset and y is the end offset.
- -Q
- when used together with -N glimpse not only displays the
filename where the match occurs, but the exact occurrences (offsets) as
seen in the index. This option is relevant only if the index was built
with -b; otherwise, the offsets are not available in the index. This
option is ignored when used not with -N.
- -r
- This option is an agrep option and it will be ignored in
glimpse, unless glimpse is used with a file name at the end which makes it
run as agrep. If the file name is a directory name, the -r option will
search (recursively) the whole directory and everything below it. (The
glimpse index will not be used.)
- -R k
- defines the maximum size (in bytes) of a record. The
maximum value (which is the default) is 48K. Defining the maximum to be
lower than the default may speed up some searches.
- -s
- Work silently, that is, display nothing except error
messages. This is useful for checking the error status.
- -Sk
- Set the cost of a substitution to k (k is a
positive integer). This option does not currently work with regular
expressions.
- -t
- Similar to the -d option, except that the delimiter is
assumed to appear at the end of the record. Glimpse will output the
record starting from the end of delim to (and including) the next
delim. (See warning for the -d option.)
- -T directory
- Use directory as a place where temporary files are
built. (Glimpse produces some small temporary files usually in /tmp.) This
option is useful mainly in the context of structured queries for the
Harvest project, where the temporary files may be non-trivial, and the
/tmp directory may not have enough space for them.
- -U
- (starting at version 4.0B1) Interprets an index created
with the -X or the -U option in glimpseindex. Useful mostly for WebGlimpse
or similar web applications. When glimpse outputs matches, it will display
the filename, the URL, and the title automatically.
- -v
- (This option is an agrep option and it will be ignored in
glimpse, unless glimpse is used with a file name at the end which makes it
run as agrep.) Output all records/lines that do not contain a
match. (Glimpse does not support the NOT operator yet.)
- -V
- prints the current version of glimpse.
- -w
- Search for the pattern as a word — i.e., surrounded
by non-alphanumeric characters. For example, glimpse -w car will
match car, but not characters and not car10. The non-alphanumeric
must surround the match; they cannot be counted as errors. This
option does not work with regular expressions. Performance Note:
When -w is used together with the -i option, the search may become much
faster. The -w will not work with $, ^, and _ (see BUGS below). It is
recommended to have -i and -w as defaults, for example, through an alias.
We use the following alias in our .cshrc file
alias glwi 'glimpse -w -i'
- -W
- The default for Boolean AND queries is that they cover one
record (the default for a record is one line) at a time. For example,
glimpse 'good;bad' will output all lines containing both 'good' and 'bad'.
The -W option changes the scope of Booleans to be the whole file. Within a
file glimpse will output all matches to any of the patterns. So, glimpse
-W 'good;bad' will output all lines containing 'good' or 'bad', but
only in files that contain both patterns. The NOT operator '~' can be used
only with -W. It is described later on. The OR operator is essentially
unaffected (unless it is in combination with the other Boolean
operations). For structured queries, the scope is always the whole
attribute or file.
- -x
- The pattern must match the whole line. (This option is
translated to -w when the index is searched and it is used only when the
actual text is searched. It is of limited use in glimpse.)
- -X
- (from version 4.0B1 only) Output the names of files that
contain a match even if these files have been deleted since the index was
built. Without this option glimpse will simply ignore these files.
- -y
- Do not prompt. Proceed with the match as if the answer to
any prompt is y. Servers (or any other scripts) using glimpse will
probably want to use this option.
- -Y k
- If the index was constructed with the -t option, then -Y x
will output only matches to files that were created or modified within the
last x days. There are no major performance penalties for this
option.
- -z
- Allow customizable filtering, using the file
.glimpse_filters to perform the programs listed there for each match. The
best example is compress/decompress. If .glimpse_filters include the line
*.Z uncompress <
(separated by tabs) then before indexing any file that matches the pattern
"*.Z" (same syntax as the one for .glimpse_exclude) the command
listed is executed first (assuming input is from stdin, which is why
uncompress needs <) and its output (assuming it goes to stdout) is
indexed. The file itself is not changed (i.e., it stays compressed). Then
if glimpse -z is used, the same program is used on these files on the fly.
Any program can be used (we run 'exec'). For example, one can filter out
parts of files that should not be indexed. Glimpseindex tries to apply all
filters in .glimpse_filters in the order they are given. For example, if
you want to uncompress a file and then extract some part of it, put the
compression command (the example above) first and then another line that
specifies the extraction. Note that this can slow down the search because
the filters need to be run before files are searched. (See also
glimpseindex.)
- -Z
- No op. (It's useful for glimpse's internals. Trust
us.)
The characters `
$', `^
', `
∗', `
['
,
`
]'
, `
^', `
|', `
(', `
)', `
!',
and `
\' can cause unexpected results when included in the
pattern, as these characters are also meaningful to the shell. To avoid
these problems, enclose the entire pattern in single quotes, i.e., 'pattern'.
Do not use double quotes (").
glimpse supports a large variety of patterns, including simple strings,
strings with classes of characters, sets of strings, wild cards, and regular
expressions (see LIMITATIONS).
- Strings
- Strings are any sequence of characters, including the
special symbols `^' for beginning of line and `$' for end of line. The
following special characters ( `$', `^', `∗',
`[', `^', `|', `(', `)',
`!', and `\' ) as well as the following meta characters
special to glimpse (and agrep): `;', `,', `#',
`<', `>', `-', and `.', should be
preceded by `\' if they are to be matched as regular characters. For
example, \^abc\\ corresponds to the string ^abc\, whereas ^abc corresponds
to the string abc at the beginning of a line.
- Classes of characters
- a list of characters inside [] (in order) corresponds to
any character from the list. For example, [a-ho-z] is any character
between a and h or between o and z. The symbol `^' inside [] complements
the list. For example, [^i-n] denote any character in the character set
except character 'i' to 'n'. The symbol `^' thus has two meanings, but
this is consistent with egrep. The symbol `.' (don't care) stands for any
symbol (except for the newline symbol).
- Boolean operations
-
Glimpse supports an `AND' operation denoted by the
symbol `;' an `OR' operation denoted by the symbol `,', a limited version
of a 'NOT' operation (starting at version 4.0B1) denoted by the symbol
`~', or any combination. For example, glimpse 'pizza;cheeseburger'
will output all lines containing both patterns. glimpse -F 'gnu;\.c$'
'define;DEFAULT' will output all lines containing both 'define' and
'DEFAULT' (anywhere in the line, not necessarily in order) in files whose
name contains 'gnu' and ends with .c. glimpse
'{political,computer};science' will match 'political science' or
'science of computers'. The NOT operation works only together with the -W
option and it is generally applies only to the whole file rather to
individual records. Its output may sometimes seem counterintuitive. Use
with care. glimpse -W 'fame;~glory' will output all lines
containing 'fame' in all files that contain 'fame' but do not contain
'glory'; This is the most common use of NOT, and in this case it works as
expected. glimpse -W '~{fame;glory}' will be limited to files that
do not contain both words, and will output all lines containing one of
them.
- Wild cards
- The symbol '#' is used to denote a sequence of any number
(including 0) of arbitrary characters (see LIMITATIONS). The symbol # is
equivalent to .* in egrep. In fact, .* will work too, because it is a
valid regular expression (see below), but unless this is part of an actual
regular expression, # will work faster. (Currently glimpse is experiencing
some problems with #.)
- Combination of exact and approximate matching
- Any pattern inside angle brackets <> must match the
text exactly even if the match is with errors. For example,
<mathemat>ics matches mathematical with one error (replacing the
last s with an a), but mathe<matics> does not match mathematical no
matter how many errors are allowed. (This option is buggy at the
moment.)
- Regular expressions
- Since the index is word based, a regular expression must
match words that appear in the index for glimpse to find it. Glimpse first
strips the regular expression from all non-alphabetic characters, and
searches the index for all remaining words. It then applies the regular
expression matching algorithm to the files found in the index. For
example, glimpse 'abc.*xyz' will search the index for all files
that contain both 'abc' and 'xyz', and then search directly for 'abc.*xyz'
in those files. (If you use glimpse -w 'abc.*xyz', then 'abcxyz' will not
be found, because glimpse will think that abc and xyz need to be matches
to whole words.) The syntax of regular expressions in glimpse is in
general the same as that for agrep. The union operation `|', Kleene
closure `*', and parentheses () are all supported. Currently '+' is not
supported. Regular expressions are currently limited to approximately 30
characters (generally excluding meta characters). Some options (-d, -w,
-t, -x, -D, -I, -S) do not currently work with regular expressions. The
maximal number of errors for regular expressions that use '*' or '|' is 4.
(See LIMITATIONS.)
- structured queries
- Glimpse supports some form of structured queries using
Harvest's SOIF format. See STRUCTURED QUERIES below for details.
(Run "glimpse '^glimpse' this-file" to get a list of all examples,
some of which were given earlier.)
- glimpse -F 'haystack.h$' needle
- finds all needles in all haystack.h's files.
- glimpse -2 -F html Anestesiology
- outputs all occurrences of Anestesiology with two errors in
files with html somewhere in their full name.
- glimpse -l -F '\.c$' variablename
- lists the names of all .c files that contain variablename
(the -l option lists file names rather than output the matched
lines).
- glimpse -F 'mail;1993' 'windsurfing;Arizona'
- finds all lines containing windsurfing and
Arizona in all files having `mail' and '1993' somewhere in their
full name.
- glimpse -F mail 't.j@#uk'
- finds all mail addresses (search only files with mail
somewhere in their name) from the uk, where the login name ends with t.j,
where the . stands for any one character. (This is very useful to find a
login name of someone whose middle name you don't know.)
- glimpse -F mbox -h -G . > MBOX
- concatenates all files whose name matches `mbox' into one
big one.
Glimpse includes an optional new compression program, called
cast, which
allows glimpse (and agrep) to search the compressed files without having to
decompress them. The search is actually significantly faster when the files
are compressed. However, we have not tested
cast as thoroughly as we
would have liked, and a mishap in a compression algorithm can cause loss of
data, so we recommend at this point to use
cast very carefully. We do
not support or maintain cast. (Unless you specifically use
cast, the
default is to ignore it.)
All files used by glimpse are located at the directory(ies) where the index(es)
is (are) stored and have .glimpse_ as a prefix. The first two files
(.glimpse_exclude and .glimpse_include) are optionally supplied by the user.
The other files are built and read by glimpse.
- .glimpse_exclude
- contains a list of files that glimpseindex is explicitly
told to ignore. In general, the syntax of .glimpse_exclude/include is the
same as that of agrep (or any other grep). The lines in the
.glimpse_exclude file are matched to the file names, and if they match,
the files are excluded. Notice that agrep matches to parts of the string!
e.g., agrep /ftp/pub will match /home/ftp/pub and /ftp/pub/whatever. So,
if you want to exclude /ftp/pub/core, you just list it, as is, in the
.glimpse_exclude file. If you put "/home/ftp/pub/cdrom" in
.glimpse_exclude, every file name that matches that string will be
excluded, meaning all files below it. You can use ^ to indicate the
beginning of a file name, and $ to indicate the end of one, and you can
use * and ? in the usual way. For example /ftp/*html will exclude
/ftp/pub/foo.html, but will also exclude /home/ftp/pub/html/whatever; if
you want to exclude files that start with /ftp and end with html use
^/ftp*html$ Notice that putting a * at the beginning or at the end is
redundant (in fact, in this case glimpseindex will remove the * when it
does the indexing). No other meta characters are allowed in
.glimpse_exclude (e.g., don't use .* or # or |). Lines with * or ? must
have no more than 30 characters. Notice that, although the index itself
will not be indexed, the list of file names (.glimpse_filenames) will be
indexed unless it is explicitly listed in .glimpse_exclude.
- .glimpse_filters
- See the description above for the -z option.
- .glimpse_include
- contains a list of files that glimpseindex is explicitly
told to include in the index even though they may look like
non-text files. Symbolic links are followed by glimpseindex only if they
are specifically included here. If a file is in both .glimpse_exclude and
.glimpse_include it will be excluded.
- .glimpse_filenames
- contains the list of all indexed file names, one per line.
This is an ASCII file that can also be used with agrep to search for a
file name leading to a fast find command. For example,
glimpse 'count#\.c$' ~/.glimpse_filenames
will output the names of all (indexed) .c files that have 'count' in their
name (including anywhere on the path from the index). Setting the
following alias in the .login file may be useful:
alias findfile 'glimpse -h !:1 ~/.glimpse_filenames'
- .glimpse_index
- contains the index. The index consists of lines, each
starting with a word followed by a list of block numbers (unless the -o or
-b options are used, in which case each word is followed by an offset into
the file .glimpse_partitions where all pointers are kept). The block/file
numbers are stored in binary form, so this is not an ASCII file.
- .glimpse_messages
- contains the output of the -w option (see above).
- .glimpse_partitions
- contains the partition of the indexed space into blocks
and, when the index is built with the -o or -b options, some part of the
index. This file is used internally by glimpse and it is a non-ASCII
file.
- .glimpse_statistics
- contains some statistics about the makeup of the index.
Useful for some advanced applications and customization of glimpse.
- .glimpse_turbo
- An added data structure (used under glimpseindex -o or -b
only) that helps to speed up queries significantly for large indexes. Its
size is 0.25MB. Glimpse will work without it if needed.
Glimpse can search for Boolean combinations of "attribute=value" terms
by using the Harvest SOIF parser library (in glimpse/libtemplate). To search
this way, the index must be made by using the -s option of glimpseindex (this
can be used in conjunction with other glimpseindex options). For glimpse and
glimpseindex to recognize "structured" files, they must be in SOIF
format. In this format, each value is prefixed by an attribute-name with the
size of the value (in bytes) present in "{}" after the name of the
attribute. For example, The following lines are part of an SOIF file:
type{17}: Directory-Listing
md5{32}: 3858c73d68616df0ed58a44d306b12ba
Any string can serve as an attribute name. Glimpse
"pattern;type=Directory-Listing" will search for "pattern"
only in files whose type is "Directory-Listing". The file itself is
considered to be one "object" and its name/url appears as the first
attribute with an "@" prefix; e.g., @FILE {
http://xxx... } The
scope of Boolean operations changes from records (lines) to whole files when
structured queries are used in glimpse (since individual query terms can look
at different attributes and they may not be "covered" by the
record/line). Note that glimpse can only search for patterns in the value
parts of the SOIF file: there are some attributes (like the TTL, MD5, etc.)
that are interpreted by Harvest's internal routines. See RFC 2655 for more
detailed information of the SOIF format.
- 1.
- U. Manber and S. Wu, "GLIMPSE: A Tool to Search
Through Entire File Systems," Usenix Winter 1994 Technical
Conference (best paper award), San Francisco (January 1994), pp.
23-32. Also, Technical Report #TR 93-34, Dept. of Computer Science,
University of Arizona, October 1993 (a postscript file is available by
anonymous ftp at ftp://webglimpse.net/pub/glimpse/TR93-34.ps).
- 2.
- S. Wu and U. Manber, "Fast Text Searching Allowing
Errors," Communications of the ACM 35 (October 1992),
pp. 83-91.
agrep(1),
ed(1),
ex(1),
glimpseindex(1),
glimpseserver(1),
grep(1),
sh(1),
csh(1).
The index of glimpse is word based. A pattern that contains more than one word
cannot be found in the index. The way glimpse overcomes this weakness is by
splitting any multi-word pattern into its set of words and looking for all of
them in the index. For example,
glimpse 'linear programming' will first
consult the index to find all files containing both
linear and
programming, and then apply agrep to find the combined pattern. This is
usually an effective solution, but it can be slow for cases where both words
are very common, but their combination is not.
As was mentioned in the section on PATTERNS above, some characters serve as meta
characters for glimpse and need to be preceded by '\' to search for them. The
most common examples are the characters '.' (which stands for a wild card),
and '*' (the Kleene closure). So, "glimpse ab.de" will match abcde,
but "glimpse ab\.de" will not, and "glimpse ab*de" will
not match ab*de, but "glimpse ab\*de" will. The meta character - is
translated automatically to a hyphen unless it appears between [] (in which
case it denotes a range of characters).
The index of glimpse stores all patterns in lower case. When glimpse searches
the index it first converts all patterns to lower case, finds the appropriate
files, and then searches the actual files using the original patterns. So, for
example,
glimpse ABCXYZ will first find all files containing abcxyz in
any combination of lower and upper cases, and then searches these files
directly, so only the right cases will be found. One problem with this
approach is discovering misspellings that are caused by wrong cases. For
example,
glimpse -B abcXYZ will first search the index for the best
match to abcxyz (because the pattern is converted to lower case); it will find
that there are matches with no errors, and will go to those files to search
them directly, this time with the original upper cases. If the closest match
is, say AbcXYZ, glimpse may miss it, because it doesn't expect an error.
Another problem is speed. If you search for "ATT", it will look at
the index for "att". Unless you use -w to match the whole word,
glimpse may have to search all files containing, for example,
"Seattle" which has "att" in it.
There is no size limit for simple patterns and simple patterns within Boolean
expressions. More complicated patterns, such as regular expressions, are
currently limited to approximately 30 characters. Lines are limited to 1024
characters. Records are limited to 48K, and may be truncated if they are
larger than that. The limit of record length can be changed by modifying the
parameter Max_record in agrep.h.
Glimpseindex does not index words of size > 64.
In some rare cases, regular expressions using * or # may not match correctly.
A query that contains no alphanumeric characters is not recommended (unless
glimpse is used as agrep and the file names are provided). This is an
understatement.
The notion of "match to the whole word" (the -w option) can be tricky
sometimes. For example, glimpse -w 'word$' will not match 'word' appearing at
the end of a line, because the extra '$' makes the pattern more than just one
simple word. The same thing can happen with ^ and with _. To be on the safe
side, use the -w option only when the patterns are actual words.
Please send bug reports or comments to
[email protected].
Exit status is 0 if any matches are found, 1 if none, 2 for syntax errors or
inaccessible files.
Udi Manber and Burra Gopal, Department of Computer Science, University of
Arizona, and Sun Wu, the National Chung-Cheng University, Taiwan. Now
maintained by Golda Velez at Internet WorkShop (Email:
[email protected])