The New Sniff
A whole new search engine.
By: Ariel Faigon, June 21, 1996
[ Introduction | User Perspective | Implementation Notes | Improvement Summary ]
Introduction
In the past two weeks, the Sniff
search engine was rewritten from scratch. Most of the previous search limitations are now gone. Numerous enhancements were added.The code of the indexer and search engine are now 100% original. No more external party legacies exist in the indexer and search engine.
Changes from a User Perspective
- Digits, underscores, and dashes are included in indexed words. No more "Alphabetic query only" limitations exist. You may search (quickly) for terms like:
6.2
,Key-O-Matic
,_exit
, orR5000
The old index based on an old Lycos beta release, was using trigraphs of characters for word lookups. The main disadvantage of its 26x26x26 lookup table of trigraphs was that only alphabetic terms could be indexed.- Sniff now supports the boolean logic modifiers:
so you may refine or expand your search as much as you wish. Use
AND, OR, NOT
OR
to increase the number of hits, or useNOT
as a prefix to a word, in order to weed out irrelevant matches containing that word. For example:
bug OR problem NOT compiler AND 6.2
Here's the full updated Search Tips Document.
The old search engine supported only OR style queries. The AND was implied by attempting to weigh multiple matches higher in the list of hits. However, this didn't always work well; sometimes combined (AND) matches were not scored high enough, depending on the number of appearances of words in a document.- Did the old Sniff use to give you too many irrelevant hits? No more. The default conjunction is
AND
, so you are much less likely to get hundreds of irrelevant matches as you were getting with the old Sniff. The lookup is a perfect hash function, so you get only those pages that actually contain all the terms you're looking for.- Sniff now supports shorter terms. The old search used to ignore anything shorter than three letters. You may now search for
C2
, orPI
.- Scoring is significantly better. Top items are more relevant than they used to be.
The index itself now includes accurate weighting information per each word/document pair, in WAIS (relative frequency) style, which it didn't use to have. In addition there's additional weighting of words in titles, URLs, and words which appear very early in a document.- The output is cleaner and less cluttered. Hits are separated by empty lines rather than by horizontal rules, and the body of each hit is indented.
- The pager navigation bar at the end of each hit-list has friendly
[prev]
and[next]
links added. It also works in all cases, including quoted searches.- The output for each hit is more informative: it includes the last-modified date of documents (when that date is given by the server) and their sizes. It also gives the list of words that matched each hit (Useful in case you are doing an
OR
'ed query) The header also includes a summary of number of hits on each of your query words.
The old search didn't have a per hit matching word information. It was difficult to tell which of the query words matched each hit.- When searching for a quoted expression such as
"Cosmo 3D"
the search is faster than it used to be. Moreover, you can combine a quoted expression with other words in a query.
When given a quoted expression query, the old search was bypassing the inverted index and doing a simplegrep
on the raw data. The new search does better: it consults the inverted index for the separate words and only then checks if the full quoted expression appears there. Note that this still has a few limitations, only one quoted expression is allowed in a query, andOR, NOT
modifiers are ignored.- The output in case no hits were found is more helpful: it contains a summary of the words that did appear, and for those words that didn't: you get a list of possible alternatives (similar words). Try, for example, searching for
blablah
.x - New HTML extensions: frames and client-side imagemaps are are now supported by the robot.
- Restricted domain (e.g. Cray, Silicon Sales) searching is now supported and easily configurable via a table.
Implementation Notes
- There's no more Lycos code. the code is 100% original. We're free to do with it whatever we wish.
- Notwithstanding the enhanced functionality, the code is now 4 times smaller (about 1000 lines), clearer, cleaner and easier to change/enhance. It is written in perl5, using Berkeley-DB for the stored index. This replaces over 4000 lines of difficult to understand C code, perl, and C-shell scripts. In addition, the code is now more cohesive: for example, the index inverter which used to be implemented in 4 programs is now a single script.
- I actually even put-in full English word stemming but disabled it by default after realizing that stemming actually makes things worse dealing with technical data where you want to be specific. (e.g. stemming converts SGI to SGY, and programmable [logic] to [software] program) anyway, the stemming hooks are included if I ever wish to activate them.
- The new code does a perfect hashing of complete words rather than prefixes. The word definition is a regular expression that is easy to change and adapt to new needs.
- Configuration and tuning data is now completely separated from the main code.
Improvement Summary:
- Search terms are more general (digits, dashes, underscores)
- Boolean logic is supported (AND, OR, NOT)
- Much less irrelevant hits are produced (default is ANDing terms)
- Hit ordering is better (WAIS style weighting plus additional tuning included in index)
- No copyright restrictions on the implementation (100% original code)
- Implementation is more flexible and easy to change (perl5, Berkeley-DB)
- New HTML extensions (frames, client-side imagemaps) are supported
- More friendly failure handling. Possible alternative keywords are given if no-hits were found.
- Restricted domain searching is supported.
Thanks to Shiraaz Bhabha for the cool graphics
Feedback to Ariel Faigon ariel@engr.sgi.com