Inverse Document Frequency and Web Search Engines

Report
Authors:Prey, Kevin, Department of Computer ScienceUniversity of Virginia French, James, Department of Computer ScienceUniversity of Virginia Powell, Allison, Department of Computer ScienceUniversity of Virginia Viles, Charles, Department of Computer ScienceUniversity of Virginia
Abstract:

Full text searching over a database of moderate size often uses the inverse document frequency, idf = log(N/df), as a component in term weighting functions used for document indexing and retrieval. However, in very large databases (e.g. internet search engines), there is the potential that the collection size (N) could dominate the idf value, decreasing the usefulness of idf as a term weighting component. In this short paper we examine the properties of idf in the context of internet search engines. The observed idf values may also shed light upon the indexed content of the WWW. For example, if the internet search engines we survey index random samples of the WWW, we would expect similar idf values for the same term across the different search engines.

Rights:
All rights reserved (no additional license for public reuse)
Language:
English
Source Citation:

Prey, Kevin, James French, Allison Powell, and Charles Viles. "Inverse Document Frequency and Web Search Engines." University of Virginia Dept. of Computer Science Tech Report (2001).

Publisher:
University of Virginia, Department of Computer Science
Published Date:
2001