public interface DocLengthTable
Interface representing an object that keeps track of the length of each document in the
collection. Document lengths are measured in number of terms. The number of documents n is
provided by getDocCount()
, and the documents are consecutively numbered, starting from
d + 1, where d is the by value provided by getDocnoOffset()
.
The notion of docno offset is necessary for large document collections that are partitioned, where docnos need to be consecutively numbered across partitions. For example, the first English segment of ClueWeb09 contains 50,220,423 documents, has a docno offset of 0, and contains documents numbered from 1 to 50,220,423. The second segment has a docno offset of 50,220,423 and begins with docno 50,220,424. By convention, docnos are numbered starting at one because it is impossible to code zero using certain schemes (e.g., gamma codes).
Modifier and Type | Method and Description |
---|---|
float |
getAvgDocLength()
Returns the average document length.
|
int |
getDocCount()
Returns number of documents in the collection.
|
int |
getDocLength(int docno)
Returns the length of a document.
|
int |
getDocnoOffset()
Returns the first docno in this collection.
|
float getAvgDocLength()
int getDocCount()
int getDocLength(int docno)
int getDocnoOffset()