public class TermPositions extends Object implements Writable
A Hadoop Writable
that encodes the position of term occurrences within a document. Term
occurrences are represented as an array of ints, where each int represents a term position. These
objects serve as intermediate values in building document-sorted inverted indexes.
In serialized form, term positions are represented as first-order differences (i.e., position gaps or p-gaps) using Gamma encoding. As an example, let's say a term has a term frequency of 5, at token positions [3, 53, 58, 90, 101]. Such an object would be encoded as the following sequence of ints: 3, 50, 5, 32, 11, each of which is expressed using Gamma codes. Every int except the first represents the difference between the previous term position and the current term position.
Constructor and Description |
---|
TermPositions()
Creates an empty
TermPositions object. |
TermPositions(int[] pos,
short tf)
Creates a
TermPositions object with initial parameters. |
Modifier and Type | Method and Description |
---|---|
TermPositions |
clone()
Returns a shallow copy of this object.
|
static TermPositions |
create(byte[] bytes)
Factory method for creating
TermPositions objects. |
static TermPositions |
create(DataInput in)
Factory method for creating
TermPositions objects. |
int |
getEncodedSize()
Returns the size (in bits) of serialized form of this object.
|
int[] |
getPositions()
Returns the array of term positions.
|
short |
getTf()
Returns the term frequency.
|
void |
readFields(DataInput in)
Deserializes this object.
|
byte[] |
serialize()
Serializes this object and returns the raw serialized form in a byte array.
|
void |
set(int[] pos,
short tf)
Sets the term positions and term frequency of this object.
|
String |
toString()
Generates a human-readable String representation of this object.
|
void |
write(DataOutput out)
Serializes this object.
|
public TermPositions()
TermPositions
object.public TermPositions(int[] pos, short tf)
TermPositions
object with initial parameters. Note that the length of the
term positions array does not need to be the term frequency; this supports reusing arrays of
mismatching sizes.pos
- array of term positionstf
- the term frequencypublic TermPositions clone()
public static TermPositions create(byte[] bytes) throws IOException
TermPositions
objects.bytes
- raw serialized formTermPositions
objectIOException
public static TermPositions create(DataInput in) throws IOException
TermPositions
objects.in
- source to read fromTermPositions
objectIOException
public int getEncodedSize()
public int[] getPositions()
public short getTf()
public void readFields(DataInput in) throws IOException
readFields
in interface Writable
in
- data sourceIOException
public byte[] serialize() throws IOException
IOException
public void set(int[] pos, short tf)
pos
- array of term positionstf
- the term frequencypublic String toString()
public void write(DataOutput out) throws IOException
write
in interface Writable
out
- where to write the serialized representationIOException