
In this assignment you'll be implementing indexing and retrieval for lexical vectors (i.e., bag of words). Specifically, you will build:
Download corpus-enwiki-sample.txt.gz.
The file is 716 MB and has an MD5 checksum of feb030231d6746d3a3c7c59d8c96a0bc.
Once you download the corpus, uncompress it.
The file should be 1.9 GB and have an MD5 checksum of 4fe0c91d72b237fd938406550ad7ec60.
The corpus is a sample from Wikipedia. For this assignment, we'll be treating each sentence from Wikipedia as a separate "document", so in fact we'll be indexing and searching Wikipedia sentences.
Look at the first three lines of the file:
0 Anarchism.0004 Anarchism is usually considered a far-left ideology and much of anarchist economics and anarchist legal philosophy reflects anti-authoritarian interpretations of communism , collectivism , syndicalism , mutualism , or participatory economics . 1 Anarchism.0027 By the time of the French Revolution , some such as the Enraged Ones began to use the term positively in opposition to Jacobin centralisation of power , seeing `` revolutionary government '' as oxymoronic . 2 Anarchism.0037 His aversion to the imposition of a rules-based society led him to denounce as a manifestation of the people 's `` mental enslavement '' the foundations of law , property rights and even the institution of marriage .
The format is the docid (uniquely numbered), followed by a tab, followed by the actual "document" content. The content contains the article title, followed by a tab, followed by a sentence. However, treat the article title and the sentence together as "the document" (for the purposes of indexing).
There are 10,805,214 "documents" (i.e., lines) in corpus-enwiki-sample.txt.
It might be unwieldy to work with the entire corpus all at once, especially if you are debugging and want a faster iteration cycle.
Feel free to create a smaller subset of the corpus for this purpose.
Your first task is to implement the inverted indexing algorithm here, in PySpark, using RDDs. The inverted indexing algorithm you'll implement buffers all the postings and then sorts based on docid. There are inefficiencies associated with this approach, as we discussed in lecture, but that's fine for this assignment.
We provide a stub build_index_uncompressed.py for you to get started.
To guide you through the implementation and to make sure you remain on the right track, you must use this stub code.
Please do not change anything other than the parts specified in the stub code.
In this first implementation, we will build an inverted index that does not use any compression. The entire index will be in plain text and human readable. The text representation of each postings list should look like the following:
term df docid1:tf1 docid2:tf2 ...
That is, each postings list is contained on a line, and each line starts with the term, which is the output key in PySpark.
The term is followed by a tab.
The remainder of the line is the postings list itself (the PySpark value).
Each postings list begins with the df, or document frequency.
The document frequency is the count of the number of documents that contain the term.
After the df, there should be exactly that number of docid:tf tuples (space delimited).
That is, if df=10, there should be exactly 10 docid:tf tuples, each denoting a document that contains the term.
The tf specifies the number of times that the term appears in the document.
Your task is to write an indexer that takes a corpus and generates an inverted index in exactly the format specified above.
That is, the output of your PySpark job should be an index directory, and in that index directory there should be a bunch of part files (e.g., part-00000).
Each part file should be a plain-text file, and each line should exactly be in the format above.
Across all the part files in your index directory, there should be one line for each unique term in the vocabulary.
The following is the desired I/O behavior of the program you'll write. We'll issue the following command:
spark-submit build_index_uncompressed.py -input corpus-enwiki-sample.txt -output index-enwiki-uncompressed
Which will execute your program.
The result of executing your program should be an index directory index-enwiki-uncompressed/ that conforms to the specification outlined above.
Use the tokenize function in the stub Python script to tokenize the documents.
Next, you'll write a simple boolean retrieval engine (i.e., search engine) that performs boolean retrieval using the plain-text inverted index you've just built.
Your boolean retrieval engine will operate a term at a time, as discussed in class and described here. However, we'll make a significant simplification in the queries. Instead of supporting arbitrarily nested boolean expressions, you will assume that all queries are interpreted from left to right in a greedy manner. That is:
a OR b AND c
Will be interpreted as:
(a OR b) AND c
That is, you can read the query from left to right, and greedily apply the operator (OR, AND) to the operands.
We provide a stub boolean_retrieval_uncompressed.py for you to get started.
To guide you through the implementation and to make sure you remain on the right track, you must use this stub code.
Please do not change anything other than the parts specified in the stub code.
The following is the desired I/O behavior of the program you'll write. We'll issue the following command:
python boolean_retrieval_uncompressed.py -corpus corpus-enwiki-sample.txt -index index-enwiki-uncompressed -query "waterloo AND canada"
Which will execute your program. The output of executing your program should be, printed to stdout, the results of the query, each result on a line. That is, if there are 10 results to the query, the output should be precisely 10 lines. The results should be sorted in ascending order of docid. Please use the following snippet of code to make the results easy to skim and verify:
sentence_trimmed = sentence[:100] + "..." if len(sentence) > 100 else sentence
print(f"{docid:8} {title:72} {sentence_trimmed}")
For each result, you'll print out the docid, the Wikipedia article "title" (e.g., Anarchism.0004 above), and the sentence.
For the sentence, remove all newlines.
Based on the above snippet: If the resulting string is less than 100 chracters, print the entire string;
if the string is longer than 100 characters, print out only the first 100 characters, followed by the string "...".
So the output will look something like this:
379526 University of Waterloo.0155 St. Paul 's University College is a university college founded by members of the United Church of Ca... 379529 University of Waterloo.0174 The 2016 -- 2017 Times Higher Education World University Rankings placed Waterloo 173rd in the world... 379530 University of Waterloo.0185 The 2016 ARWU ranked for the field of engineering and computer science the university 51 -- 75 and s... 379531 University of Waterloo.0190 In the same year , the ARWU also ranked Waterloo 48th in the world , and second in Canada in the eng... 379532 University of Waterloo.0201 In the Times ' 2018 ranking for the subject of physical sciences , Waterloo placed 96th in the world...
We leave plenty of space for the Wikipedia article "title" because there are some long ones...
Now we're going to add compression to the inverted indexing algorithm. Specifically:
We provide a stub build_index_compressed.py for you to get started.
To guide you through the implementation and to make sure you remain on the right track, you must use this stub code.
Please do not change anything other than the parts specified in the stub code.
Your implementation should use the same algorithm as in the uncompressed version (PySpark, with RDDs).
The only difference is that each postings list should be a bytearray, not a str.
The postings lists should be identical from the logical perspective — that is, consisting of the df and (docid, tf) tuples.
Thus, the entire inverted index should be a dict from terms (i.e., strs) to bytearrays.
The output of your indexer should be a single file, not a directory.
Hint: Take the index and pickle it.
The following is the desired I/O behavior of the program you'll write. We'll issue the following command:
spark-submit build_index_compressed.py -input corpus-enwiki-sample.txt -output index-enwiki-compressed
Which will execute your program.
The result of executing your program should be a single file named index-enwiki-compressed that conforms to the specification outlined above.
Use the tokenize function in the stub Python script to tokenize the documents.
Use the vint_encode function in the stub Python script to perform vbyte encoding.
And, as expected, you'll write a simple boolean retrieval engine (i.e., search engine) that performs boolean retrieval using the compressed inverted index you've just built. The behavior of this retrieval engine should be identical as before, except that it operates over the compressed index.
We provide a stub boolean_retrieval_compressed.py for you to get started.
To guide you through the implementation and to make sure you remain on the right track, you must use this stub code.
Please do not change anything other than the parts specified in the stub code.
Use the vint_decode_one function in the stub Python script to perform vbyte decoding.
The following is the desired I/O behavior of the program you'll write. We'll issue the following command:
python boolean_retrieval_compressed.py -corpus corpus-enwiki-sample.txt -index index-enwiki-compressed -query "waterloo AND canada"
Which will execute your program. The output should be identical to the uncompressed version above.
Procedure is the same as previous assignments. Use this link to create an assignment repo for submission.
Answer the following questions about the sizes of different index variants:
Use du -h to compute index sizes.
Questions (1) and (2) above are exactly the sizes of the indexes created by build_index_uncompressed.py and build_index_compressed.py, respectively.
For question (3), you'll have to tweak code in build_index_compressed.py and rerun.
Note: Your submission should include all compression techniques specified above; for question (3), we just want an index size measurement based on an one-off run.
Encode the answers as follows in README.md:
size (1): XX size (2): XX size (3): XX
Where XX is the output of du -h, e.g., "124MB".
Run the following queries using boolean_retrieval_uncompressed.py:
"waterloo AND canada" "waterloo AND belgium" "waterloo OR stanford AND cheriton" "big AND data AND hadoop"
Store the results in four separate files, r{1,2,3,4}-uncompressed.txt, respectively.
Run the same queries using boolean_retrieval_compressed.py.
Store the results in four separate files, r{1,2,3,4}-compressed.txt, respectively.
Your assignment repo should have the following files:
README.md (with the index sizes, per above).build_index_uncompressed.pyboolean_retrieval_uncompressed.pybuild_index_compressed.pyboolean_retrieval_compressed.pyr{1,2,3,4}-uncompressed.txt, per above.r{1,2,3,4}-compressed.txt, per above.
Submit the assignment by committing your edits and pushing your repo back to origin.
build_index_uncompressed.py: 10 pointsboolean_retrieval_uncompressed.py: 10 pointsbuild_index_compressed.py: 10 pointsboolean_retrieval_compressed.py: 8 pointsREADME.md)What does "following instructions" mean? These are "free points" if you follow the instructions provided in this assignment. These points are to handle the scenario where all your answers are correct, but you did not follow the instructions and that caused us to go out of our way to fix your submission so that it conforms to the instructions. (For example, you removed the ids that we used for tracking, which would make it much more difficult to grade.) In these and other related cases, we will dock points from this category.
Total: 75 points