├── .gitignore ├── README.md ├── documents ├── lotr.txt ├── the_hobbit.txt ├── rainbows_end.txt └── silmarillion.txt └── vsm.py /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | *~ -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | The vector space model for documents 2 | ==================================== 3 | 4 | The program vsm.py implements a toy search engine to illustrate the 5 | vector space model for documents. 6 | 7 | The program asks you to enter a search query, and then returns all 8 | documents matching the query, in decreasing order of cosine 9 | similarity, according to the vector space model. 10 | 11 | The document corpus consists of just four documents, which are product 12 | descriptions of popular books, taken from Amazon.com. -------------------------------------------------------------------------------- /documents/lotr.txt: -------------------------------------------------------------------------------- 1 | One Ring to rule them all, 2 | One Ring to find them, 3 | One Ring to bring them all 4 | and in the darkness bind them. 5 | 6 | In ancient times the Rings of Power were crafted by the Elven-smiths, 7 | and Sauron, The Dark Lord, forged the One Ring, filling it with his 8 | own power so that he could rule all others. But the One Ring was taken 9 | from him, and though he sought it throughout Middle-earth still it 10 | remained lost to him. After many ages it fell, by chance, into the 11 | hands of the hobbit, Bilbo Baggins. 12 | 13 | From his fastness in the Dark Tower of Mordor, Sauron's power spread 14 | far and wide. He gathered all the Great Rings to him, but ever he 15 | searched far and wide for the One Ring that would complete his 16 | dominion. 17 | 18 | On his eleventy-first birthday, Bilbo dissapeared bequeathing to his 19 | young cousin, Frodo, the Ruling Ring, and a perilous quest: to journey 20 | across Middle-earth, deep into the shadow of the Dark Lord and destroy 21 | the Ring by casting it into the Cracks of Doom. 22 | 23 | The Lord of the Rings tells of the great quest undertaken by Frodo and 24 | the Fellowship of the Ring: Gandalf the wizard, the hobbits Merry, 25 | Pippin and Sam, Gimli the Dwarf, Legolas the Elf, Boromir of Gondor, 26 | and a tall, mysterious stranger called Strider. 27 | -------------------------------------------------------------------------------- /documents/the_hobbit.txt: -------------------------------------------------------------------------------- 1 | "In a hole in the ground there lived a hobbit. Not a nasty, dirty, wet 2 | hole, filled with the ends of worms and an oozy smell, nor yet a dry, 3 | bare, sandy hole with nothing in it to sit down on or to eat: it was a 4 | hobbit-hole, and that means comfort." 5 | 6 | The hobbit-hole in question belongs to one Bilbo Baggins, an 7 | upstanding member of a "little people, about half our height, and 8 | smaller than the bearded dwarves." He is, like most of his kind, well 9 | off, well fed, and best pleased when sitting by his own fire with a 10 | pipe, a glass of good beer, and a meal to look forward to. Certainly 11 | this particular hobbit is the last person one would expect to see set 12 | off on a hazardous journey; indeed, when Gandalf the Grey stops by one 13 | morning, "looking for someone to share in an adventure," Baggins 14 | fervently wishes the wizard elsewhere. No such luck, however; soon 13 15 | fortune-seeking dwarves have arrived on the hobbit's doorstep in 16 | search of a burglar, and before he can even grab his hat or an 17 | umbrella, Bilbo Baggins is swept out his door and into a dangerous 18 | adventure. 19 | 20 | The dwarves' goal is to return to their ancestral home in the Lonely 21 | Mountains and reclaim a stolen fortune from the dragon Smaug. Along 22 | the way, they and their reluctant companion meet giant spiders, 23 | hostile elves, ravening wolves--and, most perilous of all, a 24 | subterranean creature named Gollum from whom Bilbo wins a magical ring 25 | in a riddling contest. It is from this life-or-death game in the dark 26 | that J.R.R. Tolkien's masterwork, The Lord of the Rings, would 27 | eventually spring. Though The Hobbit is lighter in tone than the 28 | trilogy that follows, it has, like Bilbo Baggins himself, unexpected 29 | iron at its core. Don't be fooled by its fairy-tale demeanor; this is 30 | very much a story for adults, though older children will enjoy it, 31 | too. By the time Bilbo returns to his comfortable hobbit-hole, he is a 32 | different person altogether, well primed for the bigger adventures to 33 | come--and so is the reader. 34 | -------------------------------------------------------------------------------- /documents/rainbows_end.txt: -------------------------------------------------------------------------------- 1 | Robert Gu is a recovering Alzheimer's patient. The world that he 2 | remembers was much as we know it today. Now, as he regains his 3 | faculties through a cure developed during the years of his near-fatal 4 | decline, he discovers that the world has changed and so has his place 5 | in it. He was a world-renowned poet. Now he is seventy-five years old, 6 | though by a medical miracle he looks much younger, and he's starting 7 | over, for the first time unsure of his poetic gifts. Living with his 8 | son's family, he has no choice but to learn how to cope with a new 9 | information age in which the virtual and the real are a seamless 10 | continuum, layers of reality built on digital views seen by a single 11 | person or millions, depending on your choice. But the consensus 12 | reality of the digital world is available only if, like his 13 | thirteen-year-old granddaughter Miri, you know how to wear your 14 | wireless access--through nodes designed into smart clothes--and to see 15 | the digital context--through smart contact lenses. 16 | 17 | With knowledge comes risk. When Robert begins to re-train at Fairmont 18 | High, learning with other older people what is second nature to Miri 19 | and other teens at school, he unwittingly becomes part of a 20 | wide-ranging conspiracy to use technology as a tool for world 21 | domination. 22 | 23 | In a world where every computer chip has Homeland Security built-in, 24 | this conspiracy is something that baffles even the most sophisticated 25 | security analysts, including Robert's son and daughter-in law, two top 26 | people in the U.S. military. And even Miri, in her attempts to protect 27 | her grandfather, may be entangled in the plot. 28 | 29 | As Robert becomes more deeply involved in conspiracy, he is shocked to 30 | learn of a radical change planned for the UCSD Geisel Library; all the 31 | books there, and worldwide, would cease to physically exist. He and 32 | his fellow re-trainees feel compelled to join protests against the 33 | change. With forces around the world converging on San Diego, both the 34 | conspiracy and the protest climax in a spectacular moment as unique 35 | and satisfying as it is unexpected. This is science fiction at its 36 | very best, by a master storyteller at his peak. 37 | -------------------------------------------------------------------------------- /documents/silmarillion.txt: -------------------------------------------------------------------------------- 1 | The Music of the Ainur 2 | 3 | There was Eru, the One, who in Arda is called Ilúvatar; and he made 4 | first the Ainur, the Holy Ones, that were the offspring of his 5 | thought, and they were with him before aught else was made. And he 6 | spoke to them, propounding to them themes of music; and they sang 7 | before him, and he was glad. But for a long while they sang only each 8 | alone, or but few together, while the rest hearkened; for each 9 | comprehended only that part of the mind of Ilúvatar from which he 10 | came, and in the understanding of their brethren they grew but 11 | slowly. Yet ever as they listened they came to deeper understanding, 12 | and increased in unison and harmony. 13 | 14 | And it came to pass that Ilúvatar called together all the Ainur and 15 | declared to them a mighty theme, unfolding to them things greater and 16 | more wonderful than he had yet revealed; and the glory of its 17 | beginning and the splendour of its end amazed the Ainur, so that they 18 | bowed before Ilúvatar and were silent. 19 | 20 | Then Ilúvatar said to them: ‘Of the theme that I have declared to you, 21 | I will now that ye make in harmony together a Great Music. And since I 22 | have kindled you with the Flame Imperishable, ye shall show forth your 23 | powers in adorning this theme, each with his own thoughts and devices, 24 | if he will. But I will sit and hearken, and be glad that through you 25 | great beauty has been wakened into song. 26 | 27 | Then the voices of the Ainur, like unto harps and lutes, and pipes and 28 | trumpets, and viols and organs, and like unto countless choirs singing 29 | with words, began to fashion the theme of Ilúvatar to a great music; 30 | and a sound arose of endless interchanging melodies woven in harmony 31 | that passed beyond hearing into the depths and into the heights, and 32 | the places of the dwelling of Ilúvatar were filled to overflowing, and 33 | the music and the echo of the music went out into the Void, and it was 34 | not void. Never since have the Ainur made any music like to this 35 | music, though it has been said that a greater still shall be made 36 | before Ilúvatar by the choirs of the Ainur and the Children of 37 | Ilúvatar after the end of days. Then the themes of Ilúvatar shall be 38 | played aright, and take Being in the moment of their utterance, for 39 | all shall then understand fully his intent in their part, and each 40 | shall know the comprehension of each, and Ilúvatar shall give to their 41 | thoughts the secret fire, being well pleased. 42 | 43 | But now Ilúvatar sat and hearkened, and for a great while it seemed 44 | good to him, for in the music there were no flaws. But as the theme 45 | progressed, it came into the heart of Melkor to interweave matters of 46 | his own imagining that were not in accord with the theme of Ilúvatar; 47 | for he sought therein to increase the power and glory of the part 48 | assigned to himself. To Melkor among the Ainur had been given the 49 | greatest gifts of power and knowledge, and he had a share in all the 50 | gifts of his brethren. He had gone often alone into the void places 51 | seeking the Imperishable Flame; for desire grew hot within him to 52 | bring into Being things of his own, and it seemed to him that Ilúvatar 53 | took no thought for the Void, and he was impatient of its 54 | emptiness. Yet he found not the Fire, for it is with Ilúvatar. But 55 | being alone he had begun to conceive thoughts of his own unlike those 56 | of his brethren. 57 | 58 | Some of these thoughts he now wove into his music, and straight-way 59 | discord arose about him, and many that sang nigh him grew despondent, 60 | and their thought was disturbed and their music faltered; but some 61 | began to attune their music to his rather than to the thought which 62 | they had at first. Then the discord of Melkor spread ever wider, and 63 | the melodies which had been heard before foundered in a sea of 64 | turbulent sound. But Ilúvatar sat and hearkened until it seemed that 65 | about his throne there was a raging storm, as of dark waters that made 66 | war one upon another in an endless wrath that would not be assuaged. 67 | 68 | Then Ilúvatar arose, and the Ainur perceived that he smiled; and he 69 | lifted up his left hand, and a new theme began amid the storm, like 70 | and yet unlike to the former theme, and it gathered power and had new 71 | beauty. But the discord of Melkor rose in uproar and contended with 72 | it, and again there was a war of sound more violent than before, until 73 | many of the Ainur were dismayed and sang no longer, and Melkor had the 74 | mastery. Then again Ilúvatar arose, and the Ainur perceived that his 75 | countenance was stern; and he lifted up his right hand, and behold! a 76 | third theme grew amid the confusion, and it was unlike the others. For 77 | it seemed at first soft and sweet, a mere rippling of gentle sounds in 78 | delicate melodies; but it could not be quenched, and it took to itself 79 | power and profundity. And it seemed at last that there were two musics 80 | progressing at one time before the seat of Ilúvatar, and they were 81 | utterly at variance. The one was deep and wide and beautiful, but slow 82 | and blended with an immeasurable sorrow, from which its beauty chiefly 83 | came. The other had now achieved a unity of its own; but it was loud, 84 | and vain, and endlessly repeated; and it had little harmony, but 85 | rather a clamorous unison as of many trumpets braying upon a few 86 | notes. And it essayed to drown the other music by the violence of its 87 | voice, but it seemed that its most triumphant notes were taken by the 88 | other and woven into its own solemn pattern. 89 | 90 | In the midst of this strife, whereat the halls of Ilúvatar shook and a 91 | tremor ran out into the silences yet unmoved, Ilúvatar arose a third 92 | time, and his face was terrible to behold. Then he raised up both his 93 | hands, and in one chord, deeper than the Abyss, higher than the 94 | Firmament, piercing as the light of the eye of Ilúvatar, the Music 95 | ceased. 96 | -------------------------------------------------------------------------------- /vsm.py: -------------------------------------------------------------------------------- 1 | """vsm.py implements a toy search engine to illustrate the vector 2 | space model for documents. 3 | 4 | It asks you to enter a search query, and then returns all documents 5 | matching the query, in decreasing order of cosine similarity, 6 | according to the vector space model.""" 7 | 8 | from collections import defaultdict 9 | import math 10 | import sys 11 | 12 | # We use a corpus of four documents. Each document has an id, and 13 | # these are the keys in the following dict. The values are the 14 | # corresponding filenames. 15 | document_filenames = {0 : "documents/lotr.txt", 16 | 1 : "documents/silmarillion.txt", 17 | 2 : "documents/rainbows_end.txt", 18 | 3 : "documents/the_hobbit.txt"} 19 | 20 | # The size of the corpus 21 | N = len(document_filenames) 22 | 23 | # dictionary: a set to contain all terms (i.e., words) in the document 24 | # corpus. 25 | dictionary = set() 26 | 27 | # postings: a defaultdict whose keys are terms, and whose 28 | # corresponding values are the so-called "postings list" for that 29 | # term, i.e., the list of documents the term appears in. 30 | # 31 | # The way we implement the postings list is actually not as a Python 32 | # list. Rather, it's as a dict whose keys are the document ids of 33 | # documents that the term appears in, with corresponding values equal 34 | # to the frequency with which the term occurs in the document. 35 | # 36 | # As a result, postings[term] is the postings list for term, and 37 | # postings[term][id] is the frequency with which term appears in 38 | # document id. 39 | postings = defaultdict(dict) 40 | 41 | # document_frequency: a defaultdict whose keys are terms, with 42 | # corresponding values equal to the number of documents which contain 43 | # the key, i.e., the document frequency. 44 | document_frequency = defaultdict(int) 45 | 46 | # length: a defaultdict whose keys are document ids, with values equal 47 | # to the Euclidean length of the corresponding document vector. 48 | length = defaultdict(float) 49 | 50 | # The list of characters (mostly, punctuation) we want to strip out of 51 | # terms in the document. 52 | characters = " .,!#$%^&*();:\n\t\\\"?!{}[]<>" 53 | 54 | def main(): 55 | initialize_terms_and_postings() 56 | initialize_document_frequencies() 57 | initialize_lengths() 58 | while True: 59 | do_search() 60 | 61 | def initialize_terms_and_postings(): 62 | """Reads in each document in document_filenames, splits it into a 63 | list of terms (i.e., tokenizes it), adds new terms to the global 64 | dictionary, and adds the document to the posting list for each 65 | term, with value equal to the frequency of the term in the 66 | document.""" 67 | global dictionary, postings 68 | for id in document_filenames: 69 | f = open(document_filenames[id],'r') 70 | document = f.read() 71 | f.close() 72 | terms = tokenize(document) 73 | unique_terms = set(terms) 74 | dictionary = dictionary.union(unique_terms) 75 | for term in unique_terms: 76 | postings[term][id] = terms.count(term) # the value is the 77 | # frequency of the 78 | # term in the 79 | # document 80 | 81 | def tokenize(document): 82 | """Returns a list whose elements are the separate terms in 83 | document. Something of a hack, but for the simple documents we're 84 | using, it's okay. Note that we case-fold when we tokenize, i.e., 85 | we lowercase everything.""" 86 | terms = document.lower().split() 87 | return [term.strip(characters) for term in terms] 88 | 89 | def initialize_document_frequencies(): 90 | """For each term in the dictionary, count the number of documents 91 | it appears in, and store the value in document_frequncy[term].""" 92 | global document_frequency 93 | for term in dictionary: 94 | document_frequency[term] = len(postings[term]) 95 | 96 | def initialize_lengths(): 97 | """Computes the length for each document.""" 98 | global length 99 | for id in document_filenames: 100 | l = 0 101 | for term in dictionary: 102 | l += imp(term,id)**2 103 | length[id] = math.sqrt(l) 104 | 105 | def imp(term,id): 106 | """Returns the importance of term in document id. If the term 107 | isn't in the document, then return 0.""" 108 | if id in postings[term]: 109 | return postings[term][id]*inverse_document_frequency(term) 110 | else: 111 | return 0.0 112 | 113 | def inverse_document_frequency(term): 114 | """Returns the inverse document frequency of term. Note that if 115 | term isn't in the dictionary then it returns 0, by convention.""" 116 | if term in dictionary: 117 | return math.log(N/document_frequency[term],2) 118 | else: 119 | return 0.0 120 | 121 | def do_search(): 122 | """Asks the user what they would like to search for, and returns a 123 | list of relevant documents, in decreasing order of cosine 124 | similarity.""" 125 | query = tokenize(raw_input("Search query >> ")) 126 | if query == []: 127 | sys.exit() 128 | # find document ids containing all query terms. Works by 129 | # intersecting the posting lists for all query terms. 130 | relevant_document_ids = intersection( 131 | [set(postings[term].keys()) for term in query]) 132 | if not relevant_document_ids: 133 | print "No documents matched all query terms." 134 | else: 135 | scores = sorted([(id,similarity(query,id)) 136 | for id in relevant_document_ids], 137 | key=lambda x: x[1], 138 | reverse=True) 139 | print "Score: filename" 140 | for (id,score) in scores: 141 | print str(score)+": "+document_filenames[id] 142 | 143 | def intersection(sets): 144 | """Returns the intersection of all sets in the list sets. Requires 145 | that the list sets contains at least one element, otherwise it 146 | raises an error.""" 147 | return reduce(set.intersection, [s for s in sets]) 148 | 149 | def similarity(query,id): 150 | """Returns the cosine similarity between query and document id. 151 | Note that we don't bother dividing by the length of the query 152 | vector, since this doesn't make any difference to the ordering of 153 | search results.""" 154 | similarity = 0.0 155 | for term in query: 156 | if term in dictionary: 157 | similarity += inverse_document_frequency(term)*imp(term,id) 158 | similarity = similarity / length[id] 159 | return similarity 160 | 161 | if __name__ == "__main__": 162 | main() 163 | --------------------------------------------------------------------------------