├── README.adoc ├── animals.arr ├── tests.arr └── cosine-similarity.arr /README.adoc: -------------------------------------------------------------------------------- 1 | = README 2 | 3 | Implements the cosine similarity algorithm as described in 4 | https://www.geeksforgeeks.org/cosine-similarity/. (Note: We have an 5 | another less-standard algorithm in our examplar assignment 6 | https://cs.brown.edu/courses/csci0190/2023/docdiff.html.) 7 | 8 | We'd like to store the input documents in Google Drive. Currently 9 | the only Drive documents that we can work with are spreadsheets, 10 | so we add a way to extract the relevant text from a spreadsheet. 11 | We will assume that that the spreadsheets used for this purpose 12 | contain only one cell, which contains the entire text of the 13 | document. We may need to revisit based on any size limitations 14 | imposed by Google Spreadsheets. The first step then would be to 15 | then to spread the text across a single column. 16 | 17 | The functions provided: 18 | 19 | - `cosine-similarity-lists()` which takes two lists of words 20 | (strings) and finds the cosine similarity between them. 21 | 22 | - `cosine-similarity-files()` which takes two Google Drive IDs, 23 | and finds the cosine similarity between their respective 24 | spreadsheet single-cell contents. 25 | 26 | - `cosine-similarity()` which takes two strings, and finds teh 27 | cosine similarity of the list of words contained in the 28 | respective strings. 29 | 30 | Internally, a list of words associated with one document is 31 | uniquified, and a (non-mutable) string-dict is created associating each word 32 | with its count. Thus the list associated with a document maps 33 | (only) the words in it to their counts. We don't need to keep 34 | track of any other words that may appear in comparable documents 35 | (unlike docdiff). 36 | 37 | The `dot-product()` of two such string-dicts goes over every key 38 | in the first dict, and if it is also represented in the second 39 | dict, multiplies them. The sum of such multiples is the dot 40 | product. 41 | 42 | To normalize this dot-product (i.e., to hem it between 0 and 1), 43 | we divide by the product of the magnitudes of the two 44 | string-dicts. (The magnitude of a string-dict is the square-root of 45 | its dot-product with itself.) 46 | 47 | == Other types of comparison 48 | 49 | Two other simpler forms of comparison are also provided (with the 50 | same signature as for cosine similarity above): 51 | 52 | - `simple-equality-lists()`, `simple-equality-files()`, and 53 | `simple-equality()` 54 | check if the words are the same in the same order. Output is 55 | boolean. 56 | 57 | - `bag-equality-lists()`, `bag-equality-files()`, and 58 | `bag-equality()` check if 59 | the Bag Of Words are the same (i.e., order doesn't matter, but 60 | count does). Output is boolean. 61 | 62 | - `angle-difference-lists()`, `angle-difference-files()`, and 63 | `angle-difference` return the arccos of what the corresponding 64 | `cosine-similarity*` function returns. Output is in degrees. 65 | 66 | == Debugging aid 67 | 68 | The function `string-to-bag()` takes a text (string) and after 69 | collapsing case and removing punctuation, returns a table of 70 | rows, where each row lists a word along with its frequency. 71 | -------------------------------------------------------------------------------- /animals.arr: -------------------------------------------------------------------------------- 1 | # from https://en.wikipedia.org/wiki/Elephants_in_Thailand 2 | elephant-article = "The elephant has been a contributor to Thai society and its icon for many centuries. The elephant has had a considerable impact on Thai culture. The Thai elephant is the official national animal of Thailand. The elephant found in Thailand is the Indian elephant, a subspecies of the Asian elephant." 3 | 4 | # from https://en.wikipedia.org/wiki/Polar_bear 5 | polarbear-article = "The polar bear is a large bear native to the Arctic and nearby areas. It is closely related to the brown bear, and the two species can interbreed. The polar bear is the largest extant species of bear and land carnivore, with adult males weighing 300–800 kg. The polar bear is white- or yellowish-furred with black skin and a thick layer of fat." 6 | 7 | # from https://en.wikipedia.org/wiki/Rhinoceros 8 | rhino-article = "Rhinoceroses are some of the largest remaining megafauna: all weigh over half a tonne in adulthood. They have a herbivorous diet, small brains 400–600 g for mammals of their size, one or two horns, and a thick 1.5–5 cm, protective skin formed from layers of collagen positioned in a lattice structure. They generally eat leafy material." 9 | 10 | # from https://en.wikipedia.org/wiki/Blue_whale 11 | bluewhale-article = "The blue whale is a marine mammal and a baleen whale. Reaching a maximum confirmed length of 29.9 m and weighing up to 199 tons, it is the largest animal known ever to have existed. The blue whale's long and slender body can be of various shades of greyish-blue on its upper surface and somewhat lighter underneath." 12 | 13 | # from https://en.wikipedia.org/wiki/Snow_leopard 14 | snowleopard-article = "The snow leopard is a species of large cat in the genus Panthera of the family Felidae. The species is native to the mountain ranges of Central and South Asia. It is listed as Vulnerable on the IUCN Red List because the global population is estimated to number fewer than 10,000 mature individuals and is expected to decline about 10% by 2040." 15 | 16 | # from https://en.wikipedia.org/wiki/Manatee 17 | manatee-article = "Manatees are herbivores and eat over 60 different freshwater and saltwater plants. Manatees inhabit the shallow, marshy coastal areas and rivers of the Caribbean Sea, the Gulf of Mexico, the Amazon basin, and West Africa. The main causes of death for manatees are human-related issues, such as habitat destruction and human objects." 18 | 19 | # from https://en.wikipedia.org/wiki/Chimpanzee 20 | chimpanzee-article = "The chimpanzee lives in groups that range in size from 15 to 150 members, although individuals travel and forage in much smaller groups during the day. The species lives in a strict male-dominated hierarchy, where disputes are generally settled without the need for violence. Nearly all chimpanzee populations have been recorded using tools, modifying sticks, rocks, grass and leaves and using them for hunting and acquiring honey, termites, ants, nuts and water." 21 | 22 | # from https://en.wikipedia.org/wiki/American_badger 23 | badger-article = "The American badger is a North American badger similar in appearance to the European badger, although not closely related. It is found in the western, central, and northeastern United States, northern Mexico, and south-central Canada to certain areas of southwestern British Columbia. The American badger's habitat is typified by open grasslands with available prey (such as mice, squirrels, and groundhogs)." 24 | 25 | # from https://en.wikipedia.org/wiki/Snail 26 | snail-article = "Snails can be found in a very wide range of environments, including ditches, deserts, and the abyssal depths of the sea. Although land snails may be more familiar to laymen, marine snails constitute the majority of snail species, and have much greater diversity and a greater biomass. Numerous kinds of snail can also be found in fresh water." 27 | 28 | # from https://en.wikipedia.org/wiki/Hamster 29 | hamster-article = "Hamsters feed primarily on seeds, fruits, vegetation, and occasionally burrowing insects. In the wild, they are crepuscular: they forage during the twilight hours. In captivity, however, they are known to live a conventionally nocturnal lifestyle, waking around sundown to feed and exercise. Physically, they are stout-bodied with distinguishing features that include elongated cheek pouches extending to their shoulders, which they use to carry food back to their burrows, as well as a short tail and fur-covered feet." 30 | 31 | # from https://en.wikipedia.org/wiki/Giraffe 32 | giraffe-article = "The giraffe's distinguishing characteristics are its extremely long neck and legs, horn-like ossicones, and spotted coat patterns. It is classified under the family Giraffidae, along with its closest extant relative, the okapi. Its scattered range extends from Chad in the north to South Africa in the south and from Niger in the west to Somalia in the east." 33 | 34 | # from https://en.wikipedia.org/wiki/Hippopotamus 35 | hippo-article = "Hippos inhabit rivers, lakes, and mangrove swamps. Territorial bulls each preside over a stretch of water and a group of five to thirty cows and calves. Mating and birth both occur in the water. During the day, hippos remain cool by staying in water or mud, emerging at dusk to graze on grasses. While hippos rest near each other in the water, grazing is a solitary activity and hippos typically do not display territorial behaviour on land. Hippos are among the most dangerous animals in the world due to their aggressive and unpredictable nature. " 36 | 37 | standard-named-articles = [list: 38 | [list: "elephant", elephant-article], 39 | [list: "polarbear", polarbear-article], 40 | [list: "rhino", rhino-article], 41 | [list: "bluewhale", bluewhale-article], 42 | [list: "snowleopard", snowleopard-article], 43 | [list: "manatee", manatee-article], 44 | [list: "chimpanzee", chimpanzee-article], 45 | [list: "badger", badger-article], 46 | [list: "snail", snail-article], 47 | [list: "hamster", hamster-article], 48 | [list: "giraffe", giraffe-article], 49 | [list: "hippo", hippo-article], 50 | ] 51 | 52 | student-article = elephant-article 53 | 54 | fun distance-to-helper(candidate-article :: String, corpus :: List, ignore-stop-words :: Boolean) -> Table block: 55 | var candidate-words = string-to-list-of-natlang-words(candidate-article) 56 | if ignore-stop-words: 57 | candidate-words := remove-stop-words(candidate-words) 58 | else: false 59 | end 60 | var tbl = table: article :: String, difference :: Number end 61 | for each(named-article from corpus) block: 62 | article-name = named-article.get(0) 63 | var article-words = string-to-list-of-natlang-words(named-article.get(1)) 64 | if ignore-stop-words: 65 | article-words := remove-stop-words(article-words) 66 | else: false 67 | end 68 | new-row = tbl.row(article-name, angle-difference-lists(candidate-words, article-words)) 69 | tbl := tbl.add-row(new-row) 70 | end 71 | tbl 72 | end 73 | 74 | fun distance-to(candidate-article): 75 | distance-to-helper(candidate-article, standard-named-articles, false) 76 | end 77 | 78 | fun distance-to-cleaned(candidate-article): 79 | distance-to-helper(candidate-article, standard-named-articles, true) 80 | end 81 | 82 | # try 83 | # 84 | # distance-to(student-article) 85 | # -- this doesn't ignore stop words 86 | 87 | # distance-to-cleaned(student-article) 88 | # -- this does ignore stop words 89 | 90 | -------------------------------------------------------------------------------- /tests.arr: -------------------------------------------------------------------------------- 1 | # load cosine-similarity.arr and animals.arr before this file 2 | 3 | check "dot-product": 4 | x-sd = list-of-words-to-sd(string-to-list-of-natlang-words("apple banana citrus")) 5 | y-sd = list-of-words-to-sd(string-to-list-of-natlang-words("apple banana banana citrus citrus citrus")) 6 | dot-product(x-sd, x-sd) is 3 7 | dot-product(y-sd, y-sd) is 14 8 | dot-product(x-sd, y-sd) is 6 9 | dot-product(y-sd, x-sd) is 6 10 | end 11 | 12 | # two sample Google IDs for spreadsheets containing texts 13 | var sheet_id1 = "1CnAGrIMW7W1Qrxtm8ZmJXYcQvkoMbSmzL7Ixw6d4FYQ" 14 | var sheet_id2 = "10ngDjr6ahZICKrSVb6zFnOKRDmqMNeqqJCVEDvxONWs" 15 | 16 | check "simple equality": 17 | 18 | # comparing file to itself shd always yield true 19 | simple-equality-files(sheet_id1, sheet_id1) is true 20 | simple-equality-files(sheet_id2, sheet_id2) is true 21 | 22 | # comparing file to a different file shd always yield false 23 | simple-equality-files(sheet_id1, sheet_id2) is false 24 | simple-equality-files(sheet_id2, sheet_id1) is false 25 | 26 | # comparing a text to itself yields true 27 | simple-equality-lists([list: "apple", "apple", "orange"], [list: "apple", "apple", "orange"]) is true 28 | 29 | # comparing a text to a permuted version of itself yields false 30 | simple-equality-lists([list: "apple", "apple", "orange"], [list: "apple", "orange", "apple"]) is false 31 | 32 | # comparing obviously dissimilar texts yields false 33 | simple-equality-lists([list: "apple", "apple", "orange"], [list: "apple", "orange", "orange", "orange"]) is false 34 | simple-equality-lists([list: "a", "a", "a", "b", "b", "d", "d", "d", "d", "d"], [list: "a"]) is false 35 | 36 | # comparing exactly similar texts yields true 37 | simple-equality-lists([list: "a", "b", "c", "d"], [list: "a", "b", "c", "d"]) is true 38 | 39 | # same as above, but using single strings instead of lists of words 40 | simple-equality("apple apple orange", "apple apple orange") is true 41 | simple-equality("apple apple orange", "apple orange apple") is false 42 | simple-equality("apple apple orange", "apple orange orange orange")is false 43 | simple-equality("a a a b b d d d d d", "a") is false 44 | simple-equality("a b c d", "a b c d") is true 45 | end 46 | 47 | check "bag equality": 48 | 49 | # comparing file to itself shd always yield true 50 | bag-equality-files(sheet_id1, sheet_id1) is true 51 | bag-equality-files(sheet_id2, sheet_id2) is true 52 | 53 | # comparing file to a different file shd always yield false 54 | bag-equality-files(sheet_id1, sheet_id2) is false 55 | bag-equality-files(sheet_id2, sheet_id1) is false 56 | 57 | # comparing a text to itself yields true 58 | bag-equality-lists([list: "apple", "apple", "orange"], [list: "apple", "apple", "orange"]) is true 59 | 60 | # comparing a text to a permuted version of itself yields true 61 | bag-equality-lists([list: "apple", "apple", "orange"], [list: "apple", "orange", "apple"]) is true 62 | 63 | # comparing obviously dissimilar texts yields false 64 | bag-equality-lists([list: "apple", "apple", "orange"], [list: "apple", "orange", "orange", "orange"]) is false 65 | bag-equality-lists([list: "a", "a", "a", "b", "b", "d", "d", "d", "d", "d"], [list: "a"]) is false 66 | 67 | # comparing exactly similar texts yields true 68 | bag-equality-lists([list: "a", "b", "c", "d"], [list: "a", "b", "c", "d"]) is true 69 | 70 | # same as above, but using single strings instead of lists of words 71 | bag-equality("apple apple orange", "apple apple orange") is true 72 | bag-equality("apple apple orange", "apple orange apple") is true 73 | bag-equality("apple apple orange", "apple orange orange orange")is false 74 | bag-equality("a a a b b d d d d d", "a") is false 75 | bag-equality("a b c d", "a b c d") is true 76 | end 77 | 78 | check "cosine equality": 79 | 80 | # comparing file to itself shd always yield 1 81 | cosine-similarity-files(sheet_id1, sheet_id1) is-roughly 1 82 | cosine-similarity-files(sheet_id2, sheet_id2) is-roughly 1 83 | 84 | # comparing file to a different file shd always yield < 1 85 | cosine-similarity-files(sheet_id1, sheet_id2) satisfies lam(x): x < 1 end 86 | cosine-similarity-files(sheet_id2, sheet_id1) satisfies lam(x): x < 1 end 87 | 88 | # comparing a text to itself yields 1 89 | cosine-similarity-lists([list: "apple", "apple", "orange"], [list: "apple", "apple", "orange"]) is-roughly 1 90 | 91 | # comparing a text to a permuted version of itself also yields 1 92 | cosine-similarity-lists([list: "apple", "apple", "orange"], [list: "apple", "orange", "apple"]) is-roughly 1 93 | 94 | # comparing obviously dissimilar texts yields <1 95 | cosine-similarity-lists([list: "apple", "apple", "orange"], [list: "apple", "orange", "orange", "orange"]) is-roughly (1 / sqrt(2)) 96 | 97 | cosine-similarity-lists([list: "a", "a", "a", "b", "b", "d", "d", "d", "d", "d"], [list: "a"]) is%(within-rel(0.01)) ~0.49 98 | 99 | # comparing exactly similar texts yields 1 100 | cosine-similarity-lists([list: "doo", "doo", "be", "doo", "be"], [list: "doo", "be", "doo", "be", "doo"]) is-roughly 1 101 | 102 | # same as above, but with single strings rather than lists of words 103 | cosine-similarity("apple apple orange", "apple apple orange") is-roughly 1 104 | cosine-similarity("apple apple orange", "apple orange apple") is-roughly 1 105 | cosine-similarity("apple apple orange", "apple orange orange orange") is-roughly (1 / sqrt(2)) 106 | cosine-similarity("a a a b b d d d d d", "a") is%(within-rel(0.01)) ~0.49 107 | cosine-similarity("doo doo be doo be", "doo be doo be doo") is-roughly 1 108 | end 109 | 110 | check "angle difference": 111 | 112 | # comparing file to itself shd always yield 0 113 | angle-difference-files(sheet_id1, sheet_id1) is-roughly 0 114 | angle-difference-files(sheet_id2, sheet_id2) is-roughly 0 115 | 116 | # comparing file to a different file shd always yield >0 117 | angle-difference-files(sheet_id1, sheet_id2) satisfies lam(x): x > 0 end 118 | angle-difference-files(sheet_id2, sheet_id1) satisfies lam(x): x > 0 end 119 | 120 | # comparing a text to itself yields 0 121 | angle-difference-lists([list: "apple", "apple", "orange"], [list: "apple", "apple", "orange"]) is-roughly 0 122 | 123 | # comparing a text to a permuted version of itself also yields 0 124 | angle-difference-lists([list: "apple", "apple", "orange"], [list: "apple", "orange", "apple"]) is-roughly 0 125 | 126 | # comparing obviously dissimilar texts yields >0 127 | angle-difference-lists([list: "apple", "apple", "orange"], [list: "apple", "orange", "orange", "orange"]) is-roughly ~45 128 | angle-difference-lists([list: "a", "a", "a", "b", "b", "d", "d", "d", "d", "d"], [list: "a"]) is%(within-rel(0.01)) ~60.878 129 | 130 | # comparing exactly similar texts yields 0 131 | angle-difference-lists([list: "doo", "doo", "be", "doo", "be"], [list: "doo", "be", "doo", "be", "doo"]) is%(within-rel(0.01)) 0 132 | 133 | # same as above, but using single strings instead of lists of words 134 | angle-difference("apple apple orange", "apple apple orange") is-roughly 0 135 | angle-difference("apple apple orange", "apple orange apple") is-roughly 0 136 | angle-difference("apple apple orange", "apple orange orange orange") is-roughly ~45 137 | angle-difference("a a a b b d d d d d", "a") is%(within-rel(0.01)) ~60.878 138 | angle-difference("doo doo be doo be", "doo be doo be doo") is-roughly 0 139 | 140 | # 141 | angle-difference-cleaned("apple apple orange", "apple apple orange") is-roughly 0 142 | angle-difference-cleaned("apple apple orange", "apple orange apple") is-roughly 0 143 | angle-difference-cleaned("apple apple orange", "apple orange orange orange") is-roughly ~45 144 | # angle-difference-cleaned("a a a b b d d d d d", "a") is%(within-rel(0.01)) ~90 145 | angle-difference-cleaned("a a a b b d d d d d", "a") raises "cosine-similarity" 146 | angle-difference-cleaned("doo doo be doo be", "doo be doo be doo") is-roughly 0 147 | end 148 | 149 | check "string-to-bag": 150 | # the returned bag has columns "word" and "frequency" 151 | S.list-to-list-set(string-to-bag("doo be doo be doo").get-column("word")) is [S.list-set: "be", "doo"] 152 | S.list-to-list-set(string-to-bag("doo be doo be doo").get-column("frequency")) is [S.list-set: 2, 3] 153 | S.list-to-list-set(string-to-bag-cleaned("the whale").get-column("word")) is [S.list-set: "whale"] 154 | S.list-to-list-set(string-to-bag-cleaned("the whale").get-column("frequency")) is [S.list-set: 1] 155 | end 156 | 157 | fun distance-table-get-article-difference(tbl :: Table, art :: String) block: 158 | # this is used only for testing. 159 | # takes the table resulting from a distance-to call, and an article name `art`, 160 | # and returns the similarity associated with `art` 161 | table-rows = tbl.all-rows() 162 | var answer-found = false 163 | var simty = 0 164 | for each(table-row from table-rows) block: 165 | if not(answer-found): 166 | if table-row.get-value('article') == art block: 167 | simty := table-row.get-value('difference') 168 | answer-found := true 169 | else: false 170 | end 171 | else: false 172 | end 173 | end 174 | simty 175 | end 176 | 177 | check "distance-to": 178 | tbl1 = distance-to(elephant-article) # stopwords present 179 | tbl2 = distance-to-cleaned(elephant-article) # stopwords ignored 180 | # 181 | # following checks that the distance between an article and itself is 0 182 | # whether or not stopwords are removed 183 | distance-table-get-article-difference(tbl1, 'elephant') is 0 184 | distance-table-get-article-difference(tbl2, 'elephant') is 0 185 | end 186 | -------------------------------------------------------------------------------- /cosine-similarity.arr: -------------------------------------------------------------------------------- 1 | provide * 2 | 3 | import string-dict as SD 4 | 5 | import gdrive-sheets as GDS 6 | 7 | import data-source as DS 8 | 9 | import tables as T 10 | 11 | import sets as S 12 | 13 | fun list-of-words-to-sd(xx :: List) -> SD.StringDict block: 14 | msd = [SD.mutable-string-dict:] 15 | for each(x from xx): 16 | old-value = cases(Option) (msd.get-now(x)): 17 | | none => 0 18 | | some(v) => v 19 | end 20 | msd.set-now(x, old-value + 1) 21 | end 22 | msd.freeze() 23 | end 24 | 25 | lower-case-a-cp = string-to-code-point('a') 26 | lower-case-z-cp = string-to-code-point('z') 27 | 28 | fun is-non-punct(c :: String) -> Boolean: 29 | if (c == ' ') or (c == '\n'): true 30 | else: 31 | c-cp = string-to-code-point(c) 32 | (c-cp >= lower-case-a-cp) and (c-cp <= lower-case-z-cp) 33 | end 34 | end 35 | 36 | fun is-non-empty-string(s :: String) -> Boolean: 37 | s <> '' 38 | end 39 | 40 | fun massage-string(w :: String) -> String: 41 | fold(lam(string-a, string-b): string-a + string-b end, '', string-explode(string-to-lower(w)).filter(is-non-punct)) 42 | end 43 | 44 | fun string-to-list-of-natlang-words(s :: String) -> List: 45 | string-split-all(massage-string(string-to-lower(s)), ' ').filter(is-non-empty-string) 46 | end 47 | 48 | # stop words from https://dl.acm.org/doi/pdf/10.1145/378881.378888, Appendix A, Christopher Fox 49 | 50 | standard-stop-words = [list: "the", "and", "a", "that", "was", "for", "with", "not", "on", "at", "i", "had", "are", "or", "an", "they", "one", "would", "all", "there", "their", "him", "has", "when", "if", "out", "what", "up", "about", "into", "can", "other", "some", "time", "two", "then", "do", "now", "such", "man", "our", "even", "made", "after", "many", "must", "years", "much", "your", "down", "should", "of", "to", "in", "is", "he", "it", "as", "his", "be", "by", "this", "but", "from", "have", "you", "which", "were", "her", "she", "will", "we", "been", "who", "more", "no", "so", "said", "its", "than", "them", "only", "new", "could", "these", "may", "first", "any", "my", "like", "over", "me", "most", "also", "did", "before", "through", "where", "back", "way", "well", "because", "each", "people", "state", "mr", "how", "make", "still", "own", "work", "long", "both", "under", "never", "same", "while", "last", "might", "day", "since", "come", "great", "three", "go", "few", "use", "without", "place", "old", "small", "home", "went", "once", "school", "every", "united", "number", "does", "away", "water", "fact", "though", "enough", "almost", "took", "night", "system", "general", "better", "why", "end", "find", "asked", "going", "knew", "toward", "just", "those", "too", "world", "very", "good", "see", "men", "here", "get", "between", "year", "another", "being", "life", "know", "us", "off", "against", "came", "right", "states", "take", "himself", "during", "again", "around", "however", "mrs", "thought", "part", "high", "upon", "say", "used", "war", "until", "always", "something", "public", "put", "think", "head", "far", "hand", "set", "nothing", "point", "house", "later", "eyes", "next", "program", "give", "white", "room", "social", "young", "present", "order", "second", "possible", "light", "face", "important", "among", "early", "need", "within", "business", "felt", "best", "ever", "least", "got", "mind", "want", "others", "although", "open", "area", "done", "certain", "door", "different", "sense", "help", "perhaps", "group", "side", "several", "let", "national", "given", "rather", "per", "often", "god", "things", "large", "big", "become", "case", "along", "four", "power", "saw", "less", "thing", "today", "interest", "turned", "members", "family", "problem", "kind", "began", "thus", "seemed", "whole", "itself"] 51 | 52 | 53 | fun remove-stop-words(list-of-words :: List): 54 | list-of-words.filter(lam(w): not(standard-stop-words.member(w)) end) 55 | end 56 | 57 | fun string-to-bag-helper(str :: String, ignore-stop-words :: Boolean) -> Table block: 58 | var candidate-words = string-to-list-of-natlang-words(str) 59 | if ignore-stop-words: 60 | candidate-words := remove-stop-words(candidate-words) 61 | else: false 62 | end 63 | sd = list-of-words-to-sd(candidate-words) 64 | var tbl = table: word :: String, frequency :: Number end 65 | words = sd.keys().to-list() 66 | for each(word from words): 67 | new-row = tbl.row(word, sd.get-value(word)) 68 | tbl := tbl.add-row(new-row) 69 | end 70 | tbl 71 | end 72 | 73 | fun string-to-bag(str :: String) -> Table: 74 | string-to-bag-helper(str, false) 75 | end 76 | 77 | fun string-to-bag-cleaned(str :: String) -> Table: 78 | string-to-bag-helper(str, true) 79 | end 80 | 81 | fun dot-product(sd1 :: SD.StringDict, sd2 :: SD.StringDict) -> Number block: 82 | var n = 0 83 | sd1-key-list = sd1.keys-list() 84 | for each(key from sd1-key-list) block: 85 | if sd2.has-key(key): 86 | n := n + (sd1.get-value(key) * sd2.get-value(key)) 87 | else: false 88 | end 89 | end 90 | n 91 | end 92 | 93 | fun get-spreadsheet-string(ss :: Any) -> String: 94 | ws = GDS.open-sheet-by-index(ss, 0, false) 95 | tbl = load-table: text :: String 96 | source: ws 97 | sanitize text using DS.string-sanitizer 98 | end 99 | entire-col = extract text from tbl end 100 | entire-col.get(0) 101 | end 102 | 103 | fun get-spreadsheet-words(ss :: Any) -> List: 104 | cell-string = get-spreadsheet-string(ss) 105 | string-to-list-of-natlang-words(cell-string) 106 | end 107 | 108 | # *-similarity-lists functions: These compare lists of strings 109 | 110 | fun simple-equality-lists(words1 :: List, words2 :: List) -> Boolean: 111 | words1 == words2 112 | end 113 | 114 | fun bag-equality-lists(words1 :: List, words2 :: List) -> Boolean: 115 | sd1 = list-of-words-to-sd(words1) 116 | sd2 = list-of-words-to-sd(words2) 117 | sd1 == sd2 118 | end 119 | 120 | fun cosine-similarity-lists(words1 :: List, words2 :: List) -> Number: 121 | sd1 = list-of-words-to-sd(words1) 122 | sd2 = list-of-words-to-sd(words2) 123 | # we are NOT using 124 | # cosine similarity as defined in standard Pyret assignment docdiff, which is 125 | # dot-product(sd1, sd2) / num-max(dot-product(sd1, sd1), dot-product(sd2, sd2)) 126 | 127 | # the usual cosine similarity, as described in 128 | # https://en.wikipedia.org/wiki/Cosine_similarity 129 | if sd1 == sd2: 1 130 | else if (sd1.count() == 0) or (sd2.count() == 0): 131 | raise('cosine-similarity is undefined when one arg is empty and the other isn\'t; given ' + to-string(words1) + ' and ' + to-string(words2)) 132 | else: 133 | dot-product(sd1, sd2) / (sqrt(dot-product(sd1, sd1)) * sqrt(dot-product(sd2, sd2))) 134 | end 135 | end 136 | 137 | fun angle-difference-lists(words1 :: List, words2 :: List) -> Number: 138 | cos-sim = cosine-similarity-lists(words1, words2) 139 | (num-acos(cos-sim) * 180) / 3.14159265 140 | end 141 | 142 | # *-similarity functions: These compare string inputs directly 143 | 144 | fun simple-equality(string1 :: String, string2 :: String) -> Boolean: 145 | # either use straight string comparison, or 146 | # massage the argument strings (converting to lower case, removing punctuation) before comparing 147 | # 148 | # string1 == string2 149 | words1 = string-to-list-of-natlang-words(string1) 150 | words2 = string-to-list-of-natlang-words(string2) 151 | simple-equality-lists(words1, words2) 152 | end 153 | 154 | fun bag-equality(string1 :: String, string2 :: String) -> Boolean: 155 | words1 = string-to-list-of-natlang-words(string1) 156 | words2 = string-to-list-of-natlang-words(string2) 157 | bag-equality-lists(words1, words2) 158 | end 159 | 160 | fun cosine-similarity(string1 :: String, string2 :: String) -> Number: 161 | words1 = string-to-list-of-natlang-words(string1) 162 | words2 = string-to-list-of-natlang-words(string2) 163 | cosine-similarity-lists(words1, words2) 164 | end 165 | 166 | fun angle-difference(string1 :: String, string2 :: String) -> Number: 167 | cos-sim = cosine-similarity(string1, string2) 168 | (num-acos(cos-sim) * 180) / 3.14159265 169 | end 170 | 171 | fun simple-equality-cleaned(string1 :: String, string2 :: String) -> Boolean: 172 | words1 = remove-stop-words(string-to-list-of-natlang-words(string1)) 173 | words2 = remove-stop-words(string-to-list-of-natlang-words(string2)) 174 | simple-equality-lists(words1, words2) 175 | end 176 | 177 | fun bag-equality-cleaned(string1 :: String, string2 :: String) -> Boolean: 178 | words1 = remove-stop-words(string-to-list-of-natlang-words(string1)) 179 | words2 = remove-stop-words(string-to-list-of-natlang-words(string2)) 180 | bag-equality-lists(words1, words2) 181 | end 182 | 183 | fun cosine-similarity-cleaned(string1 :: String, string2 :: String) -> Number: 184 | words1 = remove-stop-words(string-to-list-of-natlang-words(string1)) 185 | words2 = remove-stop-words(string-to-list-of-natlang-words(string2)) 186 | cosine-similarity-lists(words1, words2) 187 | end 188 | 189 | fun angle-difference-cleaned(string1 :: String, string2 :: String) -> Number: 190 | words1 = remove-stop-words(string-to-list-of-natlang-words(string1)) 191 | words2 = remove-stop-words(string-to-list-of-natlang-words(string2)) 192 | angle-difference-lists(words1, words2) 193 | end 194 | 195 | # *-similarity-files: These compares files (Google Ids) containing the respective contents. 196 | # format: headerless spreadsheet with just one cell containing a string 197 | 198 | fun simple-equality-files(file1 :: String, file2 :: String) -> Boolean: 199 | ss1 = GDS.load-spreadsheet(file1) 200 | ss2 = GDS.load-spreadsheet(file2) 201 | string1 = get-spreadsheet-string(ss1) 202 | string2 = get-spreadsheet-string(ss2) 203 | simple-equality(string1, string2) 204 | end 205 | 206 | fun bag-equality-files(file1 :: String, file2 :: String) -> Boolean: 207 | ss1 = GDS.load-spreadsheet(file1) 208 | ss2 = GDS.load-spreadsheet(file2) 209 | words1 = get-spreadsheet-words(ss1) 210 | words2 = get-spreadsheet-words(ss2) 211 | bag-equality-lists(words1, words2) 212 | end 213 | 214 | fun cosine-similarity-files(file1 :: String, file2 :: String) -> Number: 215 | ss1 = GDS.load-spreadsheet(file1) 216 | ss2 = GDS.load-spreadsheet(file2) 217 | words1 = get-spreadsheet-words(ss1) 218 | words2 = get-spreadsheet-words(ss2) 219 | cosine-similarity-lists(words1, words2) 220 | end 221 | 222 | fun angle-difference-files(file1 :: String, file2 :: String) -> Number: 223 | cos-sim = cosine-similarity-files(file1, file2) 224 | (num-acos(cos-sim) * 180) / 3.14159265 225 | end 226 | --------------------------------------------------------------------------------