├── README.adoc
├── animals.arr
├── tests.arr
└── cosine-similarity.arr


/README.adoc:
--------------------------------------------------------------------------------
 1 | = README
 2 | 
 3 | Implements the cosine similarity algorithm as described in
 4 | https://www.geeksforgeeks.org/cosine-similarity/. (Note: We have an
 5 | another less-standard algorithm in our examplar assignment
 6 | https://cs.brown.edu/courses/csci0190/2023/docdiff.html.)
 7 | 
 8 | We'd like to store the input documents in Google Drive. Currently
 9 | the only Drive documents that we can work with are spreadsheets,
10 | so we add a way to extract the relevant text from a spreadsheet.
11 | We will assume that that the spreadsheets used for this purpose
12 | contain only one cell, which contains the entire text of the
13 | document. We may need to revisit based on any size limitations
14 | imposed by Google Spreadsheets. The first step then would be to
15 | then to spread the text across a single column.
16 | 
17 | The functions provided:
18 | 
19 | - `cosine-similarity-lists()` which takes two lists of words
20 |   (strings) and finds the cosine similarity between them.
21 | 
22 | - `cosine-similarity-files()` which takes two Google Drive IDs,
23 |   and finds the cosine similarity between their respective
24 |   spreadsheet single-cell contents.
25 | 
26 | - `cosine-similarity()` which takes two strings, and finds teh
27 |   cosine similarity of the list of words contained in the
28 |   respective strings.
29 | 
30 | Internally, a list of words associated with one document is
31 | uniquified, and a (non-mutable) string-dict is created associating each word
32 | with its count. Thus the list associated with a document maps
33 | (only) the words in it to their counts. We don't need to keep
34 | track of any other words that may appear in comparable documents
35 | (unlike docdiff).
36 | 
37 | The `dot-product()` of two such string-dicts goes over every key
38 | in the first dict, and if it is also represented in the second
39 | dict, multiplies them. The sum of such multiples is the dot
40 | product.
41 | 
42 | To normalize this dot-product (i.e., to hem it between 0 and 1),
43 | we divide by the product of the magnitudes of the two
44 | string-dicts. (The magnitude of a string-dict is the square-root of
45 | its dot-product with itself.)
46 | 
47 | == Other types of comparison
48 | 
49 | Two other simpler forms of comparison are also provided (with the
50 | same signature as for cosine similarity above):
51 | 
52 | - `simple-equality-lists()`, `simple-equality-files()`, and
53 |   `simple-equality()`
54 |   check if the words are the same in the same order. Output is
55 |   boolean.
56 | 
57 | - `bag-equality-lists()`, `bag-equality-files()`, and
58 |   `bag-equality()` check if
59 |   the Bag Of Words are the same (i.e., order doesn't matter, but
60 |   count does). Output is boolean.
61 | 
62 | - `angle-difference-lists()`, `angle-difference-files()`, and
63 |   `angle-difference` return the arccos of what the corresponding
64 |   `cosine-similarity*` function returns. Output is in degrees.
65 | 
66 | == Debugging aid
67 | 
68 | The function `string-to-bag()` takes a text (string) and after
69 | collapsing case and removing punctuation, returns a table of
70 | rows, where each row lists a word along with its frequency.
71 | 


--------------------------------------------------------------------------------
/animals.arr:
--------------------------------------------------------------------------------
 1 | # from https://en.wikipedia.org/wiki/Elephants_in_Thailand
 2 | elephant-article = "The elephant has been a contributor to Thai society and its icon for many centuries. The elephant has had a considerable impact on Thai culture. The Thai elephant is the official national animal of Thailand. The elephant found in Thailand is the Indian elephant, a subspecies of the Asian elephant."
 3 | 
 4 | # from https://en.wikipedia.org/wiki/Polar_bear
 5 | polarbear-article = "The polar bear is a large bear native to the Arctic and nearby areas. It is closely related to the brown bear, and the two species can interbreed. The polar bear is the largest extant species of bear and land carnivore, with adult males weighing 300–800 kg. The polar bear is white- or yellowish-furred with black skin and a thick layer of fat."
 6 | 
 7 | # from https://en.wikipedia.org/wiki/Rhinoceros
 8 | rhino-article = "Rhinoceroses are some of the largest remaining megafauna: all weigh over half a tonne in adulthood. They have a herbivorous diet, small brains 400–600 g for mammals of their size, one or two horns, and a thick 1.5–5 cm, protective skin formed from layers of collagen positioned in a lattice structure. They generally eat leafy material."
 9 | 
10 | # from https://en.wikipedia.org/wiki/Blue_whale
11 | bluewhale-article = "The blue whale is a marine mammal and a baleen whale. Reaching a maximum confirmed length of 29.9 m and weighing up to 199 tons, it is the largest animal known ever to have existed. The blue whale's long and slender body can be of various shades of greyish-blue on its upper surface and somewhat lighter underneath."
12 | 
13 | # from https://en.wikipedia.org/wiki/Snow_leopard
14 | snowleopard-article = "The snow leopard is a species of large cat in the genus Panthera of the family Felidae. The species is native to the mountain ranges of Central and South Asia. It is listed as Vulnerable on the IUCN Red List because the global population is estimated to number fewer than 10,000 mature individuals and is expected to decline about 10% by 2040."
15 | 
16 | # from https://en.wikipedia.org/wiki/Manatee
17 | manatee-article = "Manatees are herbivores and eat over 60 different freshwater and saltwater plants. Manatees inhabit the shallow, marshy coastal areas and rivers of the Caribbean Sea, the Gulf of Mexico, the Amazon basin, and West Africa. The main causes of death for manatees are human-related issues, such as habitat destruction and human objects."
18 | 
19 | # from https://en.wikipedia.org/wiki/Chimpanzee
20 | chimpanzee-article = "The chimpanzee lives in groups that range in size from 15 to 150 members, although individuals travel and forage in much smaller groups during the day. The species lives in a strict male-dominated hierarchy, where disputes are generally settled without the need for violence. Nearly all chimpanzee populations have been recorded using tools, modifying sticks, rocks, grass and leaves and using them for hunting and acquiring honey, termites, ants, nuts and water."
21 | 
22 | # from https://en.wikipedia.org/wiki/American_badger
23 | badger-article = "The American badger is a North American badger similar in appearance to the European badger, although not closely related. It is found in the western, central, and northeastern United States, northern Mexico, and south-central Canada to certain areas of southwestern British Columbia. The American badger's habitat is typified by open grasslands with available prey (such as mice, squirrels, and groundhogs)."
24 | 
25 | # from https://en.wikipedia.org/wiki/Snail
26 | snail-article = "Snails can be found in a very wide range of environments, including ditches, deserts, and the abyssal depths of the sea. Although land snails may be more familiar to laymen, marine snails constitute the majority of snail species, and have much greater diversity and a greater biomass. Numerous kinds of snail can also be found in fresh water."
27 | 
28 | # from https://en.wikipedia.org/wiki/Hamster
29 | hamster-article = "Hamsters feed primarily on seeds, fruits, vegetation, and occasionally burrowing insects. In the wild, they are crepuscular: they forage during the twilight hours. In captivity, however, they are known to live a conventionally nocturnal lifestyle, waking around sundown to feed and exercise. Physically, they are stout-bodied with distinguishing features that include elongated cheek pouches extending to their shoulders, which they use to carry food back to their burrows, as well as a short tail and fur-covered feet."
30 | 
31 | # from https://en.wikipedia.org/wiki/Giraffe
32 | giraffe-article = "The giraffe's distinguishing characteristics are its extremely long neck and legs, horn-like ossicones, and spotted coat patterns. It is classified under the family Giraffidae, along with its closest extant relative, the okapi. Its scattered range extends from Chad in the north to South Africa in the south and from Niger in the west to Somalia in the east."
33 | 
34 | # from https://en.wikipedia.org/wiki/Hippopotamus
35 | hippo-article = "Hippos inhabit rivers, lakes, and mangrove swamps. Territorial bulls each preside over a stretch of water and a group of five to thirty cows and calves. Mating and birth both occur in the water. During the day, hippos remain cool by staying in water or mud, emerging at dusk to graze on grasses. While hippos rest near each other in the water, grazing is a solitary activity and hippos typically do not display territorial behaviour on land. Hippos are among the most dangerous animals in the world due to their aggressive and unpredictable nature. "
36 | 
37 | standard-named-articles = [list:
38 |   [list: "elephant", elephant-article],
39 |   [list: "polarbear", polarbear-article],
40 |   [list: "rhino", rhino-article],
41 |   [list: "bluewhale", bluewhale-article],
42 |   [list: "snowleopard", snowleopard-article],
43 |   [list: "manatee", manatee-article],
44 |   [list: "chimpanzee", chimpanzee-article],
45 |   [list: "badger", badger-article],
46 |   [list: "snail", snail-article],
47 |   [list: "hamster", hamster-article],
48 |   [list: "giraffe", giraffe-article],
49 |   [list: "hippo", hippo-article],
50 | ]
51 | 
52 | student-article = elephant-article
53 | 
54 | fun distance-to-helper(candidate-article :: String, corpus :: List<Any>, ignore-stop-words :: Boolean) -> Table block:
55 |   var candidate-words = string-to-list-of-natlang-words(candidate-article)
56 |   if ignore-stop-words:
57 |     candidate-words := remove-stop-words(candidate-words)
58 |   else: false
59 |   end
60 |   var tbl = table: article :: String, difference :: Number end
61 |   for each(named-article from corpus) block:
62 |     article-name = named-article.get(0)
63 |     var article-words = string-to-list-of-natlang-words(named-article.get(1))
64 |     if ignore-stop-words:
65 |       article-words := remove-stop-words(article-words)
66 |     else: false
67 |     end
68 |     new-row = tbl.row(article-name, angle-difference-lists(candidate-words, article-words))
69 |     tbl := tbl.add-row(new-row)
70 |   end
71 |   tbl
72 | end
73 | 
74 | fun distance-to(candidate-article):
75 |   distance-to-helper(candidate-article, standard-named-articles, false)
76 | end
77 | 
78 | fun distance-to-cleaned(candidate-article):
79 |   distance-to-helper(candidate-article, standard-named-articles, true)
80 | end
81 | 
82 | # try
83 | #
84 | # distance-to(student-article)
85 | # -- this doesn't ignore stop words
86 | 
87 | # distance-to-cleaned(student-article)
88 | # -- this does ignore stop words
89 | 
90 | 


--------------------------------------------------------------------------------
/tests.arr:
--------------------------------------------------------------------------------
  1 | # load cosine-similarity.arr and animals.arr before this file
  2 | 
  3 | check "dot-product":
  4 |   x-sd = list-of-words-to-sd(string-to-list-of-natlang-words("apple banana citrus"))
  5 |   y-sd = list-of-words-to-sd(string-to-list-of-natlang-words("apple banana banana citrus citrus citrus"))
  6 |   dot-product(x-sd, x-sd) is 3
  7 |   dot-product(y-sd, y-sd) is 14
  8 |   dot-product(x-sd, y-sd) is 6
  9 |   dot-product(y-sd, x-sd) is 6
 10 | end
 11 | 
 12 | # two sample Google IDs for spreadsheets containing texts
 13 | var sheet_id1 = "1CnAGrIMW7W1Qrxtm8ZmJXYcQvkoMbSmzL7Ixw6d4FYQ"
 14 | var sheet_id2 = "10ngDjr6ahZICKrSVb6zFnOKRDmqMNeqqJCVEDvxONWs"
 15 | 
 16 | check "simple equality":
 17 | 
 18 |   # comparing file to itself shd always yield true
 19 |   simple-equality-files(sheet_id1, sheet_id1) is true
 20 |   simple-equality-files(sheet_id2, sheet_id2) is true
 21 | 
 22 |   # comparing file to a different file shd always yield false
 23 |   simple-equality-files(sheet_id1, sheet_id2) is false
 24 |   simple-equality-files(sheet_id2, sheet_id1) is false
 25 | 
 26 |   # comparing a text to itself yields true
 27 |   simple-equality-lists([list: "apple", "apple", "orange"], [list: "apple", "apple", "orange"]) is true
 28 | 
 29 |   # comparing a text to a permuted version of itself yields false
 30 |   simple-equality-lists([list: "apple", "apple", "orange"], [list: "apple", "orange", "apple"]) is false
 31 | 
 32 |   # comparing obviously dissimilar texts yields false
 33 |   simple-equality-lists([list: "apple", "apple", "orange"], [list: "apple", "orange", "orange", "orange"]) is false
 34 |   simple-equality-lists([list: "a", "a", "a", "b", "b", "d", "d", "d", "d", "d"], [list: "a"]) is false
 35 | 
 36 |   # comparing exactly similar texts yields true
 37 |   simple-equality-lists([list: "a", "b", "c", "d"], [list: "a", "b", "c", "d"]) is true
 38 | 
 39 |   # same as above, but using single strings instead of lists of words
 40 |   simple-equality("apple apple orange", "apple apple orange") is true
 41 |   simple-equality("apple apple orange", "apple orange apple") is false
 42 |   simple-equality("apple apple orange", "apple orange orange orange")is false
 43 |   simple-equality("a a a b b d d d d d", "a") is false
 44 |   simple-equality("a b c d", "a b c d") is true
 45 | end
 46 | 
 47 | check "bag equality":
 48 | 
 49 |   # comparing file to itself shd always yield true
 50 |   bag-equality-files(sheet_id1, sheet_id1) is true
 51 |   bag-equality-files(sheet_id2, sheet_id2) is true
 52 | 
 53 |   # comparing file to a different file shd always yield false
 54 |   bag-equality-files(sheet_id1, sheet_id2) is false
 55 |   bag-equality-files(sheet_id2, sheet_id1) is false
 56 | 
 57 |   # comparing a text to itself yields true
 58 |   bag-equality-lists([list: "apple", "apple", "orange"], [list: "apple", "apple", "orange"]) is true
 59 | 
 60 |   # comparing a text to a permuted version of itself yields true
 61 |   bag-equality-lists([list: "apple", "apple", "orange"], [list: "apple", "orange", "apple"]) is true
 62 | 
 63 |   # comparing obviously dissimilar texts yields false
 64 |   bag-equality-lists([list: "apple", "apple", "orange"], [list: "apple", "orange", "orange", "orange"]) is false
 65 |   bag-equality-lists([list: "a", "a", "a", "b", "b", "d", "d", "d", "d", "d"], [list: "a"]) is false
 66 | 
 67 |   # comparing exactly similar texts yields true
 68 |   bag-equality-lists([list: "a", "b", "c", "d"], [list: "a", "b", "c", "d"]) is true
 69 | 
 70 |   # same as above, but using single strings instead of lists of words
 71 |   bag-equality("apple apple orange", "apple apple orange") is true
 72 |   bag-equality("apple apple orange", "apple orange apple") is true
 73 |   bag-equality("apple apple orange", "apple orange orange orange")is false
 74 |   bag-equality("a a a b b d d d d d", "a") is false
 75 |   bag-equality("a b c d", "a b c d") is true
 76 | end
 77 | 
 78 | check "cosine equality":
 79 | 
 80 |   # comparing file to itself shd always yield 1
 81 |   cosine-similarity-files(sheet_id1, sheet_id1) is-roughly 1
 82 |   cosine-similarity-files(sheet_id2, sheet_id2) is-roughly 1
 83 | 
 84 |   # comparing file to a different file shd always yield < 1
 85 |   cosine-similarity-files(sheet_id1, sheet_id2) satisfies lam(x): x < 1 end
 86 |   cosine-similarity-files(sheet_id2, sheet_id1) satisfies lam(x): x < 1 end
 87 | 
 88 |   # comparing a text to itself yields 1
 89 |   cosine-similarity-lists([list: "apple", "apple", "orange"], [list: "apple", "apple", "orange"]) is-roughly 1
 90 | 
 91 |   # comparing a text to a permuted version of itself also yields 1
 92 |   cosine-similarity-lists([list: "apple", "apple", "orange"], [list: "apple", "orange", "apple"]) is-roughly 1
 93 | 
 94 |   # comparing obviously dissimilar texts yields <1
 95 |   cosine-similarity-lists([list: "apple", "apple", "orange"], [list: "apple", "orange", "orange", "orange"]) is-roughly (1 / sqrt(2))
 96 | 
 97 |   cosine-similarity-lists([list: "a", "a", "a", "b", "b", "d", "d", "d", "d", "d"], [list: "a"]) is%(within-rel(0.01)) ~0.49
 98 | 
 99 |   # comparing exactly similar texts yields 1
100 |   cosine-similarity-lists([list: "doo", "doo", "be", "doo", "be"], [list: "doo", "be", "doo", "be", "doo"]) is-roughly 1
101 | 
102 |   # same as above, but with single strings rather than lists of words
103 |   cosine-similarity("apple apple orange", "apple apple orange") is-roughly 1
104 |   cosine-similarity("apple apple orange", "apple orange apple") is-roughly 1
105 |   cosine-similarity("apple apple orange", "apple orange orange orange") is-roughly (1 / sqrt(2))
106 |   cosine-similarity("a a a b b d d d d d", "a") is%(within-rel(0.01)) ~0.49
107 |   cosine-similarity("doo doo be doo be", "doo be doo be doo") is-roughly 1
108 | end
109 | 
110 | check "angle difference":
111 | 
112 |   # comparing file to itself shd always yield 0
113 |   angle-difference-files(sheet_id1, sheet_id1) is-roughly 0
114 |   angle-difference-files(sheet_id2, sheet_id2) is-roughly 0
115 | 
116 |   # comparing file to a different file shd always yield >0
117 |   angle-difference-files(sheet_id1, sheet_id2) satisfies lam(x): x > 0 end
118 |   angle-difference-files(sheet_id2, sheet_id1) satisfies lam(x): x > 0 end
119 | 
120 |   # comparing a text to itself yields 0
121 |   angle-difference-lists([list: "apple", "apple", "orange"], [list: "apple", "apple", "orange"]) is-roughly 0
122 | 
123 |   # comparing a text to a permuted version of itself also yields 0
124 |   angle-difference-lists([list: "apple", "apple", "orange"], [list: "apple", "orange", "apple"]) is-roughly 0
125 | 
126 |   # comparing obviously dissimilar texts yields >0
127 |   angle-difference-lists([list: "apple", "apple", "orange"], [list: "apple", "orange", "orange", "orange"]) is-roughly ~45
128 |   angle-difference-lists([list: "a", "a", "a", "b", "b", "d", "d", "d", "d", "d"], [list: "a"]) is%(within-rel(0.01)) ~60.878
129 | 
130 |   # comparing exactly similar texts yields 0
131 |   angle-difference-lists([list: "doo", "doo", "be", "doo", "be"], [list: "doo", "be", "doo", "be", "doo"]) is%(within-rel(0.01)) 0
132 | 
133 |   # same as above, but using single strings instead of lists of words
134 |   angle-difference("apple apple orange", "apple apple orange") is-roughly 0
135 |   angle-difference("apple apple orange", "apple orange apple") is-roughly 0
136 |   angle-difference("apple apple orange", "apple orange orange orange") is-roughly ~45
137 |   angle-difference("a a a b b d d d d d", "a") is%(within-rel(0.01)) ~60.878
138 |   angle-difference("doo doo be doo be", "doo be doo be doo") is-roughly 0
139 | 
140 |   #
141 |   angle-difference-cleaned("apple apple orange", "apple apple orange") is-roughly 0
142 |   angle-difference-cleaned("apple apple orange", "apple orange apple") is-roughly 0
143 |   angle-difference-cleaned("apple apple orange", "apple orange orange orange") is-roughly ~45
144 |   # angle-difference-cleaned("a a a b b d d d d d", "a") is%(within-rel(0.01)) ~90
145 |   angle-difference-cleaned("a a a b b d d d d d", "a") raises "cosine-similarity"
146 |   angle-difference-cleaned("doo doo be doo be", "doo be doo be doo") is-roughly 0
147 | end
148 | 
149 | check "string-to-bag":
150 |   # the returned bag has columns "word" and "frequency"
151 |   S.list-to-list-set(string-to-bag("doo be doo be doo").get-column("word")) is [S.list-set: "be", "doo"]
152 |   S.list-to-list-set(string-to-bag("doo be doo be doo").get-column("frequency")) is [S.list-set: 2, 3]
153 |   S.list-to-list-set(string-to-bag-cleaned("the whale").get-column("word")) is [S.list-set: "whale"]
154 |   S.list-to-list-set(string-to-bag-cleaned("the whale").get-column("frequency")) is [S.list-set: 1]
155 | end
156 | 
157 | fun distance-table-get-article-difference(tbl :: Table, art :: String) block:
158 |   # this is used only for testing.
159 |   # takes the table resulting from a distance-to call, and an article name `art`,
160 |   # and returns the similarity associated with `art`
161 |   table-rows = tbl.all-rows()
162 |   var answer-found = false
163 |   var simty = 0
164 |   for each(table-row from table-rows) block:
165 |     if not(answer-found):
166 |       if table-row.get-value('article') == art block:
167 |         simty := table-row.get-value('difference')
168 |         answer-found := true
169 |       else: false
170 |       end
171 |     else: false
172 |     end
173 |   end
174 |   simty
175 | end
176 | 
177 | check "distance-to":
178 |   tbl1 = distance-to(elephant-article) # stopwords present
179 |   tbl2 = distance-to-cleaned(elephant-article) # stopwords ignored
180 |   #
181 |   # following checks that the distance between an article and itself is 0
182 |   # whether or not stopwords are removed
183 |   distance-table-get-article-difference(tbl1, 'elephant') is 0
184 |   distance-table-get-article-difference(tbl2, 'elephant') is 0
185 | end
186 | 


--------------------------------------------------------------------------------
/cosine-similarity.arr:
--------------------------------------------------------------------------------
  1 | provide *
  2 | 
  3 | import string-dict as SD
  4 | 
  5 | import gdrive-sheets as GDS
  6 | 
  7 | import data-source as DS
  8 | 
  9 | import tables as T
 10 | 
 11 | import sets as S
 12 | 
 13 | fun list-of-words-to-sd(xx :: List<String>) -> SD.StringDict<Number> block:
 14 |   msd = [SD.mutable-string-dict:]
 15 |   for each(x from xx):
 16 |     old-value = cases(Option) (msd.get-now(x)):
 17 |         | none => 0
 18 |         | some(v) => v
 19 |         end
 20 |     msd.set-now(x, old-value + 1)
 21 |   end
 22 |   msd.freeze()
 23 | end
 24 | 
 25 | lower-case-a-cp = string-to-code-point('a')
 26 | lower-case-z-cp = string-to-code-point('z')
 27 | 
 28 | fun is-non-punct(c :: String) -> Boolean:
 29 |   if (c == ' ') or (c == '\n'): true
 30 |   else:
 31 |     c-cp = string-to-code-point(c)
 32 |     (c-cp >= lower-case-a-cp) and (c-cp <= lower-case-z-cp)
 33 |   end
 34 | end
 35 | 
 36 | fun is-non-empty-string(s :: String) -> Boolean:
 37 |   s <> ''
 38 | end
 39 | 
 40 | fun massage-string(w :: String) -> String:
 41 |   fold(lam(string-a, string-b): string-a + string-b end, '', string-explode(string-to-lower(w)).filter(is-non-punct))
 42 | end
 43 | 
 44 | fun string-to-list-of-natlang-words(s :: String) -> List<String>:
 45 |   string-split-all(massage-string(string-to-lower(s)), ' ').filter(is-non-empty-string)
 46 | end
 47 | 
 48 | # stop words from https://dl.acm.org/doi/pdf/10.1145/378881.378888, Appendix A, Christopher Fox
 49 | 
 50 | standard-stop-words = [list: "the", "and", "a", "that", "was", "for", "with", "not", "on", "at", "i", "had", "are", "or", "an", "they", "one", "would", "all", "there", "their", "him", "has", "when", "if", "out", "what", "up", "about", "into", "can", "other", "some", "time", "two", "then", "do", "now", "such", "man", "our", "even", "made", "after", "many", "must", "years", "much", "your", "down", "should", "of", "to", "in", "is", "he", "it", "as", "his", "be", "by", "this", "but", "from", "have", "you", "which", "were", "her", "she", "will", "we", "been", "who", "more", "no", "so", "said", "its", "than", "them", "only", "new", "could", "these", "may", "first", "any", "my", "like", "over", "me", "most", "also", "did", "before", "through", "where", "back", "way", "well", "because", "each", "people", "state", "mr", "how", "make", "still", "own", "work", "long", "both", "under", "never", "same", "while", "last", "might", "day", "since", "come", "great", "three", "go", "few", "use", "without", "place", "old", "small", "home", "went", "once", "school", "every", "united", "number", "does", "away", "water", "fact", "though", "enough", "almost", "took", "night", "system", "general", "better", "why", "end", "find", "asked", "going", "knew", "toward", "just", "those", "too", "world", "very", "good", "see", "men", "here", "get", "between", "year", "another", "being", "life", "know", "us", "off", "against", "came", "right", "states", "take", "himself", "during", "again", "around", "however", "mrs", "thought", "part", "high", "upon", "say", "used", "war", "until", "always", "something", "public", "put", "think", "head", "far", "hand", "set", "nothing", "point", "house", "later", "eyes", "next", "program", "give", "white", "room", "social", "young", "present", "order", "second", "possible", "light", "face", "important", "among", "early", "need", "within", "business", "felt", "best", "ever", "least", "got", "mind", "want", "others", "although", "open", "area", "done", "certain", "door", "different", "sense", "help", "perhaps", "group", "side", "several", "let", "national", "given", "rather", "per", "often", "god", "things", "large", "big", "become", "case", "along", "four", "power", "saw", "less", "thing", "today", "interest", "turned", "members", "family", "problem", "kind", "began", "thus", "seemed", "whole", "itself"]
 51 | 
 52 | 
 53 | fun remove-stop-words(list-of-words :: List<String>):
 54 |   list-of-words.filter(lam(w): not(standard-stop-words.member(w)) end)
 55 | end
 56 | 
 57 | fun string-to-bag-helper(str :: String, ignore-stop-words :: Boolean) -> Table block:
 58 |   var candidate-words = string-to-list-of-natlang-words(str)
 59 |   if ignore-stop-words:
 60 |     candidate-words := remove-stop-words(candidate-words)
 61 |   else: false
 62 |   end
 63 |   sd = list-of-words-to-sd(candidate-words)
 64 |   var tbl = table: word :: String, frequency :: Number end
 65 |   words = sd.keys().to-list()
 66 |   for each(word from words):
 67 |     new-row = tbl.row(word, sd.get-value(word))
 68 |     tbl := tbl.add-row(new-row)
 69 |   end
 70 |   tbl
 71 | end
 72 | 
 73 | fun string-to-bag(str :: String) -> Table:
 74 |   string-to-bag-helper(str, false)
 75 | end
 76 | 
 77 | fun string-to-bag-cleaned(str :: String) -> Table:
 78 |   string-to-bag-helper(str, true)
 79 | end
 80 | 
 81 | fun dot-product(sd1 :: SD.StringDict<Number>, sd2 :: SD.StringDict<Number>) -> Number block:
 82 |   var n = 0
 83 |   sd1-key-list = sd1.keys-list()
 84 |   for each(key from sd1-key-list) block:
 85 |     if sd2.has-key(key):
 86 |       n := n + (sd1.get-value(key) * sd2.get-value(key))
 87 |     else: false
 88 |     end
 89 |   end
 90 |   n
 91 | end
 92 | 
 93 | fun get-spreadsheet-string(ss :: Any) -> String:
 94 |   ws = GDS.open-sheet-by-index(ss, 0, false)
 95 |   tbl = load-table: text :: String
 96 |     source: ws
 97 |     sanitize text using DS.string-sanitizer
 98 |   end
 99 |   entire-col = extract text from tbl end
100 |   entire-col.get(0)
101 | end
102 | 
103 | fun get-spreadsheet-words(ss :: Any) -> List<String>:
104 |   cell-string = get-spreadsheet-string(ss)
105 |   string-to-list-of-natlang-words(cell-string)
106 | end
107 | 
108 | #  *-similarity-lists functions: These compare lists of strings
109 | 
110 | fun simple-equality-lists(words1 :: List<String>, words2 :: List<String>) -> Boolean:
111 |   words1 == words2
112 | end
113 | 
114 | fun bag-equality-lists(words1 :: List<String>, words2 :: List<String>) -> Boolean:
115 |   sd1 = list-of-words-to-sd(words1)
116 |   sd2 = list-of-words-to-sd(words2)
117 |   sd1 == sd2
118 | end
119 | 
120 | fun cosine-similarity-lists(words1 :: List<String>, words2 :: List<String>) -> Number:
121 |   sd1 = list-of-words-to-sd(words1)
122 |   sd2 = list-of-words-to-sd(words2)
123 |   # we are NOT using
124 |   # cosine similarity as defined in standard Pyret assignment docdiff, which is
125 |   # dot-product(sd1, sd2) / num-max(dot-product(sd1, sd1), dot-product(sd2, sd2))
126 | 
127 |   # the usual cosine similarity, as described in
128 |   # https://en.wikipedia.org/wiki/Cosine_similarity
129 |   if sd1 == sd2: 1
130 |   else if (sd1.count() == 0) or (sd2.count() == 0):
131 |     raise('cosine-similarity is undefined when one arg is empty and the other isn\'t; given ' + to-string(words1) + ' and ' + to-string(words2))
132 |   else:
133 |     dot-product(sd1, sd2) / (sqrt(dot-product(sd1, sd1)) * sqrt(dot-product(sd2, sd2)))
134 |   end
135 | end
136 | 
137 | fun angle-difference-lists(words1 :: List<String>, words2 :: List<String>) -> Number:
138 |   cos-sim = cosine-similarity-lists(words1, words2)
139 |   (num-acos(cos-sim) * 180) / 3.14159265
140 | end
141 | 
142 | # *-similarity functions: These compare string inputs directly
143 | 
144 | fun simple-equality(string1 :: String, string2 :: String) -> Boolean:
145 |   # either use straight string comparison, or
146 |   # massage the argument strings (converting to lower case, removing punctuation) before comparing
147 |   #
148 |   # string1 == string2
149 |   words1 = string-to-list-of-natlang-words(string1)
150 |   words2 = string-to-list-of-natlang-words(string2)
151 |   simple-equality-lists(words1, words2)
152 | end
153 | 
154 | fun bag-equality(string1 :: String, string2 :: String) -> Boolean:
155 |   words1 = string-to-list-of-natlang-words(string1)
156 |   words2 = string-to-list-of-natlang-words(string2)
157 |   bag-equality-lists(words1, words2)
158 | end
159 | 
160 | fun cosine-similarity(string1 :: String, string2 :: String) -> Number:
161 |   words1 = string-to-list-of-natlang-words(string1)
162 |   words2 = string-to-list-of-natlang-words(string2)
163 |   cosine-similarity-lists(words1, words2)
164 | end
165 | 
166 | fun angle-difference(string1 :: String, string2 :: String) -> Number:
167 |   cos-sim = cosine-similarity(string1, string2)
168 |   (num-acos(cos-sim) * 180) / 3.14159265
169 | end
170 | 
171 | fun simple-equality-cleaned(string1 :: String, string2 :: String) -> Boolean:
172 |   words1 = remove-stop-words(string-to-list-of-natlang-words(string1))
173 |   words2 = remove-stop-words(string-to-list-of-natlang-words(string2))
174 |   simple-equality-lists(words1, words2)
175 | end
176 | 
177 | fun bag-equality-cleaned(string1 :: String, string2 :: String) -> Boolean:
178 |   words1 = remove-stop-words(string-to-list-of-natlang-words(string1))
179 |   words2 = remove-stop-words(string-to-list-of-natlang-words(string2))
180 |   bag-equality-lists(words1, words2)
181 | end
182 | 
183 | fun cosine-similarity-cleaned(string1 :: String, string2 :: String) -> Number:
184 |   words1 = remove-stop-words(string-to-list-of-natlang-words(string1))
185 |   words2 = remove-stop-words(string-to-list-of-natlang-words(string2))
186 |   cosine-similarity-lists(words1, words2)
187 | end
188 | 
189 | fun angle-difference-cleaned(string1 :: String, string2 :: String) -> Number:
190 |   words1 = remove-stop-words(string-to-list-of-natlang-words(string1))
191 |   words2 = remove-stop-words(string-to-list-of-natlang-words(string2))
192 |   angle-difference-lists(words1, words2)
193 | end
194 | 
195 | # *-similarity-files: These compares files (Google Ids) containing the respective contents.
196 | # format: headerless spreadsheet with just one cell containing a string
197 | 
198 | fun simple-equality-files(file1 :: String, file2 :: String) -> Boolean:
199 |   ss1 = GDS.load-spreadsheet(file1)
200 |   ss2 = GDS.load-spreadsheet(file2)
201 |   string1 = get-spreadsheet-string(ss1)
202 |   string2 = get-spreadsheet-string(ss2)
203 |   simple-equality(string1, string2)
204 | end
205 | 
206 | fun bag-equality-files(file1 :: String, file2 :: String) -> Boolean:
207 |   ss1 = GDS.load-spreadsheet(file1)
208 |   ss2 = GDS.load-spreadsheet(file2)
209 |   words1 = get-spreadsheet-words(ss1)
210 |   words2 = get-spreadsheet-words(ss2)
211 |   bag-equality-lists(words1, words2)
212 | end
213 | 
214 | fun cosine-similarity-files(file1 :: String, file2 :: String) -> Number:
215 |   ss1 = GDS.load-spreadsheet(file1)
216 |   ss2 = GDS.load-spreadsheet(file2)
217 |   words1 = get-spreadsheet-words(ss1)
218 |   words2 = get-spreadsheet-words(ss2)
219 |   cosine-similarity-lists(words1, words2)
220 | end
221 | 
222 | fun angle-difference-files(file1 :: String, file2 :: String) -> Number:
223 |   cos-sim = cosine-similarity-files(file1, file2)
224 |   (num-acos(cos-sim) * 180) / 3.14159265
225 | end
226 | 


--------------------------------------------------------------------------------