├── README.md
├── java
    ├── Readme.md
    └── TextSimilaritySimpleWrapper
├── minimal-stop-word-list.txt
├── terrier-stop-word-list.txt
└── text-similarity-rest-reference.md


/README.md:
--------------------------------------------------------------------------------
 1 | 
 2 | # Text Mining and NLP APIs
 3 | 
 4 | RxNLP's Text Mining and NLP APIs allow for quick analysis of highly messy unstructured text. You can cluster sentences, extract topics, summarize opinions, generate lexical similarities between two pieces of text and more. All of this would allow you to build powerful data-driven applications such as a Tweet Analysis app with tweets clustered into logical topics (using Sentence Clustering), similar Tweets detected using Text Similarity and sentiments in Tweets nicely summarized for users using Opinosis Summarization. 
 5 | 
 6 | [PyRxNLP](https://github.com/RxNLP/pyrxnlp) is our Python SDK for these APIs. 
 7 | 
 8 | ## List of APIs
 9 | - [HTML2Text API](http://www.rxnlp.com/api-reference/html2text-api/)
10 | - [Topics Extraction API](http://www.rxnlp.com/api-reference/topics-and-themes-api-reference/)
11 | - [Text Similarity API](http://www.rxnlp.com/api-reference/text-similarity-api-reference/)
12 | - [Sentence Clustering API](http://www.rxnlp.com/sentence-clustering-api/)
13 | - [N-Gram and Word Counting API](http://www.rxnlp.com/api-reference/n-gram-and-word-counter-api-reference/)
14 | - [Opinosis Opinion Summarization](https://market.mashape.com/RxNLP/text-mining-and-nlp#opinosis-summaries) [[see paper](https://dl.acm.org/citation.cfm?id=1873820)]
15 | 
16 | 
17 | ## Quick Start
18 | - Register for a <b>[RapidApi account](https://rapidapi.com/)</b> 
19 | - Subscribe to an [API plan](https://rapidapi.com/RxNLP/api/text-mining-and-nlp). 
20 | - Use the Basic plan if you have limited daily calls
21 | - [Get your Key](https://rapidapi.com/RxNLP/api/text-mining-and-nlp)
22 | - Build your API client or start with the [PyRxNLP](https://github.com/RxNLP/pyrxnlp) SDK
23 | 
24 | ## Get in touch
25 | 
26 | For questions, use our [contact form](http://www.rxnlp.com/contact/) or send a message to contact@rxnlp.com.  
27 | 
28 | 
29 | 


--------------------------------------------------------------------------------
/java/Readme.md:
--------------------------------------------------------------------------------
 1 | 
 2 | ## TextSimilaritySimpleWrapper.java
 3 | 
 4 | This is a simple Java Wrapper for [RxNLP's Text Similarity API](http://www.rxnlp.com/api-reference/text-similarity-api-reference/) which computes the similarity between two pieces of text (text can be arbitrarily long). The API computes dice, jaccard and cosine similarity between texts and also produces the average scores. Note that the API can also clean the text prior to computing the similarity scores.
 5 | 
 6 | 
 7 | 
 8 | ## Explanation of Code
 9 | 
10 | 
11 | ### Making connection to the API
12 | 
13 | Note that using plain vanilla HTTP Request is much faster than using the Unirest library. The X-Mashape-Key is the API key needed to access the API endpoint. You will need to register with Mashape and subscribe to the [Text Mining and NLP API](https://market.mashape.com/rxnlp/text-mining-and-nlp) in order to obtain the API Key.
14 | 
15 | ```java
16 | 
17 | 		//endpoint
18 | 		targetUrl = new URL("https://rxnlp-core.p.mashape.com/computeSimilarity");
19 | 	
20 | 		//First set the headers
21 | 		HttpURLConnection httpConnection = (HttpURLConnection) targetUrl.openConnection();
22 | 		httpConnection.setDoOutput(true);
23 | 		httpConnection.setRequestMethod("POST");
24 | 		httpConnection.setRequestProperty("Content-Type", "application/json");	
25 | 		httpConnection.setRequestProperty("X-Mashape-Key", "<GET_YOUR_MASHAPE_KEY>");
26 | ```
27 | 
28 | 
29 | ### Send JSON request
30 | 
31 | Here we create a  JSON request based on two strings for similarity comparison and then we send it to the server. Note that the strings can be pretty long as the request does not exceed 1MB. For details on the parameters, [refer to this documentation](http://www.rxnlp.com/api-reference/text-similarity-api-reference/#request)
32 | 
33 | ```java
34 | 		String str1="This is the first string.  It can be quite long.";
35 | 		String str2="This is the second string.  It can be quite long.";
36 | 		
37 | 		//Then set input
38 | 		String input = "{\"text1\":\""+str1
39 | 			 +"\",\"text2\":\""+str2
40 | 			 +"\",\"clean\":\"true\"}"; 
41 | 
42 | 		//Next, process output
43 | 		OutputStream outputStream = httpConnection.getOutputStream();
44 | 		outputStream.write(input.getBytes());
45 | 		outputStream.flush();```
46 | ```
47 | 
48 | 
49 | 
50 | ### Read JSON response from server
51 | 
52 | Here we read the JSON response and print the raw JSON line by line. You can parse the raw JSON to obtain the cosine, jaccard and dice similarities.
53 | 
54 | ```java
55 | 			BufferedReader responseBuffer = new BufferedReader(new InputStreamReader((httpConnection.getInputStream())));
56 | 			
57 | 			//Printing output from server (you can use a json parser here instead)
58 | 			String output;
59 | 			System.out.println("Output from Server:\n");
60 | 			while ((output = responseBuffer.readLine()) != null) {
61 | 				System.out.println(output);
62 | 			}
63 | 			
64 | ```			
65 | 
66 | 
67 | ### Example JSON Request
68 | 
69 | ```json
70 |  {
71 |  "text1": "iphone 4s black new", 
72 |  "text2": "iphone 4s black old",
73 |  "clean":"true"
74 |  }
75 | ``` 
76 | 
77 | 
78 | ### Example JSON Response
79 | ```json
80 | {
81 |  "cosine": "0.750",
82 |  "jaccard": "0.600",
83 |  "dice": "0.750",
84 |  "average":"0.700"
85 | }
86 | ```
87 | 


--------------------------------------------------------------------------------
/java/TextSimilaritySimpleWrapper:
--------------------------------------------------------------------------------
 1 | package com.rxnlp.library;
 2 | 
 3 | import java.io.BufferedReader;
 4 | import java.io.InputStreamReader;
 5 | import java.io.OutputStream;
 6 | import java.net.HttpURLConnection;
 7 | import java.net.URL;
 8 | 
 9 | public class TextSimilaritySimpleWrapper {
10 | 	
11 | 	public static void main(String args[]){
12 | 		
13 | 		//This is the target URL
14 | 		URL targetUrl;
15 | 		try {
16 | 			targetUrl = new URL("https://rxnlp-core.p.mashape.com/computeSimilarity");
17 | 			
18 | 			//First set the headers
19 | 			HttpURLConnection httpConnection = (HttpURLConnection) targetUrl.openConnection();
20 | 			httpConnection.setDoOutput(true);
21 | 			httpConnection.setRequestMethod("POST");
22 | 			httpConnection.setRequestProperty("Content-Type", "application/json");
23 | 			httpConnection.setRequestProperty("X-Mashape-Key", "<GET_YOUR_MASHAPE_KEY>");
24 | 			
25 | 			String str1="كتب عربية قصيرة : قصص قصيرة، مختصرات ، الخ";
26 | 			String str2="عربية قصيرة : قصص قصيرة، مختصرات ، الخ";
27 | 
28 | 			//Then set input
29 | 			String input = "{\"text1\":\""+str1
30 | 							 +"\",\"text2\":\""+str2
31 | 							 +"\",\"clean\":\"true\"}"; 
32 | 
33 | 			//Next, process output
34 | 			OutputStream outputStream = httpConnection.getOutputStream();
35 | 			outputStream.write(input.getBytes());
36 | 			outputStream.flush();
37 | 
38 | 			//Throw exception on error
39 | 			if (httpConnection.getResponseCode() != 200) {
40 | 			throw new RuntimeException("Failed : HTTP error code : "
41 | 			+ httpConnection.getResponseCode());
42 | 			}
43 | 
44 | 			BufferedReader responseBuffer = new BufferedReader(new InputStreamReader((httpConnection.getInputStream())));
45 | 
46 | 			//Printing output from server (you can use a json parser here instead)
47 | 			String output;
48 | 			System.out.println("Output from Server:\n");
49 | 			while ((output = responseBuffer.readLine()) != null) {
50 | 				System.out.println(output);
51 | 			}
52 | 
53 | 			//disconnect from server
54 | 			httpConnection.disconnect();
55 | 		} catch (Exception e) {
56 | 			// TODO Auto-generated catch block
57 | 			e.printStackTrace();
58 | 		}
59 | 
60 | 		
61 | 	}
62 | 
63 | }
64 | 


--------------------------------------------------------------------------------
/minimal-stop-word-list.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RxNLP/nlp-cloud-apis/4871cef31996f56cc2f8e03bd955f56b1a1361dc/minimal-stop-word-list.txt


--------------------------------------------------------------------------------
/terrier-stop-word-list.txt:
--------------------------------------------------------------------------------
  1 | x
  2 | y
  3 | your
  4 | yours
  5 | yourself
  6 | yourselves
  7 | you
  8 | yond
  9 | yonder
 10 | yon
 11 | ye
 12 | yet
 13 | z
 14 | zillion
 15 | j
 16 | u
 17 | umpteen
 18 | usually
 19 | us
 20 | username
 21 | uponed
 22 | upons
 23 | uponing
 24 | upon
 25 | ups
 26 | upping
 27 | upped
 28 | up
 29 | unto
 30 | until
 31 | unless
 32 | unlike
 33 | unliker
 34 | unlikest
 35 | under
 36 | underneath
 37 | use
 38 | used
 39 | usedest
 40 | r
 41 | rath
 42 | rather
 43 | rathest
 44 | rathe
 45 | re
 46 | relate
 47 | related
 48 | relatively
 49 | regarding
 50 | really
 51 | res
 52 | respecting
 53 | respectively
 54 | q
 55 | quite
 56 | que
 57 | qua
 58 | n
 59 | neither
 60 | neaths
 61 | neath
 62 | nethe
 63 | nethermost
 64 | necessary
 65 | necessariest
 66 | necessarier
 67 | never
 68 | nevertheless
 69 | nigh
 70 | nighest
 71 | nigher
 72 | nine
 73 | noone
 74 | nobody
 75 | nobodies
 76 | nowhere
 77 | nowheres
 78 | no
 79 | noes
 80 | nor
 81 | nos
 82 | no-one
 83 | none
 84 | not
 85 | notwithstanding
 86 | nothings
 87 | nothing
 88 | nathless
 89 | natheless
 90 | t
 91 | ten
 92 | tills
 93 | till
 94 | tilled
 95 | tilling
 96 | to
 97 | towards
 98 | toward
 99 | towardest
100 | towarder
101 | together
102 | too
103 | thy
104 | thyself
105 | thus
106 | than
107 | that
108 | those
109 | thou
110 | though
111 | thous
112 | thouses
113 | thoroughest
114 | thorougher
115 | thorough
116 | thoroughly
117 | thru
118 | thruer
119 | thruest
120 | thro
121 | through
122 | throughout
123 | throughest
124 | througher
125 | thine
126 | this
127 | thises
128 | they
129 | thee
130 | the
131 | then
132 | thence
133 | thenest
134 | thener
135 | them
136 | themselves
137 | these
138 | therer
139 | there
140 | thereby
141 | therest
142 | thereafter
143 | therein
144 | thereupon
145 | therefore
146 | their
147 | theirs
148 | thing
149 | things
150 | three
151 | two
152 | o
153 | oh
154 | owt
155 | owning
156 | owned
157 | own
158 | owns
159 | others
160 | other
161 | otherwise
162 | otherwisest
163 | otherwiser
164 | of
165 | often
166 | oftener
167 | oftenest
168 | off
169 | offs
170 | offest
171 | one
172 | ought
173 | oughts
174 | our
175 | ours
176 | ourselves
177 | ourself
178 | out
179 | outest
180 | outed
181 | outwith
182 | outs
183 | outside
184 | over
185 | overallest
186 | overaller
187 | overalls
188 | overall
189 | overs
190 | or
191 | orer
192 | orest
193 | on
194 | oneself
195 | onest
196 | ons
197 | onto
198 | a
199 | atween
200 | at
201 | athwart
202 | atop
203 | afore
204 | afterward
205 | afterwards
206 | after
207 | afterest
208 | afterer
209 | ain
210 | an
211 | any
212 | anything
213 | anybody
214 | anyone
215 | anyhow
216 | anywhere
217 | anent
218 | anear
219 | and
220 | andor
221 | another
222 | around
223 | ares
224 | are
225 | aest
226 | aer
227 | against
228 | again
229 | accordingly
230 | abaft
231 | abafter
232 | abaftest
233 | abovest
234 | above
235 | abover
236 | abouter
237 | aboutest
238 | about
239 | aid
240 | amidst
241 | amid
242 | among
243 | amongst
244 | apartest
245 | aparter
246 | apart
247 | appeared
248 | appears
249 | appear
250 | appearing
251 | appropriating
252 | appropriate
253 | appropriatest
254 | appropriates
255 | appropriater
256 | appropriated
257 | already
258 | always
259 | also
260 | along
261 | alongside
262 | although
263 | almost
264 | all
265 | allest
266 | aller
267 | allyou
268 | alls
269 | albeit
270 | awfully
271 | as
272 | aside
273 | asides
274 | aslant
275 | ases
276 | astrider
277 | astride
278 | astridest
279 | astraddlest
280 | astraddler
281 | astraddle
282 | availablest
283 | availabler
284 | available
285 | aughts
286 | aught
287 | vs
288 | v
289 | variousest
290 | variouser
291 | various
292 | via
293 | vis-a-vis
294 | vis-a-viser
295 | vis-a-visest
296 | viz
297 | very
298 | veriest
299 | verier
300 | versus
301 | k
302 | g
303 | go
304 | gone
305 | good
306 | got
307 | gotta
308 | gotten
309 | get
310 | gets
311 | getting
312 | b
313 | by
314 | byandby
315 | by-and-by
316 | bist
317 | both
318 | but
319 | buts
320 | be
321 | beyond
322 | because
323 | became
324 | becomes
325 | become
326 | becoming
327 | becomings
328 | becominger
329 | becomingest
330 | behind
331 | behinds
332 | before
333 | beforehand
334 | beforehandest
335 | beforehander
336 | bettered
337 | betters
338 | better
339 | bettering
340 | betwixt
341 | between
342 | beneath
343 | been
344 | below
345 | besides
346 | beside
347 | m
348 | my
349 | myself
350 | mucher
351 | muchest
352 | much
353 | must
354 | musts
355 | musths
356 | musth
357 | main
358 | make
359 | mayest
360 | many
361 | mauger
362 | maugre
363 | me
364 | meanwhiles
365 | meanwhile
366 | mostly
367 | most
368 | moreover
369 | more
370 | might
371 | mights
372 | midst
373 | midsts
374 | h
375 | huh
376 | humph
377 | he
378 | hers
379 | herself
380 | her
381 | hereby
382 | herein
383 | hereafters
384 | hereafter
385 | hereupon
386 | hence
387 | hadst
388 | had
389 | having
390 | haves
391 | have
392 | has
393 | hast
394 | hardly
395 | hae
396 | hath
397 | him
398 | himself
399 | hither
400 | hitherest
401 | hitherer
402 | his
403 | how-do-you-do
404 | however
405 | how
406 | howbeit
407 | howdoyoudo
408 | hoos
409 | hoo
410 | w
411 | woulded
412 | woulding
413 | would
414 | woulds
415 | was
416 | wast
417 | we
418 | wert
419 | were
420 | with
421 | withal
422 | without
423 | within
424 | why
425 | what
426 | whatever
427 | whateverer
428 | whateverest
429 | whatsoeverer
430 | whatsoeverest
431 | whatsoever
432 | whence
433 | whencesoever
434 | whenever
435 | whensoever
436 | when
437 | whenas
438 | whether
439 | wheen
440 | whereto
441 | whereupon
442 | wherever
443 | whereon
444 | whereof
445 | where
446 | whereby
447 | wherewithal
448 | wherewith
449 | whereinto
450 | wherein
451 | whereafter
452 | whereas
453 | wheresoever
454 | wherefrom
455 | which
456 | whichever
457 | whichsoever
458 | whilst
459 | while
460 | whiles
461 | whithersoever
462 | whither
463 | whoever
464 | whosoever
465 | whoso
466 | whose
467 | whomever
468 | s
469 | syne
470 | syn
471 | shalling
472 | shall
473 | shalled
474 | shalls
475 | shoulding
476 | should
477 | shoulded
478 | shoulds
479 | she
480 | sayyid
481 | sayid
482 | said
483 | saider
484 | saidest
485 | same
486 | samest
487 | sames
488 | samer
489 | saved
490 | sans
491 | sanses
492 | sanserifs
493 | sanserif
494 | so
495 | soer
496 | soest
497 | sobeit
498 | someone
499 | somebody
500 | somehow
501 | some
502 | somewhere
503 | somewhat
504 | something
505 | sometimest
506 | sometimes
507 | sometimer
508 | sometime
509 | several
510 | severaler
511 | severalest
512 | serious
513 | seriousest
514 | seriouser
515 | senza
516 | send
517 | sent
518 | seem
519 | seems
520 | seemed
521 | seemingest
522 | seeminger
523 | seemings
524 | seven
525 | summat
526 | sups
527 | sup
528 | supping
529 | supped
530 | such
531 | since
532 | sine
533 | sines
534 | sith
535 | six
536 | stop
537 | stopped
538 | p
539 | plaintiff
540 | plenty
541 | plenties
542 | please
543 | pleased
544 | pleases
545 | per
546 | perhaps
547 | particulars
548 | particularly
549 | particular
550 | particularest
551 | particularer
552 | pro
553 | providing
554 | provides
555 | provided
556 | provide
557 | probably
558 | l
559 | layabout
560 | layabouts
561 | latter
562 | latterest
563 | latterer
564 | latterly
565 | latters
566 | lots
567 | lotting
568 | lotted
569 | lot
570 | lest
571 | less
572 | ie
573 | ifs
574 | if
575 | i
576 | info
577 | information
578 | itself
579 | its
580 | it
581 | is
582 | idem
583 | idemer
584 | idemest
585 | immediate
586 | immediately
587 | immediatest
588 | immediater
589 | in
590 | inwards
591 | inwardest
592 | inwarder
593 | inward
594 | inasmuch
595 | into
596 | instead
597 | insofar
598 | indicates
599 | indicated
600 | indicate
601 | indicating
602 | indeed
603 | inc
604 | f
605 | fact
606 | facts
607 | fs
608 | figupon
609 | figupons
610 | figuponing
611 | figuponed
612 | few
613 | fewer
614 | fewest
615 | frae
616 | from
617 | failing
618 | failings
619 | five
620 | furthers
621 | furtherer
622 | furthered
623 | furtherest
624 | further
625 | furthering
626 | furthermore
627 | fourscore
628 | followthrough
629 | for
630 | forwhy
631 | fornenst
632 | formerly
633 | former
634 | formerer
635 | formerest
636 | formers
637 | forbye
638 | forby
639 | fore
640 | forever
641 | forer
642 | fores
643 | four
644 | d
645 | ddays
646 | dday
647 | do
648 | doing
649 | doings
650 | doe
651 | does
652 | doth
653 | downwarder
654 | downwardest
655 | downward
656 | downwards
657 | downs
658 | done
659 | doner
660 | dones
661 | donest
662 | dos
663 | dost
664 | did
665 | differentest
666 | differenter
667 | different
668 | describing
669 | describe
670 | describes
671 | described
672 | despiting
673 | despites
674 | despited
675 | despite
676 | during
677 | c
678 | cum
679 | circa
680 | chez
681 | cer
682 | certain
683 | certainest
684 | certainer
685 | cest
686 | canst
687 | cannot
688 | cant
689 | cants
690 | canting
691 | cantest
692 | canted
693 | co
694 | could
695 | couldst
696 | comeon
697 | comeons
698 | come-ons
699 | come-on
700 | concerning
701 | concerninger
702 | concerningest
703 | consequently
704 | considering
705 | e
706 | eg
707 | eight
708 | either
709 | even
710 | evens
711 | evenser
712 | evensest
713 | evened
714 | evenest
715 | ever
716 | everyone
717 | everything
718 | everybody
719 | everywhere
720 | every
721 | ere
722 | each
723 | et
724 | etc
725 | elsewhere
726 | else
727 | ex
728 | excepted
729 | excepts
730 | except
731 | excepting
732 | exes
733 | enough


--------------------------------------------------------------------------------
/text-similarity-rest-reference.md:
--------------------------------------------------------------------------------
  1 | # RxNLP Text Similarity API REST Reference
  2 | 
  3 | ## What is Text Similarity API?
  4 | 
  5 | [<img src="https://d1g84eaw0qjo7s.cloudfront.net/images/badges/badge-icon-light-9e8eba63.png" alt="Mashape" width="143" height="38" />][1]
  6 | 
  7 | <p style="text-align: justify;">
  8 |   The Text Similarity API computes surface similarity between two pieces of text (long or short) using well known measures namely Jaccard, Dice and Cosine. Determining similarity between texts is crucial to many applications such as <em>clustering, duplicate removal, merging similar topics or themes, text retrieval and etc</em>. Let's say we have the following two product listings on eBay:
  9 | </p>
 10 | 
 11 | <pre style="text-align: justify;">
 12 | "text1": "iphone 4s black new",
 13 | "text2": "iphone 4s black old"
 14 | </pre>
 15 | 
 16 | <p style="text-align: justify;">
 17 |   How can you tell that these two listings are almost the same? You can use text similarity measures for this. The results from the Text Similarity API shows how close these two texts are using different measures:
 18 | </p>
 19 | 
 20 | <pre>{
 21 |  "cosine": "0.750",
 22 |  "jaccard": "0.600",
 23 |  "dice": "0.750",
 24 |  "average":"0.700"
 25 | }
 26 | </pre>
 27 | 
 28 | <p style="text-align: justify;">
 29 |   In text mining applications, you can heuristically set a similarity threshold. Meaning, if the similarity score between two pieces of text is greater than a value, say 0.5,<sub> </sub>then you can consider these two units as being similar.  Threshold levels are dependent on the application need. Here are some recommendations:
 30 | </p>
 31 | 
 32 | - For strict similarity, use a threshold of 0.5 and above
 33 | - For a more liberal similarity,  use a score lesser than 0.5
 34 | - In some cases, you can avoid thresholds by ranking texts by similarity scores and using only the top N most similar texts.
 35 | 
 36 | * * *
 37 | 
 38 | # Integrate Text Similarity with Code
 39 | 
 40 | To use this api, you would essentially have to set 3 parameters:
 41 | 
 42 | *   **text1**: your first unit of text or text tokens
 43 | *   **text2**: your second unit of text or text tokens
 44 | *   **clean:** perform cleaning on your text before similarity computation?
 45 | 
 46 | <div style="box-sizing: border-box;">
 47 | <p class="rtejustify" style="text-align: justify;">
 48 | You can have fairly lengthy units of texts (e.g. two plain text documents) but the maximum payload size is 1MB per request. The text that you provide can be plain words, words with Part of Speech Annotations (POS) (e.g.the/dt cow/nn jumps/vb) or combined tokens such as n-grams (e.g. this_cat cat_is is_cute).
 49 |   </p>
 50 | </div>
 51 | 
 52 | * * *
 53 | 
 54 | ## First Steps: Get your API Key
 55 | 
 56 | <p class="rtejustify">
 57 |   Before you start, please ensure that you have a <a href="http://www.rxnlp.com/api-key">valid API key</a>.
 58 | </p>
 59 | 
 60 | * * *
 61 | 
 62 | ## Request
 63 | 
 64 | The TextSimilarity endpoint accepts a **JSON request via POST.** It takes in 3 parameters:
 65 | 
 66 | <table style="width: 1201px; height: 88px;" border="0" cellspacing="1" cellpadding="1" align="left">
 67 |   <thead>
 68 |     <tr>
 69 |       <th scope="col">
 70 |         <strong>Parameter name</strong>
 71 |       </th>
 72 |       
 73 |       <th class="rtecenter" scope="col">
 74 |         <strong>Type</strong>
 75 |       </th>
 76 |       
 77 |       <th class="rtecenter" scope="col">
 78 |         <strong>Required?</strong>
 79 |       </th>
 80 |       
 81 |       <th class="rtecenter" scope="col">
 82 |         Description
 83 |       </th>
 84 |     </tr>
 85 |   </thead>
 86 |   
 87 |   <tbody>
 88 |     <tr>
 89 |       <td>
 90 |         text1
 91 |       </td>
 92 |       
 93 |       <td class="rtecenter">
 94 |         text
 95 |       </td>
 96 |       
 97 |       <td class="rtecenter">
 98 |         Yes
 99 |       </td>
100 |       
101 |       <td class="rtecenter">
102 |         first text
103 |       </td>
104 |     </tr>
105 |     
106 |     <tr>
107 |       <td>
108 |         text2
109 |       </td>
110 |       
111 |       <td class="rtecenter">
112 |         text
113 |       </td>
114 |       
115 |       <td class="rtecenter">
116 |         Yes
117 |       </td>
118 |       
119 |       <td class="rtecenter">
120 |         second text
121 |       </td>
122 |     </tr>
123 |     
124 |     <tr>
125 |       <td>
126 |         clean
127 |       </td>
128 |       
129 |       <td class="rtecenter">
130 |         text
131 |       </td>
132 |       
133 |       <td class="rtecenter">
134 |         No (Default=true)
135 |       </td>
136 |       
137 |       <td class="rtecenter">
138 |         lowercase, remove punctuation and numbers?
139 |       </td>
140 |     </tr>
141 |   </tbody>
142 | </table>
143 | 
144 | <span style="color: #ff0000;"><strong>Points to note:</strong></span> 
145 | 
146 | - There is **no maximum length** for the text, but a 1MB maximum payload per request. 
147 | - The text can be in **any language** - The text that you provide can be: 
148 |   - plain text, (e.g. the cow jumps over the moon) 
149 |   - text with POS annotations (e.g. *the/dt cow/nn jumps/vb*) 
150 |   - manipulated texts such as n-grams (e.g. *thiscat catis iscute*). 
151 | - Since this is a json request, your text has to be properly escaped and encoded in UTF-8
152 |   
153 | Requests can be sent in any language as long as it is formatted according to the expected JSON format. There is a library called the [unirest][2] library that handles http request and response in several languages including Java, Python, Ruby, Node.js, PHP and more. Here is an example, using the Java Unirest library:
154 | 
155 |     // These code snippets use an open-source library. http://unirest.io/java 
156 |     HttpResponse response = Unirest.post("https://rxnlp-core.p.mashape.com/<strong>computeSimilarity</strong>") 
157 |     .header("X-Mashape-Key", "<your_api_key>") 
158 |     .header("Content-Type", "application/json") 
159 |     .header("Accept", "application/json") 
160 |     .body("{'text1':'this is test 1','text2':'this is test 2!', 'clean':'false'}") .asJson();
161 |     
162 | 
163 | *   'text1' and 'text2' are the two texts that you want to compute similarity over and are both **mandatory**.
164 | *   'clean' indicates if you want your text to be cleaned up prior to computing text similarity and this is **optional**
165 | *   Content type with application/json is **mandatory** to indicate the type of request being sent
166 | *   X-Mashape-Key is **mandatory** and it is the key that allows you access to the API Here is a simple <a href="https://github.com/RxNLP/sdk-and-resources/blob/master/java/TextSimilaritySimpleWrapper" target="_blank">wrapper</a> for the text similarity API in Java using HttpURLConnection
167 | 
168 | * * *
169 | 
170 | ## Response
171 | 
172 | <p style="text-align: justify;">
173 |   Text Similarity returns a <strong>JSON response</strong>. It returns the <strong>Cosine</strong>, <strong>Jaccard </strong>and <strong>Dice</strong> similarity scores along with the <strong>average</strong> based on these 3 scores. Here is an example request and response output:
174 | </p>
175 | 
176 | **Request:**
177 | 
178 | <pre>{
179 |  "text1":"this is test 2",
180 |  "text2":"this is test 2!", 
181 |  "clean":"true"
182 | }
183 | </pre>
184 | 
185 | **Response:**
186 | 
187 | <pre>{
188 |  "cosine": "1.000",
189 |  "jaccard": "1.000",
190 |  "dice": "1.000",
191 |  "average": "1.000"
192 |  }</pre>
193 | 
194 | **Request:**
195 | 
196 | <pre>{
197 | "text1":"this is test 2",
198 | "text2":"this is test 2!", 
199 | "clean":"false"
200 | }</pre>
201 | 
202 | **Response:** 
203 | 
204 | <pre><code>
205 | { 
206 |   cosine :0.750 , 
207 |   jaccard: 0.600, 
208 |   dice: 0.750, 
209 |   average:0.700
210 | }
211 | </code></pre>
212 | 
213 | <p style="text-align: justify;">
214 |   Since you have access to different similarity measures, you can choose to use one of these measures at all times or all of it at once. You can also use the average scores.
215 | </p>
216 | 
217 | * * *
218 | 
219 | # Which Similarity Measure to Use?
220 | 
221 | <p class="rtejustify" style="text-align: justify;">
222 |   If you have very short texts and want a strict measure that ensures only phrases that are very similar get high scores, then <strong>Jaccard </strong>would be ideal. However, if your text is more than 5 words long, Cosine or Dice may be more appropriate since these measures tend not to over-penalize non-overlapping terms. You can also average all three scores. In either case, please do some experimentation before you decide which measure(s) to use.
223 | </p>
224 | 
225 | * * *
226 | 
227 | # Improving Similarity Measures
228 | 
229 | <p class="rtejustify" style="text-align: justify;">
230 |   There are several ways to improve similarity (meaning finding more overlaps). Here are some ideas to improve reliability in the similarity measures:
231 | </p>
232 | 
233 | *   Stem the text units before computing similarity
234 | *   Remove determiners (e.g. the, an, a) [<a href="http://dictionary.cambridge.org/us/grammar/british-grammar/determiners-the-my-some-this" target="_blank">see list</a>]
235 | *   Remove stop words [<a href="https://github.com/RxNLP/text-mining/blob/master/terrier-stop-word-list.txt" target="_blank">full stop word list</a>] [<a href="https://github.com/RxNLP/text-mining/blob/master/minimal-stop-word-list.txt" target="_blank">minimal stop list</a>] [<a href="http://www.ranks.nl/stopwords" target="_blank">stop words in other languages</a>]
236 | 
237 | * * *
238 | 
239 | # Languages Supported
240 | 
241 | Text Similarity is <span style="text-decoration: underline; color: #ff0000;">language-neutral</span> and would thus work for all languages.
242 | 
243 |  [1]: https://www.mashape.com/rxnlp/text-mining-and-nlp?&utm_campaign=mashape5-embed&utm_medium=button&utm_source=text-mining-and-nlp&utm_content=anchorlink&utm_term=icon-light
244 |  [2]: http://unirest.io/
245 | 


--------------------------------------------------------------------------------
70 \| Parameter name 71 \|	74 \| Type 75 \|	78 \| Required? 79 \|	82 \| Description 83 \|
90 \| text1 91 \|	94 \| text 95 \|	98 \| Yes 99 \|	102 \| first text 103 \|
108 \| text2 109 \|	112 \| text 113 \|	116 \| Yes 117 \|	120 \| second text 121 \|
126 \| clean 127 \|	130 \| text 131 \|	134 \| No (Default=true) 135 \|	138 \| lowercase, remove punctuation and numbers? 139 \|