├── README.md ├── java ├── Readme.md └── TextSimilaritySimpleWrapper ├── minimal-stop-word-list.txt ├── terrier-stop-word-list.txt └── text-similarity-rest-reference.md /README.md: -------------------------------------------------------------------------------- 1 | 2 | # Text Mining and NLP APIs 3 | 4 | RxNLP's Text Mining and NLP APIs allow for quick analysis of highly messy unstructured text. You can cluster sentences, extract topics, summarize opinions, generate lexical similarities between two pieces of text and more. All of this would allow you to build powerful data-driven applications such as a Tweet Analysis app with tweets clustered into logical topics (using Sentence Clustering), similar Tweets detected using Text Similarity and sentiments in Tweets nicely summarized for users using Opinosis Summarization. 5 | 6 | [PyRxNLP](https://github.com/RxNLP/pyrxnlp) is our Python SDK for these APIs. 7 | 8 | ## List of APIs 9 | - [HTML2Text API](http://www.rxnlp.com/api-reference/html2text-api/) 10 | - [Topics Extraction API](http://www.rxnlp.com/api-reference/topics-and-themes-api-reference/) 11 | - [Text Similarity API](http://www.rxnlp.com/api-reference/text-similarity-api-reference/) 12 | - [Sentence Clustering API](http://www.rxnlp.com/sentence-clustering-api/) 13 | - [N-Gram and Word Counting API](http://www.rxnlp.com/api-reference/n-gram-and-word-counter-api-reference/) 14 | - [Opinosis Opinion Summarization](https://market.mashape.com/RxNLP/text-mining-and-nlp#opinosis-summaries) [[see paper](https://dl.acm.org/citation.cfm?id=1873820)] 15 | 16 | 17 | ## Quick Start 18 | - Register for a [RapidApi account](https://rapidapi.com/) 19 | - Subscribe to an [API plan](https://rapidapi.com/RxNLP/api/text-mining-and-nlp). 20 | - Use the Basic plan if you have limited daily calls 21 | - [Get your Key](https://rapidapi.com/RxNLP/api/text-mining-and-nlp) 22 | - Build your API client or start with the [PyRxNLP](https://github.com/RxNLP/pyrxnlp) SDK 23 | 24 | ## Get in touch 25 | 26 | For questions, use our [contact form](http://www.rxnlp.com/contact/) or send a message to contact@rxnlp.com. 27 | 28 | 29 | -------------------------------------------------------------------------------- /java/Readme.md: -------------------------------------------------------------------------------- 1 | 2 | ## TextSimilaritySimpleWrapper.java 3 | 4 | This is a simple Java Wrapper for [RxNLP's Text Similarity API](http://www.rxnlp.com/api-reference/text-similarity-api-reference/) which computes the similarity between two pieces of text (text can be arbitrarily long). The API computes dice, jaccard and cosine similarity between texts and also produces the average scores. Note that the API can also clean the text prior to computing the similarity scores. 5 | 6 | 7 | 8 | ## Explanation of Code 9 | 10 | 11 | ### Making connection to the API 12 | 13 | Note that using plain vanilla HTTP Request is much faster than using the Unirest library. The X-Mashape-Key is the API key needed to access the API endpoint. You will need to register with Mashape and subscribe to the [Text Mining and NLP API](https://market.mashape.com/rxnlp/text-mining-and-nlp) in order to obtain the API Key. 14 | 15 | ```java 16 | 17 | //endpoint 18 | targetUrl = new URL("https://rxnlp-core.p.mashape.com/computeSimilarity"); 19 | 20 | //First set the headers 21 | HttpURLConnection httpConnection = (HttpURLConnection) targetUrl.openConnection(); 22 | httpConnection.setDoOutput(true); 23 | httpConnection.setRequestMethod("POST"); 24 | httpConnection.setRequestProperty("Content-Type", "application/json"); 25 | httpConnection.setRequestProperty("X-Mashape-Key", ""); 26 | ``` 27 | 28 | 29 | ### Send JSON request 30 | 31 | Here we create a JSON request based on two strings for similarity comparison and then we send it to the server. Note that the strings can be pretty long as the request does not exceed 1MB. For details on the parameters, [refer to this documentation](http://www.rxnlp.com/api-reference/text-similarity-api-reference/#request) 32 | 33 | ```java 34 | String str1="This is the first string. It can be quite long."; 35 | String str2="This is the second string. It can be quite long."; 36 | 37 | //Then set input 38 | String input = "{\"text1\":\""+str1 39 | +"\",\"text2\":\""+str2 40 | +"\",\"clean\":\"true\"}"; 41 | 42 | //Next, process output 43 | OutputStream outputStream = httpConnection.getOutputStream(); 44 | outputStream.write(input.getBytes()); 45 | outputStream.flush();``` 46 | ``` 47 | 48 | 49 | 50 | ### Read JSON response from server 51 | 52 | Here we read the JSON response and print the raw JSON line by line. You can parse the raw JSON to obtain the cosine, jaccard and dice similarities. 53 | 54 | ```java 55 | BufferedReader responseBuffer = new BufferedReader(new InputStreamReader((httpConnection.getInputStream()))); 56 | 57 | //Printing output from server (you can use a json parser here instead) 58 | String output; 59 | System.out.println("Output from Server:\n"); 60 | while ((output = responseBuffer.readLine()) != null) { 61 | System.out.println(output); 62 | } 63 | 64 | ``` 65 | 66 | 67 | ### Example JSON Request 68 | 69 | ```json 70 | { 71 | "text1": "iphone 4s black new", 72 | "text2": "iphone 4s black old", 73 | "clean":"true" 74 | } 75 | ``` 76 | 77 | 78 | ### Example JSON Response 79 | ```json 80 | { 81 | "cosine": "0.750", 82 | "jaccard": "0.600", 83 | "dice": "0.750", 84 | "average":"0.700" 85 | } 86 | ``` 87 | -------------------------------------------------------------------------------- /java/TextSimilaritySimpleWrapper: -------------------------------------------------------------------------------- 1 | package com.rxnlp.library; 2 | 3 | import java.io.BufferedReader; 4 | import java.io.InputStreamReader; 5 | import java.io.OutputStream; 6 | import java.net.HttpURLConnection; 7 | import java.net.URL; 8 | 9 | public class TextSimilaritySimpleWrapper { 10 | 11 | public static void main(String args[]){ 12 | 13 | //This is the target URL 14 | URL targetUrl; 15 | try { 16 | targetUrl = new URL("https://rxnlp-core.p.mashape.com/computeSimilarity"); 17 | 18 | //First set the headers 19 | HttpURLConnection httpConnection = (HttpURLConnection) targetUrl.openConnection(); 20 | httpConnection.setDoOutput(true); 21 | httpConnection.setRequestMethod("POST"); 22 | httpConnection.setRequestProperty("Content-Type", "application/json"); 23 | httpConnection.setRequestProperty("X-Mashape-Key", ""); 24 | 25 | String str1="كتب عربية قصيرة : قصص قصيرة، مختصرات ، الخ"; 26 | String str2="عربية قصيرة : قصص قصيرة، مختصرات ، الخ"; 27 | 28 | //Then set input 29 | String input = "{\"text1\":\""+str1 30 | +"\",\"text2\":\""+str2 31 | +"\",\"clean\":\"true\"}"; 32 | 33 | //Next, process output 34 | OutputStream outputStream = httpConnection.getOutputStream(); 35 | outputStream.write(input.getBytes()); 36 | outputStream.flush(); 37 | 38 | //Throw exception on error 39 | if (httpConnection.getResponseCode() != 200) { 40 | throw new RuntimeException("Failed : HTTP error code : " 41 | + httpConnection.getResponseCode()); 42 | } 43 | 44 | BufferedReader responseBuffer = new BufferedReader(new InputStreamReader((httpConnection.getInputStream()))); 45 | 46 | //Printing output from server (you can use a json parser here instead) 47 | String output; 48 | System.out.println("Output from Server:\n"); 49 | while ((output = responseBuffer.readLine()) != null) { 50 | System.out.println(output); 51 | } 52 | 53 | //disconnect from server 54 | httpConnection.disconnect(); 55 | } catch (Exception e) { 56 | // TODO Auto-generated catch block 57 | e.printStackTrace(); 58 | } 59 | 60 | 61 | } 62 | 63 | } 64 | -------------------------------------------------------------------------------- /minimal-stop-word-list.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RxNLP/nlp-cloud-apis/4871cef31996f56cc2f8e03bd955f56b1a1361dc/minimal-stop-word-list.txt -------------------------------------------------------------------------------- /terrier-stop-word-list.txt: -------------------------------------------------------------------------------- 1 | x 2 | y 3 | your 4 | yours 5 | yourself 6 | yourselves 7 | you 8 | yond 9 | yonder 10 | yon 11 | ye 12 | yet 13 | z 14 | zillion 15 | j 16 | u 17 | umpteen 18 | usually 19 | us 20 | username 21 | uponed 22 | upons 23 | uponing 24 | upon 25 | ups 26 | upping 27 | upped 28 | up 29 | unto 30 | until 31 | unless 32 | unlike 33 | unliker 34 | unlikest 35 | under 36 | underneath 37 | use 38 | used 39 | usedest 40 | r 41 | rath 42 | rather 43 | rathest 44 | rathe 45 | re 46 | relate 47 | related 48 | relatively 49 | regarding 50 | really 51 | res 52 | respecting 53 | respectively 54 | q 55 | quite 56 | que 57 | qua 58 | n 59 | neither 60 | neaths 61 | neath 62 | nethe 63 | nethermost 64 | necessary 65 | necessariest 66 | necessarier 67 | never 68 | nevertheless 69 | nigh 70 | nighest 71 | nigher 72 | nine 73 | noone 74 | nobody 75 | nobodies 76 | nowhere 77 | nowheres 78 | no 79 | noes 80 | nor 81 | nos 82 | no-one 83 | none 84 | not 85 | notwithstanding 86 | nothings 87 | nothing 88 | nathless 89 | natheless 90 | t 91 | ten 92 | tills 93 | till 94 | tilled 95 | tilling 96 | to 97 | towards 98 | toward 99 | towardest 100 | towarder 101 | together 102 | too 103 | thy 104 | thyself 105 | thus 106 | than 107 | that 108 | those 109 | thou 110 | though 111 | thous 112 | thouses 113 | thoroughest 114 | thorougher 115 | thorough 116 | thoroughly 117 | thru 118 | thruer 119 | thruest 120 | thro 121 | through 122 | throughout 123 | throughest 124 | througher 125 | thine 126 | this 127 | thises 128 | they 129 | thee 130 | the 131 | then 132 | thence 133 | thenest 134 | thener 135 | them 136 | themselves 137 | these 138 | therer 139 | there 140 | thereby 141 | therest 142 | thereafter 143 | therein 144 | thereupon 145 | therefore 146 | their 147 | theirs 148 | thing 149 | things 150 | three 151 | two 152 | o 153 | oh 154 | owt 155 | owning 156 | owned 157 | own 158 | owns 159 | others 160 | other 161 | otherwise 162 | otherwisest 163 | otherwiser 164 | of 165 | often 166 | oftener 167 | oftenest 168 | off 169 | offs 170 | offest 171 | one 172 | ought 173 | oughts 174 | our 175 | ours 176 | ourselves 177 | ourself 178 | out 179 | outest 180 | outed 181 | outwith 182 | outs 183 | outside 184 | over 185 | overallest 186 | overaller 187 | overalls 188 | overall 189 | overs 190 | or 191 | orer 192 | orest 193 | on 194 | oneself 195 | onest 196 | ons 197 | onto 198 | a 199 | atween 200 | at 201 | athwart 202 | atop 203 | afore 204 | afterward 205 | afterwards 206 | after 207 | afterest 208 | afterer 209 | ain 210 | an 211 | any 212 | anything 213 | anybody 214 | anyone 215 | anyhow 216 | anywhere 217 | anent 218 | anear 219 | and 220 | andor 221 | another 222 | around 223 | ares 224 | are 225 | aest 226 | aer 227 | against 228 | again 229 | accordingly 230 | abaft 231 | abafter 232 | abaftest 233 | abovest 234 | above 235 | abover 236 | abouter 237 | aboutest 238 | about 239 | aid 240 | amidst 241 | amid 242 | among 243 | amongst 244 | apartest 245 | aparter 246 | apart 247 | appeared 248 | appears 249 | appear 250 | appearing 251 | appropriating 252 | appropriate 253 | appropriatest 254 | appropriates 255 | appropriater 256 | appropriated 257 | already 258 | always 259 | also 260 | along 261 | alongside 262 | although 263 | almost 264 | all 265 | allest 266 | aller 267 | allyou 268 | alls 269 | albeit 270 | awfully 271 | as 272 | aside 273 | asides 274 | aslant 275 | ases 276 | astrider 277 | astride 278 | astridest 279 | astraddlest 280 | astraddler 281 | astraddle 282 | availablest 283 | availabler 284 | available 285 | aughts 286 | aught 287 | vs 288 | v 289 | variousest 290 | variouser 291 | various 292 | via 293 | vis-a-vis 294 | vis-a-viser 295 | vis-a-visest 296 | viz 297 | very 298 | veriest 299 | verier 300 | versus 301 | k 302 | g 303 | go 304 | gone 305 | good 306 | got 307 | gotta 308 | gotten 309 | get 310 | gets 311 | getting 312 | b 313 | by 314 | byandby 315 | by-and-by 316 | bist 317 | both 318 | but 319 | buts 320 | be 321 | beyond 322 | because 323 | became 324 | becomes 325 | become 326 | becoming 327 | becomings 328 | becominger 329 | becomingest 330 | behind 331 | behinds 332 | before 333 | beforehand 334 | beforehandest 335 | beforehander 336 | bettered 337 | betters 338 | better 339 | bettering 340 | betwixt 341 | between 342 | beneath 343 | been 344 | below 345 | besides 346 | beside 347 | m 348 | my 349 | myself 350 | mucher 351 | muchest 352 | much 353 | must 354 | musts 355 | musths 356 | musth 357 | main 358 | make 359 | mayest 360 | many 361 | mauger 362 | maugre 363 | me 364 | meanwhiles 365 | meanwhile 366 | mostly 367 | most 368 | moreover 369 | more 370 | might 371 | mights 372 | midst 373 | midsts 374 | h 375 | huh 376 | humph 377 | he 378 | hers 379 | herself 380 | her 381 | hereby 382 | herein 383 | hereafters 384 | hereafter 385 | hereupon 386 | hence 387 | hadst 388 | had 389 | having 390 | haves 391 | have 392 | has 393 | hast 394 | hardly 395 | hae 396 | hath 397 | him 398 | himself 399 | hither 400 | hitherest 401 | hitherer 402 | his 403 | how-do-you-do 404 | however 405 | how 406 | howbeit 407 | howdoyoudo 408 | hoos 409 | hoo 410 | w 411 | woulded 412 | woulding 413 | would 414 | woulds 415 | was 416 | wast 417 | we 418 | wert 419 | were 420 | with 421 | withal 422 | without 423 | within 424 | why 425 | what 426 | whatever 427 | whateverer 428 | whateverest 429 | whatsoeverer 430 | whatsoeverest 431 | whatsoever 432 | whence 433 | whencesoever 434 | whenever 435 | whensoever 436 | when 437 | whenas 438 | whether 439 | wheen 440 | whereto 441 | whereupon 442 | wherever 443 | whereon 444 | whereof 445 | where 446 | whereby 447 | wherewithal 448 | wherewith 449 | whereinto 450 | wherein 451 | whereafter 452 | whereas 453 | wheresoever 454 | wherefrom 455 | which 456 | whichever 457 | whichsoever 458 | whilst 459 | while 460 | whiles 461 | whithersoever 462 | whither 463 | whoever 464 | whosoever 465 | whoso 466 | whose 467 | whomever 468 | s 469 | syne 470 | syn 471 | shalling 472 | shall 473 | shalled 474 | shalls 475 | shoulding 476 | should 477 | shoulded 478 | shoulds 479 | she 480 | sayyid 481 | sayid 482 | said 483 | saider 484 | saidest 485 | same 486 | samest 487 | sames 488 | samer 489 | saved 490 | sans 491 | sanses 492 | sanserifs 493 | sanserif 494 | so 495 | soer 496 | soest 497 | sobeit 498 | someone 499 | somebody 500 | somehow 501 | some 502 | somewhere 503 | somewhat 504 | something 505 | sometimest 506 | sometimes 507 | sometimer 508 | sometime 509 | several 510 | severaler 511 | severalest 512 | serious 513 | seriousest 514 | seriouser 515 | senza 516 | send 517 | sent 518 | seem 519 | seems 520 | seemed 521 | seemingest 522 | seeminger 523 | seemings 524 | seven 525 | summat 526 | sups 527 | sup 528 | supping 529 | supped 530 | such 531 | since 532 | sine 533 | sines 534 | sith 535 | six 536 | stop 537 | stopped 538 | p 539 | plaintiff 540 | plenty 541 | plenties 542 | please 543 | pleased 544 | pleases 545 | per 546 | perhaps 547 | particulars 548 | particularly 549 | particular 550 | particularest 551 | particularer 552 | pro 553 | providing 554 | provides 555 | provided 556 | provide 557 | probably 558 | l 559 | layabout 560 | layabouts 561 | latter 562 | latterest 563 | latterer 564 | latterly 565 | latters 566 | lots 567 | lotting 568 | lotted 569 | lot 570 | lest 571 | less 572 | ie 573 | ifs 574 | if 575 | i 576 | info 577 | information 578 | itself 579 | its 580 | it 581 | is 582 | idem 583 | idemer 584 | idemest 585 | immediate 586 | immediately 587 | immediatest 588 | immediater 589 | in 590 | inwards 591 | inwardest 592 | inwarder 593 | inward 594 | inasmuch 595 | into 596 | instead 597 | insofar 598 | indicates 599 | indicated 600 | indicate 601 | indicating 602 | indeed 603 | inc 604 | f 605 | fact 606 | facts 607 | fs 608 | figupon 609 | figupons 610 | figuponing 611 | figuponed 612 | few 613 | fewer 614 | fewest 615 | frae 616 | from 617 | failing 618 | failings 619 | five 620 | furthers 621 | furtherer 622 | furthered 623 | furtherest 624 | further 625 | furthering 626 | furthermore 627 | fourscore 628 | followthrough 629 | for 630 | forwhy 631 | fornenst 632 | formerly 633 | former 634 | formerer 635 | formerest 636 | formers 637 | forbye 638 | forby 639 | fore 640 | forever 641 | forer 642 | fores 643 | four 644 | d 645 | ddays 646 | dday 647 | do 648 | doing 649 | doings 650 | doe 651 | does 652 | doth 653 | downwarder 654 | downwardest 655 | downward 656 | downwards 657 | downs 658 | done 659 | doner 660 | dones 661 | donest 662 | dos 663 | dost 664 | did 665 | differentest 666 | differenter 667 | different 668 | describing 669 | describe 670 | describes 671 | described 672 | despiting 673 | despites 674 | despited 675 | despite 676 | during 677 | c 678 | cum 679 | circa 680 | chez 681 | cer 682 | certain 683 | certainest 684 | certainer 685 | cest 686 | canst 687 | cannot 688 | cant 689 | cants 690 | canting 691 | cantest 692 | canted 693 | co 694 | could 695 | couldst 696 | comeon 697 | comeons 698 | come-ons 699 | come-on 700 | concerning 701 | concerninger 702 | concerningest 703 | consequently 704 | considering 705 | e 706 | eg 707 | eight 708 | either 709 | even 710 | evens 711 | evenser 712 | evensest 713 | evened 714 | evenest 715 | ever 716 | everyone 717 | everything 718 | everybody 719 | everywhere 720 | every 721 | ere 722 | each 723 | et 724 | etc 725 | elsewhere 726 | else 727 | ex 728 | excepted 729 | excepts 730 | except 731 | excepting 732 | exes 733 | enough -------------------------------------------------------------------------------- /text-similarity-rest-reference.md: -------------------------------------------------------------------------------- 1 | # RxNLP Text Similarity API REST Reference 2 | 3 | ## What is Text Similarity API? 4 | 5 | [Mashape][1] 6 | 7 |

8 | The Text Similarity API computes surface similarity between two pieces of text (long or short) using well known measures namely Jaccard, Dice and Cosine. Determining similarity between texts is crucial to many applications such as clustering, duplicate removal, merging similar topics or themes, text retrieval and etc. Let's say we have the following two product listings on eBay: 9 |

10 | 11 |
 12 | "text1": "iphone 4s black new",
 13 | "text2": "iphone 4s black old"
 14 | 
15 | 16 |

17 | How can you tell that these two listings are almost the same? You can use text similarity measures for this. The results from the Text Similarity API shows how close these two texts are using different measures: 18 |

19 | 20 |
{
 21 |  "cosine": "0.750",
 22 |  "jaccard": "0.600",
 23 |  "dice": "0.750",
 24 |  "average":"0.700"
 25 | }
 26 | 
27 | 28 |

29 | In text mining applications, you can heuristically set a similarity threshold. Meaning, if the similarity score between two pieces of text is greater than a value, say 0.5, then you can consider these two units as being similar.  Threshold levels are dependent on the application need. Here are some recommendations: 30 |

31 | 32 | - For strict similarity, use a threshold of 0.5 and above 33 | - For a more liberal similarity,  use a score lesser than 0.5 34 | - In some cases, you can avoid thresholds by ranking texts by similarity scores and using only the top N most similar texts. 35 | 36 | * * * 37 | 38 | # Integrate Text Similarity with Code 39 | 40 | To use this api, you would essentially have to set 3 parameters: 41 | 42 | * **text1**: your first unit of text or text tokens 43 | * **text2**: your second unit of text or text tokens 44 | * **clean:** perform cleaning on your text before similarity computation? 45 | 46 |
47 |

48 | You can have fairly lengthy units of texts (e.g. two plain text documents) but the maximum payload size is 1MB per request. The text that you provide can be plain words, words with Part of Speech Annotations (POS) (e.g.the/dt cow/nn jumps/vb) or combined tokens such as n-grams (e.g. this_cat cat_is is_cute). 49 |

50 |
51 | 52 | * * * 53 | 54 | ## First Steps: Get your API Key 55 | 56 |

57 | Before you start, please ensure that you have a valid API key. 58 |

59 | 60 | * * * 61 | 62 | ## Request 63 | 64 | The TextSimilarity endpoint accepts a **JSON request via POST.** It takes in 3 parameters: 65 | 66 | 67 | 68 | 69 | 72 | 73 | 76 | 77 | 80 | 81 | 84 | 85 | 86 | 87 | 88 | 89 | 92 | 93 | 96 | 97 | 100 | 101 | 104 | 105 | 106 | 107 | 110 | 111 | 114 | 115 | 118 | 119 | 122 | 123 | 124 | 125 | 128 | 129 | 132 | 133 | 136 | 137 | 140 | 141 | 142 |
70 | Parameter name 71 | 74 | Type 75 | 78 | Required? 79 | 82 | Description 83 |
90 | text1 91 | 94 | text 95 | 98 | Yes 99 | 102 | first text 103 |
108 | text2 109 | 112 | text 113 | 116 | Yes 117 | 120 | second text 121 |
126 | clean 127 | 130 | text 131 | 134 | No (Default=true) 135 | 138 | lowercase, remove punctuation and numbers? 139 |
143 | 144 | Points to note: 145 | 146 | - There is **no maximum length** for the text, but a 1MB maximum payload per request. 147 | - The text can be in **any language** - The text that you provide can be: 148 | - plain text, (e.g. the cow jumps over the moon) 149 | - text with POS annotations (e.g. *the/dt cow/nn jumps/vb*) 150 | - manipulated texts such as n-grams (e.g. *thiscat catis iscute*). 151 | - Since this is a json request, your text has to be properly escaped and encoded in UTF-8 152 | 153 | Requests can be sent in any language as long as it is formatted according to the expected JSON format. There is a library called the [unirest][2] library that handles http request and response in several languages including Java, Python, Ruby, Node.js, PHP and more. Here is an example, using the Java Unirest library: 154 | 155 | // These code snippets use an open-source library. http://unirest.io/java 156 | HttpResponse response = Unirest.post("https://rxnlp-core.p.mashape.com/computeSimilarity") 157 | .header("X-Mashape-Key", "") 158 | .header("Content-Type", "application/json") 159 | .header("Accept", "application/json") 160 | .body("{'text1':'this is test 1','text2':'this is test 2!', 'clean':'false'}") .asJson(); 161 | 162 | 163 | * 'text1' and 'text2' are the two texts that you want to compute similarity over and are both **mandatory**. 164 | * 'clean' indicates if you want your text to be cleaned up prior to computing text similarity and this is **optional** 165 | * Content type with application/json is **mandatory** to indicate the type of request being sent 166 | * X-Mashape-Key is **mandatory** and it is the key that allows you access to the API Here is a simple wrapper for the text similarity API in Java using HttpURLConnection 167 | 168 | * * * 169 | 170 | ## Response 171 | 172 |

173 | Text Similarity returns a JSON response. It returns the Cosine, Jaccard and Dice similarity scores along with the average based on these 3 scores. Here is an example request and response output: 174 |

175 | 176 | **Request:** 177 | 178 |
{
179 |  "text1":"this is test 2",
180 |  "text2":"this is test 2!", 
181 |  "clean":"true"
182 | }
183 | 
184 | 185 | **Response:** 186 | 187 |
{
188 |  "cosine": "1.000",
189 |  "jaccard": "1.000",
190 |  "dice": "1.000",
191 |  "average": "1.000"
192 |  }
193 | 194 | **Request:** 195 | 196 |
{
197 | "text1":"this is test 2",
198 | "text2":"this is test 2!", 
199 | "clean":"false"
200 | }
201 | 202 | **Response:** 203 | 204 |

205 | { 
206 |   cosine :0.750 , 
207 |   jaccard: 0.600, 
208 |   dice: 0.750, 
209 |   average:0.700
210 | }
211 | 
212 | 213 |

214 | Since you have access to different similarity measures, you can choose to use one of these measures at all times or all of it at once. You can also use the average scores. 215 |

216 | 217 | * * * 218 | 219 | # Which Similarity Measure to Use? 220 | 221 |

222 | If you have very short texts and want a strict measure that ensures only phrases that are very similar get high scores, then Jaccard would be ideal. However, if your text is more than 5 words long, Cosine or Dice may be more appropriate since these measures tend not to over-penalize non-overlapping terms. You can also average all three scores. In either case, please do some experimentation before you decide which measure(s) to use. 223 |

224 | 225 | * * * 226 | 227 | # Improving Similarity Measures 228 | 229 |

230 | There are several ways to improve similarity (meaning finding more overlaps). Here are some ideas to improve reliability in the similarity measures: 231 |

232 | 233 | * Stem the text units before computing similarity 234 | * Remove determiners (e.g. the, an, a) [see list] 235 | * Remove stop words [full stop word list] [minimal stop list] [stop words in other languages] 236 | 237 | * * * 238 | 239 | # Languages Supported 240 | 241 | Text Similarity is language-neutral and would thus work for all languages. 242 | 243 | [1]: https://www.mashape.com/rxnlp/text-mining-and-nlp?&utm_campaign=mashape5-embed&utm_medium=button&utm_source=text-mining-and-nlp&utm_content=anchorlink&utm_term=icon-light 244 | [2]: http://unirest.io/ 245 | --------------------------------------------------------------------------------