");
24 |
25 | String str1="كتب عربية قصيرة : قصص قصيرة، مختصرات ، الخ";
26 | String str2="عربية قصيرة : قصص قصيرة، مختصرات ، الخ";
27 |
28 | //Then set input
29 | String input = "{\"text1\":\""+str1
30 | +"\",\"text2\":\""+str2
31 | +"\",\"clean\":\"true\"}";
32 |
33 | //Next, process output
34 | OutputStream outputStream = httpConnection.getOutputStream();
35 | outputStream.write(input.getBytes());
36 | outputStream.flush();
37 |
38 | //Throw exception on error
39 | if (httpConnection.getResponseCode() != 200) {
40 | throw new RuntimeException("Failed : HTTP error code : "
41 | + httpConnection.getResponseCode());
42 | }
43 |
44 | BufferedReader responseBuffer = new BufferedReader(new InputStreamReader((httpConnection.getInputStream())));
45 |
46 | //Printing output from server (you can use a json parser here instead)
47 | String output;
48 | System.out.println("Output from Server:\n");
49 | while ((output = responseBuffer.readLine()) != null) {
50 | System.out.println(output);
51 | }
52 |
53 | //disconnect from server
54 | httpConnection.disconnect();
55 | } catch (Exception e) {
56 | // TODO Auto-generated catch block
57 | e.printStackTrace();
58 | }
59 |
60 |
61 | }
62 |
63 | }
64 |
--------------------------------------------------------------------------------
/minimal-stop-word-list.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RxNLP/nlp-cloud-apis/4871cef31996f56cc2f8e03bd955f56b1a1361dc/minimal-stop-word-list.txt
--------------------------------------------------------------------------------
/terrier-stop-word-list.txt:
--------------------------------------------------------------------------------
1 | x
2 | y
3 | your
4 | yours
5 | yourself
6 | yourselves
7 | you
8 | yond
9 | yonder
10 | yon
11 | ye
12 | yet
13 | z
14 | zillion
15 | j
16 | u
17 | umpteen
18 | usually
19 | us
20 | username
21 | uponed
22 | upons
23 | uponing
24 | upon
25 | ups
26 | upping
27 | upped
28 | up
29 | unto
30 | until
31 | unless
32 | unlike
33 | unliker
34 | unlikest
35 | under
36 | underneath
37 | use
38 | used
39 | usedest
40 | r
41 | rath
42 | rather
43 | rathest
44 | rathe
45 | re
46 | relate
47 | related
48 | relatively
49 | regarding
50 | really
51 | res
52 | respecting
53 | respectively
54 | q
55 | quite
56 | que
57 | qua
58 | n
59 | neither
60 | neaths
61 | neath
62 | nethe
63 | nethermost
64 | necessary
65 | necessariest
66 | necessarier
67 | never
68 | nevertheless
69 | nigh
70 | nighest
71 | nigher
72 | nine
73 | noone
74 | nobody
75 | nobodies
76 | nowhere
77 | nowheres
78 | no
79 | noes
80 | nor
81 | nos
82 | no-one
83 | none
84 | not
85 | notwithstanding
86 | nothings
87 | nothing
88 | nathless
89 | natheless
90 | t
91 | ten
92 | tills
93 | till
94 | tilled
95 | tilling
96 | to
97 | towards
98 | toward
99 | towardest
100 | towarder
101 | together
102 | too
103 | thy
104 | thyself
105 | thus
106 | than
107 | that
108 | those
109 | thou
110 | though
111 | thous
112 | thouses
113 | thoroughest
114 | thorougher
115 | thorough
116 | thoroughly
117 | thru
118 | thruer
119 | thruest
120 | thro
121 | through
122 | throughout
123 | throughest
124 | througher
125 | thine
126 | this
127 | thises
128 | they
129 | thee
130 | the
131 | then
132 | thence
133 | thenest
134 | thener
135 | them
136 | themselves
137 | these
138 | therer
139 | there
140 | thereby
141 | therest
142 | thereafter
143 | therein
144 | thereupon
145 | therefore
146 | their
147 | theirs
148 | thing
149 | things
150 | three
151 | two
152 | o
153 | oh
154 | owt
155 | owning
156 | owned
157 | own
158 | owns
159 | others
160 | other
161 | otherwise
162 | otherwisest
163 | otherwiser
164 | of
165 | often
166 | oftener
167 | oftenest
168 | off
169 | offs
170 | offest
171 | one
172 | ought
173 | oughts
174 | our
175 | ours
176 | ourselves
177 | ourself
178 | out
179 | outest
180 | outed
181 | outwith
182 | outs
183 | outside
184 | over
185 | overallest
186 | overaller
187 | overalls
188 | overall
189 | overs
190 | or
191 | orer
192 | orest
193 | on
194 | oneself
195 | onest
196 | ons
197 | onto
198 | a
199 | atween
200 | at
201 | athwart
202 | atop
203 | afore
204 | afterward
205 | afterwards
206 | after
207 | afterest
208 | afterer
209 | ain
210 | an
211 | any
212 | anything
213 | anybody
214 | anyone
215 | anyhow
216 | anywhere
217 | anent
218 | anear
219 | and
220 | andor
221 | another
222 | around
223 | ares
224 | are
225 | aest
226 | aer
227 | against
228 | again
229 | accordingly
230 | abaft
231 | abafter
232 | abaftest
233 | abovest
234 | above
235 | abover
236 | abouter
237 | aboutest
238 | about
239 | aid
240 | amidst
241 | amid
242 | among
243 | amongst
244 | apartest
245 | aparter
246 | apart
247 | appeared
248 | appears
249 | appear
250 | appearing
251 | appropriating
252 | appropriate
253 | appropriatest
254 | appropriates
255 | appropriater
256 | appropriated
257 | already
258 | always
259 | also
260 | along
261 | alongside
262 | although
263 | almost
264 | all
265 | allest
266 | aller
267 | allyou
268 | alls
269 | albeit
270 | awfully
271 | as
272 | aside
273 | asides
274 | aslant
275 | ases
276 | astrider
277 | astride
278 | astridest
279 | astraddlest
280 | astraddler
281 | astraddle
282 | availablest
283 | availabler
284 | available
285 | aughts
286 | aught
287 | vs
288 | v
289 | variousest
290 | variouser
291 | various
292 | via
293 | vis-a-vis
294 | vis-a-viser
295 | vis-a-visest
296 | viz
297 | very
298 | veriest
299 | verier
300 | versus
301 | k
302 | g
303 | go
304 | gone
305 | good
306 | got
307 | gotta
308 | gotten
309 | get
310 | gets
311 | getting
312 | b
313 | by
314 | byandby
315 | by-and-by
316 | bist
317 | both
318 | but
319 | buts
320 | be
321 | beyond
322 | because
323 | became
324 | becomes
325 | become
326 | becoming
327 | becomings
328 | becominger
329 | becomingest
330 | behind
331 | behinds
332 | before
333 | beforehand
334 | beforehandest
335 | beforehander
336 | bettered
337 | betters
338 | better
339 | bettering
340 | betwixt
341 | between
342 | beneath
343 | been
344 | below
345 | besides
346 | beside
347 | m
348 | my
349 | myself
350 | mucher
351 | muchest
352 | much
353 | must
354 | musts
355 | musths
356 | musth
357 | main
358 | make
359 | mayest
360 | many
361 | mauger
362 | maugre
363 | me
364 | meanwhiles
365 | meanwhile
366 | mostly
367 | most
368 | moreover
369 | more
370 | might
371 | mights
372 | midst
373 | midsts
374 | h
375 | huh
376 | humph
377 | he
378 | hers
379 | herself
380 | her
381 | hereby
382 | herein
383 | hereafters
384 | hereafter
385 | hereupon
386 | hence
387 | hadst
388 | had
389 | having
390 | haves
391 | have
392 | has
393 | hast
394 | hardly
395 | hae
396 | hath
397 | him
398 | himself
399 | hither
400 | hitherest
401 | hitherer
402 | his
403 | how-do-you-do
404 | however
405 | how
406 | howbeit
407 | howdoyoudo
408 | hoos
409 | hoo
410 | w
411 | woulded
412 | woulding
413 | would
414 | woulds
415 | was
416 | wast
417 | we
418 | wert
419 | were
420 | with
421 | withal
422 | without
423 | within
424 | why
425 | what
426 | whatever
427 | whateverer
428 | whateverest
429 | whatsoeverer
430 | whatsoeverest
431 | whatsoever
432 | whence
433 | whencesoever
434 | whenever
435 | whensoever
436 | when
437 | whenas
438 | whether
439 | wheen
440 | whereto
441 | whereupon
442 | wherever
443 | whereon
444 | whereof
445 | where
446 | whereby
447 | wherewithal
448 | wherewith
449 | whereinto
450 | wherein
451 | whereafter
452 | whereas
453 | wheresoever
454 | wherefrom
455 | which
456 | whichever
457 | whichsoever
458 | whilst
459 | while
460 | whiles
461 | whithersoever
462 | whither
463 | whoever
464 | whosoever
465 | whoso
466 | whose
467 | whomever
468 | s
469 | syne
470 | syn
471 | shalling
472 | shall
473 | shalled
474 | shalls
475 | shoulding
476 | should
477 | shoulded
478 | shoulds
479 | she
480 | sayyid
481 | sayid
482 | said
483 | saider
484 | saidest
485 | same
486 | samest
487 | sames
488 | samer
489 | saved
490 | sans
491 | sanses
492 | sanserifs
493 | sanserif
494 | so
495 | soer
496 | soest
497 | sobeit
498 | someone
499 | somebody
500 | somehow
501 | some
502 | somewhere
503 | somewhat
504 | something
505 | sometimest
506 | sometimes
507 | sometimer
508 | sometime
509 | several
510 | severaler
511 | severalest
512 | serious
513 | seriousest
514 | seriouser
515 | senza
516 | send
517 | sent
518 | seem
519 | seems
520 | seemed
521 | seemingest
522 | seeminger
523 | seemings
524 | seven
525 | summat
526 | sups
527 | sup
528 | supping
529 | supped
530 | such
531 | since
532 | sine
533 | sines
534 | sith
535 | six
536 | stop
537 | stopped
538 | p
539 | plaintiff
540 | plenty
541 | plenties
542 | please
543 | pleased
544 | pleases
545 | per
546 | perhaps
547 | particulars
548 | particularly
549 | particular
550 | particularest
551 | particularer
552 | pro
553 | providing
554 | provides
555 | provided
556 | provide
557 | probably
558 | l
559 | layabout
560 | layabouts
561 | latter
562 | latterest
563 | latterer
564 | latterly
565 | latters
566 | lots
567 | lotting
568 | lotted
569 | lot
570 | lest
571 | less
572 | ie
573 | ifs
574 | if
575 | i
576 | info
577 | information
578 | itself
579 | its
580 | it
581 | is
582 | idem
583 | idemer
584 | idemest
585 | immediate
586 | immediately
587 | immediatest
588 | immediater
589 | in
590 | inwards
591 | inwardest
592 | inwarder
593 | inward
594 | inasmuch
595 | into
596 | instead
597 | insofar
598 | indicates
599 | indicated
600 | indicate
601 | indicating
602 | indeed
603 | inc
604 | f
605 | fact
606 | facts
607 | fs
608 | figupon
609 | figupons
610 | figuponing
611 | figuponed
612 | few
613 | fewer
614 | fewest
615 | frae
616 | from
617 | failing
618 | failings
619 | five
620 | furthers
621 | furtherer
622 | furthered
623 | furtherest
624 | further
625 | furthering
626 | furthermore
627 | fourscore
628 | followthrough
629 | for
630 | forwhy
631 | fornenst
632 | formerly
633 | former
634 | formerer
635 | formerest
636 | formers
637 | forbye
638 | forby
639 | fore
640 | forever
641 | forer
642 | fores
643 | four
644 | d
645 | ddays
646 | dday
647 | do
648 | doing
649 | doings
650 | doe
651 | does
652 | doth
653 | downwarder
654 | downwardest
655 | downward
656 | downwards
657 | downs
658 | done
659 | doner
660 | dones
661 | donest
662 | dos
663 | dost
664 | did
665 | differentest
666 | differenter
667 | different
668 | describing
669 | describe
670 | describes
671 | described
672 | despiting
673 | despites
674 | despited
675 | despite
676 | during
677 | c
678 | cum
679 | circa
680 | chez
681 | cer
682 | certain
683 | certainest
684 | certainer
685 | cest
686 | canst
687 | cannot
688 | cant
689 | cants
690 | canting
691 | cantest
692 | canted
693 | co
694 | could
695 | couldst
696 | comeon
697 | comeons
698 | come-ons
699 | come-on
700 | concerning
701 | concerninger
702 | concerningest
703 | consequently
704 | considering
705 | e
706 | eg
707 | eight
708 | either
709 | even
710 | evens
711 | evenser
712 | evensest
713 | evened
714 | evenest
715 | ever
716 | everyone
717 | everything
718 | everybody
719 | everywhere
720 | every
721 | ere
722 | each
723 | et
724 | etc
725 | elsewhere
726 | else
727 | ex
728 | excepted
729 | excepts
730 | except
731 | excepting
732 | exes
733 | enough
--------------------------------------------------------------------------------
/text-similarity-rest-reference.md:
--------------------------------------------------------------------------------
1 | # RxNLP Text Similarity API REST Reference
2 |
3 | ## What is Text Similarity API?
4 |
5 | [
][1]
6 |
7 |
8 | The Text Similarity API computes surface similarity between two pieces of text (long or short) using well known measures namely Jaccard, Dice and Cosine. Determining similarity between texts is crucial to many applications such as clustering, duplicate removal, merging similar topics or themes, text retrieval and etc. Let's say we have the following two product listings on eBay:
9 |
10 |
11 |
12 | "text1": "iphone 4s black new",
13 | "text2": "iphone 4s black old"
14 |
15 |
16 |
17 | How can you tell that these two listings are almost the same? You can use text similarity measures for this. The results from the Text Similarity API shows how close these two texts are using different measures:
18 |
19 |
20 | {
21 | "cosine": "0.750",
22 | "jaccard": "0.600",
23 | "dice": "0.750",
24 | "average":"0.700"
25 | }
26 |
27 |
28 |
29 | In text mining applications, you can heuristically set a similarity threshold. Meaning, if the similarity score between two pieces of text is greater than a value, say 0.5, then you can consider these two units as being similar. Threshold levels are dependent on the application need. Here are some recommendations:
30 |
31 |
32 | - For strict similarity, use a threshold of 0.5 and above
33 | - For a more liberal similarity, use a score lesser than 0.5
34 | - In some cases, you can avoid thresholds by ranking texts by similarity scores and using only the top N most similar texts.
35 |
36 | * * *
37 |
38 | # Integrate Text Similarity with Code
39 |
40 | To use this api, you would essentially have to set 3 parameters:
41 |
42 | * **text1**: your first unit of text or text tokens
43 | * **text2**: your second unit of text or text tokens
44 | * **clean:** perform cleaning on your text before similarity computation?
45 |
46 |
47 |
48 | You can have fairly lengthy units of texts (e.g. two plain text documents) but the maximum payload size is 1MB per request. The text that you provide can be plain words, words with Part of Speech Annotations (POS) (e.g.the/dt cow/nn jumps/vb) or combined tokens such as n-grams (e.g. this_cat cat_is is_cute).
49 |
50 |
51 |
52 | * * *
53 |
54 | ## First Steps: Get your API Key
55 |
56 |
57 | Before you start, please ensure that you have a valid API key.
58 |
59 |
60 | * * *
61 |
62 | ## Request
63 |
64 | The TextSimilarity endpoint accepts a **JSON request via POST.** It takes in 3 parameters:
65 |
66 |
67 |
68 |
69 |
70 | Parameter name
71 | |
72 |
73 |
74 | Type
75 | |
76 |
77 |
78 | Required?
79 | |
80 |
81 |
82 | Description
83 | |
84 |
85 |
86 |
87 |
88 |
89 |
90 | text1
91 | |
92 |
93 |
94 | text
95 | |
96 |
97 |
98 | Yes
99 | |
100 |
101 |
102 | first text
103 | |
104 |
105 |
106 |
107 |
108 | text2
109 | |
110 |
111 |
112 | text
113 | |
114 |
115 |
116 | Yes
117 | |
118 |
119 |
120 | second text
121 | |
122 |
123 |
124 |
125 |
126 | clean
127 | |
128 |
129 |
130 | text
131 | |
132 |
133 |
134 | No (Default=true)
135 | |
136 |
137 |
138 | lowercase, remove punctuation and numbers?
139 | |
140 |
141 |
142 |
143 |
144 | Points to note:
145 |
146 | - There is **no maximum length** for the text, but a 1MB maximum payload per request.
147 | - The text can be in **any language** - The text that you provide can be:
148 | - plain text, (e.g. the cow jumps over the moon)
149 | - text with POS annotations (e.g. *the/dt cow/nn jumps/vb*)
150 | - manipulated texts such as n-grams (e.g. *thiscat catis iscute*).
151 | - Since this is a json request, your text has to be properly escaped and encoded in UTF-8
152 |
153 | Requests can be sent in any language as long as it is formatted according to the expected JSON format. There is a library called the [unirest][2] library that handles http request and response in several languages including Java, Python, Ruby, Node.js, PHP and more. Here is an example, using the Java Unirest library:
154 |
155 | // These code snippets use an open-source library. http://unirest.io/java
156 | HttpResponse response = Unirest.post("https://rxnlp-core.p.mashape.com/computeSimilarity")
157 | .header("X-Mashape-Key", "")
158 | .header("Content-Type", "application/json")
159 | .header("Accept", "application/json")
160 | .body("{'text1':'this is test 1','text2':'this is test 2!', 'clean':'false'}") .asJson();
161 |
162 |
163 | * 'text1' and 'text2' are the two texts that you want to compute similarity over and are both **mandatory**.
164 | * 'clean' indicates if you want your text to be cleaned up prior to computing text similarity and this is **optional**
165 | * Content type with application/json is **mandatory** to indicate the type of request being sent
166 | * X-Mashape-Key is **mandatory** and it is the key that allows you access to the API Here is a simple wrapper for the text similarity API in Java using HttpURLConnection
167 |
168 | * * *
169 |
170 | ## Response
171 |
172 |
173 | Text Similarity returns a JSON response. It returns the Cosine, Jaccard and Dice similarity scores along with the average based on these 3 scores. Here is an example request and response output:
174 |
175 |
176 | **Request:**
177 |
178 | {
179 | "text1":"this is test 2",
180 | "text2":"this is test 2!",
181 | "clean":"true"
182 | }
183 |
184 |
185 | **Response:**
186 |
187 | {
188 | "cosine": "1.000",
189 | "jaccard": "1.000",
190 | "dice": "1.000",
191 | "average": "1.000"
192 | }
193 |
194 | **Request:**
195 |
196 | {
197 | "text1":"this is test 2",
198 | "text2":"this is test 2!",
199 | "clean":"false"
200 | }
201 |
202 | **Response:**
203 |
204 |
205 | {
206 | cosine :0.750 ,
207 | jaccard: 0.600,
208 | dice: 0.750,
209 | average:0.700
210 | }
211 |
212 |
213 |
214 | Since you have access to different similarity measures, you can choose to use one of these measures at all times or all of it at once. You can also use the average scores.
215 |
216 |
217 | * * *
218 |
219 | # Which Similarity Measure to Use?
220 |
221 |
222 | If you have very short texts and want a strict measure that ensures only phrases that are very similar get high scores, then Jaccard would be ideal. However, if your text is more than 5 words long, Cosine or Dice may be more appropriate since these measures tend not to over-penalize non-overlapping terms. You can also average all three scores. In either case, please do some experimentation before you decide which measure(s) to use.
223 |
224 |
225 | * * *
226 |
227 | # Improving Similarity Measures
228 |
229 |
230 | There are several ways to improve similarity (meaning finding more overlaps). Here are some ideas to improve reliability in the similarity measures:
231 |
232 |
233 | * Stem the text units before computing similarity
234 | * Remove determiners (e.g. the, an, a) [see list]
235 | * Remove stop words [full stop word list] [minimal stop list] [stop words in other languages]
236 |
237 | * * *
238 |
239 | # Languages Supported
240 |
241 | Text Similarity is language-neutral and would thus work for all languages.
242 |
243 | [1]: https://www.mashape.com/rxnlp/text-mining-and-nlp?&utm_campaign=mashape5-embed&utm_medium=button&utm_source=text-mining-and-nlp&utm_content=anchorlink&utm_term=icon-light
244 | [2]: http://unirest.io/
245 |
--------------------------------------------------------------------------------