├── .gitignore
├── LICENSE.rst
├── MANIFEST.in
├── README.md
├── docs
    ├── algorithm.md
    └── api.md
├── setup.cfg
├── setup.py
└── sounder
    ├── __init__.py
    └── sounder.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | # Byte-compiled / optimized / DLL files
 2 | __pycache__/
 3 | *.py[cod]
 4 | *$py.class
 5 | 
 6 | # C extensions
 7 | *.so
 8 | 
 9 | # Distribution / packaging
10 | .Python
11 | env/
12 | build/
13 | develop-eggs/
14 | dist/
15 | downloads/
16 | eggs/
17 | .eggs/
18 | lib/
19 | lib64/
20 | parts/
21 | sdist/
22 | var/
23 | *.egg-info/
24 | .installed.cfg
25 | *.egg
26 | 
27 | # PyInstaller
28 | #  Usually these files are written by a python script from a template
29 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
30 | *.manifest
31 | *.spec
32 | 
33 | # Installer logs
34 | pip-log.txt
35 | pip-delete-this-directory.txt
36 | 
37 | # Unit test / coverage reports
38 | htmlcov/
39 | .tox/
40 | .coverage
41 | .coverage.*
42 | .cache
43 | nosetests.xml
44 | coverage.xml
45 | *,cover
46 | .hypothesis/
47 | 
48 | # Translations
49 | *.mo
50 | *.pot
51 | 
52 | # Django stuff:
53 | *.log
54 | local_settings.py
55 | 
56 | # Flask stuff:
57 | instance/
58 | .webassets-cache
59 | 
60 | # Scrapy stuff:
61 | .scrapy
62 | 
63 | # Sphinx documentation
64 | docs/_build/
65 | 
66 | # PyBuilder
67 | target/
68 | 
69 | # IPython Notebook
70 | .ipynb_checkpoints
71 | 
72 | # pyenv
73 | .python-version
74 | 
75 | # celery beat schedule file
76 | celerybeat-schedule
77 | 
78 | # dotenv
79 | .env
80 | 
81 | # virtualenv
82 | venv/
83 | ENV/
84 | 
85 | # Spyder project settings
86 | .spyderproject
87 | 
88 | # Rope project settings
89 | .ropeproject
90 | 
91 | #pycharm
92 | .idea/


--------------------------------------------------------------------------------
/LICENSE.rst:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) [2017] [Ujjwal Gupta]
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.


--------------------------------------------------------------------------------
/MANIFEST.in:
--------------------------------------------------------------------------------
1 | include LICENSE.rst README.rst


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Sounder API
  2 | 
  3 | This section is dedicated to the Sounder Library API, which is an abstraction of the [Sounder Algorithm](https://slapbot.github.io/documentation/resources/algorithm), To read the full paper explaining how Sounder works and can be incoporated in the project as well as where it can be used at, kindly refer here: [Sounder Explained](https://slapbot.github.io/documentation/resources/algorithm), [PDF version](https://slapbot.github.io/documentation/resources/algorithm/sounder.pdf)
  4 | 
  5 | - [Installation](#installation)
  6 | - [Instantiate Class](#instantiate)
  7 | - [Search Method](#search)
  8 | - [Probability Method](#probability)
  9 | - [Filter Method](#filter)
 10 | - [Practical Usage](#practical-usage)
 11 | 
 12 | <hr>
 13 | 
 14 | <a name="installation">
 15 | 
 16 | ## Installation
 17 | 
 18 | Installing Sounder library into your application is easy as pie with `pip` package manager, allowing you to do a simple command from your favorite command line as follows:
 19 | 
 20 | 	pip install sounder
 21 | 
 22 | <hr>
 23 | 
 24 | <a name="instantiate">
 25 | 
 26 | ## Instantiate Class
 27 | 
 28 | The first and the foremost thing to do is to import the class like so.
 29 | 
 30 | 	from sounder import Sounder
 31 | 
 32 | And then simply instantiating the class.
 33 | 
 34 | 	sounder = Sounder([['facebook', 'notifications'], ['twitter', 'notifications']])
 35 | 
 36 | You can pass dataset as a positional argument(optional) to the Sounder constructor, or set it later down the line using 
 37 | `set_module()` method which returns self.
 38 | 
 39 | 	sounder.set_dataset([['facebook', 'notifications'], ['twitter', 'notifications']])
 40 | 
 41 | As you can already notice, in order to use `search` method, the `dataset` needs to be `2 dimensional list`, containing string elements.
 42 | 
 43 | <hr>
 44 | 
 45 | <a name="search">
 46 | 
 47 | ## Search Method
 48 | 
 49 | `search(query, dataset=None, metaphone=False)` method takes a positional argument(compulsory), a query which needs to be a list composed of string that needs to be searched through the dataset, like so.
 50 | 
 51 | 	sounder = Sounder([['facebook', 'notifications'], ['twitter', 'notifications'], ['note', 'something']])
 52 | 	index = sounder.search(['trackbook', 'notifs'])
 53 | 
 54 | `search` method always returns back the index which it found to be most probable to be identical for your given set of data. In this case index will equate to 0.
 55 | 
 56 | This method take other optional arguments as follows:
 57 | 
 58 | - **dataset :** It's simply the dataset, in case you don't want set dataset while instantiating the class, no problem just pass it as a another argument. Though again it needs to be a double dimensional list.
 59 | 
 60 | - **metaphone :** It defaults to False, resonating to the fact that you don't want to use metaphones in addition to the master algorithm. On True state, all the dataset and query is first transformed to metaphones and then inputted to the algorithm increasing efficiency in cases where input data is quite randomized or uses generic terms.
 61 | 
 62 | <hr>
 63 | 
 64 | <a name="probability">
 65 | 
 66 | ## Probability Method
 67 | 
 68 | `probability(query, dataset=None, metaphone=False, detailed=False, prediction=False)` method takes again a single positional argument which is the query that needs to be compared with the dataset. (A list composed of strings.), like so.
 69 | 
 70 | 	sounder = Sounder([['facebook', 'notifications'], ['twitter', 'notifications'], ['note', 'something']])
 71 | 	chances = sounder.probability(['trackbook', 'notifs'])
 72 | 
 73 | `probability` method returns result depending on the optional parameters under given cases:
 74 | 
 75 | - **No optional argument passed :** It returns the list the size of the dataset, composed of probability that the query list is most probable to the dataset, resulting from a value between 0.0 to 100.0 where 0.0 refers to nothing matches, and 100.0 to everything matches.
 76 | 
 77 | - **detailed :** If detailed argument is set to True, then it returns back the size of the dataset in a nested list format, where the first element is the probability that the query list is most probable to the dataset, while the second element is an another list the size of the ith data of dataset, consisting the probabiltiy that jth word of the ith data was found on the query by solving assignment problem, resulting from a value between 0.0 to 100.0 where 0.0 refers to nothing matches.
 78 | 
 79 | - **prediction :** If set to True, it returns back a dict, with keys `chances` and `index` suggesting which index of the dataset is most probable to the the given query in terms of similarity while chances denoting to a value between 0.0 to 100.0 where 0.0 refers to nothing matches.
 80 | 
 81 | Two other arguments that can be set are :
 82 | 
 83 | - **dataset :** Again, in case you didn't set the dataset on the instantiation, fear not, just pass it as an argument. One more thing, this time it doesn't necessarily needs to be a double dimensional list if you're just comparing two lists of string elements. like so.
 84 | 	
 85 | 		information = sounder.probability(['trackbook'], dataset=['facebook'])
 86 | 
 87 | Sounder basically internally map it into double dimensional list automatically, giving you the leverage to compare any two lists of words.
 88 | 
 89 | - **metaphones :** Again, it's exactly the same as for search method.
 90 | 
 91 | <hr>
 92 | 
 93 | <a name="filter">
 94 | 
 95 | ## Filter Method
 96 | 
 97 | `filter(query, reserved_sub_words=None)` is basically a utility provided you to filter the stop words out of your string, for instance, `"Hey Stephanie, what is the time right now?"` would filter away `['hey', 'what', 'is', 'the']` since they don't hold higher meaning, leaving behind key_words like `['stephanie', 'time', 'right', 'now']`
 98 | 
 99 | This method is just a utility to help you do the entire intent recognization from single library, but you're free to use any kind of system. It returns back a dictionary with keys such as `sub_words` and `key_words`, resonating to stop words found in the string and keywords found in it in a list form respectively.
100 | 
101 | - **reserved_sub_words :** is the filter that is used to filter out the stop words, you can pass your own filter in the method itself or through using `set_filter(reserved_sub_words)` method which returns the self instance. **Note :** make sure the filter is a dictionary of all the words that you consider as stop words. Default is as follows:
102 | 
103 | 		{
104 | 	        "what", "where", "which", "how", "when", "who",
105 | 	        "is", "are", "makes", "made", "make", "did", "do",
106 | 	        "to", "the", "of", "from", "against", "and", "or",
107 | 	        "you", "me", "we", "us", "your", "my", "mine", 'yours',
108 | 	        "could", "would", "may", "might", "let", "possibly",
109 | 	        'tell', "give", "told", "gave", "know", "knew",
110 | 	        'a', 'am', 'an', 'i', 'like', 'has', 'have', 'need',
111 | 	        'will', 'be', "this", 'that', "for"
112 | 		}
113 | 
114 | <hr>
115 | 
116 | <a name="practical-usage">
117 | 
118 | ## Practical Usage
119 | 
120 | This algorithm is the brain of [Stephanie](https://slapbot.github.io), an open-source platform built specifically for voice-controlled application as well as to automate daily tasks imitating much of an virtual assistant's work.
121 | 


--------------------------------------------------------------------------------
/docs/algorithm.md:
--------------------------------------------------------------------------------
  1 | # Sounder Algorithm
  2 | 
  3 | - [Introduction](#introduction)
  4 | - [Algorithm](#algorithm)
  5 | 	- [Levenshtein Edit Distance](#levenshtein-edit-distance)
  6 | 	- [Munkres Algorithm](#munkres-algorithm)
  7 | 	- [Metaphones](#metaphones)
  8 | 	- [Putting It All Together](#putting-it-all-together)
  9 | - [Practical Application](#practical-application)
 10 | 	- [Application Logic](#application-logic)
 11 | 	- [Pseudo Code]($pseudo-code)
 12 | 	- [In Practice](#in-practice)
 13 | - [Potential Optimization](#potential-optimization)
 14 | - [Library](#library)
 15 | - [Conclusion](#conclusion)
 16 | 
 17 | <hr>
 18 | 
 19 | <a name="introduction">
 20 | # INTRODUCTION
 21 | 
 22 | While I was tackling a NLP (Natural Language Processing) problem for one of my project "Stephanie", an open-source platform imitating a voice-controlled virtual assistant, it required a specific algorithm to observe a sentence and allocate some 'meaning' to it, which then I created using some neat tricks and few principles such as sub filtering, string metric and maximum weight matching.
 23 | 
 24 | <hr>
 25 | 
 26 | <a name="algorithm">
 27 | # ALGORITHM
 28 | 
 29 | The algorithm uses Levenshtein Edit Distance and Munkres Assignment Algorithm to tackle problems like string metric and maximum weight matching. Let’s get an overview on each of the given algorithm as described above:
 30 | 
 31 | <hr>
 32 | 
 33 | <a name="levenshtein-edit-distance">
 34 | ## LEVENSTEIN EDIT DISTANCE
 35 | 
 36 | In information theory and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. It is named after Vladimir Levenshtein, who considered this distance in 1965.
 37 | 
 38 | Levenshtein distance may also be referred to as edit distance, although that term may also denote a larger family of distance metrics. It is closely related to pair wise string alignments.
 39 | In layman terms, The Levenshtein algorithm (also called Edit-Distance) calculates the least number of edit operations that are necessary to modify one string to obtain another string.
 40 |  
 41 | <img src="http://www.levenshtein.net/images/levenshtein_meilenstein_path.gif" class="img-responsive">
 42 | 
 43 | "=" Match; "o" Substitution; "+" Insertion; "-" Deletion
 44 | 
 45 | It is been used to calculate the score which denotes the chances that given word is equal to some other word.
 46 | 
 47 | <hr>
 48 | 
 49 | <a name="munkres-algorithm">
 50 | ## MUNKRES ALGORITHM
 51 | 
 52 | The Hungarian method is a combinatorial optimization algorithm that solves the assignment problem in polynomial time and which anticipated later primal-dual methods. It was developed and published in 1955 by Harold Kuhn.
 53 | James Munkres reviewed the algorithm in 1957 and observed that it is (strongly) polynomial. Since then the algorithm has been known also as the Kuhn–Munkres algorithm or Munkres assignment algorithm.
 54 | 
 55 | Let’s use an example to explain it even further, there are three workers: Armond, Francine, and Herbert. One of them has to clean the bathroom, another sweep the floors, and the third wash the windows, but they each demand different pay for the various tasks. The problem is to find the lowest-cost way to assign the jobs. The problem can be represented in a matrix of the costs of the workers doing the jobs. For example:
 56 | <div class="table-responsive">
 57 | 	<table class="table table-bordered">
 58 | 	    <thead>
 59 | 	      <tr>
 60 | 	      	<th>Name</th>
 61 | 	        <th>Clean Bathrooms</th>
 62 | 	        <th>Sweep Floors</th>
 63 | 	        <th>Wash Windows</th>
 64 | 	      </tr>
 65 | 	    </thead>
 66 | 	    <tbody>
 67 | 	      <tr>
 68 | 	        <td>Armond</td>
 69 | 	        <td>$2</td>
 70 | 	        <td>$3</td>
 71 | 	        <td>$3</td>
 72 | 	      </tr>
 73 | 	      <tr>
 74 | 	        <td>Francine</td>
 75 | 	        <td>$3</td>
 76 | 	        <td>$2</td>
 77 | 	        <td>$3</td>
 78 | 	      </tr>
 79 | 	      <tr>
 80 | 	        <td>Herbert</td>
 81 | 	        <td>$3</td>
 82 | 	        <td>$3</td>
 83 | 	        <td>$2</td>
 84 | 	      </tr>
 85 | 	    </tbody>
 86 | 	  </table>	
 87 | </div>
 88 | 
 89 | The Hungarian method, when applied to the above table, would give the minimum cost: this is $6, achieved by having Armond clean the bathroom, Francine sweep the floors, and Herbert wash the windows.
 90 | Similarly it can be used to compute the maximum cost by doing a small alteration to the cost matrix. The simplest way to do that is to subtract all elements from a large value.
 91 | 
 92 | <hr>
 93 | 
 94 | <a name="metaphones">
 95 | ## METAPHONES
 96 | 
 97 | Metaphone is a phonetic algorithm, published by Lawrence Philips in 1990, for indexing words by their English pronunciation.
 98 | 
 99 | Similar to soundex metaphone creates the same key for similar sounding words. It's more accurate than soundex as it knows the basic rules of English pronunciation. The metaphone generated keys are of variable length.
100 | The use of metaphone is to get a more generic form of a text especially names such as stephanie ends up as "STFN", stefanie also ends up as "STFN", while tiffany would end as "TFN".
101 | 
102 | They pretty much can be used in addition to current given algorithms in case the input data contains lot of generic information/unrevised information.
103 | 
104 | <a name="putting-it-all-together">
105 | ## PUTTING IT ALL TOGETHER
106 | 
107 | Algorithm takes two parameters, dataset and query.
108 | 
109 | Dataset is a 2D list comprised of words, for a dummy example, this could be a dataset:
110 | 
111 | 	[[‘twitter’, ‘notifications’], [‘unread’, ‘emails’], …]
112 | 
113 | And now query will be a list of given keywords that needs to be searched, for instance:
114 | 
115 | 	[‘give’, ‘twitter’, notifs’]
116 | 
117 | Now we will iterate through each row (list) of 2D list (dataset) and compare it to query list in this given way:
118 | 
119 | - Now we will have two single dimensional lists like [‘twitter’, ‘notifications’] and [‘give’, …] for the very first iteration of the dataset in the above given example.
120 | - We use double loop to compare each word of the first list with each word in other list, like comparing ‘twitter’ with ‘give’, then ‘twitter’ with ‘twitter’…, then ‘notificaitons’ with ‘give’, ‘notifications’ with ‘twitter’ and so on.
121 | - Comparison is done using Leinshtein Distance, this results in a value ranging from 0.0 – 1.0 where 1.0 equates to the fact that both of the words are same.
122 | - Now we have a one to one comparison of each word present in both of the lists, now we can create a matrix out of it whose dimension will be size of first list x size of second list. In this given example, 2 x 3 which will have values computed from levenshtein embedded in it.
123 | - Now this matrix can used to compute the maximum assignment using Munkes algorithm, which takes the given algorithm and returns back the maximum scores index. 
124 | - This returned matrix is a vector matrix with the size of number of rows of the cost matrix (matrix that was sent to munkes algorithm) denoting which values gives the maximum output.
125 | - Now we initialize a list and create a loop, which goes from 0 to the length of data list (ith member of dataset) and give value to each word present in the data list computed from munkres algorithm which basically denotes that chances that each word of data is present in query, (for ex- what’s the chance that the word ‘twitter’ is present in the query, ‘notifications’ and so on)
126 | - Now we get these this list and compute it’s average by summing up all the values (chances that given word is present) and divide it by the number of words present in it (length).
127 | - Finally we calculate the maximum average of the entire dataset and return its index. (In case there are two indexes which has same average we choose the one which has higher sum (values)).
128 | 
129 | On a side note, metaphones can be used in addition to given concept where all of the dataset and query is first converted into respective metaphones before sending it to this master algorithm, increasing the efficiency of the algorithm quite drastically especially with generic names. ("Stephanie" = "STFN"), ("Tiffany" = "TFN") and is actively used in Stephanie, to check the availability, (“Stephanie,…”), or (“Hey Stephanie, wake up”), since it works best with small input data.
130 | 
131 | <hr>
132 | 
133 | <a name="practical-application">
134 | # PRACTICAL APPLICATION
135 | 
136 | For the starters, "Stephanie" is a virtual assistant inspired by various other products in the market such as "Siri", "Cortana" and "Watson" to name a few. It's written in python and acts more like a framework of some sort than a core application, where 3rd party modules can be injected into it using a really simple developer API.
137 | So how Stephanie "thinks" is that when a user speaks through an audio device to Stephanie, it records that voice and saves it as .wav file in the memory and sends the contents of that file to one of the leading voice recognition services provided by various services such as "Google cloud speech", "Bing speech API", "wit.ai" and various others after then it receives a response which is usually a string containing the "meaning" of the sentence based on the learning algorithm that service uses, so the end product is a text which is usually what speaker meant now the goal of Stephanie is to somehow use that text to learn what user meant and fulfill it's potential needs.
138 | 
139 | <hr>
140 | 
141 | <a name="application-logic">
142 | ## APPLICATION LOGIC
143 | 
144 | The application logic of Stephanie works somewhat this way:
145 | 
146 | - The modules in Stephanie are defined as separate classes
147 | - Each class inherits from a base class which handles all the dependencies and provides an interface layer to core functions of the application such as .speak("Good evening, Sir"), .listen(), .decipher(), .search(), etc methods, so that developers can interact with the main application quite easily by just calling the above methods.
148 | - Each module defines a set of keywords which when found in the users "spoken" text (retrieved from one of the 3rd party services such as "Google speech" and so on.) instantiates that specific module and calls a method on it.
149 | 
150 | That method is also defined in the class, so now what the algorithm does is that it takes the full text as one of its input while the other input is all the keywords of different modules in some kind of array format, and then it results with the index which algorithm decides is the best guess of the given modules.
151 | 
152 | <hr>
153 | 
154 | <a name="pseudo-code">
155 | ## PSEUDO CODE
156 | 
157 | A really simple demo module in this case twitter module which gets some twitter notifications.
158 | 
159 | 	TwitterModule(BaseModule): // Base module handles all the dependencies.
160 | 		def __init__(self, *args):
161 | 			super().__init__(*args)
162 | 		def get_notifications(self):
163 | 			// use some logic to get notifications by hooking with Twitter API
164 | 			response = "You have %s notifications, namely %s" % (data['count'], 									              data ['text']
165 | 			self.assistant.say(response)
166 | 
167 | TwitterModule has its keywords assigned as ['twitter', 'notifications'], though to be a little more specific and help developers writing rules it's written as:
168 | 
169 | 		["TwitterModule@GetNotifications", ['twitter', 'notifications']]
170 | 
171 | - "TwitterModule" is the classname.
172 | - "GetNotifications" is the method used to handle that function, it's converted into snake_case and invoked dynamically.
173 | - [ 'twitter', 'notifications'] are the keywords which co-relate Twittermodule to intended text.
174 | 
175 | <hr>
176 | 
177 | <a name="in-practice">
178 | ## IN PRACTICE
179 | 
180 | So now whenever a user speaks something to Stephanie, it takes the voice, gets the intended text in string format, it's then cleaned a little bit to get rid of irregularities and split into an array with delimiter parameter set as " ". This array is then sent to another text processing mechanism where all the "sub words" (words which hold somewhat lower precedence such as "a", "an", "the", "I", "you", "what", "which", "has", etc basically all the articles, tenses and some prepositions) are filtered into another array which for now is left unused.
181 | So, now we have two arrays one with keywords spoken by a user and second is a 2D array with module information and keywords co-relating them, we take the keywords part of that 2D array and then pass user_keywords and module_keywords (it's still 2D array) to our algorithm which chunks the result in this case an index value which relates the element in that array (module) with which the user keywords are most identical with.
182 | 
183 | <hr>
184 | 
185 | <a name="potential-optimization">
186 | # POTENTIAL OPTIMIZATION
187 | 
188 | - Sub words shouldn’t be filtered out completely instead can be filtered down to priority scales. For instance “a”, “an”, “the” could resonate to 1 as they don’t hold much importance to the meaning of sentence, while “I”, “You”, “We” could resonate to 2 since they are a bit more meaningful and so on.
189 | - Deep Neural Networks can be used alongside this algorithm to predict the intent of a given sentence by using correct data sets and modeling.
190 | - Actual code provided can be optimized with techniques like memorization, using numpy instead of built-in lists.
191 | 
192 | <hr>
193 | 
194 | <a name="library">
195 | # Library
196 | 
197 | This very algorithm has been implemented in python programming language, and is completely open source. You can embed it in any form in one of your applications whether it be open source or commercial. Head to [API section](/stephanie/documentation/resources/algorithm-api) to get detailed overview of how to work with the library in numerous ways.
198 | 
199 | <hr>
200 | 
201 | <a name="conclusion">
202 | # Conclusion
203 | 
204 | The working efficiency of this algorithm is pretty good, and I would highly recommend checking the code provided in the github to gain more of the hidden insight and see it work in practice. This search algorithm acts as a primary core of Stephanie which helps it determining the ‘meaning’ of the sentence and triggering the correct response. I highly recommend checking Stephanie as well since it’s open-source and you could see this phonetic algorithm in real life application.
205 | 
206 | **P.S.** I am 18 year old lad, who didn't go to college and most of my  understanding of programming and computer science fundamentals came from active surfing, reading lots of books, and just plainly asking questions and finding it's answers. So I doubt this as more of a blog than a research paper, so kindly go easy on me and if you find any mistake or just want to improve the given document or just wanna talk about it in depth or merely for fun, just contact me at ugupta41@gmail.com.
207 | 


--------------------------------------------------------------------------------
/docs/api.md:
--------------------------------------------------------------------------------
 1 | # Sounder API
 2 | 
 3 | This section is dedicated to the [Sounder Library's](github_link) API.
 4 | 
 5 | - [Installation](#installation)
 6 | - [Instantiate Class](#instantiate)
 7 | - [Search Method](#search)
 8 | - [Probability Method](#probability)
 9 | - [Filter Method](#filter)
10 | 
11 | <a name="installation">
12 | ## Installation
13 | 
14 | Installing Sounder library into your application is easy as pie with `pip` package manager, allowing you to do a simple command from your favorite command line as follows:
15 | 
16 | 	pip install sounder
17 | 
18 | <a name="instantiate">
19 | ## Instantiate Class
20 | 
21 | The first and the foremost thing to do is to import the class like so.
22 | 
23 | 	from sounder import Sounder
24 | 
25 | And then simply instantiating the class.
26 | 
27 | 	sounder = Sounder([['facebook', 'notifications'], ['twitter', 'notifications']])
28 | 
29 | You can pass dataset as a positional argument(optional) to the Sounder constructor, or set it later down the line using 
30 | `set_module()` method which returns self.
31 | 
32 | 	sounder.set_dataset([['facebook', 'notifications'], ['twitter', 'notifications']])
33 | 
34 | As you can already notice, in order to use search method, the dataset needs to be 2 dimensional list, containing string elements.
35 | 
36 | <a name="search">
37 | ## Search Method
38 | 
39 | `search(query, dataset=None, metaphone=False)` method takes a positional argument(compulsory), a query which needs to be a list composed of string that needs to be searched through the dataset, like so.
40 | 
41 | 	sounder = Sounder([['facebook', 'notifications'], ['twitter', 'notifications'], ['note', 'something']])
42 | 	index = sounder.search(['trackbook', 'notifs'])
43 | 
44 | `search` method always returns back the index which it found to be most probable to be identical for your given set of data. In this case index will equate to 0.
45 | 
46 | This method take other optional arguments as follows:
47 | 
48 | - **dataset :** It's simply the dataset, in case you don't want set dataset while instantiating the class, no problem just pass it as a another argument. Though again it needs to be a double dimensional list.
49 | 
50 | - **metaphone :** It defaults to False, resonating to the fact that you don't want to use metaphones in addition to the master algorithm. On True state, all the dataset and query is first transformed to metaphones and then inputted to the algorithm increasing efficiency in cases where input data is quite randomized or uses generic terms.
51 | 
52 | ## Probability Method
53 | 
54 | `probability(query, dataset=None, metaphone=False, detailed=False, prediction=False)` method takes again a single positional argument which is the query that needs to be compared with the dataset. (A list composed of strings.), like so.
55 | 
56 | 	sounder = Sounder([['facebook', 'notifications'], ['twitter', 'notifications'], ['note', 'something']])
57 | 	chances = sounder.probabiltiy(['trackbook', 'notifs'])
58 | 
59 | `probability` method returns result depending on the optional parameters under given cases:
60 | 
61 | - **No optional arugment passed :** It returns the list the size of the dataset, composed of probability that the query list is most probable to the dataset, resulting from a value between 0.0 to 1.0 where 0.0 refers to nothing matches, and 1.0 to everything matches.
62 | 
63 | - **detailed :** If detailed argument is set to True, then it returns back the size of the dataset in a nested list format, where the first element is the probability that the query list is most probable to the dataset, while the second element is an another list the size of the ith data of dataset, consisting the probabiltiy that jth word of the ith data was found on the query by solving assignment problem, resulting from a value between 0.0 to 1.0 where 0.0 refers to nothing matches.
64 | 
65 | - **prediction :** If set to True, it returns back a dict, with keys `chances` and `index` suggesting which index of the dataset is most probable to the the given query in terms of similarity while chances denoting to a value between 0.0 to 1.0 where 0.0 refers to nothing matches.
66 | 
67 | Two other arguments that can be set are :
68 | 
69 | - **dataset :** Again, in case you didn't set the dataset on the instantiation, fear not, just pass it as an argument. One more thing, this time it doesn't necessarily needs to be a double dimensional list if you're just comparing two lists of string elements. like so.
70 | 	
71 | 		information = sounder.probability(['trackbook'], dataset=['facebook'])
72 | 
73 | Sounder basically internally map it into double dimensional list automatically, giving you the leverage to compare any two lists of words.
74 | 
75 | - **metaphones :** Again, it's exactly the same as for search method.
76 | 
77 | <a name="filter">
78 | ## Filter Method
79 | 
80 | `filter(query, reserved_sub_words=None)` is basically a utility provided you to filter the stop words out of your string, for instance, `"Hey Stephanie, what is the time right now?"` would filter away `['hey', 'what', 'is', 'the']` since they don't hold higher meaning, leaving behind key_words like `['stephanie', 'time', 'right', 'now']`
81 | 
82 | This method is just a utility to help you do the entire intent recognization from single library, but you're free to use any kind of system. It returns back a dictionary with keys such as `sub_words` and `key_words`, resonating to stop words found in the string and keywords found in it in a list form respectively.
83 | 
84 | - **reserved_sub_words :** is the filter that is used to filter out the stop words, you can pass your own filter in the method itself or through using `set_filter(reserved_sub_words)` method which returns the self instance. **Note :** make sure the filter is a dictionary of all the words that you consider as stop words. Default is as follows:
85 | 
86 | 		{
87 | 	        "what", "where", "which", "how", "when", "who",
88 | 	        "is", "are", "makes", "made", "make", "did", "do",
89 | 	        "to", "the", "of", "from", "against", "and", "or",
90 | 	        "you", "me", "we", "us", "your", "my", "mine", 'yours',
91 | 	        "could", "would", "may", "might", "let", "possibly",
92 | 	        'tell', "give", "told", "gave", "know", "knew",
93 | 	        'a', 'am', 'an', 'i', 'like', 'has', 'have', 'need',
94 | 	        'will', 'be', "this", 'that', "for"
95 | 		}
96 | 


--------------------------------------------------------------------------------
/setup.cfg:
--------------------------------------------------------------------------------
1 | [bdist_wheel]
2 | # This flag says that the code is written to work on both Python 2 and Python
3 | # 3. If at all possible, it is good practice to do this. If you cannot, you
4 | # will need to generate wheels for each Python version that you support.
5 | universal=1


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | from setuptools import setup, find_packages
 2 | from codecs import open
 3 | from os import path
 4 | 
 5 | file_path = path.abspath(path.dirname(__file__))
 6 | 
 7 | # Get the long description from the README file
 8 | with open(path.join(file_path, 'README.md'), encoding='utf-8') as f:
 9 |     long_description = f.read()
10 | 
11 | setup(
12 |     name='sounder',
13 | 
14 |     version='0.2.0',
15 | 
16 |     description='An intent recognition algorithm.',
17 |     long_description=long_description,
18 | 
19 |     url='https://github.com/slapbot/sounder',
20 | 
21 |     author='Ujjwal Gupta',
22 |     author_email='ugupta41@gmail.com',
23 | 
24 |     license='MIT',
25 | 
26 |     classifiers=[
27 |         'Development Status :: 3 - Alpha',
28 | 
29 |         'Environment :: Web Environment',
30 | 
31 |         'Operating System :: OS Independent',
32 | 
33 |         'Intended Audience :: Developers',
34 |         'Topic :: Software Development :: Build Tools',
35 | 
36 |         'License :: OSI Approved :: MIT License',
37 | 
38 |         'Programming Language :: Python',
39 | 
40 |         'Programming Language :: Python :: 3',
41 |         'Programming Language :: Python :: 3.3',
42 |         'Programming Language :: Python :: 3.4',
43 |         'Programming Language :: Python :: 3.5',
44 |     ],
45 | 
46 |     keywords='intent recognition munkres levenshtein edit-distance algorithm speech pattern sentiment analysis guess text',
47 | 
48 |     packages=find_packages(exclude=['sample']),
49 | 
50 |     install_requires=['munkres', 'metaphone'],
51 | )
52 | 


--------------------------------------------------------------------------------
/sounder/__init__.py:
--------------------------------------------------------------------------------
1 | from .sounder import Sounder


--------------------------------------------------------------------------------
/sounder/sounder.py:
--------------------------------------------------------------------------------
  1 | # noinspection PyPep8Naming
  2 | import sys
  3 | from munkres import Munkres
  4 | from difflib import SequenceMatcher as sm
  5 | from metaphone import doublemetaphone as dm
  6 | 
  7 | 
  8 | # noinspection PyShadowingNames
  9 | class Sounder:
 10 |     def __init__(self, dataset=[]):
 11 |         self.dataset = dataset
 12 |         self.reserved_sub_words = self.get_reserved_sub_words()
 13 | 
 14 |     def set_dataset(self, dataset):
 15 |         self.dataset = dataset
 16 |         return self
 17 | 
 18 |     def set_filter(self, reserved_sub_words):
 19 |         self.reserved_sub_words = reserved_sub_words
 20 | 
 21 |     def get_metaphones(self, query, dataset):
 22 |         new_query = [dm(given_keyword)[0] for given_keyword in query]
 23 |         new_dataset = []
 24 |         for data in dataset:
 25 |             user_keywords = [dm(user_keyword)[0] for user_keyword in data]
 26 |             new_dataset.append(user_keywords)
 27 |         return new_query, new_dataset
 28 | 
 29 |     @staticmethod
 30 |     def get_reserved_sub_words():
 31 |         return {
 32 |             "what", "where", "which", "how", "when", "who",
 33 |             "is", "are", "makes", "made", "make", "did", "do",
 34 |             "to", "the", "of", "from", "against", "and", "or",
 35 |             "you", "me", "we", "us", "your", "my", "mine", 'yours',
 36 |             "could", "would", "may", "might", "let", "possibly",
 37 |             'tell', "give", "told", "gave", "know", "knew",
 38 |             'a', 'am', 'an', 'i', 'like', 'has', 'have', 'need',
 39 |             'will', 'be', "this", 'that', "for"
 40 |         }
 41 | 
 42 |     def filter(self, query, reserved_sub_words=None):
 43 |         if reserved_sub_words:
 44 |             self.reserved_sub_words = reserved_sub_words
 45 |         sub_words = []
 46 |         reserved_sub_words = self.get_reserved_sub_words()
 47 |         raw_text_array = query.lower().split()
 48 |         key_words = raw_text_array.copy()
 49 |         for index, raw_text in enumerate(raw_text_array):
 50 |             if raw_text in self.reserved_sub_words:
 51 |                 sub_words.append(raw_text)
 52 |                 key_words.remove(raw_text)
 53 |         return { 'sub_words': sub_words, 'key_words': key_words}
 54 | 
 55 |     def search(self, query, dataset=None, metaphone=False):
 56 |         if dataset:
 57 |             self.dataset = dataset
 58 |         if self.dataset:
 59 |             if metaphone:
 60 |                 query, self.dataset = self.get_metaphones(query, dataset)
 61 |             index = self.process(self.dataset, query)
 62 |         else:
 63 |             raise TypeError("Missing dataset parameter since it's not been initialized either.")
 64 |         return index
 65 | 
 66 |     def probability(self, query, dataset=None, detailed=False, prediction=False, metaphone=False):
 67 |         if dataset:
 68 |             if any(isinstance(i, str) for i in dataset):
 69 |                 dataset = [dataset]
 70 |             self.dataset = dataset
 71 |         if self.dataset:
 72 |             if metaphone:
 73 |                 query, self.dataset = self.get_metaphones(query, self.dataset)
 74 |             chances = self.process_chances(self.dataset, query)
 75 |         else:
 76 |             raise TypeError("Missing dataset parameter since it's not been initialized either.")
 77 |         if prediction:
 78 |             index = self.pick(chances)
 79 |             if detailed:
 80 |                 return {
 81 |                 'chances': chances[index],
 82 |                 'index': index
 83 |                 }
 84 |             return {
 85 |                 'chances': chances[index][0],
 86 |                 'index': index
 87 |                 }
 88 |         if detailed:
 89 |             return chances
 90 |         return [chance[0] for chance in chances]
 91 | 
 92 |     def process_chances(self, dataset, query):
 93 |         scores = []
 94 |         for data in dataset:
 95 |             temp_scores = self.process_words(data, query)
 96 |             word_score = self.pick_most_probable_word(temp_scores, len(data))
 97 |             avg_score = sum(word_score) / len(word_score)
 98 |             scores.append([avg_score, word_score])
 99 |         return scores
100 | 
101 |     def process(self, dataset, query):
102 |         scores = []
103 |         for data in dataset:
104 |             temp_scores = self.process_words(data, query)
105 |             word_score = self.pick_most_probable_word(temp_scores, len(data))
106 |             avg_score = sum(word_score) / len(word_score)
107 |             scores.append([avg_score, word_score])
108 |         return self.pick(scores)
109 | 
110 |     def process_words(self, data, query):
111 |         temp_scores = []
112 |         avg_scores_list = []
113 |         for index, s_word in enumerate(data):
114 |             avg_scores = [0 for _ in range(0, len(query))]
115 |             for index, k_word in enumerate(query):
116 |                 avg_scores[index] = self.loop2(k_word, s_word)
117 |             avg_scores_list.append(avg_scores)
118 |         temp_scores = self.hungarian_algorithm(avg_scores_list)
119 |         return temp_scores
120 | 
121 |     @staticmethod
122 |     def pick_most_probable_word(temp_scores, length):
123 |         word_score = [0 for _ in range(0, length)]
124 |         for index, temp_score in enumerate(temp_scores):
125 |             word_score[index] = temp_score[0]
126 |         return word_score
127 | 
128 |     @staticmethod
129 |     def hungarian_algorithm(matrix):
130 |         temp_scores = []
131 |         cost_matrix = []
132 |         for row in matrix:
133 |             cost_row = []
134 |             for col in row:
135 |                 cost_row += [sys.maxsize - col]
136 |             cost_matrix += [cost_row]
137 |         m = Munkres()
138 |         indexes = m.compute(cost_matrix)
139 |         for row, column in indexes:
140 |             score = matrix[row][column]
141 |             index = column
142 |             temp_scores.append([score, index])
143 |         return temp_scores
144 | 
145 |     @staticmethod
146 |     def loop2(k_word, s_word):
147 |         word_score = sm(None, k_word, s_word)
148 |         return round(word_score.ratio() * 100)
149 | 
150 |     @staticmethod
151 |     def pick(scores):
152 |         max_score = 0
153 |         max_index = 0
154 |         for index, item in enumerate(scores):
155 |             if item[0] > max_score:
156 |                 max_score = item[0]
157 |                 max_index = index
158 |         picked = scores[max_index][1]
159 |         perm_sum = sum(picked)
160 |         perm_avg = perm_sum / len(picked)
161 |         for index, item in enumerate(scores):
162 |             if item[0] == max_score and index != max_index:
163 |                 temp_sum = sum(item[1])
164 |                 temp_avg = temp_sum / len(item[1])
165 |                 if temp_avg > perm_avg:
166 |                     max_index = index
167 |                     perm_sum = temp_sum
168 |                     perm_avg = temp_avg
169 |                 elif temp_avg == perm_avg:
170 |                     if temp_sum > perm_sum:
171 |                         max_index = index
172 |                         perm_sum = temp_sum
173 |                         perm_avg = temp_avg
174 |         return max_index
175 | 


--------------------------------------------------------------------------------