├── sample_raw_unames.csv
├── LICENSE.md
└── README.md


/sample_raw_unames.csv:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hellohaptik/conversation_data/master/sample_raw_unames.csv


--------------------------------------------------------------------------------
/LICENSE.md:
--------------------------------------------------------------------------------
1 | LICENSE
2 | ==============
3 | Haptik's conversation_data is made available under the Open Database License: http://opendatacommons.org/licenses/odbl/1.0/. 
4 | Any rights in individual contents of the database are licensed under the Database Contents License: http://opendatacommons.org/licenses/dbcl/1.0/
5 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # conversation_data
 2 | This repository contains the sample data used for training the **"Neural Conversational Model"** for the *Reminders* domain, in the context of [Haptik](http://www.haptik.ai)
 3 | 
 4 | ## Overview
 5 | The corpus comprises of dyadic conversations from the Reminders domain. It is a collection of messages exchanged between Haptik users and chat assistants (responses from humans or the Rule Based ChatBot). The original corpus contains data collected over the past two years. This repository contains a sample data from the original corpus. The messages include casual queries, out of domain requests, push notifications sent to user to start a new conversation or to keep them engaged. The messages are observed to be mostly in English language but significant proportion of data also consists of code-mixed and code-switched messages - primarily English code-mixed with Hindi. In the original corpus, each message has rich metadata accompanying it – user info (age, gender, location, device, etc.), timestamps and type of message (e.g. simple text, UI element, form, etc). The UI elements are represented in the haptik specific custom format in the text corpus.
 6 | 
 7 | ## Preprocessing
 8 | Before we train the Neural model, we perform several preprocessing steps, for efficient training. With preprocessing we try to normalize the data, reduce vocabulary size, convert raw text into actionables, remove unnecessary data.
 9 | The [sample_raw_unames.csv](sample_raw_unames.csv) file contains the data in raw format (without any processing). For maintaining privacy, we do not reveal the user-names and have replaced the names with *\_name\_* tag.  Following are the preprocessing steps followed:
10 | 
11 | 1. **Removal of out of domain conversations**
12 | 
13 | Haptik being a personal assistant app, users tend to ask queries from other domains in Reminders service. The first step involves removal of conversations from the data which do not belong to Reminders domain. This step is performed using the in-house domain identifier. Even after this step, some conversations are not filtered out because of the domain shifts that happen midway through a conversation.
14 | 
15 | 2. **Merge Reminder Notifications**
16 | 
17 | Users usually set multiple reminders on Haptik. For example, the drinking water reminder asks for how frequently should a user be reminded regarding the task. We often see a continuous series of a reminder notifications, in a conversation. In this step, we delete all such notifications except for the last one in the series. This reduces the number of outbound messages in data by a significant amount.
18 | 
19 | The [sample_pruned_chats_unames.csv](sample_pruned_chats_unames.csv) file contains the data after performing the steps 1,2. You will see a large difference in size of the data because of the reasons explained above.
20 | 
21 | 3. **Replace Entities**
22 | 
23 | In this step we replace the original text of the named entities date, time ,phone number, user name and assistant name by the *“\_date\_”, “\_time\_”, “\_phone\_”, “\_name\_”, “\_name\_”* tags respectively. This reduces the size of vocabulary significantly. In our scenario, the value of the entity is not important but only its presence is important. A separate object maintains the entities identified (explained in hybrid). For identifying the entities, we use the in-house [NER](https://github.com/hellohaptik/chatbot_ner).
24 | 
25 | *Ex. Raw query:* Remind me to go to gym tomorrow at 5pm
26 | 
27 | *Preprocessed query:* Remind me to go to gym \_date\_ at \_time\_ 
28 | 
29 | 4. **Structured messages**
30 | 
31 | This step involves handling of UI elements. In case of an UI element, we simply replace the text with the corresponding ID (every structured outbound UI element has an associated ID). In case of the filled UI elements sent by an user, we replace the element values with element keys for all the filled element fields.
32 | 
33 | *Ex. Query:* 
34 | “Wake up call
35 | 
36 | Date: 22-10-2016
37 | 
38 | Time: 6:05 AM”
39 | 
40 | *Preprocessed query:* “wake up call date \_date\_ time \_time\_” 
41 | 
42 | 5. **Extracting actions from message**
43 | 
44 | This is a crucial part of the preprocessing step. The corpus consists of raw text messages exchanged between user and the haptik assistant. But being an utility product, just a text response is not enough to cater the user queries. We need a mechanism to identify the action that needs to be performed. While setting up or cancelling a reminder, an acknowledgement message is sent to the user. In this step of preprocessing, we try to utilize these messages and tag them as an action. Whenever an action tag is predicted as a response by the neural model, we perform that action by calling corresponding APIs.
45 | 
46 | *Ex. Assistant Message:* Okay, done. We will remind you to take your medicine, via a call at 2:00 PM on Tue, 18 April. Take care :)
47 | 
48 | *Preprocessed message:* \_api\_call\_reminder\_medicine\_
49 | 
50 | *Action:* set a medicine reminder 
51 | 
52 | 6. **Orthography**
53 | 
54 | In this step we first convert the string to lowercase. We remove all the punctuation marks. We replace all the numeric values with “\_numeral\_” tag. At the end of this step, the data can contain characters only from set (a-z) and a special character (“\_”). This step helps reduce the vocabulary size by a significant fraction. While this step helps training on smaller data, the downside of this approach is that, we have to introduce a post-processing step for making the response look appropriate.
55 | 
56 | The [sample_preprocessed.csv](sample_preprocessed.csv) file contains the data after performing steps 3, 4, 5 and 6.
57 | 
58 | ## Training Data (Context - Response pairs)
59 | 
60 | For training a Neural Conversation Model, we need context - respose pair (to treat it as a SMT problem where the input and output are same semtence in two different languages).
61 | 
62 | For an example conversation like [U1 A1 U2 A2 A3 U3 A4] where Ui refers to a user message and Ai refers to an assistant message, Context-Response pairs are generated as follows:
63 | 
64 | Context [U1] Response [A1]
65 | 
66 | Context [U1 A1 U2] Response [A2]
67 | 
68 | Context [U1 A1 U2 A2] Response [A3]
69 | 
70 | Context [U1 A1 U2 A2 A3 U3] Response [A4]
71 | 
72 | - The file [sample_human_context_response.tsv](sample_human_context_response.tsv) contains such context response pairs, where the response is sent by a human assistant. While in the file [sample_bot_context_response.tsv](sample_bot_context_response.tsv), the responses are sent by bot.
73 | 
74 | - Different messages from the same speaker are separated by a particular token while messages representing end of turn i.e switching of speaker between user and agent are represented by a separate token.
75 | 
76 | - The context part is limited to maximum 160 words 
77 | 
78 | ## Terminology
79 | 
80 | - **coll_id:** User ID
81 | 
82 | - **conv_no:** User's session number (one conversation -> one chat session)
83 | 
84 | - **Direction:** True -> Message is sent by user; False -> message is sent by Assistant (human or bot)
85 | 
86 | - **msg_type:** Various types of messages varying in formats (haptik specific). Some important msg types include *22* -> reminder notification; *0* -> sender is a human; *17* -> form
87 | 


--------------------------------------------------------------------------------