├── README.md ├── sim-GEN ├── README.md └── data.zip ├── sim-M ├── README.md ├── dev.json ├── test.json └── train.json └── sim-R ├── README.md ├── dev.json ├── test.json └── train.json /README.md: -------------------------------------------------------------------------------- 1 | # Simulated Dialogue 2 | 3 | ## Machines Talking To Machines (M2M) 4 | 5 | We present datasets of conversations between an agent and a simulated user. 6 | These conversations are collected using our M2M framework that combines dialogue 7 | self-play and crowd sourcing to exhaustively generate dialogues. The dialogue 8 | self-play step generates dialogue outlines consisting of the semantic frames for 9 | each turn of the dialogue. The crowd sourcing step provides natural language 10 | realizations for each dialogue turn. More details are available in [this 11 | paper](https://arxiv.org/abs/1801.04871). Please cite the paper if you use or 12 | discuss these datasets in your work: 13 | 14 | ```shell 15 | @article{shah2018building, 16 | title={Building a Conversational Agent Overnight with Dialogue Self-Play}, 17 | author={Shah, Pararth and Hakkani-T{\"u}r, Dilek and T{\"u}r, Gokhan and Rastogi, Abhinav and Bapna, Ankur and Nayak, Neha and Heck, Larry}, 18 | journal={arXiv preprint arXiv:1801.04871}, 19 | year={2018} 20 | } 21 | ``` 22 | 23 | ## Datasets 24 | 25 | We are releasing two datasets containing dialogues for booking a restaurant 26 | table and buying a movie ticket. The number of dialogues in each dataset are 27 | listed below. The README file within each directory contains further details 28 | about the dataset. 29 | 30 | | Dataset | Slots | Train | Dev | Test | 31 | | ------------------ | ------------------------------------------------------------------------------ | ----- | --- | ---- | 32 | | Sim-R (Restaurant) | price\_range, location, restaurant\_name,
category, num\_people, date, time | 1116 | 349 | 775 | 33 | | Sim-M (Movie) | theatre\_name, movie, date, time,
num\_people | 384 | 120 | 264 | 34 | | Sim-GEN (Movie) | theatre\_name, movie, date, time,
num\_people | 100K | 10K | 10K | 35 | 36 | **The datasets are provided "AS IS" without any warranty, express or implied. 37 | Google disclaims all liability for any damages, direct or indirect, resulting 38 | from the use of these datasets.** 39 | 40 | **Please email {abhirast, dilekh}@google.com with questions.** 41 | 42 | 43 | ## Dialogue State Tracking 44 | 45 | Our publication *Scalable Multi-Domain Dialogue State Tracking (IEEE ASRU 2017)* 46 | reports joint goal accuracy on Sim-R and Sim-M datasets. The released version of 47 | the datasets includes fixes for some errors in dialogue state and action 48 | annotations. This [updated version](https://arxiv.org/abs/1712.10224) of the 49 | paper reports the results on corrected datasets. 50 | 51 | 52 | ## End-to-End Trainable Task Oriented Dialogue 53 | 54 | Our publication [*Dialogue Learning with Human Teaching and Feedback in 55 | End-to-End Trainable Task-Oriented Dialogue 56 | Systems*](https://arxiv.org/pdf/1804.06512.pdf) uses Sim-GEN. 57 | -------------------------------------------------------------------------------- /sim-GEN/README.md: -------------------------------------------------------------------------------- 1 | ## Simulator Generated Dataset (sim-GEN) 2 | 3 | This directory contains an expanded set of dialogues generated via dialogue 4 | self-play between a user simulator and a system agent, as follows: 5 | 6 | - The dialogues collected using the M2M framework for the movie ticket booking 7 | task (sim-M) are used as a seed set to form a crowd-sourced corpus of 8 | natural language utterances for the user and the system agents. 9 | - Subsequently, many more dialogue outlines are generated using self-play 10 | between the simulated user and system agent. 11 | - The dialogue outlines are converted to natural language dialogues by 12 | replacing each dialogue act in the outline with an utterance sampled from 13 | the set of crowd-sourced utterances collected with M2M. 14 | 15 | In this manner, we can generate an arbitrarily large number of dialogue outlines 16 | and convert them automatically to natural language dialogues without any 17 | additional crowd-sourcing step. Although the diversity of natural language in 18 | the dataset does not increase, the number of unique dialogue states present in 19 | the dataset will increase since a larger variety of dialogue outlines will be 20 | available in the expanded dataset. 21 | 22 | This dataset was used for experiments reported in [this 23 | paper](https://arxiv.org/abs/1804.06512). Please cite the paper if you use or 24 | discuss sim-GEN in your work: 25 | 26 | ```shell 27 | @article{liu2018dialogue, 28 | title={Dialogue Learning with Human Teaching and Feedback in End-to-End Trainable Task-Oriented Dialogue Systems}, 29 | author={Liu, Bing and Tur, Gokhan and Hakkani-Tur, Dilek and Shah, Pararth and Heck, Larry}, 30 | journal={NAACL}, 31 | year={2018} 32 | } 33 | ``` 34 | 35 | ## Data format 36 | 37 | The data splits are made available as a .zip file containing dialogues in JSON 38 | format. Each dialogue object contains the following fields: 39 | 40 | * **dialogue\_id** - *string* unique identifier for each dialogue. 41 | * **turns** - *list* of turn objects: 42 | * **system\_acts** - *list* of system dialogue acts for this system turn: 43 | * **name** - *string* system act name 44 | * **slot\_values** - *optional dictionary* mapping slot names to 45 | values 46 | * **system\_utterance** - *string* natural language utterance 47 | corresponding to the system acts for this turn 48 | * **user\_utterance** - *string* natural language user utterance following 49 | the system utterance in this turn 50 | * **dialogue\_state** - *dictionary* ground truth slot-value mapping after 51 | the user utterance 52 | * **database\_state** - database results based on current dialogue state: 53 | * **scores** - *list* of scores, between 0.0 and 1.0, of top 5 54 | database results. 1.0 means matches all constraints and 0.0 means no 55 | match 56 | * **has\_more\_results** - *boolean* whether backend has more matching 57 | results 58 | * **has\_no\_results** - *boolean* whether backend has no matching 59 | results 60 | 61 | An additional file **db.json** is provided which contains the set of values for 62 | each slot. 63 | 64 | Note: The date values in the dataset are normalized as the constants, 65 | "base_date_plus_X", for X from 0 to 6. X=0 corresponds to the current date (i.e. 66 | 'today'), X=1 is 'tomorrow', etc. This is done to allow handling of relative 67 | references to dates (e.g. 'this weekend', 'next Wednesday', etc). The parsing of 68 | such phrases should be done as a separate pre-processing step. 69 | -------------------------------------------------------------------------------- /sim-GEN/data.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/google-research-datasets/simulated-dialogue/9c5752ffa60d00c58efe52ce2ec6a29f2c2bb47b/sim-GEN/data.zip -------------------------------------------------------------------------------- /sim-M/README.md: -------------------------------------------------------------------------------- 1 | ## Structure of the Data 2 | 3 | Each dialogue is represented as a json object with the following fields: 4 | 5 | * **dialogue\_id** - A unique identifier for a dialogue. 6 | * **turns** - A list of annotated agent and user utterance pairs having the 7 | following fields: 8 | * **system\_acts** - A list of system actions. An action consists of an 9 | action type, and optional slot and value arguments. Each action has the 10 | following fields: 11 | * **type** - An action type. Possible values are listed below. 12 | * **slot** - Optional slot argument. 13 | * **value** - Optional value argument. If value is present, slot must 14 | be present. 15 | * **system\_utterance** - The system utterance having the following 16 | fields. 17 | * **text** - The text of the utterance. 18 | * **tokens** - A list containing tokenized version of text. 19 | * **slots** - A list containing locations of mentions of values 20 | corresponding to slots in the utterance, having the following 21 | fields: 22 | * **slot** - The name of the slot 23 | * **start** - The index of the first token corresponding to a slot 24 | value in the tokens list. 25 | * **exclusive\_end** - The index of the token succeeding the last 26 | token corresponding to the slot value in the tokens list. In 27 | python, `tokens[start:exclusive_end]` gives the tokens for slot 28 | value. 29 | * **user\_acts** - A list of user actions. Has the same structure as 30 | system\_acts. 31 | * **user\_utterance** - The user utterance. It has three fields, similar 32 | to system\_utterance. 33 | * **user_intents** - A list of user intents specified in the current turn. 34 | Possible values are listed below. 35 | * **dialogue\_state** - Contains the preferences for the different slots 36 | as specified by the user upto the current turn of the dialogue. 37 | Represented as a list containing: 38 | * **slot** - The name of the slot. 39 | * **value** - The value assigned to the slot. 40 | 41 | The list of action types is inspired from the Cambridge dialogue act schema 42 | ([DSTC2 Handbook](http://camdial.org/~mh521/dstc/downloads/handbook.pdf), Pg 19) 43 | . The possible values are: 44 | 45 | * AFFIRM 46 | * CANT\_UNDERSTAND 47 | * CONFIRM 48 | * INFORM 49 | * GOOD\_BYE 50 | * GREETING 51 | * NEGATE 52 | * OTHER 53 | * NOTIFY\_FAILURE 54 | * NOTIFY\_SUCCESS 55 | * OFFER 56 | * REQUEST 57 | * REQUEST\_ALTS 58 | * SELECT 59 | * THANK\_YOU 60 | 61 | The possible values of user intents are: 62 | 63 | * BUY\_MOVIE\_TICKETS 64 | -------------------------------------------------------------------------------- /sim-R/README.md: -------------------------------------------------------------------------------- 1 | ## Structure of the Data 2 | 3 | Each dialogue is represented as a json object with the following fields: 4 | 5 | * **dialogue\_id** - A unique identifier for a dialogue. 6 | * **turns** - A list of annotated agent and user utterance pairs having the 7 | following fields: 8 | * **system\_acts** - A list of system actions. An action consists of an 9 | action type, and optional slot and value arguments. Each action has the 10 | following fields: 11 | * **type** - An action type. Possible values are listed below. 12 | * **slot** - Optional slot argument. 13 | * **value** - Optional value argument. If value is present, slot must 14 | be present. 15 | * **system\_utterance** - The system utterance having the following 16 | fields. 17 | * **text** - The text of the utterance. 18 | * **tokens** - A list containing tokenized version of text. 19 | * **slots** - A list containing locations of mentions of values 20 | corresponding to slots in the utterance, having the following 21 | fields: 22 | * **slot** - The name of the slot 23 | * **start** - The index of the first token corresponding to a slot 24 | value in the tokens list. 25 | * **exclusive\_end** - The index of the token succeeding the last 26 | token corresponding to the slot value in the tokens list. In 27 | python, `tokens[start:exclusive_end]` gives the tokens for slot 28 | value. 29 | * **user\_acts** - A list of user actions. Has the same structure as 30 | system\_acts. 31 | * **user\_utterance** - The user utterance. It has three fields, similar 32 | to system\_utterance. 33 | * **user_intents** - A list of user intents specified in the current turn. 34 | Possible values are listed below. 35 | * **dialogue\_state** - Contains the preferences for the different slots 36 | as specified by the user upto the current turn of the dialogue. 37 | Represented as a list containing: 38 | * **slot** - The name of the slot. 39 | * **value** - The value assigned to the slot. 40 | 41 | The list of action types is inspired from the Cambridge dialogue act schema 42 | ([DSTC2 Handbook](http://camdial.org/~mh521/dstc/downloads/handbook.pdf), Pg 19) 43 | . The possible values are: 44 | 45 | * AFFIRM 46 | * CANT\_UNDERSTAND 47 | * CONFIRM 48 | * INFORM 49 | * GOOD\_BYE 50 | * GREETING 51 | * NEGATE 52 | * OTHER 53 | * NOTIFY\_FAILURE 54 | * NOTIFY\_SUCCESS 55 | * OFFER 56 | * REQUEST 57 | * REQUEST\_ALTS 58 | * SELECT 59 | * THANK\_YOU 60 | 61 | The possible values of user intents are: 62 | 63 | * FIND\_RESTAURANT 64 | * RESERVE\_RESTAURANT 65 | --------------------------------------------------------------------------------