├── README.md
├── sim-GEN
├── README.md
└── data.zip
├── sim-M
├── README.md
├── dev.json
├── test.json
└── train.json
└── sim-R
├── README.md
├── dev.json
├── test.json
└── train.json
/README.md:
--------------------------------------------------------------------------------
1 | # Simulated Dialogue
2 |
3 | ## Machines Talking To Machines (M2M)
4 |
5 | We present datasets of conversations between an agent and a simulated user.
6 | These conversations are collected using our M2M framework that combines dialogue
7 | self-play and crowd sourcing to exhaustively generate dialogues. The dialogue
8 | self-play step generates dialogue outlines consisting of the semantic frames for
9 | each turn of the dialogue. The crowd sourcing step provides natural language
10 | realizations for each dialogue turn. More details are available in [this
11 | paper](https://arxiv.org/abs/1801.04871). Please cite the paper if you use or
12 | discuss these datasets in your work:
13 |
14 | ```shell
15 | @article{shah2018building,
16 | title={Building a Conversational Agent Overnight with Dialogue Self-Play},
17 | author={Shah, Pararth and Hakkani-T{\"u}r, Dilek and T{\"u}r, Gokhan and Rastogi, Abhinav and Bapna, Ankur and Nayak, Neha and Heck, Larry},
18 | journal={arXiv preprint arXiv:1801.04871},
19 | year={2018}
20 | }
21 | ```
22 |
23 | ## Datasets
24 |
25 | We are releasing two datasets containing dialogues for booking a restaurant
26 | table and buying a movie ticket. The number of dialogues in each dataset are
27 | listed below. The README file within each directory contains further details
28 | about the dataset.
29 |
30 | | Dataset | Slots | Train | Dev | Test |
31 | | ------------------ | ------------------------------------------------------------------------------ | ----- | --- | ---- |
32 | | Sim-R (Restaurant) | price\_range, location, restaurant\_name,
category, num\_people, date, time | 1116 | 349 | 775 |
33 | | Sim-M (Movie) | theatre\_name, movie, date, time,
num\_people | 384 | 120 | 264 |
34 | | Sim-GEN (Movie) | theatre\_name, movie, date, time,
num\_people | 100K | 10K | 10K |
35 |
36 | **The datasets are provided "AS IS" without any warranty, express or implied.
37 | Google disclaims all liability for any damages, direct or indirect, resulting
38 | from the use of these datasets.**
39 |
40 | **Please email {abhirast, dilekh}@google.com with questions.**
41 |
42 |
43 | ## Dialogue State Tracking
44 |
45 | Our publication *Scalable Multi-Domain Dialogue State Tracking (IEEE ASRU 2017)*
46 | reports joint goal accuracy on Sim-R and Sim-M datasets. The released version of
47 | the datasets includes fixes for some errors in dialogue state and action
48 | annotations. This [updated version](https://arxiv.org/abs/1712.10224) of the
49 | paper reports the results on corrected datasets.
50 |
51 |
52 | ## End-to-End Trainable Task Oriented Dialogue
53 |
54 | Our publication [*Dialogue Learning with Human Teaching and Feedback in
55 | End-to-End Trainable Task-Oriented Dialogue
56 | Systems*](https://arxiv.org/pdf/1804.06512.pdf) uses Sim-GEN.
57 |
--------------------------------------------------------------------------------
/sim-GEN/README.md:
--------------------------------------------------------------------------------
1 | ## Simulator Generated Dataset (sim-GEN)
2 |
3 | This directory contains an expanded set of dialogues generated via dialogue
4 | self-play between a user simulator and a system agent, as follows:
5 |
6 | - The dialogues collected using the M2M framework for the movie ticket booking
7 | task (sim-M) are used as a seed set to form a crowd-sourced corpus of
8 | natural language utterances for the user and the system agents.
9 | - Subsequently, many more dialogue outlines are generated using self-play
10 | between the simulated user and system agent.
11 | - The dialogue outlines are converted to natural language dialogues by
12 | replacing each dialogue act in the outline with an utterance sampled from
13 | the set of crowd-sourced utterances collected with M2M.
14 |
15 | In this manner, we can generate an arbitrarily large number of dialogue outlines
16 | and convert them automatically to natural language dialogues without any
17 | additional crowd-sourcing step. Although the diversity of natural language in
18 | the dataset does not increase, the number of unique dialogue states present in
19 | the dataset will increase since a larger variety of dialogue outlines will be
20 | available in the expanded dataset.
21 |
22 | This dataset was used for experiments reported in [this
23 | paper](https://arxiv.org/abs/1804.06512). Please cite the paper if you use or
24 | discuss sim-GEN in your work:
25 |
26 | ```shell
27 | @article{liu2018dialogue,
28 | title={Dialogue Learning with Human Teaching and Feedback in End-to-End Trainable Task-Oriented Dialogue Systems},
29 | author={Liu, Bing and Tur, Gokhan and Hakkani-Tur, Dilek and Shah, Pararth and Heck, Larry},
30 | journal={NAACL},
31 | year={2018}
32 | }
33 | ```
34 |
35 | ## Data format
36 |
37 | The data splits are made available as a .zip file containing dialogues in JSON
38 | format. Each dialogue object contains the following fields:
39 |
40 | * **dialogue\_id** - *string* unique identifier for each dialogue.
41 | * **turns** - *list* of turn objects:
42 | * **system\_acts** - *list* of system dialogue acts for this system turn:
43 | * **name** - *string* system act name
44 | * **slot\_values** - *optional dictionary* mapping slot names to
45 | values
46 | * **system\_utterance** - *string* natural language utterance
47 | corresponding to the system acts for this turn
48 | * **user\_utterance** - *string* natural language user utterance following
49 | the system utterance in this turn
50 | * **dialogue\_state** - *dictionary* ground truth slot-value mapping after
51 | the user utterance
52 | * **database\_state** - database results based on current dialogue state:
53 | * **scores** - *list* of scores, between 0.0 and 1.0, of top 5
54 | database results. 1.0 means matches all constraints and 0.0 means no
55 | match
56 | * **has\_more\_results** - *boolean* whether backend has more matching
57 | results
58 | * **has\_no\_results** - *boolean* whether backend has no matching
59 | results
60 |
61 | An additional file **db.json** is provided which contains the set of values for
62 | each slot.
63 |
64 | Note: The date values in the dataset are normalized as the constants,
65 | "base_date_plus_X", for X from 0 to 6. X=0 corresponds to the current date (i.e.
66 | 'today'), X=1 is 'tomorrow', etc. This is done to allow handling of relative
67 | references to dates (e.g. 'this weekend', 'next Wednesday', etc). The parsing of
68 | such phrases should be done as a separate pre-processing step.
69 |
--------------------------------------------------------------------------------
/sim-GEN/data.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/google-research-datasets/simulated-dialogue/9c5752ffa60d00c58efe52ce2ec6a29f2c2bb47b/sim-GEN/data.zip
--------------------------------------------------------------------------------
/sim-M/README.md:
--------------------------------------------------------------------------------
1 | ## Structure of the Data
2 |
3 | Each dialogue is represented as a json object with the following fields:
4 |
5 | * **dialogue\_id** - A unique identifier for a dialogue.
6 | * **turns** - A list of annotated agent and user utterance pairs having the
7 | following fields:
8 | * **system\_acts** - A list of system actions. An action consists of an
9 | action type, and optional slot and value arguments. Each action has the
10 | following fields:
11 | * **type** - An action type. Possible values are listed below.
12 | * **slot** - Optional slot argument.
13 | * **value** - Optional value argument. If value is present, slot must
14 | be present.
15 | * **system\_utterance** - The system utterance having the following
16 | fields.
17 | * **text** - The text of the utterance.
18 | * **tokens** - A list containing tokenized version of text.
19 | * **slots** - A list containing locations of mentions of values
20 | corresponding to slots in the utterance, having the following
21 | fields:
22 | * **slot** - The name of the slot
23 | * **start** - The index of the first token corresponding to a slot
24 | value in the tokens list.
25 | * **exclusive\_end** - The index of the token succeeding the last
26 | token corresponding to the slot value in the tokens list. In
27 | python, `tokens[start:exclusive_end]` gives the tokens for slot
28 | value.
29 | * **user\_acts** - A list of user actions. Has the same structure as
30 | system\_acts.
31 | * **user\_utterance** - The user utterance. It has three fields, similar
32 | to system\_utterance.
33 | * **user_intents** - A list of user intents specified in the current turn.
34 | Possible values are listed below.
35 | * **dialogue\_state** - Contains the preferences for the different slots
36 | as specified by the user upto the current turn of the dialogue.
37 | Represented as a list containing:
38 | * **slot** - The name of the slot.
39 | * **value** - The value assigned to the slot.
40 |
41 | The list of action types is inspired from the Cambridge dialogue act schema
42 | ([DSTC2 Handbook](http://camdial.org/~mh521/dstc/downloads/handbook.pdf), Pg 19)
43 | . The possible values are:
44 |
45 | * AFFIRM
46 | * CANT\_UNDERSTAND
47 | * CONFIRM
48 | * INFORM
49 | * GOOD\_BYE
50 | * GREETING
51 | * NEGATE
52 | * OTHER
53 | * NOTIFY\_FAILURE
54 | * NOTIFY\_SUCCESS
55 | * OFFER
56 | * REQUEST
57 | * REQUEST\_ALTS
58 | * SELECT
59 | * THANK\_YOU
60 |
61 | The possible values of user intents are:
62 |
63 | * BUY\_MOVIE\_TICKETS
64 |
--------------------------------------------------------------------------------
/sim-R/README.md:
--------------------------------------------------------------------------------
1 | ## Structure of the Data
2 |
3 | Each dialogue is represented as a json object with the following fields:
4 |
5 | * **dialogue\_id** - A unique identifier for a dialogue.
6 | * **turns** - A list of annotated agent and user utterance pairs having the
7 | following fields:
8 | * **system\_acts** - A list of system actions. An action consists of an
9 | action type, and optional slot and value arguments. Each action has the
10 | following fields:
11 | * **type** - An action type. Possible values are listed below.
12 | * **slot** - Optional slot argument.
13 | * **value** - Optional value argument. If value is present, slot must
14 | be present.
15 | * **system\_utterance** - The system utterance having the following
16 | fields.
17 | * **text** - The text of the utterance.
18 | * **tokens** - A list containing tokenized version of text.
19 | * **slots** - A list containing locations of mentions of values
20 | corresponding to slots in the utterance, having the following
21 | fields:
22 | * **slot** - The name of the slot
23 | * **start** - The index of the first token corresponding to a slot
24 | value in the tokens list.
25 | * **exclusive\_end** - The index of the token succeeding the last
26 | token corresponding to the slot value in the tokens list. In
27 | python, `tokens[start:exclusive_end]` gives the tokens for slot
28 | value.
29 | * **user\_acts** - A list of user actions. Has the same structure as
30 | system\_acts.
31 | * **user\_utterance** - The user utterance. It has three fields, similar
32 | to system\_utterance.
33 | * **user_intents** - A list of user intents specified in the current turn.
34 | Possible values are listed below.
35 | * **dialogue\_state** - Contains the preferences for the different slots
36 | as specified by the user upto the current turn of the dialogue.
37 | Represented as a list containing:
38 | * **slot** - The name of the slot.
39 | * **value** - The value assigned to the slot.
40 |
41 | The list of action types is inspired from the Cambridge dialogue act schema
42 | ([DSTC2 Handbook](http://camdial.org/~mh521/dstc/downloads/handbook.pdf), Pg 19)
43 | . The possible values are:
44 |
45 | * AFFIRM
46 | * CANT\_UNDERSTAND
47 | * CONFIRM
48 | * INFORM
49 | * GOOD\_BYE
50 | * GREETING
51 | * NEGATE
52 | * OTHER
53 | * NOTIFY\_FAILURE
54 | * NOTIFY\_SUCCESS
55 | * OFFER
56 | * REQUEST
57 | * REQUEST\_ALTS
58 | * SELECT
59 | * THANK\_YOU
60 |
61 | The possible values of user intents are:
62 |
63 | * FIND\_RESTAURANT
64 | * RESERVE\_RESTAURANT
65 |
--------------------------------------------------------------------------------