├── README.md
├── sim-GEN
    ├── README.md
    └── data.zip
├── sim-M
    ├── README.md
    ├── dev.json
    ├── test.json
    └── train.json
└── sim-R
    ├── README.md
    ├── dev.json
    ├── test.json
    └── train.json


/README.md:
--------------------------------------------------------------------------------
 1 | # Simulated Dialogue
 2 | 
 3 | ## Machines Talking To Machines (M2M)
 4 | 
 5 | We present datasets of conversations between an agent and a simulated user.
 6 | These conversations are collected using our M2M framework that combines dialogue
 7 | self-play and crowd sourcing to exhaustively generate dialogues. The dialogue
 8 | self-play step generates dialogue outlines consisting of the semantic frames for
 9 | each turn of the dialogue. The crowd sourcing step provides natural language
10 | realizations for each dialogue turn. More details are available in [this
11 | paper](https://arxiv.org/abs/1801.04871). Please cite the paper if you use or
12 | discuss these datasets in your work:
13 | 
14 | ```shell
15 | @article{shah2018building,
16 |   title={Building a Conversational Agent Overnight with Dialogue Self-Play},
17 |   author={Shah, Pararth and Hakkani-T{\"u}r, Dilek and T{\"u}r, Gokhan and Rastogi, Abhinav and Bapna, Ankur and Nayak, Neha and Heck, Larry},
18 |   journal={arXiv preprint arXiv:1801.04871},
19 |   year={2018}
20 | }
21 | ```
22 | 
23 | ## Datasets
24 | 
25 | We are releasing two datasets containing dialogues for booking a restaurant
26 | table and buying a movie ticket. The number of dialogues in each dataset are
27 | listed below. The README file within each directory contains further details
28 | about the dataset.
29 | 
30 | | Dataset            | Slots                                                                          | Train | Dev | Test |
31 | | ------------------ | ------------------------------------------------------------------------------ | ----- | --- | ---- |
32 | | Sim-R (Restaurant) | price\_range, location, restaurant\_name,<br>category, num\_people, date, time | 1116  | 349 | 775  |
33 | | Sim-M (Movie)      | theatre\_name, movie, date, time,<br>num\_people                               | 384   | 120 | 264  |
34 | | Sim-GEN (Movie)    | theatre\_name, movie, date, time,<br>num\_people                               | 100K  | 10K | 10K  |
35 | 
36 | **The datasets are provided "AS IS" without any warranty, express or implied.
37 | Google disclaims all liability for any damages, direct or indirect, resulting
38 | from the use of these datasets.**
39 | 
40 | **Please email {abhirast, dilekh}@google.com with questions.**
41 | 
42 | 
43 | ## Dialogue State Tracking
44 | 
45 | Our publication *Scalable Multi-Domain Dialogue State Tracking (IEEE ASRU 2017)*
46 | reports joint goal accuracy on Sim-R and Sim-M datasets. The released version of
47 | the datasets includes fixes for some errors in dialogue state and action
48 | annotations. This [updated version](https://arxiv.org/abs/1712.10224) of the
49 | paper reports the results on corrected datasets.
50 | 
51 | 
52 | ## End-to-End Trainable Task Oriented Dialogue
53 | 
54 | Our publication [*Dialogue Learning with Human Teaching and Feedback in
55 | End-to-End Trainable Task-Oriented Dialogue
56 | Systems*](https://arxiv.org/pdf/1804.06512.pdf) uses Sim-GEN.
57 | 


--------------------------------------------------------------------------------
/sim-GEN/README.md:
--------------------------------------------------------------------------------
 1 | ## Simulator Generated Dataset (sim-GEN)
 2 | 
 3 | This directory contains an expanded set of dialogues generated via dialogue
 4 | self-play between a user simulator and a system agent, as follows:
 5 | 
 6 | -   The dialogues collected using the M2M framework for the movie ticket booking
 7 |     task (sim-M) are used as a seed set to form a crowd-sourced corpus of
 8 |     natural language utterances for the user and the system agents.
 9 | -   Subsequently, many more dialogue outlines are generated using self-play
10 |     between the simulated user and system agent.
11 | -   The dialogue outlines are converted to natural language dialogues by
12 |     replacing each dialogue act in the outline with an utterance sampled from
13 |     the set of crowd-sourced utterances collected with M2M.
14 | 
15 | In this manner, we can generate an arbitrarily large number of dialogue outlines
16 | and convert them automatically to natural language dialogues without any
17 | additional crowd-sourcing step. Although the diversity of natural language in
18 | the dataset does not increase, the number of unique dialogue states present in
19 | the dataset will increase since a larger variety of dialogue outlines will be
20 | available in the expanded dataset.
21 | 
22 | This dataset was used for experiments reported in [this
23 | paper](https://arxiv.org/abs/1804.06512). Please cite the paper if you use or
24 | discuss sim-GEN in your work:
25 | 
26 | ```shell
27 | @article{liu2018dialogue,
28 |   title={Dialogue Learning with Human Teaching and Feedback in End-to-End Trainable Task-Oriented Dialogue Systems},
29 |   author={Liu, Bing and Tur, Gokhan and Hakkani-Tur, Dilek and Shah, Pararth and Heck, Larry},
30 |   journal={NAACL},
31 |   year={2018}
32 | }
33 | ```
34 | 
35 | ## Data format
36 | 
37 | The data splits are made available as a .zip file containing dialogues in JSON
38 | format. Each dialogue object contains the following fields:
39 | 
40 | *   **dialogue\_id** - *string* unique identifier for each dialogue.
41 | *   **turns** - *list* of turn objects:
42 |     *   **system\_acts** - *list* of system dialogue acts for this system turn:
43 |         *   **name** - *string* system act name
44 |         *   **slot\_values** - *optional dictionary* mapping slot names to
45 |             values
46 |     *   **system\_utterance** - *string* natural language utterance
47 |         corresponding to the system acts for this turn
48 |     *   **user\_utterance** - *string* natural language user utterance following
49 |         the system utterance in this turn
50 |     *   **dialogue\_state** - *dictionary* ground truth slot-value mapping after
51 |         the user utterance
52 |     *   **database\_state** - database results based on current dialogue state:
53 |         *   **scores** - *list* of scores, between 0.0 and 1.0, of top 5
54 |             database results. 1.0 means matches all constraints and 0.0 means no
55 |             match
56 |         *   **has\_more\_results** - *boolean* whether backend has more matching
57 |             results
58 |         *   **has\_no\_results** - *boolean* whether backend has no matching
59 |             results
60 | 
61 | An additional file **db.json** is provided which contains the set of values for
62 | each slot.
63 | 
64 | Note: The date values in the dataset are normalized as the constants,
65 | "base_date_plus_X", for X from 0 to 6. X=0 corresponds to the current date (i.e.
66 | 'today'), X=1 is 'tomorrow', etc. This is done to allow handling of relative
67 | references to dates (e.g. 'this weekend', 'next Wednesday', etc). The parsing of
68 | such phrases should be done as a separate pre-processing step.
69 | 


--------------------------------------------------------------------------------
/sim-GEN/data.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/google-research-datasets/simulated-dialogue/9c5752ffa60d00c58efe52ce2ec6a29f2c2bb47b/sim-GEN/data.zip


--------------------------------------------------------------------------------
/sim-M/README.md:
--------------------------------------------------------------------------------
 1 | ## Structure of the Data
 2 | 
 3 | Each dialogue is represented as a json object with the following fields:
 4 | 
 5 | *   **dialogue\_id** - A unique identifier for a dialogue.
 6 | *   **turns** - A list of annotated agent and user utterance pairs having the
 7 |     following fields:
 8 |     *   **system\_acts** - A list of system actions. An action consists of an
 9 |         action type, and optional slot and value arguments. Each action has the
10 |         following fields:
11 |         *   **type** - An action type. Possible values are listed below.
12 |         *   **slot** - Optional slot argument.
13 |         *   **value** - Optional value argument. If value is present, slot must
14 |             be present.
15 |     *   **system\_utterance** - The system utterance having the following
16 |         fields.
17 |         *   **text** - The text of the utterance.
18 |         *   **tokens** - A list containing tokenized version of text.
19 |         *   **slots** - A list containing locations of mentions of values
20 |             corresponding to slots in the utterance, having the following
21 |             fields:
22 |             *   **slot** - The name of the slot
23 |             *   **start** - The index of the first token corresponding to a slot
24 |                 value in the tokens list.
25 |             *   **exclusive\_end** - The index of the token succeeding the last
26 |                 token corresponding to the slot value in the tokens list. In
27 |                 python, `tokens[start:exclusive_end]` gives the tokens for slot
28 |                 value.
29 |     *   **user\_acts** - A list of user actions. Has the same structure as
30 |         system\_acts.
31 |     *   **user\_utterance** - The user utterance. It has three fields, similar
32 |         to system\_utterance.
33 |     *   **user_intents** - A list of user intents specified in the current turn.
34 |         Possible values are listed below.
35 |     *   **dialogue\_state** - Contains the preferences for the different slots
36 |         as specified by the user upto the current turn of the dialogue.
37 |         Represented as a list containing:
38 |         *   **slot** - The name of the slot.
39 |         *   **value** - The value assigned to the slot.
40 | 
41 | The list of action types is inspired from the Cambridge dialogue act schema
42 | ([DSTC2 Handbook](http://camdial.org/~mh521/dstc/downloads/handbook.pdf), Pg 19)
43 | . The possible values are:
44 | 
45 | *   AFFIRM
46 | *   CANT\_UNDERSTAND
47 | *   CONFIRM
48 | *   INFORM
49 | *   GOOD\_BYE
50 | *   GREETING
51 | *   NEGATE
52 | *   OTHER
53 | *   NOTIFY\_FAILURE
54 | *   NOTIFY\_SUCCESS
55 | *   OFFER
56 | *   REQUEST
57 | *   REQUEST\_ALTS
58 | *   SELECT
59 | *   THANK\_YOU
60 | 
61 | The possible values of user intents are:
62 | 
63 | *   BUY\_MOVIE\_TICKETS
64 | 


--------------------------------------------------------------------------------
/sim-R/README.md:
--------------------------------------------------------------------------------
 1 | ## Structure of the Data
 2 | 
 3 | Each dialogue is represented as a json object with the following fields:
 4 | 
 5 | *   **dialogue\_id** - A unique identifier for a dialogue.
 6 | *   **turns** - A list of annotated agent and user utterance pairs having the
 7 |     following fields:
 8 |     *   **system\_acts** - A list of system actions. An action consists of an
 9 |         action type, and optional slot and value arguments. Each action has the
10 |         following fields:
11 |         *   **type** - An action type. Possible values are listed below.
12 |         *   **slot** - Optional slot argument.
13 |         *   **value** - Optional value argument. If value is present, slot must
14 |             be present.
15 |     *   **system\_utterance** - The system utterance having the following
16 |         fields.
17 |         *   **text** - The text of the utterance.
18 |         *   **tokens** - A list containing tokenized version of text.
19 |         *   **slots** - A list containing locations of mentions of values
20 |             corresponding to slots in the utterance, having the following
21 |             fields:
22 |             *   **slot** - The name of the slot
23 |             *   **start** - The index of the first token corresponding to a slot
24 |                 value in the tokens list.
25 |             *   **exclusive\_end** - The index of the token succeeding the last
26 |                 token corresponding to the slot value in the tokens list. In
27 |                 python, `tokens[start:exclusive_end]` gives the tokens for slot
28 |                 value.
29 |     *   **user\_acts** - A list of user actions. Has the same structure as
30 |         system\_acts.
31 |     *   **user\_utterance** - The user utterance. It has three fields, similar
32 |         to system\_utterance.
33 |     *   **user_intents** - A list of user intents specified in the current turn.
34 |         Possible values are listed below.
35 |     *   **dialogue\_state** - Contains the preferences for the different slots
36 |         as specified by the user upto the current turn of the dialogue.
37 |         Represented as a list containing:
38 |         *   **slot** - The name of the slot.
39 |         *   **value** - The value assigned to the slot.
40 | 
41 | The list of action types is inspired from the Cambridge dialogue act schema
42 | ([DSTC2 Handbook](http://camdial.org/~mh521/dstc/downloads/handbook.pdf), Pg 19)
43 | . The possible values are:
44 | 
45 | *   AFFIRM
46 | *   CANT\_UNDERSTAND
47 | *   CONFIRM
48 | *   INFORM
49 | *   GOOD\_BYE
50 | *   GREETING
51 | *   NEGATE
52 | *   OTHER
53 | *   NOTIFY\_FAILURE
54 | *   NOTIFY\_SUCCESS
55 | *   OFFER
56 | *   REQUEST
57 | *   REQUEST\_ALTS
58 | *   SELECT
59 | *   THANK\_YOU
60 | 
61 | The possible values of user intents are:
62 | 
63 | *   FIND\_RESTAURANT
64 | *   RESERVE\_RESTAURANT
65 | 


--------------------------------------------------------------------------------