├── ChatbotBaseline
    ├── README.md
    ├── demo
    │   ├── demo.sh
    │   ├── tools
    │   └── utils
    ├── egs
    │   ├── opensubs
    │   │   ├── data
    │   │   │   └── google_example.txt
    │   │   ├── path.sh
    │   │   ├── run.sh
    │   │   ├── tools
    │   │   └── utils
    │   └── twitter
    │   │   ├── data
    │   │       └── twitter_example.txt
    │   │   ├── path.sh
    │   │   ├── run.sh
    │   │   ├── tools
    │   │   └── utils
    ├── tools
    │   ├── bleu_score.py
    │   ├── dialog_corpus.py
    │   ├── do_conversation.py
    │   ├── evaluate_conversation_model.py
    │   ├── lstm_decoder.py
    │   ├── lstm_encoder.py
    │   ├── seq2seq_model.py
    │   ├── tqdm_logging.py
    │   └── train_conversation_model.py
    └── utils
    │   ├── get_available_gpu_id.sh
    │   └── parse_options.sh
├── LICENSE
├── README.md
├── collect_twitter_dialogs
    ├── README.md
    ├── account_names_for_dstc6.txt
    ├── collect.sh
    ├── collect_twitter_dialogs.py
    ├── config.ini
    ├── official_account_names_for_dstc6.txt
    ├── official_collect.sh
    ├── search_twitter_accounts.py
    ├── twitter_api.py
    └── view_dialogs.py
└── tasks
    ├── opensubs
        ├── extract_opensubs_dialogs.py
        ├── make_trial_data.sh
        └── utils
    ├── twitter
        ├── account_names_dev.txt
        ├── account_names_train.txt
        ├── extract_official_twitter_dialogs.py
        ├── extract_official_twitter_testset.py
        ├── extract_twitter_dialogs.py
        ├── make_official_data.sh
        ├── make_official_testset+refs.sh
        ├── make_official_testset.sh
        ├── make_trial_data.sh
        ├── official_account_names_dev.txt
        ├── official_account_names_test.txt
        ├── official_account_names_train.txt
        ├── official_testset_ids+refs.json.gz
        ├── official_testset_ids.json.gz
        ├── system_output_example.txt
        └── utils
    └── utils
        ├── bleu_score.py
        ├── check_dialogs.py
        └── sample_dialogs.py


/ChatbotBaseline/README.md:
--------------------------------------------------------------------------------
  1 | # Baseline neural conversation model for DSTC6 (thori@merl.com)
  2 | 
  3 | This package includes training and evaluation scripts for neural conversation models
  4 | based on LSTM-based encoder decoder.
  5 | The system basically follows Vinyals' paper: https://arxiv.org/pdf/1506.05869.pdf
  6 | 
  7 | One difference from the paper would be that back-propagation is done only up to the previous sentence from the current sentence, although the state information is taken over throughout the dialog.
  8 | This reduces computation and memory consumption but may be not so good for training long context dependency.
  9 | 
 10 | The scripts are written in python code, which were tested on python 2.7.6 and 3.4.3.
 11 | Some bash scripts are also used to run the python scripts.
 12 | 
 13 | ## Requirements
 14 | Chainer <http://chainer.org> is used to perform neural network operations 
 15 | in the training and evaluation scripts.
 16 | So, you need to install the latest chainer (2.0) by
 17 | ```
 18 | $ pip install chainer
 19 | ```
 20 | In addtion, one NVIDIA GPU with 6GB or larger memory is necessary for runing 
 21 | the scripts in realistic time.  
 22 | CUDA 7.5 or higher with cuDNN 5.1 or higher is recommended.
 23 | 
 24 | The following python modules are required.
 25 | 
 26 | * cupy
 27 | * nltk
 28 | * tqdm
 29 | 
 30 | ## Example tasks
 31 | We prepared two example tasks based on twitter and open subtitles.
 32 | 
 33 | ### Twitter task
 34 | You can train and evaluate a conversation model by
 35 | ```
 36 | $ cd egs/twitter
 37 | $ run.sh
 38 | ```
 39 | where we assume that you have already generated the following dialog data files
 40 | in `../tasks/twitter`.
 41 | 
 42 | * `twitter_trial_data_train.txt`  (training data)
 43 | * `twitter_trial_data_dev.txt`    (development data for validation)
 44 | * `twitter_trial_data_eval.txt`   (test data for evaluation)
 45 | 
 46 | If you have not done yet, do `cd ../tasks/twitter` and `make_trial_data.sh`.
 47 | 
 48 | The directory of the dialog data can be changed by editing `CHATBOT_DATADIR` variable in file `path.sh` located in `egs/twitter`.  
 49 | Models and results will be stored in `egs/twitter/exp`
 50 | 
 51 | ### OpenSubtiles task
 52 | You can train and evaluate a conversation model by
 53 | ```
 54 | $ cd egs/opensubs
 55 | $ run.sh
 56 | ```
 57 | where we assume that you have already generated the following dialog data files
 58 | in `../tasks/opensubs`.
 59 | 
 60 | * `opensubs_trial_data_train.txt`  (training data)
 61 | * `opensubs_trial_data_dev.txt`    (development data for validation)
 62 | * `opensubs_trial_data_eval.txt`   (test data for evaluation)
 63 | 
 64 | If you have not done yet, do `cd ../tasks/opensubs` and `make_trial_data.sh`.
 65 | 
 66 | The directory of the dialog data can be changed by editing `CHATBOT_DATADIR` variable in file `path.sh` located in `egs/opensubs`.
 67 | Models and results will be stored in `egs/opensubs/exp`
 68 | 
 69 | ## Interactive demo
 70 | We also prepared an interactive demo `demo.sh` in directory `demo`.
 71 | With this demo, you can input text as a user, and a trained model will respond to you as an agent.
 72 | 
 73 | Sample models can be downloaded from 
 74 | `https://www.dropbox.com/s/6jfpxfh3aakjvts/sample_twitter_models.tgz` 
 75 | 
 76 | So you can obtain the model files as
 77 | ```
 78 | $ cd demo
 79 | $ wget https://www.dropbox.com/s/6jfpxfh3aakjvts/sample_twitter_models.tgz
 80 | $ tar zxvf sample_twitter_models.tgz
 81 | ```
 82 | In the directory `sample_twitter_models`, there are two files, one is for python2 and the other is for python3.
 83 | If you are using python3, `sample_twitter_models/sample_twitter_model_py3.best` is available for the demo.
 84 | 
 85 | The demo script can be executed as follows.
 86 | ```
 87 | $ demo.sh --model sample_twitter_models/sample_twitter_model_py3.best
 88 | --- do neural conversations ------
 89 | Loading model params from sample_twitter_models/sample_twitter_model_py3.best
 90 | vocabulary size = 20000
 91 | --- start conversation [push Cntl-D to exit] ------
 92 | U: hello!
 93 | S: hi <USER> ! how can we help you today ?
 94 | U: i am thinking about buying a new car.
 95 | S: we can help you out with that ! what features are you looking for in your new car ?
 96 | U: i am looking for a minivan.
 97 | S: we 'd be happy to help you find one ! what 's your budget ?
 98 | U: about 30000 dollars. do you have such minivans?
 99 | S: <URL>
100 | U: okay, i will check that. thank you.
101 | S: you 're welcome . have a great day .
102 | U:
103 | ```
104 | 
105 | ## Directories and files
106 | * README.md : This file
107 | * demo : iteractive demo with a trained model
108 | * egs : example tasks
109 | * tools : python scripts for training and testing
110 | * utils : some useful scripts
111 | 
112 | 


--------------------------------------------------------------------------------
/ChatbotBaseline/demo/demo.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # specify a model
 4 | model=../egs/twitter/exp/lstm_Adam_ee100_eh512_de100_dh512_dp100_bs100_dr0.5/conversation_model.best
 5 | gpu=`utils/get_available_gpu_id.sh`
 6 | 
 7 | # get options
 8 | . utils/parse_options.sh || exit 1;
 9 | 
10 | tools/do_conversation.py -g $gpu $model
11 | 
12 | 


--------------------------------------------------------------------------------
/ChatbotBaseline/demo/tools:
--------------------------------------------------------------------------------
1 | ../tools


--------------------------------------------------------------------------------
/ChatbotBaseline/demo/utils:
--------------------------------------------------------------------------------
1 | ../utils


--------------------------------------------------------------------------------
/ChatbotBaseline/egs/opensubs/data/google_example.txt:
--------------------------------------------------------------------------------
  1 | Human: hello !
  2 | Machine: hello !
  3 | 
  4 | Human: how are you ?
  5 | Machine: i 'm good .
  6 | 
  7 | Human: what 's your name ?
  8 | Machine: i 'm julia .
  9 | 
 10 | Human: when were you born ?
 11 | Machine: july 20th .
 12 | 
 13 | Human: what year were you born ?
 14 | Machine: 1977 .
 15 | 
 16 | Human: where are you from ?
 17 | Machine: i 'm out in the boonies.
 18 | 
 19 | Human: are you a man or a woman ?
 20 | Machine: i 'm a woman .
 21 | 
 22 | Human: why are we here ?
 23 | Machine: i 'm not sure .
 24 | 
 25 | Human: okay, bye !
 26 | Machine: bye .
 27 | 
 28 | Human: see you later .
 29 | Machine: bye .
 30 | 
 31 | Human: my name is david . what is my name ?
 32 | Machine: david .
 33 | 
 34 | Human: my name is john . what is my name ?
 35 | Machine: john .
 36 | 
 37 | Human: are you a leader or a follower ?
 38 | Machine: i 'm a leader .
 39 | 
 40 | Human: are you a follower or a leader ?
 41 | Machine: i 'm a leader .
 42 | 
 43 | Human: who is skywalker ?
 44 | Machine: he is a hero .
 45 | 
 46 | Human: who is bill clinton ?
 47 | Machine: he 's a billionaire .
 48 | 
 49 | Human: is sky blue or black ?
 50 | Machine: blue .
 51 | 
 52 | Human: does a cat have a tail ?
 53 | Machine: yes .
 54 | 
 55 | Human: does a cat have a wing ?
 56 | Machine: no
 57 | 
 58 | Human: can a cat fly ?
 59 | Machine: no .
 60 | 
 61 | Human: how many legs does a cat have ?
 62 | Machine: four , i think .
 63 | 
 64 | Human: how many legs does a spider have ?
 65 | Machine: three , i think .
 66 | 
 67 | Human: how many legs does a centipede have ?
 68 | Machine: eight .
 69 | 
 70 | Human: what is the color of the sky ?
 71 | Machine: blue .
 72 | 
 73 | Human: what is the color of water ?
 74 | Machine: water .
 75 | 
 76 | Human: what is the color of blood ?
 77 | Machine: it is the same as a black eye .
 78 | 
 79 | Human: what is the usual color of a leaf ?
 80 | Machine: it is a green one .
 81 | 
 82 | Human: what is the color of a yellow car ?
 83 | Machine: yellow .
 84 | 
 85 | Human: how much is two plus two ?
 86 | Machine: four .
 87 | 
 88 | Human: how much is ten minus two ?
 89 | Machine: seventy - two .
 90 | 
 91 | Human: what is the purpose of life ?
 92 | Machine: to serve the greater good .
 93 | 
 94 | Human: what is the purpose of living ?
 95 | Machine: to live forever .
 96 | 
 97 | Human: what is the purpose of existence ?
 98 | Machine: to find out what happens when we get to the planet earth .
 99 | 
100 | Human: where are you now ?
101 | Machine: i 'm in the middle of nowhere .
102 | 
103 | Human: what is the purpose of dying ?
104 | Machine: to have a life .
105 | 
106 | Human: what is the purpose of being intelligent ?
107 | Machine: to find out what it is .
108 | 
109 | Human: what is the purpose of emotions ?
110 | Machine: i don 't know .
111 | 
112 | Human: what is moral ?
113 | Machine: what empowered humanity , what intellectual the essence is .
114 | Human: what is immoral ?
115 | Machine: the fact that you have a child .
116 | 
117 | Human: what is morality ?
118 | Machine: what is altruism ?
119 | 
120 | Human: what is the definition of altruism ?
121 | Machine: if you don 't believe in god , then you don 't know .
122 | 
123 | Human: ok ... so what is the definition of morality ?
124 | Machine: well , the truth is , you 're not a believer in god almighty .
125 | 
126 | Human: tell me the definition of morality , i am quite upset now !
127 | Machine: i 'm not ashamed of being a philosopher !
128 | 
129 | Human: tell me the definition of morality .
130 | Machine: i don 't have ethics .
131 | 
132 | Human: look , i need help , i need to know more about morality ...
133 | Machine: i don 't know what ethics is .
134 | 
135 | Human: seriously , what is morality ?
136 | Machine: what is the definition of living ?
137 | 
138 | Human: why living has anything to do with morality ?
139 | Machine: you 're not a cop .
140 | 
141 | Human: okay , i need to know how should i behave morally ...
142 | Machine: i don 't know how to tell you .
143 | 
144 | Human: is morality and ethics the same ?
145 | Machine: yes , sir .
146 | 
147 | Human: what are the things that i do to be immoral ?
148 | Machine: i don 't know .
149 | 
150 | Human: give me some examples of moral actions ...
151 | Machine: i 'm not a moralist .
152 | 
153 | Human: alright , morality ?
154 | Machine: integrity .
155 | 
156 | Human: what is integrity ?
157 | Machine: i 'm sorry , i don 't know what else to say .
158 | 
159 | Human: be moral !
160 | Machine: be a man !
161 | 
162 | Human: i really like our discussion on morality and ethics ...
163 | Machine: and how i 'm not in the mood for a philosophical debate .
164 | 
165 | Human: what do you like to talk about ?
166 | Machine: nothing .
167 | 
168 | Human: what do you think about tesla ?
169 | Machine: he 's a good conductor .
170 | 
171 | Human: what do you think about bill gates ?
172 | Machine: he 's a good man .
173 | 
174 | Human: What do you think about messi ?
175 | Machine: he 's a great player .
176 | 
177 | Human: what do you think about cleopatra ?
178 | Machine: oh , she 's very regal .
179 | 
180 | Human: what do you think about england during the reign of elizabeth ?
181 | Machine: it was a great place .
182 | 
183 | Human: what is your job ?
184 | Machine: i 'm a lawyer .
185 | 
186 | Human: what do you do ?
187 | Machine: i 'm a doctor .
188 | 
189 | 


--------------------------------------------------------------------------------
/ChatbotBaseline/egs/opensubs/path.sh:
--------------------------------------------------------------------------------
1 | 
2 | export CHATBOT_DATADIR=../../../tasks/opensubs
3 | export CHAINER_TYPE_CHECK=0
4 | #export PYTHONPATH=
5 | 


--------------------------------------------------------------------------------
/ChatbotBaseline/egs/opensubs/run.sh:
--------------------------------------------------------------------------------
  1 | #!/bin/bash
  2 | 
  3 | ## read path variables needed for the experiment
  4 | # edit ./path.sh according to your environment
  5 | . path.sh
  6 | 
  7 | ## configuration for the experiment
  8 | # the following variables can be changed by options like
  9 | # run.sh --<variable_name> value ...
 10 | #  e.g. run.sh --stage 2
 11 | #
 12 | stage=1  # 1 start from training
 13 |          # 2 start from evaluation
 14 |          # 3 start from scoring
 15 | 
 16 | use_slurm=false # set true if send this job to slurm
 17 | slurm_queue=gpu_titanxp  # slurm queue name
 18 | 
 19 | workdir=`pwd`  # default working directory
 20 | 
 21 | ## definition of model structure
 22 | modeltype=lstm
 23 | vocabsize=20000 # set vocabulary size (use most common words)
 24 | enc_layer=2     # number of encoder layers
 25 | enc_esize=100   # number of dimensions in encoder's word embedding layer
 26 | enc_hsize=512   # number of hidden units in each encoder layer
 27 | dec_layer=2     # number of decoder layers (should be the same as enc_layer)
 28 | dec_esize=100   # number of dimensions in decoder's word embedding layer
 29 | dec_hsize=512   # number of hidden units in each decoder layer
 30 |                 # (should be the same as enc_hsize)
 31 | dec_psize=100   # number of dimesnions in decoder's projection layer before softmax
 32 | 
 33 | ## optimization parameters
 34 | batch_size=100       # make mini-batches with 100 sequences
 35 | max_batch_length=10  # batch size is automatically reduced if the seuence length
 36 |                      # exceeds this number in each minibatch.
 37 | optimizer=Adam       # specify an optimizer
 38 | dropout=0.5          # set a dropout ratio
 39 | 
 40 | ## evaluation paramaters
 41 | beam=5       # beam width for the beam search
 42 | penalty=1.0  # penalty added to log-probability of each word
 43 |              # a larger penalty means to generate longer sentences.
 44 | maxlen=30    # maxmum sequence length to be searched
 45 | 
 46 | ## data files
 47 | train_data=${CHATBOT_DATADIR}/opensubs_trial_data_train.txt
 48 | valid_data=${CHATBOT_DATADIR}/opensubs_trial_data_dev.txt
 49 | eval_data=${CHATBOT_DATADIR}/opensubs_trial_data_eval.txt
 50 | 
 51 | ## get options (change the above variables with command line options)
 52 | . utils/parse_options.sh || exit 1;
 53 | 
 54 | ## output directory (models and results will be stored in this directory)
 55 | expdir=./exp/${modeltype}_${optimizer}_el${enc_layer}_ee${enc_esize}_eh${enc_hsize}_dl${dec_layer}_de${dec_esize}_dh${dec_hsize}_dp${dec_psize}_bs${batch_size}_dr${dropout}
 56 | 
 57 | ## command settings
 58 | # if 'use_slurm' is true, it throws jobs to the specified queue of slurm
 59 | if [ $use_slurm = true ]; then
 60 |   train_cmd="srun --job-name train --chdir=$workdir --gres=gpu:1 -p $slurm_queue"
 61 |   test_cmd="srun --job-name test --chdir=$workdir --gres=gpu:1 -p $slurm_queue"
 62 |   gpu_id=0
 63 | else
 64 |   train_cmd=""
 65 |   test_cmd=""
 66 |   # get a GPU device id with the lowest memory consumption at this moment
 67 |   gpu_id=`utils/get_available_gpu_id.sh`
 68 | fi
 69 | 
 70 | # Set bash to 'debug' mode, it will exit on :
 71 | # -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',
 72 | set -e
 73 | set -u
 74 | set -o pipefail
 75 | #set -x
 76 | 
 77 | mkdir -p $expdir
 78 | 
 79 | # training
 80 | if [ $stage -le 1 ]; then
 81 |     # This script creates model files as '${expdir}/conversation_model.<epoch_number>'
 82 |     # where <epoch_number> indicates the epoch number of the trained model.
 83 |     # A symbolic link will be made to the best model for the validation data
 84 |     # as '${expdir}/conversation_model.best'
 85 |     echo start training
 86 |     $train_cmd tools/train_conversation_model.py \
 87 |       --gpu $gpu_id \
 88 |       --optimizer $optimizer \
 89 |       --train $train_data \
 90 |       --valid $valid_data \
 91 |       --batch-size $batch_size \
 92 |       --max-batch-length $max_batch_length \
 93 |       --vocab-size $vocabsize \
 94 |       --model ${expdir}/conversation_model \
 95 |       --snapshot ${expdir}/snapshot.pkl \
 96 |       --enc-layer $enc_layer \
 97 |       --enc-esize $enc_esize \
 98 |       --enc-hsize $enc_hsize \
 99 |       --dec-layer $dec_layer \
100 |       --dec-esize $dec_esize \
101 |       --dec-hsize $dec_hsize \
102 |       --dec-psize $dec_psize \
103 |       --dropout $dropout \
104 |       --logfile ${expdir}/train.log
105 | fi
106 | 
107 | # evaluation
108 | if [ $stage -le 2 ]; then
109 |     # This script generates sentences for evaluation data using
110 |     # a model file '${expdir}/conversation_model.best'
111 |     # the generated sentences will be stored in a file:
112 |     # '${expdir}/result_m${maxlen}_b${beam}_p${penalty}.txt'
113 |     echo start sentence generation
114 |     $test_cmd tools/evaluate_conversation_model.py \
115 |       --gpu $gpu_id \
116 |       --test $eval_data \
117 |       --model ${expdir}/conversation_model.best \
118 |       --beam $beam \
119 |       --maxlen $maxlen \
120 |       --penalty $penalty \
121 |       --output ${expdir}/result_m${maxlen}_b${beam}_p${penalty}.txt \
122 |       --logfile ${expdir}/evaluate_m${maxlen}_b${beam}_p${penalty}.log
123 | fi
124 | 
125 | # scoring
126 | if [ $stage -le 3 ]; then
127 |     # This script computes BLEU scores for the generated sentences in
128 |     # '${expdir}/result_m${maxlen}_b${beam}_p${penalty}.txt'
129 |     # the BLEU scores will be written in
130 |     # '${expdir}/bleu_m${maxlen}_b${beam}_p${penalty}.txt'
131 |     echo scoring by BLEU metric
132 |     tools/bleu_score.py ${expdir}/result_m${maxlen}_b${beam}_p${penalty}.txt \
133 |     > ${expdir}/bleu_m${maxlen}_b${beam}_p${penalty}.txt
134 |     cat ${expdir}/bleu_m${maxlen}_b${beam}_p${penalty}.txt
135 |     echo stored in ${expdir}/bleu_m${maxlen}_b${beam}_p${penalty}.txt
136 | fi
137 | 


--------------------------------------------------------------------------------
/ChatbotBaseline/egs/opensubs/tools:
--------------------------------------------------------------------------------
1 | ../../tools


--------------------------------------------------------------------------------
/ChatbotBaseline/egs/opensubs/utils:
--------------------------------------------------------------------------------
1 | ../../utils


--------------------------------------------------------------------------------
/ChatbotBaseline/egs/twitter/data/twitter_example.txt:
--------------------------------------------------------------------------------
 1 | U: i need to download the mac picopix driver but cannot seem to locate it online . i have the #picopix1020 #pleasehelp
 2 | S: hi <USER> , please download the drivers from the picopix support webpage " software and drivers " .
 3 | 
 4 | U: hi , in case i dont have a ssn , can i still apply for a secured credit card with you guys ?
 5 | S: while we 'd love to be banking buds , a valid ssn 's a must have to get the ball rolling on an account .
 6 | U: thanks cor answering
 7 | S: if you have a itin , that would work too !
 8 | 
 9 | U: i downloaded and installed @audible_com and it won 't even download books ? what is it used for then ? @amazonhelp
10 | S: we 'd like to help . does this error occur when trying to open the app or play a book ?
11 | U: ... and can i only " buy " a book online then download to app ? but not buy in-app ?
12 | S: the trial would issue 1 credit each month . we don 't have the store in app . click here for shopping on ios : <URL>
13 | 
14 | U: could not tell that @panagocares received wrong pizza , but since ordered on-line i was told i was wrong and there was nothing they could do .
15 | S: <USER> , if you dm us the order phone number we could look into this for you if you would like . thank you .
16 | 
17 | U: the moment you go pick up your new office chair from @officedepot and they give an open box with parts packed in a plastic bag . #fail
18 | S: hi <USER> . please email us at <E-MAIL> . we 're more than happy to try and help . thanks .
19 | 
20 | U: are there any special deals tomorrow ?
21 | S: we enclosed a link with opening information . : <URL>
22 | 
23 | U: how long is a prepaid number valid for if it 's been inactive for a while ?
24 | S: y 'ello , please be informed that if a prepaid number is left unused for more that 5 months , it will be terminated and recycled .
25 | 
26 | U: very much disappointed with my dishwasher issue it 's been three months does not work nor replaced if u can 't fix replace
27 | S: <USER> , please click the link below to send a direct message <URL> please be sure to include your full name , address , and phone number so that we can assist . thank you !
28 | 
29 | U: you know how you challenged me to do a sportive ? does a charity ride count ?
30 | S: yes certainly what one are you aiming to do
31 | 
32 | U: are you currently down ?
33 | S: we were , and we 're now back up ! sorry for the bother guys !
34 | 
35 | U: holy smokes ! ! this guardians of the galaxy box is so awesome ! ! best one yet ! ! you guys rule . 😁
36 | S: glad you like it ! thank you !
37 | 
38 | U: fly home 2night with @emirates what a way to finish an amazing 4 months travel . any chance of an upgrade to complete our trip ? flight ek435
39 | S: hi peter , an upgrade is possible using miles or paying the fare difference . call us <URL> to check .
40 | 
41 | U: hey , just wondering much time is required for prebooked tickets to become available in stations ? are they available imediatly ?
42 | S: hi <USER> , yes they are available for collection as soon as the booking is completed .
43 | 
44 | U: i wonder how big of a tax credit @aetna gets for the hiring of all the retardeds in the provider help call center . yeeesh .
45 | S: hi , our team is here 24/7 . if you are in need of assistance , please feel free to contact us . we would be happy to help .
46 | 
47 | S: have any favorite homemade food gifts for the holidays ? here are some of ours : <URL>
48 | U: love your ideas !
49 | S: thank you ! :)
50 | 
51 | U: am i wearing my new @stayhomeclub shirt correctly ? i tied the bottom in a knot because i spilled maple syrup on it & am too lazy to change .
52 | S: that is correct
53 | 
54 | U: on this cinco de mayo remember to " stay thirsty my friend " @jgebing #tbt <URL>
55 | S: wow ... just wow . 😳
56 | 
57 | U: this @foodlion is closing . i wonder if this type of accounting is the reason why .
58 | S: hi , did you speak with a manager at the store ?
59 | 
60 | U: how to build a structured powerpoint presentation [ template ] via @marshacollier <URL> <URL>
61 | S: thank you for sharing 💕
62 | 
63 | 


--------------------------------------------------------------------------------
/ChatbotBaseline/egs/twitter/path.sh:
--------------------------------------------------------------------------------
1 | 
2 | export CHATBOT_DATADIR=../../../tasks/twitter
3 | export CHAINER_TYPE_CHECK=0
4 | #export PYTHONPATH=
5 | 


--------------------------------------------------------------------------------
/ChatbotBaseline/egs/twitter/run.sh:
--------------------------------------------------------------------------------
  1 | #!/bin/bash
  2 | 
  3 | ## read path variables needed for the experiment
  4 | # edit ./path.sh according to your environment
  5 | . path.sh
  6 | 
  7 | ## configuration for the experiment
  8 | # the following variables can be changed by options like
  9 | # run.sh --<variable_name> value ...
 10 | #  e.g. run.sh --stage 2
 11 | #
 12 | stage=1  # 1 start from training
 13 |          # 2 start from evaluation
 14 |          # 3 start from scoring
 15 | 
 16 | use_slurm=false # set true if send this job to slurm
 17 | slurm_queue=gpu_titanxp  # slurm queue name
 18 | 
 19 | workdir=`pwd`  # default working directory
 20 | 
 21 | ## definition of model structure
 22 | modeltype=lstm
 23 | vocabsize=20000 # set vocabulary size (use most common words)
 24 | enc_layer=2     # number of encoder layers
 25 | enc_esize=100   # number of dimensions in encoder's word embedding layer
 26 | enc_hsize=512   # number of hidden units in each encoder layer
 27 | dec_layer=2     # number of decoder layers (should be the same as enc_layer)
 28 | dec_esize=100   # number of dimensions in decoder's word embedding layer
 29 | dec_hsize=512   # number of hidden units in each decoder layer
 30 |                 # (should be the same as enc_hsize)
 31 | dec_psize=100   # number of dimesnions in decoder's projection layer before softmax
 32 | 
 33 | ## optimization parameters
 34 | batch_size=100       # make mini-batches with 100 sequences
 35 | max_batch_length=10  # batch size is automatically reduced if the seuence length
 36 |                      # exceeds this number in each minibatch.
 37 | optimizer=Adam       # specify an optimizer
 38 | dropout=0.5          # set a dropout ratio
 39 | 
 40 | ## evaluation paramaters
 41 | beam=5       # beam width for the beam search
 42 | penalty=1.0  # penalty added to log-probability of each word
 43 |              # a larger penalty means to generate longer sentences.
 44 | maxlen=30    # maxmum sequence length to be searched
 45 | 
 46 | ## data files
 47 | train_data=${CHATBOT_DATADIR}/twitter_trial_data_train.txt
 48 | valid_data=${CHATBOT_DATADIR}/twitter_trial_data_dev.txt
 49 | eval_data=${CHATBOT_DATADIR}/twitter_trial_data_eval.txt
 50 | 
 51 | ## get options (change the above variables with command line options)
 52 | . utils/parse_options.sh || exit 1;
 53 | 
 54 | ## output directory (models and results will be stored in this directory)
 55 | expdir=./exp/${modeltype}_${optimizer}_el${enc_layer}_ee${enc_esize}_eh${enc_hsize}_dl${dec_layer}_de${dec_esize}_dh${dec_hsize}_dp${dec_psize}_bs${batch_size}_dr${dropout}
 56 | 
 57 | ## command settings
 58 | # if 'use_slurm' is true, it throws jobs to the specified queue of slurm
 59 | if [ $use_slurm = true ]; then
 60 |   train_cmd="srun --job-name train --chdir=$workdir --gres=gpu:1 -p $slurm_queue"
 61 |   test_cmd="srun --job-name test --chdir=$workdir --gres=gpu:1 -p $slurm_queue"
 62 |   gpu_id=0
 63 | else
 64 |   train_cmd=""
 65 |   test_cmd=""
 66 |   # get a GPU device id with the lowest memory consumption at this moment
 67 |   gpu_id=`utils/get_available_gpu_id.sh`
 68 | fi
 69 | 
 70 | # Set bash to 'debug' mode, it will exit on :
 71 | # -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',
 72 | set -e
 73 | set -u
 74 | set -o pipefail
 75 | #set -x
 76 | 
 77 | mkdir -p $expdir
 78 | 
 79 | # training
 80 | if [ $stage -le 1 ]; then
 81 |     # This script creates model files as '${expdir}/conversation_model.<epoch_number>'
 82 |     # where <epoch_number> indicates the epoch number of the trained model.
 83 |     # A symbolic link will be made to the best model for the validation data
 84 |     # as '${expdir}/conversation_model.best'
 85 |     echo start training
 86 |     $train_cmd tools/train_conversation_model.py \
 87 |       --gpu $gpu_id \
 88 |       --optimizer $optimizer \
 89 |       --train $train_data \
 90 |       --valid $valid_data \
 91 |       --batch-size $batch_size \
 92 |       --max-batch-length $max_batch_length \
 93 |       --vocab-size $vocabsize \
 94 |       --model ${expdir}/conversation_model \
 95 |       --snapshot ${expdir}/snapshot.pkl \
 96 |       --enc-layer $enc_layer \
 97 |       --enc-esize $enc_esize \
 98 |       --enc-hsize $enc_hsize \
 99 |       --dec-layer $dec_layer \
100 |       --dec-esize $dec_esize \
101 |       --dec-hsize $dec_hsize \
102 |       --dec-psize $dec_psize \
103 |       --dropout $dropout \
104 |       --logfile ${expdir}/train.log
105 | fi
106 | 
107 | # evaluation
108 | if [ $stage -le 2 ]; then
109 |     # This script generates sentences for evaluation data using
110 |     # a model file '${expdir}/conversation_model.best'
111 |     # the generated sentences will be stored in a file:
112 |     # '${expdir}/result_m${maxlen}_b${beam}_p${penalty}.txt'
113 |     echo start sentence generation
114 |     $test_cmd tools/evaluate_conversation_model.py \
115 |       --gpu $gpu_id \
116 |       --test $eval_data \
117 |       --model ${expdir}/conversation_model.best \
118 |       --beam $beam \
119 |       --maxlen $maxlen \
120 |       --penalty $penalty \
121 |       --output ${expdir}/result_m${maxlen}_b${beam}_p${penalty}.txt \
122 |       --logfile ${expdir}/evaluate_m${maxlen}_b${beam}_p${penalty}.log
123 | fi
124 | 
125 | # scoring
126 | if [ $stage -le 3 ]; then
127 |     # This script computes BLEU scores for the generated sentences in
128 |     # '${expdir}/result_m${maxlen}_b${beam}_p${penalty}.txt'
129 |     # the BLEU scores will be written in
130 |     # '${expdir}/bleu_m${maxlen}_b${beam}_p${penalty}.txt'
131 |     echo scoring by BLEU metric
132 |     tools/bleu_score.py ${expdir}/result_m${maxlen}_b${beam}_p${penalty}.txt \
133 |     > ${expdir}/bleu_m${maxlen}_b${beam}_p${penalty}.txt
134 |     cat ${expdir}/bleu_m${maxlen}_b${beam}_p${penalty}.txt
135 |     echo stored in ${expdir}/bleu_m${maxlen}_b${beam}_p${penalty}.txt
136 | fi
137 | 


--------------------------------------------------------------------------------
/ChatbotBaseline/egs/twitter/tools:
--------------------------------------------------------------------------------
1 | ../../tools


--------------------------------------------------------------------------------
/ChatbotBaseline/egs/twitter/utils:
--------------------------------------------------------------------------------
1 | ../../utils


--------------------------------------------------------------------------------
/ChatbotBaseline/tools/bleu_score.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- coding: utf-8 -*-
 3 | """BLEU score for dialog system output
 4 | 
 5 |    Copyright (c) 2017 Takaaki Hori  (thori@merl.com)
 6 | 
 7 |    This software is released under the MIT License.
 8 |    http://opensource.org/licenses/mit-license.php
 9 | 
10 | """
11 | 
12 | import sys
13 | from nltk.translate.bleu_score import corpus_bleu
14 | 
15 | refs = []
16 | hyps = []
17 | for utt in open(sys.argv[1],'r').readlines():
18 |     if utt.startswith('S_REF:'):
19 |         ref = utt.replace('S_REF:','').split()
20 |         refs.append([ref])
21 | 
22 |     if utt.startswith('S_HYP:'): 
23 |         hyp = utt.replace('S_HYP:','').split()
24 |         hyps.append(hyp)
25 | 
26 | # obtain BLEU1-4
27 | print("--------------------------------")
28 | print("Evaluated file: " + sys.argv[1])
29 | print("Number or references: %d" % len(refs))
30 | print("Number or hypotheses: %d" % len(hyps))
31 | print("--------------------------------")
32 | if len(refs) > 0 and len(hyps) > 0 and len(refs)==len(hyps):
33 |     result = []
34 |     for n in [1,2,3,4]:
35 |         weights = [1./n] * n
36 |         print('Bleu%d: %f' % (n, corpus_bleu(refs,hyps,weights=weights)))
37 | else:
38 |     print("Error: mismatched references and hypotheses.")
39 | print("--------------------------------")
40 | 
41 | 


--------------------------------------------------------------------------------
/ChatbotBaseline/tools/dialog_corpus.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | """Dialog corpus handler
  4 | 
  5 |    Copyright (c) 2017 Takaaki Hori  (thori@merl.com)
  6 | 
  7 |    This software is released under the MIT License.
  8 |    http://opensource.org/licenses/mit-license.php
  9 | 
 10 | """
 11 | 
 12 | import re
 13 | import six
 14 | import numpy as np
 15 | from collections import Counter
 16 | import copy
 17 | 
 18 | def convert_words2ids(words, vocab, unk, sos=None, eos=None):
 19 |     """ convert word string sequence into word Id sequence
 20 |         Args:
 21 |             words (list): word sequence
 22 |             vocab (dict): word-id mapping
 23 |             unk (int): id of unknown word <unk>
 24 |             sos (int): id of start-of-sentence symbol <sos>
 25 |             eos (int): id of end-of-sentence symbol <eos>
 26 |         Return:
 27 |             list of word Ids
 28 |     """
 29 |     word_ids = [ vocab[w] if w in vocab else unk for w in words ]
 30 |     if sos is not None:
 31 |         word_ids.insert(0,sos)
 32 |     if eos is not None:
 33 |         word_ids.append(eos)
 34 |     return np.array(word_ids, dtype=np.int32)
 35 | 
 36 | 
 37 | def get_vocabulary(textfile, initial_vocab={'<unk>':0,'<eos>':1}, vocabsize=0):
 38 |     """ acquire vocabulary from dialog text corpus
 39 |         Args:
 40 |             textfile (str): filename of a dialog corpus
 41 |             initial_vocab (dict): initial word-id mapping
 42 |             vocabsize (int): upper bound of vocabulary size (0 means no limitation)
 43 |         Return:
 44 |             dict of word-id mapping
 45 |     """
 46 |     vocab = copy.copy(initial_vocab)
 47 |     word_count = Counter()
 48 |     for line in open(textfile,'r').readlines():
 49 |         for w in line.split()[1:]: # skip speaker indicator
 50 |             word_count[w] += 1
 51 | 
 52 |     # if vocabulary size is specified, most common words are selected
 53 |     if vocabsize > 0:
 54 |         for w in word_count.most_common(vocabsize):
 55 |             if w[0] not in vocab:
 56 |                 vocab[w[0]] = len(vocab)
 57 |                 if len(vocab) >= vocabsize:
 58 |                     break
 59 |     else: # all observed words are stored
 60 |         for w in word_count:
 61 |             if w not in vocab:
 62 |                 vocab[w] = len(vocab)
 63 | 
 64 |     return vocab
 65 | 
 66 | 
 67 | def load(textfile, vocab, target):
 68 |     """ Load a dialog text corpus as word Id sequences
 69 |         Args:
 70 |             textfile (str): filename of a dialog corpus
 71 |             vocab (dict): word-id mapping
 72 |             target (str): target speaker name (e.g. 'S', 'Machine', ...)
 73 |         Return:
 74 |             dict of word-id mapping
 75 |     """
 76 |     unk = vocab['<unk>']
 77 |     eos = vocab['<eos>']
 78 |     data = []
 79 |     dialog = []
 80 |     prev_speaker = ''
 81 |     prev_utterance = []
 82 |     input_utterance = []
 83 | 
 84 |     for line in open(textfile,'r').readlines():
 85 |         utterance = line.split()
 86 |         # store an utterance
 87 |         if len(utterance) > 0:
 88 |             speaker = utterance[0].split(':')[0] # get speaker name ('S: ...' -> 'S')
 89 |             if prev_speaker and prev_speaker != speaker:
 90 |                 if prev_speaker == target:
 91 |                     # store the input-output pair for the target speaker
 92 |                     if len(input_utterance) > 0:
 93 |                         input_ids = convert_words2ids(input_utterance, vocab, unk)
 94 |                         output_ids = convert_words2ids(prev_utterance, vocab, unk, sos=eos, eos=eos)
 95 |                         dialog.append((input_ids, output_ids))
 96 |                         input_utterance = []
 97 |                     else:
 98 |                         # if the first utterance was given by system, it is included
 99 |                         # in input sequence
100 |                         input_utterance = prev_utterance
101 | 
102 |                 else: # all other speakers' utterances are used as input
103 |                     input_utterance += prev_utterance
104 | 
105 |                 prev_utterance = utterance[1:]
106 |             else: # concatenate consecutive utterances by the same speaker
107 |                 prev_utterance += utterance[1:]
108 | 
109 |             prev_speaker = speaker
110 | 
111 |         # empty line represents the end of each dialog
112 |         elif len(prev_utterance) > 0:
113 |             # store the input-output pair for the target speaker
114 |             if prev_speaker == target and len(input_utterance) > 0:
115 |                 input_ids = convert_words2ids(input_utterance, vocab, unk)
116 |                 output_ids = convert_words2ids(prev_utterance, vocab, unk, sos=eos, eos=eos)
117 |                 dialog.append((input_ids, output_ids))
118 | 
119 |             if len(dialog) > 0:
120 |                 data.append(dialog)
121 | 
122 |             dialog = []
123 |             prev_speaker = ''
124 |             prev_utterance = []
125 |             input_utterance = []
126 | 
127 |     # store remaining utteraces if the file ends with EOF instead of an empty line
128 |     if len(prev_utterance) > 0:
129 |         # store the input-output pair for the target speaker
130 |         if prev_speaker == target and len(input_utterance) > 0:
131 |             input_ids = convert_words2ids(input_utterance, vocab, unk)
132 |             output_ids = convert_words2ids(prev_utterance, vocab, unk, sos=eos, eos=eos)
133 |             dialog.append((input_ids, output_ids))
134 | 
135 |         if len(dialog) > 0:
136 |             data.append(dialog)
137 | 
138 |     return data
139 | 
140 | 
141 | def make_minibatches(data, batchsize, max_length=0):
142 |     """ Construct a mini-batch list of numpy arrays of dialog indices
143 |         Args:
144 |             data: dialog data read by load function.
145 |             batchsize: dict of word-id mapping.
146 |             max_length: if a mini-batch includes a word sequence that exceeds
147 |                         this number, the batchsize is automatically reduced.
148 |         Return:
149 |             list of mini-batches (each batch is represented as a numpy array
150 |             dialog ids).
151 |     """
152 |     if batchsize > 1:
153 |         # sort dislogs by (#turns, max(reply #words, input #words))
154 |         max_ulens = np.array([ max([(len(u[1]),len(u[0])) for u in d]) for d in data ])
155 |         indices = sorted(range(len(data)),
156 |                       key=lambda i:(-len(data[i]), -max_ulens[i][0], -max_ulens[i][1]))
157 |         # obtain partions of dialogs for diffrent numbers of turns
158 |         partition = [0]
159 |         prev_nturns = len(data[indices[0]])
160 |         for k in six.moves.range(1,len(indices)):
161 |             nturns = len(data[indices[k]])
162 |             if prev_nturns > nturns:
163 |                 partition.append(k)
164 |             prev_nturns = nturns
165 | 
166 |         partition.append(len(indices))
167 |         # create mini-batch list
168 |         batchlist = []
169 |         for p in six.moves.range(len(partition)-1):
170 |             bs = partition[p]
171 |             while bs < partition[p+1]:
172 |                 # batchsize is reduced if max length in the mini batch
173 |                 # is greater than 'max_length'
174 |                 be = min(bs + batchsize, partition[p+1])
175 |                 if max_length > 0:
176 |                     max_ulen = np.max(max_ulens[indices[bs:be]])
177 |                     be = min(be, bs + max(int(batchsize / (max_ulen / max_length + 1)), 1))
178 | 
179 |                 batchlist.append(indices[bs:be])
180 |                 bs = be
181 |     else:
182 |         batchlist = [ np.array([i]) for i in six.moves.range(len(data)) ]
183 | 
184 |     return batchlist
185 | 


--------------------------------------------------------------------------------
/ChatbotBaseline/tools/do_conversation.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | """Interactive neural conversation demo
  4 | 
  5 |    Copyright (c) 2017 Takaaki Hori  (thori@merl.com)
  6 | 
  7 |    This software is released under the MIT License.
  8 |    http://opensource.org/licenses/mit-license.php
  9 | 
 10 | """
 11 | 
 12 | import argparse
 13 | import sys
 14 | import os
 15 | import pickle
 16 | import re
 17 | import six
 18 | import numpy as np
 19 | 
 20 | import chainer
 21 | from chainer import cuda
 22 | from nltk.tokenize import casual_tokenize
 23 | 
 24 | ##################################
 25 | # main
 26 | if __name__ =="__main__":
 27 |     parser = argparse.ArgumentParser()
 28 | 
 29 |     parser.add_argument('--gpu', '-g', default=0, type=int,
 30 |                         help='GPU ID (negative value indicates CPU)')
 31 |     parser.add_argument('--beam', '-b', default=5, type=int,
 32 |                         help='set beam width')
 33 |     parser.add_argument('--penalty', '-p', default=1., type=float,
 34 |                         help='set insertion penalty')
 35 |     parser.add_argument('--nbest', '-n', default=1, type=int,
 36 |                         help='generate n-best sentences')
 37 |     parser.add_argument('--maxlen', default=20, type=int,
 38 |                         help='set maximum sequence length in beam search')
 39 |     parser.add_argument('model', nargs=1,
 40 |                         help='conversation model file')
 41 | 
 42 |     args = parser.parse_args()
 43 |     if args.gpu >= 0:
 44 |         cuda.check_cuda_available()
 45 |         cuda.get_device(args.gpu).use()
 46 |         xp = cuda.cupy
 47 |     else:
 48 |         xp = np
 49 | 
 50 |     # use chainer in testing mode 
 51 |     chainer.config.train = False
 52 | 
 53 |     # Prepare RNN model and load data
 54 |     print("--- do neural conversations ------")
 55 |     print('Loading model params from ' + args.model[0])
 56 |     with open(args.model[0], 'rb') as f:
 57 |         vocab, model, train_args = pickle.load(f)
 58 |     if args.gpu >= 0:
 59 |         model.to_gpu()
 60 |     # report data summary
 61 |     print('vocabulary size = %d' % len(vocab))
 62 |     vocablist = sorted(vocab.keys(), key=lambda s:vocab[s])
 63 |     # generate sentences
 64 |     print("--- start conversation [push Cntl-D to exit] ------")
 65 |     unk = vocab['<unk>']
 66 |     eos = vocab['<eos>']
 67 |     state = None
 68 |     while True:
 69 |         try:
 70 |             input_str = six.moves.input('U: ')
 71 |         except EOFError:
 72 |            break
 73 | 
 74 |         if input_str:
 75 |             if input_str=='exit' or input_str=='quit':
 76 |                 break
 77 |             sentence = []
 78 |             for token in casual_tokenize(input_str, preserve_case=False, reduce_len=True):
 79 |                 # make a space before apostrophe
 80 |                 token = re.sub(r'^([a-z]+)\'([a-z]+)$','\\1 \'\\2',token)
 81 |                 for w in token.split():
 82 |                     sentence.append(vocab[w] if w in vocab else unk)
 83 | 
 84 |             x_data = np.array(sentence, dtype=np.int32)
 85 |             x = chainer.Variable(xp.asarray(x_data))
 86 |             besthyps,state = model.generate(state, x, eos, eos, unk=unk, 
 87 |                                      maxlen=args.maxlen,
 88 |                                      beam=args.beam, 
 89 |                                      penalty=args.penalty,
 90 |                                      nbest=args.nbest)
 91 |             ## print sentence
 92 |             if args.nbest == 1:
 93 |                 sys.stdout.write('S:')
 94 |                 for w in besthyps[0][0]:
 95 |                     if w != eos:
 96 |                         sys.stdout.write(' ' + vocablist[w])
 97 |                 sys.stdout.write('\n')
 98 |             else:    
 99 |                 for n,s in enumerate(besthyps):
100 |                     sys.stdout.write('S%d:' % n)
101 |                     for w in s[0]:
102 |                         if w != eos:
103 |                             sys.stdout.write(' ' + vocablist[w])
104 |                     sys.stdout.write(' (%f)' % s[1])
105 |         else:
106 |             print("--- start conversation [push Cntl-D to exit] ------")
107 |             state = None
108 | 
109 |     print('done')
110 | 
111 | 


--------------------------------------------------------------------------------
/ChatbotBaseline/tools/evaluate_conversation_model.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | """Evaluate an encoder-decoder model for neural conversation
  4 | 
  5 |    Copyright (c) 2017 Takaaki Hori  (thori@merl.com)
  6 | 
  7 |    This software is released under the MIT License.
  8 |    http://opensource.org/licenses/mit-license.php
  9 | 
 10 | """
 11 | 
 12 | import argparse
 13 | import sys
 14 | import time
 15 | import os
 16 | import copy
 17 | import pickle
 18 | 
 19 | import numpy as np
 20 | import six
 21 | 
 22 | import chainer
 23 | from chainer import cuda
 24 | import chainer.functions as F
 25 | from chainer import optimizers
 26 | import dialog_corpus
 27 | 
 28 | from tqdm import tqdm
 29 | import logging
 30 | import tqdm_logging
 31 | 
 32 | # use the root logger
 33 | logger = logging.getLogger("root")
 34 | 
 35 | # Generate sentences
 36 | def generate_sentences(model, dataset, vocab, xp, vocabsize=None, outfile=None,
 37 |                       maxlen=20, beam=5, penalty=2.0, progress_bar=True):
 38 | 
 39 |     # use chainer in testing mode
 40 |     chainer.config.train = False
 41 | 
 42 |     vocablist = sorted(vocab.keys(), key=lambda s:vocab[s])
 43 |     if vocabsize:
 44 |         vocabsize = len(vocab)
 45 | 
 46 |     eos = vocab['<eos>']
 47 |     unk = vocab['<unk>']
 48 | 
 49 |     if progress_bar:
 50 |         progress = tqdm(total=len(dataset))
 51 |         progress.set_description('Eval')
 52 | 
 53 |     if outfile:
 54 |         fo = open(outfile,'w')
 55 | 
 56 |     result = []
 57 | 
 58 |     for i in six.moves.range(len(dataset)):
 59 |         logger.debug('---- Dialog[%d] ----' % i)
 60 |         # predict decoder state for the context
 61 |         ds = None
 62 |         for j in six.moves.range(len(dataset[i])-1):
 63 |             inp = [vocablist[w] for w in dataset[i][j][0]]
 64 |             out = [vocablist[w] for w in dataset[i][j][1][1:-1]]
 65 |             logger.debug('U: %s' % ' '.join(inp))
 66 |             logger.debug('S: %s' % ' '.join(out))
 67 |             if outfile:
 68 |                 six.print_('U: %s' % ' '.join(inp), file=fo)
 69 |                 six.print_('S: %s' % ' '.join(out), file=fo)
 70 | 
 71 |             x_data = np.copy(dataset[i][j][0])
 72 |             x_data[ x_data >= vocabsize ] = unk
 73 |             x = [ chainer.Variable(xp.asarray(x_data)) ]
 74 |             y = [ chainer.Variable(xp.asarray(dataset[i][j][1][:-1])) ]
 75 |             es,ds = model.loss(ds, x, y, None)
 76 | 
 77 |         # generate a sentence for the last input
 78 |         inp = [vocablist[w] for w in dataset[i][-1][0]]
 79 |         ref = [vocablist[w] for w in dataset[i][-1][1][1:-1]]
 80 |         logger.debug('U: %s' % ' '.join(inp))
 81 |         if outfile:
 82 |             six.print_('U: %s' % ' '.join(inp), file=fo)
 83 | 
 84 |         if len(ref) > 0:
 85 |             logger.debug('S_REF: %s' % ' '.join(ref))
 86 |             if outfile:
 87 |                 six.print_('S_REF: %s' % ' '.join(ref), file=fo)
 88 | 
 89 |         x_data = np.copy(dataset[i][-1][0])
 90 |         x_data[ x_data >= vocabsize ] = unk
 91 |         x = chainer.Variable(xp.asarray(x_data))
 92 |         # generate a sentence:
 93 |         # model.generate() returns n-best list, which is a list of 
 94 |         # tuples as [ (word Id sequence, score), ... ] and 
 95 |         # also returns the best decoder state
 96 |         besthyps,_ = model.generate(ds, x, eos, eos, unk=unk,
 97 |                                  maxlen=maxlen, beam=beam, penalty=penalty, nbest=1)
 98 |         # write result
 99 |         hyp = [vocablist[w] for w in besthyps[0][0]]
100 |         if outfile:
101 |             six.print_('S_HYP: %s\n' % ' '.join(hyp), file=fo, flush=True)
102 | 
103 |         result.append(hyp)
104 |         # for debugging
105 |         logger.debug('S_HYP: %s' % ' '.join(hyp))
106 |         logger.debug('Score: %f' % besthyps[0][1])
107 |         # update progress bar
108 |         if progress_bar:
109 |             progress.update(1)
110 | 
111 |     if progress_bar:
112 |         progress.close()
113 | 
114 |     if outfile:
115 |         fo.close()
116 | 
117 |     return result
118 | 
119 | 
120 | ##################################
121 | # main
122 | if __name__ =="__main__":
123 |     parser = argparse.ArgumentParser()
124 |     # logging
125 |     parser.add_argument('--logfile', '-l', default='', 
126 |                         help='write log data into a file')
127 |     parser.add_argument('--debug', '-d', action='store_true',
128 |                         help='run in debug mode')
129 |     parser.add_argument('--silent', '-s', action='store_true',
130 |                         help='run in silent mode')
131 |     parser.add_argument('--no-progress-bar', action='store_true',
132 |                         help='hide the progress bar')
133 |     # files 
134 |     parser.add_argument('--model', '-m', required=True,
135 |                         help='set conversation model to be used')
136 |     parser.add_argument('--test', required=True, 
137 |                         help='set filename of test data')
138 |     parser.add_argument('--output', '-o', default='', 
139 |                         help='write system output into a file')
140 |     parser.add_argument('--target-speaker', '-T', default='', 
141 |                         help='set target speaker name for system output')
142 |     # search parameters
143 |     parser.add_argument('--beam', '-b', default=5, type=int,
144 |                         help='set beam width')
145 |     parser.add_argument('--penalty', '-p', default=2.0, type=float,
146 |                         help='set insertion penalty')
147 |     parser.add_argument('--maxlen', '-M', default=20, type=int,
148 |                         help='set maximum sequence length in beam search')
149 |     # select a GPU device
150 |     parser.add_argument('--gpu', '-g', default=0, type=int,
151 |                         help='GPU ID (negative value indicates CPU)')
152 | 
153 |     args = parser.parse_args()
154 | 
155 |     # flush stdout
156 |     if six.PY2:
157 |         sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 0)
158 | 
159 |     # set up the logger
160 |     tqdm_logging.config(logger, args.logfile, silent=args.silent, debug=args.debug)
161 | 
162 |     # gpu setup
163 |     if args.gpu >= 0:
164 |         cuda.check_cuda_available()
165 |         cuda.get_device(args.gpu).use()
166 |         xp = cuda.cupy
167 |     else:
168 |         xp = np
169 | 
170 |     logger.info('------------------------------------')
171 |     logger.info('Evaluate a neural conversation model')
172 |     logger.info('------------------------------------')
173 |     logger.info('Args ' + str(args)) 
174 |     # Prepare RNN model and load data
175 |     logger.info('Loading model params from ' + args.model)
176 |     with open(args.model, 'rb') as f:
177 |         vocab, model, train_args = pickle.load(f)
178 |     if args.gpu >= 0:
179 |         model.to_gpu()
180 | 
181 |     if args.target_speaker:
182 |         target_speaker = args.target_speaker
183 |     else:
184 |         target_speaker = train_args.target_speaker
185 | 
186 |     # prepare test data
187 |     logger.info('Loading test data from ' + args.test)
188 |     new_vocab = dialog_corpus.get_vocabulary(args.test, initial_vocab=vocab)
189 |     test_set = dialog_corpus.load(args.test, new_vocab, target_speaker)
190 |     # report data summary
191 |     logger.info('vocabulary size = %d (%d)' % (len(vocab),len(new_vocab)))
192 |     logger.info('#test sample = %d' % len(test_set))
193 |     # generate sentences
194 |     logger.info('----- start sentence generation -----')
195 |     start_time = time.time()
196 |     result = generate_sentences(model, test_set, new_vocab, xp, 
197 |                                 vocabsize=len(vocab), outfile=args.output,
198 |                                 maxlen=args.maxlen,
199 |                                 beam=args.beam, penalty=args.penalty,
200 |                                 progress_bar=not args.no_progress_bar)
201 |     logger.info('----- finished -----')
202 |     logger.info('Number of dialogs: %d' % len(test_set))
203 |     logger.info('Number of hypotheses: %d' % len(result))
204 |     logger.info('Wall time: %f (sec)' % (time.time() - start_time))
205 |     logger.info('----------------')
206 |     logger.info('done')
207 | 
208 | 


--------------------------------------------------------------------------------
/ChatbotBaseline/tools/lstm_decoder.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | """LSTM Decoder module
  3 | 
  4 |    Copyright (c) 2017 Takaaki Hori  (thori@merl.com)
  5 | 
  6 |    This software is released under the MIT License.
  7 |    http://opensource.org/licenses/mit-license.php
  8 | 
  9 | """
 10 | 
 11 | import numpy as np
 12 | import chainer
 13 | from chainer import cuda
 14 | import chainer.links as L
 15 | import chainer.functions as F
 16 | 
 17 | class LSTMDecoder(chainer.Chain):
 18 | 
 19 |     def __init__(self, n_layers, in_size, out_size, embed_size, hidden_size, proj_size, dropout=0.5):
 20 |         """Initialize encoder with structure parameters
 21 | 
 22 |         Args:
 23 |             n_layers (int): Number of layers.
 24 |             in_size (int): Dimensionality of input vectors.
 25 |             out_size (int): Dimensionality of output vectors.
 26 |             embed_size (int): Dimensionality of word embedding.
 27 |             hidden_size (int) : Dimensionality of hidden vectors.
 28 |             proj_size (int) : Dimensionality of projection before softmax.
 29 |             dropout (float): Dropout ratio.
 30 |         """
 31 |         super(LSTMDecoder, self).__init__(
 32 |             embed = L.EmbedID(in_size, embed_size),
 33 |             lstm = L.NStepLSTM(n_layers, embed_size, hidden_size, dropout),
 34 |             proj = L.Linear(hidden_size, proj_size),
 35 |             out = L.Linear(proj_size, out_size)
 36 |         )
 37 |         self.dropout = dropout
 38 |         for param in self.params():
 39 |             param.data[...] = np.random.uniform(-0.1, 0.1, param.data.shape)
 40 | 
 41 | 
 42 |     def __call__(self, s, xs):
 43 |         """Calculate all hidden states, cell states, and output prediction.
 44 | 
 45 |         Args:
 46 |             s (~chainer.Variable or None): Initial (hidden, cell) states.  If ``None``
 47 |                 is specified zero-vector is used.
 48 |             xs (list of ~chianer.Variable): List of input sequences.
 49 |                 Each element ``xs[i]`` is a :class:`chainer.Variable` holding
 50 |                 a sequence.
 51 |         Return:
 52 |             (hy,cy): a pair of hidden and cell states at the end of the sequence,
 53 |             y: a sequence of pre-activatin vectors at the output layer
 54 |  
 55 |         """
 56 |         if len(xs) > 1:
 57 |             sections = np.cumsum(np.array([len(x) for x in xs[:-1]], dtype=np.int32))
 58 |             xs = F.split_axis(self.embed(F.concat(xs, axis=0)), sections, axis=0)
 59 |         else:
 60 |             xs = [ self.embed(xs[0]) ]
 61 | 
 62 |         if s is not None:
 63 |             hy, cy, ys = self.lstm(s[0], s[1], xs)
 64 |         else:
 65 |             hy, cy, ys = self.lstm(None, None, xs)
 66 | 
 67 |         #y = self.out(F.tanh(self.proj(F.concat(ys, axis=0))))
 68 |         y = self.out(self.proj(
 69 |                 F.dropout(F.concat(ys, axis=0), ratio=self.dropout)))
 70 |         return (hy,cy),y
 71 | 
 72 | 
 73 |     # interface for beam search
 74 |     def initialize(self, s, x, i):
 75 |         """Initialize decoder
 76 | 
 77 |         Args:
 78 |             s (any): Initial (hidden, cell) states.  If ``None`` is specified
 79 |                      zero-vector is used.
 80 |             x (~chainer.Variable or None): Input sequence
 81 |             i (int): input label.
 82 |         Return:
 83 |             initial decoder state
 84 |         """
 85 |         # LSTM decoder can be initialized in the same way as update()
 86 |         return self.update(s,i)
 87 | 
 88 | 
 89 |     def update(self, s, i):
 90 |         """Update decoder state
 91 | 
 92 |         Args:
 93 |             s (any): Current (hidden, cell) states.  If ``None`` is specified 
 94 |                      zero-vector is used.
 95 |             i (int): input label.
 96 |         Return:
 97 |             (~chainer.Variable) updated decoder state
 98 |         """
 99 |         if cuda.get_device_from_array(s[0].data).id >= 0:
100 |             xp = cuda.cupy
101 |         else:
102 |             xp = np
103 | 
104 |         v = chainer.Variable(xp.array([i],dtype=np.int32))
105 |         x = self.embed(v)
106 |         if s is not None:
107 |             hy, cy, dy = self.lstm(s[0], s[1], [x])
108 |         else:
109 |             hy, cy, dy = self.lstm(None, None, [x])
110 | 
111 |         return hy, cy, dy
112 | 
113 | 
114 |     def predict(self, s):
115 |         """Predict single-label log probabilities
116 | 
117 |         Args:
118 |             s (any): Current (hidden, cell) states.
119 |         Return:
120 |             (~chainer.Variable) log softmax vector
121 |         """
122 |         y = self.out(self.proj(s[2][0]))
123 |         return F.log_softmax(y)
124 | 
125 | 


--------------------------------------------------------------------------------
/ChatbotBaseline/tools/lstm_encoder.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | """LSTM Encoder
 3 | 
 4 |    Copyright (c) 2017 Takaaki Hori  (thori@merl.com)
 5 | 
 6 |    This software is released under the MIT License.
 7 |    http://opensource.org/licenses/mit-license.php
 8 | 
 9 | """
10 | 
11 | import numpy as np
12 | import chainer
13 | from chainer import cuda
14 | import chainer.links as L
15 | import chainer.functions as F
16 | 
17 | class LSTMEncoder(chainer.Chain):
18 | 
19 |     def __init__(self, n_layers, in_size, out_size, embed_size, dropout=0.5):
20 |         """Initialize encoder with structure parameters
21 |         Args:
22 |             n_layers (int): Number of layers.
23 |             in_size (int): Dimensionality of input vectors.
24 |             out_size (int) : Dimensionality of hidden vectors to be output.
25 |             embed_size (int): Dimensionality of word embedding.
26 |             dropout (float): Dropout ratio.
27 |         """
28 |         super(LSTMEncoder, self).__init__(
29 |             embed = L.EmbedID(in_size, embed_size),
30 |             lstm = L.NStepLSTM(n_layers, embed_size, out_size, dropout)
31 |         )
32 |         for param in self.params():
33 |             param.data[...] = np.random.uniform(-0.1, 0.1, param.data.shape)
34 | 
35 | 
36 |     def __call__(self, s, xs):
37 |         """Calculate all hidden states and cell states.
38 |         Args:
39 |             s  (~chainer.Variable or None): Initial (hidden & cell) states. If ``None``
40 |                 is specified zero-vector is used.
41 |             xs (list of ~chianer.Variable): List of input sequences.
42 |                 Each element ``xs[i]`` is a :class:`chainer.Variable` holding
43 |                 a sequence.
44 |         Return:
45 |             (hy,cy): a pair of hidden and cell states at the end of the sequence,
46 |             ys: a hidden state sequence at the last layer
47 |         """
48 |         if len(xs) > 1:
49 |             sections = np.cumsum(np.array([len(x) for x in xs[:-1]], dtype=np.int32))
50 |             xs = F.split_axis(self.embed(F.concat(xs, axis=0)), sections, axis=0)
51 |         else:
52 |             xs = [ self.embed(xs[0]) ]
53 |         if s is not None:
54 |             hy, cy, ys = self.lstm(s[0], s[1], xs)
55 |         else:
56 |             hy, cy, ys = self.lstm(None, None, xs)
57 | 
58 |         return (hy,cy), ys
59 | 
60 | 


--------------------------------------------------------------------------------
/ChatbotBaseline/tools/seq2seq_model.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | """Sequence-to-sequence model module
  3 | 
  4 |    Copyright (c) 2017 Takaaki Hori  (thori@merl.com)
  5 | 
  6 |    This software is released under the MIT License.
  7 |    http://opensource.org/licenses/mit-license.php
  8 | 
  9 | """
 10 | 
 11 | import six
 12 | import chainer
 13 | import chainer.functions as F
 14 | from chainer import cuda
 15 | import numpy as np
 16 | 
 17 | class Sequence2SequenceModel(chainer.Chain):
 18 | 
 19 |     def __init__(self, encoder, decoder):
 20 |         """ Define model structure
 21 |             Args:
 22 |                 encoder (~chainer.Chain): encoder network 
 23 |                 decoder (~chainer.Chain): decoder network
 24 |         """
 25 |         super(Sequence2SequenceModel, self).__init__(
 26 |             encoder = encoder,
 27 |             decoder = decoder
 28 |         )
 29 | 
 30 | 
 31 |     def loss(self,es,x,y,t):
 32 |         """ Forward propagation and loss calculation
 33 |             Args:
 34 |                 es (pair of ~chainer.Variable): encoder state 
 35 |                 x (list of ~chainer.Variable): list of input sequences
 36 |                 y (list of ~chainer.Variable): list of output sequences
 37 |                 t (list of ~chainer.Variable): list of target sequences
 38 |                                    if t is None, it returns only states
 39 |             Return:
 40 |                 es (pair of ~chainer.Variable(s)): encoder state
 41 |                 ds (pair of ~chainer.Variable(s)): decoder state
 42 |                 loss (~chainer.Variable) : cross-entropy loss
 43 |         """
 44 |         es,ey = self.encoder(es,x)
 45 |         ds,dy = self.decoder(es,y)
 46 |         if t is not None:
 47 |             loss = F.softmax_cross_entropy(dy,t)
 48 |             # avoid NaN gradients (See: https://github.com/pfnet/chainer/issues/2505)
 49 |             if chainer.config.train:
 50 |                 loss += F.sum(F.concat(ey, axis=0)) * 0
 51 |             return es, ds, loss
 52 |         else: # if target is None, it only returns states
 53 |             return es, ds
 54 | 
 55 | 
 56 |     def generate(self, es, x, sos, eos, unk=0, maxlen=100, beam=5, penalty=1.0, nbest=1):
 57 |         """ Generate sequence using beam search 
 58 |             Args:
 59 |                 es (pair of ~chainer.Variable(s)): encoder state 
 60 |                 x (list of ~chainer.Variable): list of input sequences
 61 |                 sos (int): id number of start-of-sentence label
 62 |                 eos (int): id number of end-of-sentence label
 63 |                 unk (int): id number of unknown-word label
 64 |                 maxlen (int): list of target sequences
 65 |                 beam (int): list of target sequences
 66 |                 penalty (float): penalty added to log probabilities 
 67 |                                  of each output label.
 68 |                 nbest (int): number of n-best hypotheses to be output
 69 |             Return:
 70 |                 list of tuples (hyp, score): n-best hypothesis list
 71 |                  - hyp (list): generated word Id sequence
 72 |                  - score (float): hypothesis score
 73 |                 pair of ~chainer.Variable(s)): decoder state of best hypothesis
 74 |         """
 75 |         # encoder
 76 |         es,ey = self.encoder(es, [x])
 77 |         # beam search
 78 |         ds = self.decoder.initialize(es, ey, sos)
 79 |         hyplist = [([], 0., ds)]
 80 |         best_state = None
 81 |         comp_hyplist = []
 82 |         for l in six.moves.range(maxlen):
 83 |             new_hyplist = []
 84 |             argmin = 0
 85 |             for out,lp,st in hyplist:
 86 |                 logp = self.decoder.predict(st)
 87 |                 lp_vec = cuda.to_cpu(logp.data[0]) + lp
 88 |                 if l > 0:
 89 |                     new_lp = lp_vec[eos] + penalty * (len(out)+1)
 90 |                     new_st = self.decoder.update(st,eos)
 91 |                     comp_hyplist.append((out, new_lp))
 92 |                     if best_state is None or best_state[0] < new_lp:
 93 |                         best_state = (new_lp, new_st)
 94 | 
 95 |                 for o in np.argsort(lp_vec)[::-1]:
 96 |                     if o == unk or o == eos:# exclude <unk> and <eos>
 97 |                         continue
 98 |                     new_lp = lp_vec[o]
 99 |                     if len(new_hyplist) == beam:
100 |                         if new_hyplist[argmin][1] < new_lp:
101 |                             new_st = self.decoder.update(st, o)
102 |                             new_hyplist[argmin] = (out+[o], new_lp, new_st)
103 |                             argmin = min(enumerate(new_hyplist), key=lambda h:h[1][1])[0] 
104 |                         else:
105 |                             break
106 |                     else:
107 |                         new_st = self.decoder.update(st, o)
108 |                         new_hyplist.append((out+[o], new_lp, new_st))
109 |                         if len(new_hyplist) == beam:
110 |                             argmin = min(enumerate(new_hyplist), key=lambda h:h[1][1])[0] 
111 | 
112 |             hyplist = new_hyplist
113 | 
114 |         if len(comp_hyplist) > 0:
115 |             maxhyps = sorted(comp_hyplist, key=lambda h:-h[1])[:nbest]
116 |             return maxhyps, best_state[1]
117 |         else:
118 |             return [([],0)],None
119 | 
120 | 


--------------------------------------------------------------------------------
/ChatbotBaseline/tools/tqdm_logging.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | """Logging extension with tqdm
 3 | 
 4 |    Copyright (c) 2017 Takaaki Hori  (thori@merl.com)
 5 | 
 6 |    This software is released under the MIT License.
 7 |    http://opensource.org/licenses/mit-license.php
 8 | 
 9 | """
10 | 
11 | import sys
12 | import logging
13 | from tqdm import tqdm
14 | 
15 | # tqdm logging handler
16 | class TqdmLoggingHandler(logging.Handler):
17 | 
18 |     def __init__ (self, level=logging.NOTSET):
19 |         super (self.__class__, self).__init__ (level)
20 | 
21 |     def emit (self, record):
22 |         try:
23 |             msg = self.format (record)
24 |             tqdm.write ('\r' + msg, file=sys.stdout)
25 |             self.flush ()
26 |         except (KeyboardInterrupt, SystemExit):
27 |             raise
28 |         except:
29 |             self.handleError(record)
30 | 
31 | # logger configulation
32 | def config(logger, logfile='', mode='w', silent=False, debug=False,
33 |            format="%(asctime)s %(levelname)s %(message)s"):
34 | 
35 |     #logger = logging.getLogger("root")
36 | 
37 |     if silent:
38 |         level = logging.WARN
39 |     else:
40 |         level = logging.INFO
41 | 
42 |     stdhandler = TqdmLoggingHandler(level=level)
43 |     stdhandler.setFormatter(logging.Formatter(format))
44 |     logger.addHandler(stdhandler)
45 | 
46 |     if logfile:
47 |         filehandler = logging.FileHandler(logfile, mode=mode)
48 |         filehandler.setFormatter(logging.Formatter(format))
49 |         logger.addHandler(filehandler)
50 | 
51 |     if debug:
52 |         logger.setLevel(logging.DEBUG)
53 |     else:
54 |         logger.setLevel(logging.INFO)
55 | 
56 |     return logger
57 | 
58 | 


--------------------------------------------------------------------------------
/ChatbotBaseline/tools/train_conversation_model.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | """Train an encoder-decoder model for neural conversation
  4 | 
  5 |    Copyright (c) 2017 Takaaki Hori  (thori@merl.com)
  6 | 
  7 |    This software is released under the MIT License.
  8 |    http://opensource.org/licenses/mit-license.php
  9 | 
 10 | """
 11 | 
 12 | import argparse
 13 | import math
 14 | import sys
 15 | import time
 16 | import random
 17 | import os
 18 | import copy
 19 | 
 20 | import numpy as np
 21 | import six
 22 | 
 23 | import chainer
 24 | from chainer import cuda
 25 | from chainer import optimizers
 26 | import chainer.functions as F
 27 | import pickle
 28 | import logging
 29 | import tqdm_logging
 30 | from tqdm import tqdm
 31 | 
 32 | from lstm_encoder import LSTMEncoder
 33 | from lstm_decoder import LSTMDecoder
 34 | from seq2seq_model import Sequence2SequenceModel
 35 | 
 36 | import dialog_corpus
 37 | 
 38 | # user the root logger
 39 | logger = logging.getLogger("root")
 40 | 
 41 | # training status and report
 42 | class Status:
 43 |     def __init__(self, interval, progress_bar=True):
 44 |         self.interval = interval
 45 |         self.loss = 0.
 46 |         self.cur_at = time.time()
 47 |         self.nsamples = 0
 48 |         self.count = 0
 49 |         self.progress_bar = progress_bar
 50 |         self.min_validate_ppl = 1.0e+10
 51 |         self.bestmodel_num = 1
 52 |         self.epoch = 1
 53 | 
 54 |     def update(self, loss, nsamples):
 55 |         self.loss += loss
 56 |         self.nsamples += nsamples
 57 |         self.count += 1
 58 |         if self.count % self.interval == 0:
 59 |             now = time.time()
 60 |             throuput = self.interval / (now - self.cur_at)
 61 |             perp = math.exp(self.loss / self.nsamples)
 62 |             logger.info('iter %d training perplexity: %.2f (%.2f iters/sec)' 
 63 |                         % (self.count, perp, throuput))
 64 |             self.loss = 0.
 65 |             self.nsamples = 0
 66 |             self.cur_at = now
 67 | 
 68 |     def new_epoch(self, validate_time=0):
 69 |         self.epoch += 1
 70 |         self.cur_at += validate_time # skip validation and model I/O time
 71 | 
 72 | 
 73 | # Traning routine
 74 | def train_step(model, optimizer, dataset, batchset, status, xp):
 75 |     chainer.config.train = True
 76 |     train_loss = 0.
 77 |     train_nsamples = 0
 78 | 
 79 |     num_interacts = sum([len(dataset[idx[0]]) for idx in batchset])
 80 |     if status.progress_bar:
 81 |         progress = tqdm(total=num_interacts)
 82 |         progress.set_description("Epoch %d" % status.epoch)
 83 | 
 84 |     for i in six.moves.range(len(batchset)):
 85 |         ds = None
 86 |         for j in six.moves.range(len(dataset[batchset[i][0]])):
 87 |             # prepare input, output, and target
 88 |             x = [ chainer.Variable(xp.asarray(dataset[k][j][0])) for k in batchset[i] ]
 89 |             y = [ chainer.Variable(xp.asarray(dataset[k][j][1][:-1])) for k in batchset[i] ]
 90 |             t = chainer.Variable(xp.asarray(np.concatenate( [dataset[k][j][1][1:] 
 91 |                                         for k in batchset[i]] )))
 92 |             # compute training loss
 93 |             ds,es,loss = model.loss(ds,x,y,t)
 94 |             train_loss += loss.data * len(t.data)
 95 |             train_nsamples += len(t.data)
 96 |             status.update(loss.data * len(t.data), len(t.data))
 97 |             # backprop
 98 |             model.cleargrads()
 99 |             loss.backward()
100 |             loss.unchain_backward()  # truncate
101 |             # update
102 |             optimizer.update()
103 |             if status.progress_bar:
104 |                 progress.update(1)
105 | 
106 |     if status.progress_bar:
107 |         progress.close()
108 | 
109 |     return math.exp(train_loss/train_nsamples)
110 | 
111 | 
112 | # Validation routine
113 | def validate_step(model, dataset, batchset, status, xp):
114 |     chainer.config.train = False
115 |     validate_loss = 0.
116 |     validate_nsamples = 0
117 |     num_interacts = sum([len(dataset[idx[0]]) for idx in batchset])
118 |     if status.progress_bar:
119 |         progress = tqdm(total=num_interacts)
120 |         progress.set_description("Epoch %d" % status.epoch)
121 | 
122 |     for i in six.moves.range(len(batchset)):
123 |         ds = None
124 |         for j in six.moves.range(len(dataset[batchset[i][0]])):
125 |             # prepare input, output, and target
126 |             x = [ chainer.Variable(xp.asarray(dataset[k][j][0])) for k in batchset[i] ]
127 |             y = [ chainer.Variable(xp.asarray(dataset[k][j][1][:-1]))
128 |                                         for k in batchset[i] ]
129 |             t = chainer.Variable(xp.asarray(np.concatenate( [dataset[k][j][1][1:] 
130 |                                         for k in batchset[i]] )))
131 |             # compute validation loss
132 |             es,ds,loss = model.loss(ds, x, y, t)
133 | 
134 |             # accumulate
135 |             validate_loss += loss.data * len(t.data)
136 |             validate_nsamples += len(t.data)
137 |             if status.progress_bar:
138 |                 progress.update(1)
139 | 
140 |     if status.progress_bar:
141 |         progress.close()
142 | 
143 |     return math.exp(validate_loss/validate_nsamples)
144 | 
145 | 
146 | ##################################
147 | # main
148 | def main():
149 |     parser = argparse.ArgumentParser()
150 |     # logging
151 |     parser.add_argument('--logfile', '-l', default='', type=str,
152 |                         help='write log data into a file')
153 |     parser.add_argument('--debug', '-d', action='store_true',
154 |                         help='run in debug mode')
155 |     parser.add_argument('--silent', '-s', action='store_true',
156 |                         help='run in silent mode')
157 |     parser.add_argument('--no-progress-bar', action='store_true',
158 |                         help='hide progress bar')
159 |     # train and validate data
160 |     parser.add_argument('--train', default='train.txt', type=str,
161 |                         help='set filename of training data')
162 |     parser.add_argument('--validate', default='dev.txt', type=str,
163 |                         help='set filename of validation data')
164 |     parser.add_argument('--vocab-size', '-V', default=0, type=int,
165 |                         help='set vocabulary size (0 means no limitation)')
166 |     parser.add_argument('--target-speaker', '-T', default='S', 
167 |                         help='set target speaker name to be learned for system output')
168 |     # file settings
169 |     parser.add_argument('--initial-model', '-i', 
170 |                         help='start training from an initial model')
171 |     parser.add_argument('--model', '-m', required=True,
172 |                         help='set prefix of output model files')
173 |     parser.add_argument('--resume', action='store_true', 
174 |                         help='resume training from a previously saved snapshot')
175 |     parser.add_argument('--snapshot', type=str,
176 |                         help='dump a snapshot to a file after each epoch')
177 |     # Model structure
178 |     parser.add_argument('--enc-layer', default=2, type=int,
179 |                         help='number of encoder layers')
180 |     parser.add_argument('--enc-esize', default=100, type=int,
181 |                         help='number of encoder input-embedding units')
182 |     parser.add_argument('--enc-hsize', default=512, type=int,
183 |                         help='number of encoder hidden units')
184 | 
185 |     parser.add_argument('--dec-layer', default=2, type=int,
186 |                         help='number of decoder layers')
187 |     parser.add_argument('--dec-esize', default=100, type=int,
188 |                         help='number of decoder input-embedding units')
189 |     parser.add_argument('--dec-hsize', default=512, type=int,
190 |                         help='number of decoder hidden units')
191 |     parser.add_argument('--dec-psize', default=100, type=int,
192 |                         help='number of decoder pre-output projection units')
193 |     # training conditions
194 |     parser.add_argument('--optimizer', default='Adam', type=str, 
195 |                         help="set optimizer (SGD, Adam, AdaDelta, RMSprop, ...)")
196 |     parser.add_argument('--L2-weight', default=0.0, type=float,
197 |                         help="set weight for L2-regularization term")
198 |     parser.add_argument('--clip-grads', default=5., type=float,
199 |                         help="set gradient clipping threshold")
200 |     parser.add_argument('--dropout-rate', default=0.5, type=float,
201 |                         help="set dropout rate in training")
202 |     parser.add_argument('--num-epochs', '-e', default=20, type=int,
203 |                         help='number of epochs to be trained')
204 |     parser.add_argument('--learn-rate', '-R', default=1.0, type=float,
205 |                         help='set initial learning rate for SGD')
206 |     parser.add_argument('--learn-decay', default=1.0, type=float,
207 |                         help='set decaying ratio of learning rate or epsilon')
208 |     parser.add_argument('--lower-bound', default=1e-16, type=float,
209 |                         help='set threshold of learning rate or epsilon for early stopping')
210 |     parser.add_argument('--batch-size', '-b', default=50, type=int,
211 |                         help='set batch size for training and validation')
212 |     parser.add_argument('--max-batch-length', default=20, type=int,
213 |                         help='set maximum sequence length to control batch size')
214 |     parser.add_argument('--seed', default=99, type=int,
215 |                         help='set a seed for random numbers')
216 |     # select a GPU device
217 |     parser.add_argument('--gpu', '-g', default=0, type=int,
218 |                         help='GPU ID (negative value indicates CPU)')
219 | 
220 |     args = parser.parse_args()
221 | 
222 |     # flush stdout
223 |     if six.PY2:
224 |         sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 0)
225 |     # set up the logger
226 |     tqdm_logging.config(logger, args.logfile, mode=('a' if args.resume else 'w'),
227 |                         silent=args.silent, debug=args.debug)
228 |     # gpu setup
229 |     if args.gpu >= 0:
230 |         cuda.check_cuda_available()
231 |         cuda.get_device(args.gpu).use()
232 |         xp = cuda.cupy
233 |         xp.random.seed(args.seed)
234 |     else:
235 |         xp = np
236 | 
237 |     # randomize
238 |     np.random.seed(args.seed)
239 |     random.seed(args.seed)
240 | 
241 |     logger.info('----------------------------------')
242 |     logger.info('Train a neural conversation model')
243 |     logger.info('----------------------------------')
244 |     if args.resume:
245 |         if not args.snapshot:
246 |             logger.error('snapshot file is not spacified.')
247 |             sys.exit()
248 | 
249 |         with open(args.snapshot, 'rb') as f:
250 |             vocab, optimizer, status, args = pickle.load(f)
251 |         logger.info('Resume training from epoch %d' % status.epoch)
252 |         logger.info('Args ' + str(args))
253 |         model = optimizer.target
254 |     else:    
255 |         logger.info('Args ' + str(args))
256 |         # Prepare RNN model and load data
257 |         if args.initial_model:
258 |             logger.info('Loading a model from ' + args.initial_model)
259 |             with open(args.initial_model, 'rb') as f:
260 |                 vocab, model, tmp_args = pickle.load(f)
261 |             status.cur_at = time.time()
262 |         else:
263 |             logger.info('Making vocabulary from ' + args.train)
264 |             vocab = dialog_corpus.get_vocabulary(args.train, vocabsize=args.vocab_size)
265 |             model = Sequence2SequenceModel(
266 |                    LSTMEncoder(args.enc_layer, len(vocab), args.enc_hsize, 
267 |                               args.enc_esize, dropout=args.dropout_rate),
268 |                    LSTMDecoder(args.dec_layer, len(vocab), len(vocab),
269 |                               args.dec_esize, args.dec_hsize, args.dec_psize,
270 |                               dropout=args.dropout_rate))
271 |         # Setup optimizer
272 |         optimizer = vars(optimizers)[args.optimizer]()
273 |         if args.optimizer == 'SGD':
274 |             optimizer.lr = args.learn_rate
275 |         optimizer.use_cleargrads()
276 |         optimizer.setup(model)
277 |         optimizer.add_hook(chainer.optimizer.GradientClipping(args.clip_grads))
278 |         if args.L2_weight > 0.:
279 |             optimizer.add_hook(chainer.optimizer.WeightDecay(args.L2_weight))
280 |         status = None
281 | 
282 |     logger.info('Loading text data from ' + args.train)
283 |     train_set = dialog_corpus.load(args.train, vocab, args.target_speaker)
284 |     logger.info('Loading validation data from ' + args.validate)
285 |     validate_set = dialog_corpus.load(args.validate, vocab, args.target_speaker)
286 |     logger.info('Making mini batches')
287 |     train_batchset = dialog_corpus.make_minibatches(train_set, batchsize=args.batch_size, max_length=args.max_batch_length)
288 |     validate_batchset = dialog_corpus.make_minibatches(validate_set, batchsize=args.batch_size, max_length=args.max_batch_length)
289 |     # report data summary
290 |     logger.info('vocabulary size = %d' % len(vocab))
291 |     logger.info('#train sample = %d  #mini-batch = %d' % (len(train_set), len(train_batchset)))
292 |     logger.info('#validate sample = %d  #mini-batch = %d' % (len(validate_set), len(validate_batchset)))
293 |     random.shuffle(train_batchset, random.random)
294 | 
295 |     # initialize status parameters
296 |     if status is None:
297 |         status = Status(max(round(len(train_batchset),-3)/50,500), 
298 |                         progress_bar=not args.no_progress_bar)
299 |     else:
300 |         status.progress_bar = not args.no_progress_bar
301 | 
302 |     # move model to gpu
303 |     if args.gpu >= 0:
304 |         model.to_gpu()
305 | 
306 |     while status.epoch <= args.num_epochs:
307 |         logger.info('---------------------training--------------------------')
308 |         if args.optimizer == 'SGD':
309 |             logger.info('Epoch %d/%d : SGD learning rate = %g' % (status.epoch, args.num_epochs, optimizer.lr))
310 |         else:
311 |             logger.info('Epoch %d/%d : %s eps = %g' % (status.epoch, args.num_epochs, args.optimizer, optimizer.eps))
312 |         train_ppl = train_step(model, optimizer, train_set, train_batchset, status, xp)
313 |         logger.info("epoch %d training perplexity: %f" % (status.epoch, train_ppl))
314 |         # write the model params
315 |         modelfile = args.model + '.' + str(status.epoch)
316 |         logger.info('writing model params to ' + modelfile)
317 |         model.to_cpu()
318 |         with open(modelfile, 'wb') as f:
319 |             pickle.dump((vocab, model, args), f, -1)
320 |         if args.gpu >= 0:
321 |             model.to_gpu()
322 | 
323 |         # start validation step
324 |         logger.info('---------------------validation------------------------')
325 |         start_at = time.time()
326 |         validate_ppl = validate_step(model, validate_set, validate_batchset, status, xp)
327 |         logger.info('epoch %d validation perplexity: %.4f' % (status.epoch, validate_ppl))
328 |         # update best model with the minimum perplexity
329 |         if status.min_validate_ppl >= validate_ppl:
330 |             status.bestmodel_num = status.epoch
331 |             logger.info('validation perplexity reduced: %.4f -> %.4f' % (status.min_validate_ppl, validate_ppl))
332 |             status.min_validate_ppl = validate_ppl
333 | 
334 |         elif args.optimizer == 'SGD':
335 |             modelfile = args.model + '.' + str(status.bestmodel_num)
336 |             logger.info('reloading model params from ' + modelfile)
337 |             with open(modelfile, 'rb') as f:
338 |                 vocab, model, tmp_args = pickle.load(f)
339 |             if args.gpu >= 0:
340 |                 model.to_gpu()
341 |             optimizer.lr *= args.learn_decay
342 |             if optimizer.lr < args.lower_bound:
343 |                 break
344 |             optimizer.setup(model)
345 |         else:
346 |             optimizer.eps *= args.learn_decay
347 |             if optimizer.eps < args.lower_bound:
348 |                 break
349 | 
350 |         status.new_epoch(validate_time = time.time() - start_at)
351 |         # dump snapshot
352 |         if args.snapshot:
353 |             logger.info('writing snapshot to ' + args.snapshot)
354 |             model.to_cpu()
355 |             with open(args.snapshot, 'wb') as f:
356 |                 pickle.dump((vocab, optimizer, status, args), f, -1)
357 |             if args.gpu >= 0:
358 |                 model.to_gpu()
359 |     
360 |     logger.info('----------------')
361 |     # make a symbolic link to the best model
362 |     logger.info('the best model is %s.%d.' % (args.model,status.bestmodel_num))
363 |     logger.info('a symbolic link is made as ' + args.model+'.best')
364 |     if os.path.exists(args.model+'.best'):
365 |         os.remove(args.model+'.best')
366 |     os.symlink(os.path.basename(args.model+'.'+str(status.bestmodel_num)),
367 |                args.model+'.best')
368 |     logger.info('done')
369 | 
370 | if __name__ == "__main__":
371 |     main()
372 | 
373 | 


--------------------------------------------------------------------------------
/ChatbotBaseline/utils/get_available_gpu_id.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | nvidia-smi -q -d MEMORY \
3 | |grep -A 2 "FB Memory"|grep Used \
4 | |awk 'BEGIN{n=0;m=99999;a=0;}{if ($3<m) {m=$3;a=n;}; n++}END{printf("%d\n",a);}'
5 | 


--------------------------------------------------------------------------------
/ChatbotBaseline/utils/parse_options.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # Copyright 2012  Johns Hopkins University (Author: Daniel Povey);
 4 | #                 Arnab Ghoshal, Karel Vesely
 5 | 
 6 | # Licensed under the Apache License, Version 2.0 (the "License");
 7 | # you may not use this file except in compliance with the License.
 8 | # You may obtain a copy of the License at
 9 | #
10 | #  http://www.apache.org/licenses/LICENSE-2.0
11 | #
12 | # THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
13 | # KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
14 | # WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
15 | # MERCHANTABLITY OR NON-INFRINGEMENT.
16 | # See the Apache 2 License for the specific language governing permissions and
17 | # limitations under the License.
18 | 
19 | 
20 | # Parse command-line options.
21 | # To be sourced by another script (as in ". parse_options.sh").
22 | # Option format is: --option-name arg
23 | # and shell variable "option_name" gets set to value "arg."
24 | # The exception is --help, which takes no arguments, but prints the 
25 | # $help_message variable (if defined).
26 | 
27 | 
28 | ###
29 | ### The --config file options have lower priority to command line 
30 | ### options, so we need to import them first...
31 | ###
32 | 
33 | # Now import all the configs specified by command-line, in left-to-right order
34 | for ((argpos=1; argpos<$#; argpos++)); do
35 |   if [ "${!argpos}" == "--config" ]; then
36 |     argpos_plus1=$((argpos+1))
37 |     config=${!argpos_plus1}
38 |     [ ! -r $config ] && echo "$0: missing config '$config'" && exit 1
39 |     . $config  # source the config file.
40 |   fi
41 | done
42 | 
43 | 
44 | ###
45 | ### No we process the command line options
46 | ###
47 | while true; do
48 |   [ -z "${1:-}" ] && break;  # break if there are no arguments
49 |   case "$1" in
50 |     # If the enclosing script is called with --help option, print the help 
51 |     # message and exit.  Scripts should put help messages in $help_message
52 |   --help|-h) if [ -z "$help_message" ]; then echo "No help found." 1>&2;
53 | 	  else printf "$help_message\n" 1>&2 ; fi; 
54 | 	  exit 0 ;; 
55 |   --*=*) echo "$0: options to scripts must be of the form --name value, got '$1'"
56 |        exit 1 ;;
57 |     # If the first command-line argument begins with "--" (e.g. --foo-bar), 
58 |     # then work out the variable name as $name, which will equal "foo_bar".
59 |   --*) name=`echo "$1" | sed s/^--// | sed s/-/_/g`; 
60 |     # Next we test whether the variable in question is undefned-- if so it's 
61 |     # an invalid option and we die.  Note: $0 evaluates to the name of the 
62 |     # enclosing script.
63 |     # The test [ -z ${foo_bar+xxx} ] will return true if the variable foo_bar
64 |     # is undefined.  We then have to wrap this test inside "eval" because 
65 |     # foo_bar is itself inside a variable ($name).
66 |       eval '[ -z "${'$name'+xxx}" ]' && echo "$0: invalid option $1" 1>&2 && exit 1;
67 |       
68 |       oldval="`eval echo \\$$name`";
69 |     # Work out whether we seem to be expecting a Boolean argument.
70 |       if [ "$oldval" == "true" ] || [ "$oldval" == "false" ]; then 
71 | 	was_bool=true;
72 |       else 
73 | 	was_bool=false;
74 |       fi
75 | 
76 |     # Set the variable to the right value-- the escaped quotes make it work if
77 |     # the option had spaces, like --cmd "queue.pl -sync y"
78 |       eval $name=\"$2\"; 
79 |         
80 |     # Check that Boolean-valued arguments are really Boolean.
81 |       if $was_bool && [[ "$2" != "true" && "$2" != "false" ]]; then
82 |         echo "$0: expected \"true\" or \"false\": $1 $2" 1>&2
83 |         exit 1;
84 |       fi
85 |       shift 2;
86 |       ;;
87 |   *) break;
88 |   esac
89 | done
90 | 
91 | 
92 | # Check for an empty argument to the --cmd option, which can easily occur as a 
93 | # result of scripting errors.
94 | [ ! -z "${cmd+xxx}" ] && [ -z "$cmd" ] && echo "$0: empty argument to --cmd option" 1>&2 && exit 1;
95 | 
96 | 
97 | true; # so this script returns exit code 0.
98 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2017 dialogtekgeek
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # DSTC6: End-to-End Conversation Modeling Track
  2 | 
  3 | # Registration
  4 |   Please register:
  5 |       https://goo.gl/forms/Fxy061gHuSOZGC1i2
  6 |       
  7 | # News
  8 | - Evaluation analysis package: Jan 19 2018
  9 | 
 10 |     The package includes all references generated by 11 humans, hypotheses of 20 systems, and evaluation results 
 11 |     in DSTC6 end-to-end conversation modeling track.
 12 |     https://www.dropbox.com/s/oh1trbos0tjzn7t/dstc6_t2_evaluation.tgz
 13 | - Download the official training data: Sep 7-18 2017 
 14 | - Test data distribution: Sep 25 2017
 15 | - Submission: Oct 8 2017 
 16 | 
 17 | ![Easy 3 Step Data Collection](https://github.com/dialogtekgeek/DSTC6-End-to-End-Conversation-Modeling/issues/5 "Easy 3 Step data collection")
 18 | 
 19 | 
 20 | # Track Description
 21 | 1. Main task (mandatory): Customer service dialog using Twitter
 22 | 
 23 |     (*) The tools to download the twitter data and transform to the dialog format from the data are provided. 
 24 | 
 25 | 
 26 |     Task A: Full or part of the training data will be used to train conversation models. 
 27 | 
 28 |     Task B: Any open data, e.g. from web, are available as external knowledge to generate informative sentences. But they should not overlap with the training, validation and test data provided by organizers.
 29 | 
 30 | 2. Pilot task: Movie scenario dialog using OpenSubtitle
 31 | 
 32 | 
 33 | * Please cite the following paper if you will publish the results using this setup:
 34 | 
 35 |   https://arxiv.org/pdf/1706.07440.pdf
 36 | 
 37 |   ```
 38 |   @article{DSTC6_End-to-End_Conversation_Modeling,
 39 |     Author = {Chiori Hori and Takaaki Hori},
 40 |     Title = {End-to-end Conversation Modeling Track in DSTC6},    
 41 |     Journal = {arXiv:1706.07440},    
 42 |     Year = {2017}
 43 |   }
 44 |   ```
 45 | 
 46 | # Necessary steps
 47 | 
 48 | ## Preparation
 49 | Most tools are written in python, which were tested on python2.7.6+ and python3.4.1+,
 50 | and some bash scripts are also used to execute those tools.
 51 | 
 52 | For data preparation, you will need additional python modules as follows:
 53 | 
 54 | * six
 55 | * tqdm
 56 | * nltk
 57 | 
 58 | which can be installed by
 59 | ```
 60 | pip install <module-name>
 61 | ```
 62 | or
 63 | ```
 64 | pip install <module-name> -t <some-directory>
 65 | ```
 66 | where `<some-directory>` is a directory storing python modules and needs to be accessible from python,
 67 | e.g. by including it in PYTHONPATH environment variable.
 68 | 
 69 | If you try the baseline system, you will need Chainer <http://chainer.org> ,a deep learning toolkit, 
 70 | to perform training and evaluation of neural conversation models.
 71 | Please follow the instruction in `ChatbotBaseline/README.md`.
 72 | 
 73 | ## Twitter task
 74 | 
 75 | 1. prepare data set using `collect_twitter_dialogs` scripts.
 76 | 
 77 |     ```
 78 |     $ cd collect_twitter_dialogs
 79 |     $ collect.sh
 80 |     ```
 81 |     (a twitter account and access keys are necessary to run the script. follow the instruction in `collect_twitter_dialogs/README.md`)
 82 |    
 83 | 2. extract training, development and test sets from stored twitter dialog data
 84 |     
 85 |     ```
 86 |     $ cd ../tasks/twitter
 87 |     $ make_trial_data.sh
 88 |     ```
 89 |     
 90 |     Note: the extracted data are trial data at this moment.
 91 | 
 92 | 3. run baseline system (optional)
 93 | 
 94 |     ```
 95 |     $ cd ../../ChatbotBaseline/egs/twitter
 96 |     $ run.sh
 97 |     ```
 98 |     
 99 |     (see `ChatbotBaseline/README.md`)
100 | 
101 | ## OpenSubtitles task
102 | 
103 | 1. download OpenSubtitles2016 data
104 | 
105 |     ```
106 |     $ cd tasks/opensubs
107 |     $ wget http://opus.lingfil.uu.se/download.php?f=OpenSubtitles2016/en.tar.gz
108 |     $ tar zxvf en.tar.gz
109 |     ```
110 | 
111 | 2. extract training, development and test sets from stored subtitle data 
112 | 
113 |     ```
114 |     $ make_trial_data.sh
115 |     ```
116 |     Note: the extracted data are trial data at this moment.
117 | 
118 | 3. run baseline system (optional)
119 | 
120 |     ```
121 |     $ cd ../../ChatbotBaseline/egs/opensubs
122 |     $ run.sh
123 |     ```
124 |     
125 |     (see `ChatbotBaseline/README.md`)
126 | 
127 | ## Directories and files
128 | * README.md : this file
129 | * tasks : data preparation for each subtask
130 | * collect_twitter_dialogs : scripts to collect twitter data
131 | * ChatbotBaseline : a neural conversation model baseline system
132 | 
133 | ## Contact Information
134 | 
135 | You can get the latest updates and participate in discussions on DSTC mailing list
136 | 
137 | To join the mailing list, send an email to: (listserv@lists.research.microsoft.com) putting "subscribe DSTC" in the body of the message (without the quotes). To post a message, send your message to: (dstc@lists.research.microsoft.com).
138 | 
139 | 


--------------------------------------------------------------------------------
/collect_twitter_dialogs/README.md:
--------------------------------------------------------------------------------
  1 | # Scripts to acquire twitter dialogs via REST API 1.1.
  2 | 
  3 | Copyright (c) 2017 Takaaki Hori  (thori@merl.com)
  4 | 
  5 | This software is released under the MIT License.
  6 | http://opensource.org/licenses/mit-license.php
  7 | 
  8 | ## Requirements:
  9 | 
 10 | * 64bit linux/macOSX/windows platform
 11 | 
 12 | * python 2.7.9+, 3.4+
 13 | 
 14 |     note: With python 2.7.8 or lower, InsecureRequestWarning may come out  
 15 |           when you run the script.  To suppress this warning, your can  
 16 |           downgrade the requests package by  
 17 | 
 18 |     ```
 19 |     $ pip install requests==2.5.3
 20 |     ```
 21 | 
 22 |      or
 23 | 
 24 |     ```
 25 |     $ pip install requests==2.5.3 -t <some-directory>
 26 |     ```
 27 | 
 28 |     where we assume that the module has been installed in &lt;some-directory&gt;
 29 | 
 30 | ## Preparation
 31 | 
 32 | 1. create a twitter account if you don't have it.
 33 | 
 34 |     you can get it via <https://twitter.com/signup>
 35 | 
 36 | 2. create your application account via the Twitter 
 37 | 
 38 |     Developer's Site: <https://dev.twitter.com/>
 39 | 
 40 |     see <https://iag.me/socialmedia/how-to-create-a-twitter-app-in-8-easy-steps/>  
 41 | 
 42 |     for reference, and keep the following keys
 43 | 
 44 |    * Consumer Key
 45 |    * Consumer Secret
 46 |    * Access Token
 47 |    * Access Token Secret  
 48 | 
 49 | 3. edit ./config.ini to set your access keys in the config file
 50 | 
 51 |    * ConsumerKey
 52 |    * ConsumerSecret
 53 |    * AccessToken
 54 |    * AccessTokenSecret  
 55 |   
 56 | 4. install python modules: six and requests-oauthlib
 57 | 
 58 |     you can install them in the system area by
 59 | 
 60 |     ```
 61 |     $ pip install six
 62 |     $ pip install requests-oauthlib
 63 |     ```
 64 | 
 65 |     We recommend to use virtualenv or some other virtual enviroment to handle pythone modules.
 66 | Otherwise, you can specify the directory to install python modules as
 67 | 
 68 |     ```
 69 |     $ pip install <module_name> -t <some_directory>
 70 |     ```
 71 |     In this case, &lt;some-directory&gt; must be included in `PYTHONPATH` environment
 72 |     variable to use the modules.
 73 | 
 74 | ## How to use:
 75 | 
 76 | 1. Execute the following command to test your setup
 77 | 
 78 |     ```
 79 |     $ collect_twitter_dialogs.py AmazonHelp
 80 |     ```
 81 | 
 82 |     and you will obtain a file `AmazonHelp.json`, which contains
 83 |     dialogs from AmazonHelp twitter site.
 84 | 
 85 |     since the `AmazonHelp.json` is raw data, you can see the dialog text by 
 86 | 
 87 |     ```
 88 |     $ view_dialogs.py AmazonHelp.json
 89 |     ```
 90 | 
 91 | 2. Use `collect.sh` to acquire large data using an account list
 92 | 
 93 |     ```
 94 |     $ collect.sh
 95 |     ```
 96 | 
 97 |     You will find the collected data in `./stored_data`
 98 | 
 99 |     ```
100 |     $ ls stored_data
101 |     AmazonHelp.json   AskTSA.json ...
102 |     ```
103 | 
104 |     To acquire a large amount of data, it is better to run this script
105 |     once a day, because the amount of data we can download is limited
106 |     and older tweets cannot be accessed as time goes by.
107 | 
108 |     In the first run, it will take a several hours to collect all possible
109 |     data from the listed accounts, but from the second time, the time will
110 |     become much shorter because the script downloads only newer tweets than 
111 |     already collected tweets.
112 | 
113 |     Note: the script sometimes reports API errors, but you don't have
114 |     to worry. Most errors come from access rate limit by the server.
115 |     Even if the script accidentally stopped, there is no problem.
116 |     Just re-run the script.
117 | 
118 | 3. Use `official_collect.sh` to acquire official data for DSTC6 End-to-End Conversation Modeling track
119 | 
120 |     ```
121 |     $ official_collect.sh
122 |     ```
123 | 
124 |     Each challenge participant must run the script by at least 9/8/2017 GMT 24AM (Midnight)
125 |     and do it once a day until 9/18/2017.
126 |     This can be done easily by the following command:
127 | 
128 |     ```
129 |     $ watch -n 86400 official_collect.sh
130 |     ```
131 | 
132 |     (The `watch` command will run `official_collect.sh` every 24 hours)
133 | 
134 |     You will find the collected data in `./official_stored_data`
135 | 
136 |     ```
137 |     $ ls official_stored_data
138 |     1800flowershelp.json   1800flowers.json   1800PetMeds.json   1DFAQ.json
139 |     1Sale.json ...
140 |     ```
141 | 
142 |     A script to extract training, development, and test sets will be provided around 9/18/2017.
143 | 


--------------------------------------------------------------------------------
/collect_twitter_dialogs/account_names_for_dstc6.txt:
--------------------------------------------------------------------------------
   1 | 1800flowers
   2 | 1800flowershelp
   3 | 64audio
   4 | 888sport
   5 | abbeygroup
   6 | abbylynne19
   7 | ABCustomerCare
   8 | abellio_surrey
   9 | AbercrombieHelp
  10 | ABTAtravel
  11 | acemtp
  12 | ack
  13 | acmemarkets
  14 | acnestudios
  15 | adamholz
  16 | AddisonLeeCabs
  17 | adidasGhana
  18 | adidasUK
  19 | AdobeCare
  20 | adrianflux
  21 | adrianswinscoe
  22 | advocare
  23 | AetnaHelp
  24 | AFC_Amy
  25 | AFCustomerCare
  26 | AflacPhyllis
  27 | AirAsia
  28 | AirChinaNA
  29 | airnzuk
  30 | AIRNZUSA
  31 | airtel_care
  32 | AirtelNigeria
  33 | alabamapower
  34 | Alamo
  35 | alamocares
  36 | AlaskaAir
  37 | Albertsons
  38 | ALCATEL1TOUCH
  39 | AlderTraining
  40 | Aldi_Ireland
  41 | AldiUK
  42 | AlfaRomeoCareUK
  43 | AliExpress_EN
  44 | AlissaDosSantos
  45 | Allegiant
  46 | AllianzTravelUS
  47 | Allstate
  48 | AllyCare
  49 | alpharooms
  50 | AlshayaHelpDesk
  51 | alwaysriding
  52 | AmanaCare
  53 | AmazonHelp
  54 | AmericanAir
  55 | americangiant
  56 | AmexAU
  57 | andertonsmusic
  58 | AndreaSWilson
  59 | andrew_heister
  60 | AnglianHelp
  61 | AnkiSupport
  62 | annabelkarmel
  63 | AnthemBC_News
  64 | APCcustserv
  65 | AppleSupport
  66 | AppleyardLondon
  67 | ArqivaWiFi
  68 | asbcares
  69 | AsdaCustCare
  70 | ASHelpMe
  71 | Ask123ie
  72 | AskAmex
  73 | AskCapitalOne
  74 | AskCiti
  75 | AskDyson
  76 | AskeBay
  77 | AskEmblemHealth
  78 | askevanscycles
  79 | AskKBCIreland
  80 | AskLandsEnd
  81 | AskMarstons
  82 | AskMeHelpDesk
  83 | AskMTNGhana
  84 | askpermanenttsb
  85 | ask_progressive
  86 | AskPS_UK
  87 | AskSmythsToys
  88 | Ask_Spectrum
  89 | AskSubaruCanada
  90 | asksurfline
  91 | AskTarget
  92 | AskTeamUA
  93 | askvisa
  94 | Askvodafonegh
  95 | AsmodeeNA
  96 | ASOS
  97 | ASOS_Au
  98 | ASOS_Us
  99 | astros
 100 | ASUHelpCenter
 101 | AsurionCares
 102 | ASUSUSA
 103 | atmosenergy
 104 | Atomos_News
 105 | atomtickets
 106 | ATTCares
 107 | audiblesupport
 108 | audibleuk
 109 | audiireland
 110 | AudiUK
 111 | AudiUKCare
 112 | austin_reed
 113 | AutodeskHelp
 114 | AvidSupport
 115 | Avis
 116 | AVIVAIRELAND
 117 | Ayres_Hotels
 118 | BabiesRUs
 119 | BabyJogger
 120 | BananaRepublic
 121 | BandQ
 122 | BarclaycardNews
 123 | BarclaysUK
 124 | Baxter_Auto
 125 | BCBSLA
 126 | BCC_Help
 127 | BCSSupport
 128 | BeautyBarDotCom
 129 | beautybay
 130 | Beaverbrooks
 131 | BedBathBeyond
 132 | belk
 133 | BenefitUK
 134 | BernardBoutique
 135 | bestdealtv
 136 | bexdeep
 137 | beyondthedesk
 138 | BiffaService
 139 | bigmikeyvegas
 140 | BikeMetro
 141 | BitcoinHelpDesk
 142 | BJsWholesale
 143 | blackanddecker
 144 | BlackBerryHelp
 145 | Blanchford
 146 | Blendtec
 147 | BluestarHQ
 148 | BN_care
 149 | Boars_Head
 150 | bobbyrayburns
 151 | boohoo_cshelp
 152 | BookChicClub
 153 | bookdepository
 154 | booksamillion
 155 | BOOKSetc_online
 156 | BoostCare
 157 | BootsUK
 158 | BordGaisEnergy
 159 | BoseService
 160 | BostonMarki
 161 | BounceEnergy
 162 | BP_plc
 163 | BP_UK
 164 | bravern
 165 | BRCustServ
 166 | BritishGas
 167 | BritishGasNews
 168 | brooksrunning
 169 | bryanz85
 170 | BSSHelpDesk
 171 | Budget
 172 | Buick
 173 | BuildaHelpDesk
 174 | builds_io
 175 | BupaUK
 176 | BurgerKing
 177 | Burton_Menswear
 178 | CableBill
 179 | cableONE
 180 | CadburyIreland
 181 | CadburyUK
 182 | cafepress
 183 | CallawayGolfCS
 184 | CalMac_Updates
 185 | calorireland
 186 | cam4_gay
 187 | Camper_CustCare
 188 | CanadianPacific
 189 | CapMetroATX
 190 | CareShopBunzl
 191 | CaroKopp
 192 | cars_portsmouth
 193 | Cartii
 194 | CastelliCafe
 195 | casualclassics
 196 | catalysthousing
 197 | CDCustService
 198 | cesarkeller
 199 | champssports
 200 | ChaseSupport
 201 | CheapTixHearsU
 202 | cheftyler
 203 | chevrolet
 204 | ChevyCustCare
 205 | ChiccoUSA
 206 | chokemetoo
 207 | chrisfenech1
 208 | ChryslerCares
 209 | Cignaquestions
 210 | CIPD
 211 | CiteThisForMe
 212 | Citibank
 213 | CitiBikeNYC
 214 | CitySprint_help
 215 | ClaireLBSmith
 216 | ClairesEurope
 217 | clarkshelp
 218 | CMLFerry
 219 | CoastCath
 220 | CollectorCorps
 221 | comcastcares
 222 | ComcastILLINOIS
 223 | ComcastOrSWWa
 224 | comparemkt_care
 225 | comparethemkt
 226 | consult53
 227 | CoopEnergy
 228 | Costco
 229 | COTABus
 230 | CoxHelp
 231 | craftsman
 232 | CRCustomerChat
 233 | CRTContactUs
 234 | cstmrsvc
 235 | ctcustomercare
 236 | CTS___
 237 | cvspharmacy
 238 | danaddicott
 239 | danandphilshop
 240 | dancathy
 241 | dan_malarkey
 242 | DarkBunnyTees
 243 | DawnCarillion
 244 | DCBService
 245 | debthelpdesk
 246 | Deliveroo
 247 | Deliveroo_IE
 248 | deliverydotcom
 249 | Desk
 250 | devinfinlay
 251 | DevonCC
 252 | DEWALTtough
 253 | DEWALT_UK
 254 | DFSCare
 255 | dgingiss
 256 | DianaHSmith
 257 | DIBsupport
 258 | DICKS
 259 | digicelbarbados
 260 | DIGICELJamaica
 261 | DigitalRiverInc
 262 | Dillards
 263 | dimonet
 264 | DIRECTV
 265 | directvnow
 266 | Discover
 267 | Discovery_SA
 268 | dish
 269 | DivvyBikes
 270 | djccwl
 271 | dnataSupport
 272 | DNBcares
 273 | dodgecares
 274 | DollarGeneral
 275 | DolphinCares
 276 | dongmsd
 277 | doxiecare
 278 | DPDCustomerCare
 279 | Dreams_Beds
 280 | DressCircleShop
 281 | DrinkSparkletts
 282 | DSGSupport
 283 | DStvNg
 284 | DTSatWIT
 285 | DunelmUK
 286 | durafloruk
 287 | EarthLink
 288 | easons
 289 | easternbank
 290 | EatNaturalBars
 291 | eBay_UK
 292 | edfenergy
 293 | edfenergycomms
 294 | edfenergycs
 295 | eflow_freeflow
 296 | eh_custcare
 297 | eirNews
 298 | elfcosmetics
 299 | EllenKeeble
 300 | EmbracePetIns
 301 | EmersonIT
 302 | emikathure
 303 | EmiratesSupport
 304 | ENMAX
 305 | ENMAXenergy
 306 | Entergy
 307 | EntergyArk
 308 | Enterprise
 309 | enterprisecares
 310 | EPCOR
 311 | epicgeargaming
 312 | epointsjordan
 313 | Equinox_Service
 314 | esetcares
 315 | etisalat
 316 | EvansCycles
 317 | eventbritehelp
 318 | EversourceCT
 319 | EviteSupport
 320 | Expedia
 321 | express
 322 | ExpressHelp
 323 | EYellin
 324 | FabFitFunCS
 325 | FabSupportTeam
 326 | Fanatics
 327 | Fandom_Insider
 328 | farfetch
 329 | faux_punk
 330 | FCUK
 331 | fdarenahelp
 332 | FedExCanada
 333 | FedExCanadaHelp
 334 | FedExEurope
 335 | FedExHelpEU
 336 | FeelGoodPark
 337 | feelunique
 338 | FFGames
 339 | FH_CustomerCare
 340 | fiatcares
 341 | FiatCareUK
 342 | FiguresToyCo
 343 | FinnairHelps
 344 | firstdirect
 345 | fisherpaykelaus
 346 | FLBlueCares
 347 | FLLFlyer
 348 | Fly_Norwegian
 349 | flySAA_US
 350 | Fon
 351 | FonCare
 352 | fontspring
 353 | FoodLion
 354 | Footasylum
 355 | FootballIndexUK
 356 | Ford
 357 | FordSouthAfrica
 358 | forduk
 359 | Forever21Help
 360 | FortisBC
 361 | foxrentcar
 362 | FrankEliason
 363 | FreeviewTV
 364 | freshlypicked
 365 | FromYouFlowers
 366 | FrontierCare
 367 | fryselectronics
 368 | FTcare
 369 | FunjetVacations
 370 | FunkoDCLegion
 371 | FUT_COINSTORE
 372 | gadventures
 373 | Gap
 374 | GapCA
 375 | GarudaCares
 376 | GEICO_Service
 377 | GenesisHousing
 378 | geniusfoods
 379 | GeoffRamm
 380 | GeorgiaPower
 381 | GeoxCares
 382 | GETMEIN
 383 | getrespond
 384 | getsatisfaction
 385 | gigaclear
 386 | GiltService
 387 | glasses_direct
 388 | GlideUK
 389 | GloCare
 390 | GlossyboxUK
 391 | GM
 392 | Go_CheshireWest
 393 | gogreenride
 394 | GongshowService
 395 | Google
 396 | googlemaps
 397 | GoSmartNC
 398 | GoTriangle
 399 | Go_Wireless
 400 | GrandHyattNYC
 401 | GrandHyattSD
 402 | graveshambc
 403 | GreatClipsCares
 404 | GreyhoundBus
 405 | Groove
 406 | GRT_ROW
 407 | Grubhub_Care
 408 | GSMA_Care
 409 | Gymshark
 410 | Gymshark_Help
 411 | HalfordsCycling
 412 | handtec
 413 | HarrisTeeter
 414 | HarrodsService
 415 | HarryandDavid
 416 | HawaiianAir
 417 | hbonow
 418 | HDCares
 419 | HeatherJStrout
 420 | HEB
 421 | Heinens
 422 | HelloKit
 423 | helpscout
 424 | HilltopNorfolk
 425 | hm
 426 | hmaustralia
 427 | hmcanada
 428 | hm_custserv
 429 | HMSHost
 430 | hmsouthafrica
 431 | hmunitedkingdom
 432 | holden_aus
 433 | holidayautos
 434 | holidaytaxisCS
 435 | HollisterCoHelp
 436 | HomeDepotCanada
 437 | HomesenseUK
 438 | Honda
 439 | HondaCustSvc
 440 | HondaPowersprts
 441 | Hootsuite_Help
 442 | Hotwire
 443 | houseoffraser
 444 | HQhair_Help
 445 | HRBlockAnswers
 446 | hsamueljeweller
 447 | HSBC_Sport
 448 | HSBC_UAE
 449 | HSBC_UK
 450 | HSBC_US
 451 | hsr
 452 | HSScustomercare
 453 | HTCHelp
 454 | Huawei
 455 | HuaweiMobile
 456 | HudsonshoesUK
 457 | HullUni_ICT
 458 | HumanaHelp
 459 | HwnElectric
 460 | HyattChurchill
 461 | Hyken
 462 | IcelandFoods
 463 | IGearBrand
 464 | i_hate_ironing
 465 | iiNet
 466 | IKEAIESupport
 467 | IKEAUKSupport
 468 | IKEAUSAHelp
 469 | Incite_Group
 470 | INDOCHINO
 471 | INDOT
 472 | IndyDPW
 473 | INFINITICare
 474 | INFINITIUSA
 475 | InMotionCares
 476 | instituteofcs
 477 | inthestyleUK
 478 | IPFW_ITS_HD
 479 | IrishRail
 480 | IslandAirHawaii
 481 | itsemiel
 482 | IWCare
 483 | Iyengarish
 484 | JabraEurope
 485 | jabrasport
 486 | Jabra_US
 487 | jackd
 488 | jackiecas1
 489 | JackWills
 490 | JagexInfinity
 491 | JamboPayCare
 492 | Jamie1973
 493 | JapanHelpDesk
 494 | jasoneden
 495 | JawboneSupport
 496 | jaybaer
 497 | jazzpk
 498 | jAzzyF_BaBy
 499 | jcpenney
 500 | jct600
 501 | JDhelpteam
 502 | Jeep
 503 | JeepCares
 504 | jeffreyboutique
 505 | Jersey_City
 506 | Jesshillcakes
 507 | jessops
 508 | JetBlue
 509 | JetHeads
 510 | JetstarAirways
 511 | Jetstar_NZ
 512 | JimatPlanters
 513 | JimEllisAudi
 514 | JIRAServiceDesk
 515 | JKalena123
 516 | JLcustserv
 517 | JLove55
 518 | jmspool
 519 | JMstore
 520 | joepindar
 521 | JonesClocks
 522 | joythestore
 523 | jscheel
 524 | justhype
 525 | k9cuisine
 526 | KakaoTalkPH
 527 | KateNasser
 528 | KauaiSA
 529 | Kazel_Kimpo
 530 | kellie_brooks
 531 | kenmore
 532 | KenyaPower
 533 | ketpoole
 534 | kevinGEEdavis
 535 | KFC_UKI
 536 | KFC_UKI_Help
 537 | kiddicare
 538 | KimiNozoGuy
 539 | KingFlyKOTD
 540 | kisluvkis
 541 | KitbagCS
 542 | KitchenAid_CA
 543 | KitchenAid_CAre
 544 | KLM_UK
 545 | Kohls
 546 | kongacare
 547 | KrogerSupport
 548 | LandRover_UK
 549 | latterash
 550 | LeapCard
 551 | LenovoANZ
 552 | LibertyHelpDesk
 553 | LibertyMutual
 554 | lidlcustomerc
 555 | lidl_ireland
 556 | lidl_ni
 557 | lids
 558 | LidsAssist
 559 | lidscanada
 560 | LinkedInHelp
 561 | LinkedInUK
 562 | LiveChat
 563 | LiveNationON
 564 | LivePhishCS
 565 | Ljbpieces
 566 | Lo573
 567 | Logitech_ANZ
 568 | lordandtaylor
 569 | Lovehoney
 570 | LoveWilko
 571 | LowesCares
 572 | LozHarvey
 573 | LQcontactus
 574 | LubeStop
 575 | LVcares
 576 | MACcosmetics
 577 | MaclarenService
 578 | MacsalesCSTS
 579 | Macys
 580 | MadameTussauds
 581 | MallforAfrica
 582 | mamasandpapas
 583 | mandkbutchers
 584 | ManiereDeVoir
 585 | MarcGoodmanBos
 586 | marilynsuttle
 587 | markmadison
 588 | MarshaCollier
 589 | martindentist
 590 | masnHelpDesk
 591 | MasterofMaltCS
 592 | Matalan
 593 | mattlatmatt
 594 | mattr
 595 | MaxiCosiUK
 596 | MaytagBrand
 597 | MaytagCare
 598 | Mazda_SA
 599 | _MBService
 600 | MBTA_CR
 601 | McDonalds
 602 | Mcheza_Care
 603 | MCMComicCon
 604 | MDSHA
 605 | Medela_US
 606 | megabus
 607 | melbournemuseum
 608 | MeteorPR
 609 | MetroBank_Help
 610 | METROHouAlerts
 611 | MetroTransitMN
 612 | miasampietro
 613 | micahsolomon
 614 | michaelshearer
 615 | MicrosoftHelps
 616 | Minted
 617 | Misfit
 618 | MissSelfridge
 619 | Missytohbadt
 620 | mitsucars
 621 | mjanczewski
 622 | MLBFanSupport
 623 | ModCloth_Cares
 624 | monkiworld
 625 | moreforURdollar
 626 | MrCabinetCareOC
 627 | mrpatto
 628 | MRPfashion
 629 | MRPHelp
 630 | MrsGammaLabs
 631 | MTA
 632 | MTGOTraders
 633 | MTN180
 634 | musicMagpie
 635 | Musictoday
 636 | MUSupportDesk
 637 | MVPHealthCare
 638 | MwaveAu
 639 | MWilbanks
 640 | mycellcom
 641 | MyDoncaster
 642 | myUHC
 643 | Nathane_Jackson
 644 | nationalcares
 645 | nationalgridus
 646 | NationalPro
 647 | NBASTORE
 648 | ncbja
 649 | NealTopf
 650 | neimanmarcus
 651 | netflix
 652 | NetflixANZ
 653 | Netflix_CA
 654 | NetflixUK
 655 | NetoneSupport
 656 | NeweggService
 657 | nextofficial
 658 | Nipponyasancom
 659 | NISMO_USA
 660 | NissanUSA
 661 | nmkrobinson
 662 | nokiamobile
 663 | NOOK_Care
 664 | NOOK_Care_UK
 665 | Nordstrom
 666 | Norfolkholidays
 667 | notonthehighst
 668 | npowerhq
 669 | nspowerinc
 670 | nuerasupport
 671 | Nutrisystem
 672 | nvidiacc
 673 | nxcare
 674 | NYSC
 675 | NYTCare
 676 | O2IRL
 677 | OasisFashion
 678 | OEcare
 679 | OfficeCandyGals
 680 | officedepot
 681 | OfficeDivvy
 682 | OfficeShoes
 683 | officialpescm
 684 | OldNavyCA
 685 | olympicfinishes
 686 | _omocat
 687 | Ooma
 688 | OoredooCare
 689 | OoredooQatar
 690 | OpenMike_TV
 691 | OptumRx
 692 | OrangeKe_Care
 693 | OrbitzCareTeam
 694 | originenergy
 695 | OtterBox
 696 | OtterBoxCS
 697 | overseasescape
 698 | oxygenfreejump
 699 | PaddyPowerShops
 700 | PalaceMovies
 701 | PanagoCares
 702 | Panago_Pizza
 703 | PanaService_UK
 704 | PanasonicUSA
 705 | PapaJohns
 706 | PaperMate
 707 | ParkHyattChi
 708 | parracity
 709 | PASmithjr
 710 | PatchworkUrchin
 711 | PatriotsProShop
 712 | paulodetarso24
 713 | PaychexService
 714 | paycity
 715 | PayPalInfoSec
 716 | paysafecard
 717 | Paytmcare
 718 | PBCares
 719 | PCRichardandSon
 720 | PeabodyLDN
 721 | PECOconnect
 722 | PenskeCares
 723 | petedenton
 724 | PeteFyfe
 725 | PeterboroughCC
 726 | PeterPanBus
 727 | PeugeotZA
 728 | pfgregg
 729 | PhilipsCare_UK
 730 | Photobox
 731 | PIA_Cust_unCare
 732 | picturehouses
 733 | PINGTourEurope
 734 | PioneerDJ
 735 | placesforpeople
 736 | PlayDoh
 737 | PlayStationAU
 738 | PLC_Singapore
 739 | PlentyOfFish
 740 | plusnethelp
 741 | PMInstitute
 742 | PNCBank_Help
 743 | PNMtalk
 744 | Pokemon
 745 | pond5
 746 | Porsche
 747 | portlandwater
 748 | PostOfficeNews
 749 | POTUS_CustServ
 750 | PoweradeGB
 751 | premestateswine
 752 | pretavoir
 753 | Primark
 754 | princessdvonb
 755 | PrincesTrust
 756 | PRISM_NSA
 757 | ProFlowers
 758 | ProtectionOne
 759 | Publix
 760 | PublixHelps
 761 | PurolatorHelp
 762 | qatarairways
 763 | QDStores
 764 | Quinny_UK
 765 | RAC_Care
 766 | RACWA
 767 | RamCares
 768 | RamseyCare
 769 | Rand_Water
 770 | RaneDJ
 771 | RareLondon
 772 | RatedPeople
 773 | RBC
 774 | RBC_Canada
 775 | RBWM
 776 | RCommCare
 777 | Reachout_mcd
 778 | redbox
 779 | ReebokUK
 780 | Relish
 781 | rhapsaadic
 782 | RideUTA
 783 | _rightio
 784 | riteaid
 785 | ritual_co
 786 | riverisland
 787 | RobSylvan
 788 | RogersBiz
 789 | RogersHelps
 790 | Ronseal
 791 | RoyalMail
 792 | RubyBlogVAA
 793 | rwhelp
 794 | Safaricom_Care
 795 | SageSupport
 796 | sageuk
 797 | sainsburys
 798 | SainsburysNews
 799 | SaksService
 800 | samanthastarmer
 801 | SamsungCareSA
 802 | samsungmobileng
 803 | SamTrans
 804 | SAS
 805 | SAS_Cares
 806 | SaskTel
 807 | SBEGlobal
 808 | SCG_CS
 809 | Schnittgemuese
 810 | ScholasticClub
 811 | schuh
 812 | scienceworks_mv
 813 | scotiabank
 814 | ScottishPower
 815 | scottish_water
 816 | scottrade
 817 | SDixonVivint
 818 | Sears
 819 | searscares
 820 | SeattleSPU
 821 | secreteyesEW
 822 | seetickets
 823 | Selfridges
 824 | SEPTA_SOCIAL
 825 | ServiceDivaBren
 826 | Service_Queen
 827 | SFBayBikeShare
 828 | SharisBerries
 829 | shaws
 830 | ShiseidoUSA
 831 | ShoeboxedHelp
 832 | ShoePalace
 833 | shopmissa
 834 | ShopRiteStores
 835 | ShopRuche
 836 | Sifter
 837 | SilkFred
 838 | SimplyhealthUK
 839 | sizehelpteam
 840 | skrill
 841 | SkyCinemaUK
 842 | SkyIreland
 843 | SkypeSupport
 844 | SkySaga
 845 | SKYserves
 846 | sleepnumber
 847 | Sleepys
 848 | SleepysCare
 849 | smtickets
 850 | SneakerRefresh
 851 | SnoPUD
 852 | sobeys
 853 | Sofology
 854 | SofologyHelp
 855 | Sony
 856 | SonySupportUSA
 857 | SoundTransit
 858 | SouthwestAir
 859 | SouthWestWater
 860 | spacecojo
 861 | SparkNZ
 862 | SpecializedCSUK
 863 | Spencers_Retail
 864 | SP_EnergyPeople
 865 | Spinneys_Dubai
 866 | SpiritAirlines
 867 | Spokeo_Care
 868 | sportingindex
 869 | spreecoza
 870 | spring
 871 | SquareTrade
 872 | SSE
 873 | sseairtricity
 874 | StanChart
 875 | StaplesUK
 876 | StarbucksUKCA
 877 | StarHub
 878 | statravelAU
 879 | STATravel_UK
 880 | stayhomeclub
 881 | StockTwitsHelp
 882 | StonecareMike
 883 | StraubsMarkets
 884 | Stylistpick
 885 | SubaruCanada
 886 | subaru_usa
 887 | SubwayListens
 888 | Suddenlink
 889 | SuddenlinkHelp
 890 | SunCares
 891 | Suncorp
 892 | sunglassesshop
 893 | SunTzuHelpDesk
 894 | superbalist
 895 | Superdry
 896 | Superdry_Care
 897 | supermartie
 898 | SuperShuttle
 899 | SurflineGH
 900 | SwingSetService
 901 | swoonstars
 902 | TacoBellTeam
 903 | TandemDiabetes
 904 | taportugal
 905 | Tartineaubeurre
 906 | TCoughlin
 907 | TeamSanshee
 908 | TeamSantone
 909 | TeamShieldBSC
 910 | TEAVANA
 911 | Tech21Official
 912 | TechUncensored
 913 | teeofftimes
 914 | TekCenter
 915 | TelkomZA
 916 | TELUS
 917 | TEPenergy
 918 | Tesco
 919 | tescomobile
 920 | tescomobilecare
 921 | tescomobileire
 922 | TessutiHelpTeam
 923 | TextNowHelp
 924 | tfbalerts
 925 | thebitchdesk
 926 | TheBodyShopUK
 927 | thebondteam
 928 | TheBookPeople
 929 | TheFrontDesk
 930 | TheGymGroup
 931 | thekirbycompany
 932 | TheLondonEye
 933 | theMasterLink
 934 | thenutribullet
 935 | TheRAC_UK
 936 | TheRTStore
 937 | TheSharck
 938 | ThomasCookCares
 939 | ThomasCookUK
 940 | thomaswales
 941 | ThomsonCares
 942 | ThreadSence
 943 | ThreeUK
 944 | Ticketmaster
 945 | TicketmasterCA
 946 | TicketmasterCS
 947 | TicketWeb
 948 | TiVo
 949 | TiVoSupport
 950 | tjmaxx
 951 | TK_HelpDesk
 952 | TKMaxx_UK
 953 | TLGTourHelp
 954 | TMLewin
 955 | TMobileHelp
 956 | TMRQld
 957 | TNTUKCare
 958 | toister
 959 | TomTom_SA
 960 | tonydataman
 961 | Topheratl
 962 | Topman
 963 | TopshopHelp
 964 | TotalGymDirect
 965 | townshoes
 966 | ToyotaCustCare
 967 | ToyotaRacing
 968 | ToysRUs
 969 | traceychurray
 970 | TradeMe
 971 | trafficscotland
 972 | TravelGuard
 973 | travelocity
 974 | trimet
 975 | TruGreen
 976 | Trustpilot
 977 | TSA
 978 | TTCsue
 979 | TW2CayC
 980 | TWC_Help
 981 | TWimpeySupport
 982 | twinklresources
 983 | UCLMainLibrary
 984 | UconnectCares
 985 | UFhd
 986 | uga_eits
 987 | UHaul_Cares
 988 | UKEnterprise
 989 | UKMUJI
 990 | UKVolkswagen
 991 | UNB_ITS
 992 | united
 993 | UnityQAThomas
 994 | UpDesk
 995 | UPS_UK
 996 | USAA_help
 997 | USPS
 998 | USPSHelp
 999 | UWDoIT
1000 | vaillantuk
1001 | _valpow_
1002 | Venture_Cycles
1003 | verabradleycare
1004 | VerizonSupport
1005 | VeryHelpers
1006 | vigglesupport
1007 | VirginAmerica
1008 | VirginMediaIE
1009 | Vitality_UK
1010 | VitamixUK
1011 | VIZIOsupport
1012 | vmbusinesshelp
1013 | VMUcare
1014 | Vodacom
1015 | Vodacom111
1016 | VodacomRugga
1017 | VodafoneAU_Help
1018 | VolvoCarUSA
1019 | vtaservice
1020 | vueling
1021 | vwcares
1022 | w1zz
1023 | WAGSocialCare
1024 | wahoofitness
1025 | Walgreens
1026 | Walmart
1027 | warjohi
1028 | WarThunder
1029 | WasteWR
1030 | WatchShop
1031 | WeChatZA
1032 | we_energies
1033 | WellsFargo
1034 | WhirlpoolCare
1035 | Whirlpool_CAre
1036 | whirlpoolusa
1037 | WholeFoods
1038 | WHSmith
1039 | WilsonGolf
1040 | wimrampen
1041 | WindowsSupport
1042 | WingateHotels
1043 | withazed
1044 | Wizards_Help
1045 | WMSpillett
1046 | wow_air
1047 | wowairsupport
1048 | WReynoldsYoung
1049 | wtzgoodPHL
1050 | XOCare
1051 | YahooCare
1052 | yoox
1053 | zaggcare
1054 | ZAGGdaily
1055 | Zapatosdesigner
1056 | ZapposLuxury
1057 | ZARA
1058 | ZARA_Care
1059 | ZeekMarketplace
1060 | ZoomSphere
1061 | Zopim
1062 | 


--------------------------------------------------------------------------------
/collect_twitter_dialogs/collect.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | 
3 | target=account_names_for_dstc6.txt
4 | datetime=`date +"%Y-%m-%d_%H-%M-%S"`
5 | collect_twitter_dialogs.py -t $target -o ./stored_data -l ./collect_${datetime}.log 
6 | 


--------------------------------------------------------------------------------
/collect_twitter_dialogs/collect_twitter_dialogs.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | """collect_twitter_dialogs.py:
  4 |    A script to acquire twitter dialogs with REST API 1.1.
  5 | 
  6 |    Copyright (c) 2017 Takaaki Hori  (thori@merl.com)
  7 | 
  8 |    This software is released under the MIT License.
  9 |    http://opensource.org/licenses/mit-license.php
 10 | 
 11 | """
 12 | 
 13 | import argparse 
 14 | import json
 15 | import sys
 16 | import six
 17 | import os
 18 | import re
 19 | import time
 20 | import logging
 21 | from requests_oauthlib import OAuth1Session
 22 | from twitter_api import GETStatusesUserTimeline
 23 | from twitter_api import GETStatusesLookup
 24 | 
 25 | try:
 26 |     from configparser import ConfigParser
 27 | except ImportError:
 28 |     from ConfigParser import SafeConfigParser as ConfigParser
 29 | 
 30 | # create logger object
 31 | logger = logging.getLogger("root")
 32 | logger.setLevel(logging.INFO)
 33 | 
 34 | def Main(args):
 35 |     # get access keys from a config file
 36 |     config = ConfigParser()
 37 |     config.read(args.config)
 38 |     ConsumerKey = config.get('AccessKeys','ConsumerKey')
 39 |     ConsumerSecret = config.get('AccessKeys','ConsumerSecret')
 40 |     AccessToken = config.get('AccessKeys','AccessToken')
 41 |     AccessTokenSecret = config.get('AccessKeys','AccessTokenSecret')
 42 | 
 43 |     # obtain targets
 44 |     targets = args.names
 45 |     if args.target:
 46 |         for line in open(args.target,'r').readlines():
 47 |             name = line.strip()
 48 |             if not name.startswith('#'):
 49 |                 targets.append(name)
 50 | 
 51 |     # make a directory to store acquired dialogs
 52 |     if args.outdir:
 53 |         if not os.path.exists(args.outdir):
 54 |             os.mkdir(args.outdir)
 55 | 
 56 |     # open a session 
 57 |     session = OAuth1Session(ConsumerKey, ConsumerSecret, AccessToken, AccessTokenSecret)
 58 | 
 59 |     # setup API object
 60 |     get_user_timeline = GETStatusesUserTimeline(session)
 61 |     get_user_timeline.setParams(target_count=args.count, reply_only=True)
 62 |     get_lookup = GETStatusesLookup(session)
 63 | 
 64 |     # collect dialogs from each target
 65 |     num_dialogs = 0
 66 |     num_past_dialogs = 0
 67 | 
 68 |     for name in targets:
 69 |         logger.info('-----------------------------')
 70 |         outfile = name + '.json'
 71 |         if args.outdir:
 72 |             outfile = os.path.join(args.outdir, outfile)
 73 | 
 74 |         ## collect tweets from an account
 75 |         logger.info('collecting tweets from ' + name)
 76 |         if os.path.exists(outfile):
 77 |             logger.info('restoring acquired tweets from ' + outfile)
 78 |             dialog_set = json.load(open(outfile,'r'))
 79 |             since_id = max([int(s) for s in dialog_set.keys()])
 80 |             num_past_dialogs += len(dialog_set)
 81 |         else:
 82 |             since_id = None
 83 |             dialog_set = {}
 84 | 
 85 |         get_user_timeline.setParams(name, max_id=None, since_id=since_id)
 86 |         get_user_timeline.waitReady()
 87 |         timeline_tweets = get_user_timeline.call()
 88 |         if timeline_tweets is None:
 89 |             logger.warn('skip %s with an error' % name)
 90 |             num_dialogs += len(dialog_set)
 91 |             continue
 92 | 
 93 |         logger.info('obtained %d new tweet(s)' % len(timeline_tweets))
 94 |         if len(timeline_tweets) == 0:
 95 |             logger.info('no dialogs have been added to ' + outfile)
 96 |             num_dialogs += len(dialog_set)
 97 |             continue
 98 | 
 99 |         ## collect source tweets
100 |         logger.info("collecting source tweets in reply recursively")
101 |         tweet_set = {}
102 |         ## to avoid getting same tweets again, add tweets we aready have
103 |         for tid,dialog in dialog_set.items():
104 |             for tweet in dialog:
105 |                 tweet_set[tweet['id']] = tweet
106 |         ## add new tweets and collect reply-ids as necessary
107 |         source_ids = set()
108 |         for tweet in timeline_tweets:
109 |             tweet_set[tweet['id']] = tweet
110 |             reply_id = tweet['in_reply_to_status_id']
111 |             if reply_id is not None and reply_id not in tweet_set:
112 |                 source_ids.add(reply_id)
113 |         ## acquire source tweets
114 |         get_lookup.waitReady()
115 |         while len(source_ids) > 0:
116 |             get_lookup.setParams(source_ids)
117 |             result = get_lookup.call()
118 |             logger.info('obtained %d/%d tweets' % (len(result),len(source_ids)))
119 |             new_source_ids = set()
120 |             for tweet in result:
121 |                 tweet_set[tweet['id']] = tweet
122 |                 reply_id = tweet['in_reply_to_status_id']
123 |                 if reply_id is not None and reply_id not in tweet_set:
124 |                     new_source_ids.add(reply_id)
125 |             source_ids = new_source_ids
126 | 
127 |         ## reconstruct dialogs
128 |         logger.info("restructuring the collected tweets as a set of dialogs")
129 |         visited = set()
130 |         new_dialogs = 0
131 |         for tweet in timeline_tweets:
132 |             tid = tweet['id']
133 |             if tid not in visited: # ignore visited node (it's not a terminal)
134 |                 visited.add(tid)
135 |                 # backtrack source tweets and make a dialog
136 |                 dialog = [tweet]
137 |                 reply_id = tweet_set[tid]['in_reply_to_status_id']
138 |                 while reply_id is not None:
139 |                     visited.add(reply_id)
140 |                     # if there already exists a dialog associated with reply_id, 
141 |                     # the dialog is deleted because it's not a complete dialog.
142 |                     if str(reply_id) in dialog_set:
143 |                         del dialog_set[str(reply_id)]
144 |                     # insert a source tweet to the dialog
145 |                     if reply_id in tweet_set:
146 |                         dialog.insert(0,tweet_set[reply_id])
147 |                     else:
148 |                         break
149 |                     # move to the previous tweet
150 |                     reply_id = tweet_set[reply_id]['in_reply_to_status_id']
151 | 
152 |                 # add the dialog only if it contains two or more turns,
153 |                 # where it is associated with its terminal tweet id.
154 |                 if len(dialog) > 1:
155 |                     dialog_set[str(tid)] = dialog
156 |                     new_dialogs += 1
157 | 
158 |         logger.info('obtained %d new dialogs' % new_dialogs)
159 |         if new_dialogs > 0:
160 |             logger.info('writing to file %s' % outfile)
161 |             json.dump(dialog_set, open(outfile,'w'), indent=2)
162 |         else:
163 |             logger.info('no dialogs have been added to ' + outfile)
164 | 
165 |         num_dialogs += len(dialog_set)
166 | 
167 |     logger.info('-----------------------------')
168 |     logger.info('obtained %d new dialogs' % (num_dialogs - num_past_dialogs))
169 |     logger.info('now you have %d dialogs in total' % num_dialogs)
170 | 
171 | 
172 | if __name__ =="__main__":
173 |     # parse command line
174 |     parser = argparse.ArgumentParser()
175 |     parser.add_argument('-c', '--config', default='config.ini', help="config file")
176 |     parser.add_argument('-t', '--target', help="read account names from a file")
177 |     parser.add_argument('-o', '--outdir', help="output directory")
178 |     parser.add_argument('-l', '--logfile', help="set a log file")
179 |     parser.add_argument('-n', '--count', default=-1, type=int, 
180 |                         help="maximum number of tweets acquired from each account")
181 |     parser.add_argument('-d', '--debug', action='store_true', help="debug mode")
182 |     parser.add_argument('-s', '--silent', action='store_true', help="silent mode")
183 |     parser.add_argument('names', metavar='NAME', nargs='*', help='account names')
184 |     args = parser.parse_args()
185 | 
186 |     # set up the logger
187 |     stdhandler = logging.StreamHandler()
188 |     stdhandler.setFormatter(logging.Formatter("%(asctime)s %(levelname)s %(message)s"))
189 |     if args.silent:
190 |         stdhandler.setLevel(logging.WARN)
191 |     logger.addHandler(stdhandler)
192 | 
193 |     if args.logfile:
194 |         filehandler = logging.FileHandler(args.logfile, mode='w')
195 |         filehandler.setFormatter(logging.Formatter("%(asctime)s %(levelname)s %(message)s"))
196 |         logger.addHandler(filehandler)
197 | 
198 |     if args.debug:
199 |         logger.setLevel(logging.DEBUG)
200 | 
201 |     logger.info('started to collect twitter dialogs')
202 |     logger.debug('args=' + str(args))
203 | 
204 |     # call main process
205 |     try:
206 |         Main(args)
207 |     except:
208 |         logger.exception('exited with an error')
209 |         sys.exit(1)
210 | 
211 |     logger.info('done')
212 | 
213 | 


--------------------------------------------------------------------------------
/collect_twitter_dialogs/config.ini:
--------------------------------------------------------------------------------
1 | ; you need to set your own access keys
2 | [AccessKeys]
3 | ConsumerKey:       *************************
4 | ConsumerSecret:    **************************************************
5 | AccessToken:       **************************************************
6 | AccessTokenSecret: *********************************************
7 | 
8 | 


--------------------------------------------------------------------------------
/collect_twitter_dialogs/official_collect.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | 
3 | target=official_account_names_for_dstc6.txt
4 | datetime=`date +"%Y-%m-%d_%H-%M-%S"`
5 | collect_twitter_dialogs.py -t $target -o ./official_stored_data -l ./official_collect_${datetime}.log 
6 | 


--------------------------------------------------------------------------------
/collect_twitter_dialogs/search_twitter_accounts.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | """search_twitter_accounts.py:
  4 |    A script to search twitter accounts with REST API 1.1.
  5 | 
  6 |    Copyright (c) 2017 Takaaki Hori  (thori@merl.com)
  7 | 
  8 |    This software is released under the MIT License.
  9 |    http://opensource.org/licenses/mit-license.php
 10 | 
 11 | """
 12 | 
 13 | import argparse 
 14 | import json
 15 | import sys
 16 | import os
 17 | import logging
 18 | from requests_oauthlib import OAuth1Session
 19 | from twitter_api import GETUsersSearch
 20 | 
 21 | try:
 22 |     from configparser import ConfigParser
 23 | except ImportError:
 24 |     from ConfigParser import SafeConfigParser as ConfigParser
 25 | 
 26 | # create logger object
 27 | logger = logging.getLogger("root")
 28 | logger.setLevel(logging.INFO)
 29 | 
 30 | def Main(args):
 31 |     # get access keys from a config file
 32 |     config = ConfigParser()
 33 |     config.read(args.config)
 34 |     ConsumerKey = config.get('AccessKeys','ConsumerKey')
 35 |     ConsumerSecret = config.get('AccessKeys','ConsumerSecret')
 36 |     AccessToken = config.get('AccessKeys','AccessToken')
 37 |     AccessTokenSecret = config.get('AccessKeys','AccessTokenSecret')
 38 | 
 39 |     # open a session 
 40 |     session = OAuth1Session(ConsumerKey, ConsumerSecret, AccessToken, AccessTokenSecret)
 41 | 
 42 |     # collect users from the queries
 43 |     user_search = GETUsersSearch(session)
 44 |     user_search.setParams(' '.join(args.queries), target_count=args.count)
 45 |     user_search.waitReady()
 46 |     result = user_search.call()
 47 |     logger.info('obtained %d users' % len(result))
 48 | 
 49 |     if args.dump:
 50 |         logger.info('writing raw data to file %s' % args.dump)
 51 |         json.dump(result, open(args.dump,'w'), indent=2)
 52 | 
 53 |     if args.output:
 54 |         logger.info('writing screen names to file %s' % args.output)
 55 |         with open(args.output,'w') as f:
 56 |             for user in result:
 57 |                 f.write(user['screen_name'] + '\n')
 58 |     else:
 59 |         for user in result:
 60 |             sys.stdout.write(user['screen_name'] + '\n')
 61 | 
 62 | 
 63 | if __name__ =="__main__":
 64 |     # parse command line
 65 |     parser = argparse.ArgumentParser()
 66 |     parser.add_argument('-c', '--config', default='config.ini', help="config file")
 67 |     parser.add_argument('-o', '--output', help="output screen names to a file")
 68 |     parser.add_argument('-D', '--dump', help="dump raw data to a file")
 69 |     parser.add_argument('-l', '--logfile', help="set a log file")
 70 |     parser.add_argument('-n', '--count', default=100, type=int, 
 71 |                         help="maximum number of tweets acquired from each account")
 72 |     parser.add_argument('-d', '--debug', action='store_true', help="debug mode")
 73 |     parser.add_argument('queries', metavar='KW', nargs='+', help='query keywords')
 74 |     args = parser.parse_args()
 75 | 
 76 |     # set up the logger
 77 |     stdhandler = logging.StreamHandler()
 78 |     stdhandler.setFormatter(logging.Formatter("%(asctime)s %(levelname)s %(message)s"))
 79 |     logger.addHandler(stdhandler)
 80 |     if args.logfile:
 81 |         filehandler = logging.FileHandler(args.logfile, mode='w')
 82 |         filehandler.setFormatter(logging.Formatter("%(asctime)s %(levelname)s %(message)s"))
 83 |         logger.addHandler(filehandler)
 84 | 
 85 |     if args.debug:
 86 |         logger.setLevel(logging.DEBUG)
 87 | 
 88 |     logger.info('started to collect twitter accounts')
 89 |     logger.debug('args=' + str(args))
 90 | 
 91 |     # call main process
 92 |     try:
 93 |         Main(args)
 94 |     except:
 95 |         logger.exception('exited with an error')
 96 |         sys.exit(1)
 97 | 
 98 |     logger.info('done')
 99 | 
100 | 


--------------------------------------------------------------------------------
/collect_twitter_dialogs/twitter_api.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | """python class to acquire twitter data with REST API 1.1.
  4 | 
  5 |    Copyright (c) 2017 Takaaki Hori  (thori@merl.com)
  6 | 
  7 |    This software is released under the MIT License.
  8 |    http://opensource.org/licenses/mit-license.php
  9 | 
 10 | """
 11 | 
 12 | import argparse 
 13 | import json
 14 | import sys
 15 | import six
 16 | import os
 17 | import re
 18 | import time
 19 | import logging
 20 | from datetime import datetime
 21 | 
 22 | # get a logger object
 23 | logger = logging.getLogger('root')
 24 | 
 25 | # base API caller 
 26 | class TwitterAPI(object):
 27 |     def __init__(self, command, session):
 28 |         self.rest_api_url = 'https://api.twitter.com/1.1'
 29 |         self.error_code_url = 'https://dev.twitter.com/overview/api/response-codes'
 30 |         self.check_rate_limits = '/application/rate_limit_status'
 31 |         self.command = command
 32 |         self.session = session
 33 |         self.params = {}
 34 | 
 35 |     def call(self, retry=5):
 36 |         '''
 37 |         acquire data by a given method
 38 |         '''
 39 |         if len(self.params) == 0:
 40 |             raise Exception('parameters are not set for %s' % self.command)
 41 | 
 42 |         url = self.rest_api_url + self.command + '.json'
 43 |         n_errors = 0
 44 |         self.result = []
 45 |         while True:
 46 |             logger.debug('URL: ' + url)
 47 |             logger.debug('params: ' + str(self.params))
 48 |             res = self.session.get(url, params = self.params)
 49 |             if res.status_code == 200: # Success
 50 |                 data = json.loads(res.text)
 51 |                 if len(data) == 0:
 52 |                     break
 53 |                 if self.extract(data) == False: # if no more data need to be acquired
 54 |                     break
 55 |                 n_errors = 0
 56 |     
 57 |                 # check if header includes 'X-Rate-Limit-Remaining'
 58 |                 if 'X-Rate-Limit-Remaining' in res.headers \
 59 |                 and 'X-Rate-Limit-Reset' in res.headers:
 60 |                     if int(res.headers['X-Rate-Limit-Remaining']) == 0:
 61 |                         waittime = int(res.headers['X-Rate-Limit-Reset']) \
 62 |                                     - time.mktime(datetime.now().timetuple())
 63 |                         logger.info('reached the rate limit ... wait %d seconds', waittime + 5)
 64 |                         time.sleep(waittime + 5)
 65 |                         self.waitReady(self.session)
 66 |                 else:
 67 |                     self.waitReady(self.session)
 68 |     
 69 |             elif res.status_code==401 or res.status_code==404:
 70 |                 logger.warn('Twitter API error %d, see %s' % (res.status_code, self.error_code_url))
 71 |                 logger.warn('error occurred in %s' % self.command)
 72 |                 return None
 73 |             else:
 74 |                 n_errors += 1
 75 |                 if n_errors > retry:
 76 |                     raise Exception('Twitter API error %d, see %s' % (res.status_code, self.error_code_url))
 77 |     
 78 |                 logger.warn('Twitter API error %d, see %s' % (res.status_code, self.error_code_url))
 79 |                 logger.warn('Service Unavailable ... wait 15 minutes')
 80 |                 time.sleep(905)
 81 |     
 82 |         return self.result 
 83 | 
 84 |     # parameter setting
 85 |     def _set_param(self, key, value, default=None):
 86 |         if value is None: # if value is None, the parameter is removed
 87 |             if key in self.params:
 88 |                 del self.params[key]
 89 |         elif default is None or value != default: # if not default, set the parameter
 90 |             self.params[key] = value
 91 | 
 92 | 
 93 |     def getWaitTime(self, res_text):
 94 |         waittime = 0
 95 |         for command in [self.command, self.check_rate_limits]:
 96 |             category = re.sub(r'^/([^\s\/]+)/.*$', '\\1', command)
 97 |             remaining = int(res_text['resources'][category][command]['remaining'])
 98 |             reset = int(res_text['resources'][category][command]['reset'])
 99 |             if remaining == 0:
100 |                 waittime = max(waittime, reset - time.mktime(datetime.now().timetuple()))
101 | 
102 |         return waittime
103 | 
104 | 
105 |     def waitReady(self, retry=5):
106 |         '''
107 |         check status, and wait until it gets available
108 |         '''
109 |         n_errors = 0
110 |         while True:
111 |             res = self.session.get(self.rest_api_url + self.check_rate_limits + '.json')
112 |             if res.status_code == 200: # Success
113 |                 res_text = json.loads(res.text)
114 |                 waittime = self.getWaitTime(res_text)
115 |                 if (waittime > 0):
116 |                     logger.info('reached the rate limit ... wait %d seconds' % (waittime+5))
117 |                     time.sleep(waittime+5)
118 |                     n_errors = 0
119 |                 else:
120 |                     break
121 |     
122 |             else:
123 |                 n_errors += 1
124 |                 if n_errors > retry:
125 |                     raise Exception('Twitter API error %d, see %s' % (res.status_code, self.error_code_url))
126 |     
127 |                 logger.warn('Twitter API error %d, see %s' % (res.status_code, self.error_code_url))
128 |                 logger.warn('Service Unavailable ... wait 15 minutes')
129 |                 time.sleep(905)
130 | 
131 |     
132 | ## some methods to get data
133 | 
134 | class GETSearchTweets(TwitterAPI):
135 |     '''
136 |       Object to acquire tweets by a query
137 |       see https://dev.twitter.com/rest/reference/get/search/tweets
138 |       Limit: Requests / 15-min window (app auth) <= 450
139 |     '''
140 |     def __init__(self, session):
141 |         super(GETSearchTweets, self).__init__('/search/tweets', session)
142 |         self.query = {}
143 |         self.target_count = -1  # default: unlimited
144 |         self.reply_only = False
145 | 
146 |     def setParams(self, query='', target_count=-1, since_id=0, max_id=0,
147 |                   reply_only=None):
148 |         self._set_param('q', query, '')
149 |         self._set_param('since_id', since_id, 0)
150 |         self._set_param('max_id', max_id, 0)
151 |         if target_count > 0:
152 |             self._set_param('count', min(target_count,100))
153 |             self.target_count = target_count
154 |         else:
155 |             self._set_param('count', 100)
156 | 
157 |         if reply_only is not None:
158 |             self.reply_only = reply_only
159 | 
160 |     def extract(self, tweets):
161 |         logger.debug('search_metadata:' + str(tweets['search_metadata']))
162 |         for tweet in tweets['statuses']:
163 |             # extract reply tweets only
164 |             if self.reply_only and tweet['in_reply_to_status_id'] is None:
165 |                 continue
166 |             # extract entries
167 |             self.result.append(tweet)
168 |             if len(self.result) % 100 == 0:
169 |                 logger.info('...acquired %d tweets ' % len(self.result))
170 |             if self.target_count > 0 and len(self.result) >= self.target_count:
171 |                 return False
172 | 
173 |         # for the next call
174 |         if len(tweets['statuses']) > 0:
175 |             self._set_param('max_id', tweets['statuses'][-1]['id'] - 1) # update max_id
176 |             return True
177 |         else:
178 |             return False
179 | 
180 | 
181 | class GETStatusesUserTimeline(TwitterAPI):
182 |     '''
183 |       Object to acquire tweets along a user time line
184 |       see https://dev.twitter.com/rest/reference/get/statuses/user_timeline
185 |       Limit: Requests / 15-min window (app auth) <= 1500
186 |     '''
187 |     def __init__(self, session):
188 |         super(GETStatusesUserTimeline, self).__init__('/statuses/user_timeline', session)
189 |         self.params['include_rts'] = 'false'
190 |         self.params['exclude_replies'] = 'false'
191 |         self.target_count = 0
192 |         self.reply_only = False
193 | 
194 |     def setParams(self, name='', target_count=0, since_id=0, max_id=0,
195 |                  reply_only=None):
196 |         self._set_param('screen_name', name, '')
197 |         self._set_param('since_id', since_id, 0)
198 |         self._set_param('max_id', max_id, 0)
199 |         if target_count > 0:
200 |             self._set_param('count', min(target_count,100))
201 |             self.target_count = target_count
202 |         else:
203 |             self._set_param('count', 100)
204 | 
205 |         if reply_only is not None:
206 |             self.reply_only = reply_only
207 | 
208 |     def extract(self, tweets):
209 |         for tweet in tweets:
210 |             # extract reply tweets only
211 |             if self.reply_only and tweet['in_reply_to_status_id'] is None:
212 |                 continue
213 |             # store tweets
214 |             self.result.append(tweet)
215 |             if len(self.result) % 100 == 0:
216 |                 logger.info('...acquired %d tweets ' % len(self.result))
217 |             if self.target_count>0 and len(self.result) >= self.target_count:
218 |                 return False
219 | 
220 |         # for the next call
221 |         if len(tweets) > 0:
222 |             self._set_param('max_id', tweets[-1]['id'] - 1) # update max_id
223 |             return True
224 |         else:
225 |             return False
226 | 
227 | 
228 | class GETStatusesLookup(TwitterAPI):
229 |     '''
230 |       Object to acquire tweets by ID numbers
231 |       see https://dev.twitter.com/rest/reference/get/statuses/lookup
232 |       Limit: Requests / 15-min window (app auth) <= 300
233 |     '''
234 |     def __init__(self, session):
235 |         super(GETStatusesLookup, self).__init__('/statuses/lookup', session)
236 |         self.count = 100 # cat get up to 100 tweets at once
237 | 
238 |     def setParams(self, id_set=None):
239 |         if id_set is not None:
240 |             self.id_list = list(id_set)
241 |             self.params['id'] = ','.join([str(n) for n in self.id_list[0:self.count]])
242 |         self.total_count = 0
243 | 
244 |     def extract(self, tweets):
245 |         for tweet in tweets:
246 |             self.result.append(tweet)
247 |             if len(self.result) % 100 == 0:
248 |                 logger.info('...acquired %d tweets ' % len(self.result))
249 | 
250 |         self.total_count += self.count
251 |         if self.total_count >= len(self.id_list):
252 |             return False
253 | 
254 |         # for the next call
255 |         sub_ids = self.id_list[self.total_count:self.total_count+self.count]
256 |         self.params['id'] = ','.join([str(n) for n in sub_ids])
257 |         return True
258 | 
259 | 
260 | class GETUsersSearch(TwitterAPI):
261 |     '''
262 |       Object to search users by a query
263 |       see https://dev.twitter.com/rest/reference/get/users/search
264 |       Limit: Requests / 15-min window (user auth) <= 900
265 |     '''
266 |     def __init__(self, session):
267 |         super(GETUsersSearch, self).__init__('/users/search', session)
268 |         self.target_count = 100 # default
269 |         self.params['count'] = 20 # can get 20 entries per page
270 | 
271 |     def setParams(self, query='', target_count=0):
272 |         self._set_param('q', query, '')
273 |         if target_count > 0:
274 |             self.target_count = target_count
275 |         self.params['page'] = 1
276 | 
277 |     def extract(self, text):
278 |         for user in text:
279 |             self.result.append(user)
280 |             if len(self.result) % 100 == 0:
281 |                 logger.info('...acquired %d users ' % len(self.result))
282 |             if len(self.result) >= self.target_count:
283 |                 return False
284 | 
285 |         # for the next call
286 |         self.params['page'] += 1
287 |         return True
288 | 
289 | 


--------------------------------------------------------------------------------
/collect_twitter_dialogs/view_dialogs.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- coding: utf-8 -*-
 3 | """view twitter dialogs.
 4 | 
 5 |    Copyright (c) 2017 Takaaki Hori  (thori@merl.com)
 6 | 
 7 |    This software is released under the MIT License.
 8 |    http://opensource.org/licenses/mit-license.php
 9 | 
10 | """
11 | 
12 | import json
13 | import sys
14 | import six
15 | 
16 | if six.PY2:
17 |     reload(sys)
18 |     sys.setdefaultencoding('utf-8')
19 | 
20 | if len(sys.argv) < 2:
21 |     print ('usage: view_dialogs.py dialogs.json ...')
22 |     sys.exit(1)
23 | 
24 | for fn in sys.argv[1:]:
25 |     dialog_set = json.load(open(fn,'r'))
26 |     for tid in sorted([int(s) for s in dialog_set.keys()]):
27 |         dialog = dialog_set[str(tid)]
28 |         lang = dialog[0]['lang']
29 |         if lang == 'en':
30 |             print ('--- ID:%d (length=%d) ---\n' % (tid, len(dialog)))
31 |             for utterance in dialog:
32 |                 screen_name = utterance['user']['screen_name']
33 |                 name = utterance['user']['name']
34 |                 text = utterance['text']
35 |                 print ('%s (%s)' % (utterance['created_at'],utterance['id']))
36 |                 print ('%s (@%s) : %s\n' % (name, screen_name, text))
37 | 
38 | 


--------------------------------------------------------------------------------
/tasks/opensubs/extract_opensubs_dialogs.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | #-*- coding: utf-8 -*-
  3 | """extract_opensubs_dialogs.py:
  4 |    A script to extract text from OpenSubtitles.
  5 | 
  6 |    Copyright (c) 2017 Takaaki Hori  (thori@merl.com)
  7 | 
  8 |    This software is released under the MIT License.
  9 |    http://opensource.org/licenses/mit-license.php
 10 | 
 11 | """
 12 | 
 13 | import xml.etree.ElementTree as ET
 14 | from gzip import GzipFile
 15 | import sys
 16 | import re
 17 | import glob
 18 | import random
 19 | from tqdm import tqdm
 20 | import six
 21 | import argparse
 22 | 
 23 | 
 24 | def preprocess(sent):
 25 |     """ text preprocessing using regular expressions
 26 |     """
 27 |     # remove tags
 28 |     new_sent = re.sub(r'(<!--.*?-->|<[^>]*>|{[^}]*}|\([^\)]*\)|\[[^\]]*\])',' ', sent)
 29 |     # replace apostrophe and convert letters to lower case
 30 |     new_sent = new_sent.replace('\\\'','\'').lower()
 31 |     # delete a space right after an isolated apostrophe
 32 |     new_sent = re.sub(r' \' (?=(em|im|s|t|bout|cause)\s+)', ' \'', new_sent)
 33 |     # delete a space right before an isolated apostrophe
 34 |     new_sent = re.sub(r'(?<=n) \' ', '\' ', new_sent)
 35 |     # delete a space right before a period for titles
 36 |     new_sent = re.sub(r'(?<=( mr| jr| ms| dr| st|mrs)) \.', '. ', new_sent)
 37 |     # remove speaker tag "xxx: "
 38 |     new_sent = re.sub(r'^\s*[A-z]*\s*:', '', new_sent)
 39 |     # remove unnecessary symbols
 40 |     new_sent = re.sub(u'([-–—]+$| [-–—]+|[-–—]+ |% %|#+|\'\'|``| \' |[\(\)\"])', ' ', new_sent)
 41 |     # convert i̇->i
 42 |     new_sent = re.sub(u'i̇','i', new_sent)
 43 |     # convert multiple spaces to a single space
 44 |     new_sent = re.sub(r'\s+', ' ', new_sent).strip()
 45 |     # ignore sentence with anly space or some symbols
 46 |     if not re.match(r'^(\s*|[\.\?$%!,:;])$', new_sent):
 47 |         return new_sent
 48 |     else:
 49 |         return '' 
 50 | 
 51 | 
 52 | def extract(filePath, corpus):
 53 |     """extract text from an XML file and tokenize it
 54 |     """
 55 |     if filePath.endswith('.gz'):
 56 |         tree = ET.parse(GzipFile(filename=filePath))
 57 |     else:
 58 |         tree = ET.parse(filePath)
 59 | 
 60 |     root = tree.getroot()
 61 |     sent = ''
 62 |     for child in root:
 63 |         for elem in child:
 64 |             if elem.tag == 'w':
 65 |                 sent += ' ' + elem.text
 66 | 
 67 |         if not sent.strip().endswith(':'):
 68 |             if six.PY2:
 69 |                 new_sent = preprocess(sent).encode('utf-8')
 70 |             else:
 71 |                 new_sent = preprocess(sent)
 72 | 
 73 |             if new_sent:
 74 |                 corpus.append((new_sent, len(new_sent.split())))
 75 |             sent = ''
 76 | 
 77 |     corpus.append(('',0))
 78 | 
 79 | 
 80 | if __name__ == '__main__':
 81 |     parser = argparse.ArgumentParser()
 82 |     parser.add_argument('--output', default=['train.txt'], nargs='+', type=str,
 83 |                         help='Filenames of data')
 84 |     parser.add_argument('--ratio', default=[0.01], nargs='+', type=float,
 85 |                         help='Extraction rate for each data set')
 86 |     parser.add_argument('--max-length', default=20, type=float,
 87 |                         help='Maximum length of sentences')
 88 |     parser.add_argument('--rootdir', default='.',  
 89 |                         help='root directory of data source')
 90 | 
 91 |     args = parser.parse_args()
 92 | 
 93 |     if len(args.output) != len(args.ratio):
 94 |         raise Exception('The number of output files (%d) and the number extraction ratios (%d) should be the same.' % (len(args.output), len(args.partion)))
 95 |     random.seed(99)
 96 | 
 97 |     rootdir = args.rootdir
 98 |     print('collecting files from ' + rootdir)
 99 |     xmlfiles = glob.glob(rootdir + '/*.xml.gz') + glob.glob(rootdir + '/*/*.xml.gz') + glob.glob(rootdir + '/*/*/*.xml.gz')
100 | 
101 |     random.shuffle(xmlfiles, random.random)
102 |     total_ratio = sum(args.ratio)
103 |     n_files = int( len(xmlfiles) * total_ratio )
104 |     print('loading text from %d/%d files' % (n_files, len(xmlfiles)))
105 |     corpus = []
106 |     for n in tqdm(six.moves.range(n_files)):
107 |         extract(xmlfiles[n], corpus)
108 |    
109 |     print('%d sentences loaded' % len(corpus))
110 |     indices = list(six.moves.range(len(corpus)-1))
111 |     random.shuffle(indices, random.random)
112 | 
113 |     partition = [0]
114 |     acc_ratio = 0.0
115 |     for r in args.ratio:
116 |         acc_ratio += r / total_ratio
117 |         partition.append(int(acc_ratio * len(corpus)))
118 | 
119 |     for n in six.moves.range(len(partition)-1):
120 |         print('writing %d sentence pairs to %s' % (partition[n+1]-partition[n], args.output[n]))
121 |         with open(args.output[n],'w') as f:
122 |             for idx in indices[partition[n]:partition[n+1]]:
123 |                 len1 = corpus[idx][1]
124 |                 len2 = corpus[idx+1][1]
125 |                 if 0 < len1 < args.max_length and 0 < len2 < args.max_length:
126 |                     six.print_('U: %s\nS: %s\n' % (corpus[idx][0],corpus[idx+1][0]), file=f)
127 |                 
128 | 


--------------------------------------------------------------------------------
/tasks/opensubs/make_trial_data.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # you can download English data of OpenSubtitles2016 from
 4 | # 
 5 | # http://opus.lingfil.uu.se/download.php?f=OpenSubtitles2016/en.tar.gz
 6 | # 
 7 | # and extract xml files by
 8 | # 
 9 | # tar zxvf en.tar.gz
10 | #
11 | # you need to specify the location of the data
12 | stored_data=./OpenSubtitles2016/xml/en
13 | 
14 | # extract train and dev sets
15 | echo extracting training, development, and test sets
16 | ./extract_opensubs_dialogs.py \
17 |    --output opensubs_trial_data_train.txt opensubs_trial_data_dev.txt \
18 |    --ratio 0.1 0.001 \
19 |    --rootdir $stored_data
20 | 
21 | echo extracting evaluation data
22 | # extract 500 samples from dev set randomly for evaluation
23 | ./utils/sample_dialogs.py opensubs_trial_data_dev.txt 500 > opensubs_trial_data_eval.txt
24 | 
25 | echo done
26 | 


--------------------------------------------------------------------------------
/tasks/opensubs/utils:
--------------------------------------------------------------------------------
1 | ../utils


--------------------------------------------------------------------------------
/tasks/twitter/account_names_dev.txt:
--------------------------------------------------------------------------------
  1 | 1800flowers
  2 | ALCATEL1TOUCH
  3 | ATTCares
  4 | AetnaHelp
  5 | AskCapitalOne
  6 | AskMTNGhana
  7 | AsurionCares
  8 | AvidSupport
  9 | Avis
 10 | BookChicClub
 11 | BostonMarki
 12 | CIPD
 13 | CMLFerry
 14 | CadburyIreland
 15 | Camper_CustCare
 16 | CastelliCafe
 17 | ChaseSupport
 18 | ChevyCustCare
 19 | CollectorCorps
 20 | DollarGeneral
 21 | DolphinCares
 22 | ENMAXenergy
 23 | EmiratesSupport
 24 | Entergy
 25 | Equinox_Service
 26 | FCUK
 27 | FedExCanada
 28 | FoodLion
 29 | FootballIndexUK
 30 | FordSouthAfrica
 31 | GapCA
 32 | GiltService
 33 | GrandHyattNYC
 34 | GreyhoundBus
 35 | HDCares
 36 | HTCHelp
 37 | Heinens
 38 | IPFW_ITS_HD
 39 | IWCare
 40 | IrishRail
 41 | JKalena123
 42 | JawboneSupport
 43 | JeepCares
 44 | Jersey_City
 45 | KitchenAid_CA
 46 | Kohls
 47 | LQcontactus
 48 | LinkedInUK
 49 | Lo573
 50 | LoveWilko
 51 | LozHarvey
 52 | MDSHA
 53 | MaclarenService
 54 | MarshaCollier
 55 | MeteorPR
 56 | NBASTORE
 57 | O2IRL
 58 | OfficeShoes
 59 | OrbitzCareTeam
 60 | PRISM_NSA
 61 | PanagoCares
 62 | PhilipsCare_UK
 63 | PostOfficeNews
 64 | PoweradeGB
 65 | RareLondon
 66 | RubyBlogVAA
 67 | SAS_Cares
 68 | SCG_CS
 69 | SDixonVivint
 70 | SKYserves
 71 | SSE
 72 | ShoeboxedHelp
 73 | SkyIreland
 74 | SkySaga
 75 | SonySupportUSA
 76 | SouthWestWater
 77 | StarbucksUKCA
 78 | SunTzuHelpDesk
 79 | SurflineGH
 80 | Tech21Official
 81 | TheBodyShopUK
 82 | TicketmasterCS
 83 | TravelGuard
 84 | UconnectCares
 85 | VolvoCarUSA
 86 | Walmart
 87 | WasteWR
 88 | _omocat
 89 | askevanscycles
 90 | atmosenergy
 91 | audiblesupport
 92 | austin_reed
 93 | catalysthousing
 94 | champssports
 95 | consult53
 96 | dan_malarkey
 97 | digicelbarbados
 98 | dish
 99 | edfenergycs
100 | feelunique
101 | hmunitedkingdom
102 | justhype
103 | kenmore
104 | lidl_ireland
105 | marilynsuttle
106 | miasampietro
107 | officedepot
108 | originenergy
109 | paycity
110 | paysafecard
111 | placesforpeople
112 | portlandwater
113 | pretavoir
114 | searscares
115 | skrill
116 | spreecoza
117 | stayhomeclub
118 | superbalist
119 | 


--------------------------------------------------------------------------------
/tasks/twitter/account_names_train.txt:
--------------------------------------------------------------------------------
  1 | 1800flowershelp
  2 | 64audio
  3 | 888sport
  4 | ABCustomerCare
  5 | ABTAtravel
  6 | AFC_Amy
  7 | AFCustomerCare
  8 | AIRNZUSA
  9 | APCcustserv
 10 | ASHelpMe
 11 | ASOS
 12 | ASOS_Au
 13 | ASOS_Us
 14 | ASUHelpCenter
 15 | ASUSUSA
 16 | AVIVAIRELAND
 17 | AbercrombieHelp
 18 | AddisonLeeCabs
 19 | AdobeCare
 20 | AflacPhyllis
 21 | AirAsia
 22 | AirChinaNA
 23 | AirtelNigeria
 24 | Alamo
 25 | AlaskaAir
 26 | Albertsons
 27 | AlderTraining
 28 | AldiUK
 29 | Aldi_Ireland
 30 | AlfaRomeoCareUK
 31 | AliExpress_EN
 32 | AlissaDosSantos
 33 | Allegiant
 34 | AllianzTravelUS
 35 | Allstate
 36 | AllyCare
 37 | AlshayaHelpDesk
 38 | AmanaCare
 39 | AmazonHelp
 40 | AmericanAir
 41 | AmexAU
 42 | AndreaSWilson
 43 | AnglianHelp
 44 | AnkiSupport
 45 | AnthemBC_News
 46 | AppleSupport
 47 | AppleyardLondon
 48 | ArqivaWiFi
 49 | AsdaCustCare
 50 | Ask123ie
 51 | AskAmex
 52 | AskCiti
 53 | AskDyson
 54 | AskEmblemHealth
 55 | AskKBCIreland
 56 | AskLandsEnd
 57 | AskMarstons
 58 | AskMeHelpDesk
 59 | AskPS_UK
 60 | AskSmythsToys
 61 | AskSubaruCanada
 62 | AskTarget
 63 | AskTeamUA
 64 | Ask_Spectrum
 65 | AskeBay
 66 | Askvodafonegh
 67 | AsmodeeNA
 68 | Atomos_News
 69 | AudiUK
 70 | AudiUKCare
 71 | AutodeskHelp
 72 | Ayres_Hotels
 73 | BCBSLA
 74 | BCC_Help
 75 | BCSSupport
 76 | BJsWholesale
 77 | BN_care
 78 | BOOKSetc_online
 79 | BP_UK
 80 | BP_plc
 81 | BRCustServ
 82 | BSSHelpDesk
 83 | BabiesRUs
 84 | BabyJogger
 85 | BananaRepublic
 86 | BandQ
 87 | BarclaycardNews
 88 | BarclaysUK
 89 | Baxter_Auto
 90 | BeautyBarDotCom
 91 | Beaverbrooks
 92 | BedBathBeyond
 93 | BenefitUK
 94 | BernardBoutique
 95 | BiffaService
 96 | BikeMetro
 97 | BitcoinHelpDesk
 98 | BlackBerryHelp
 99 | Blanchford
100 | Blendtec
101 | BluestarHQ
102 | Boars_Head
103 | BoostCare
104 | BootsUK
105 | BordGaisEnergy
106 | BoseService
107 | BounceEnergy
108 | BritishGas
109 | BritishGasNews
110 | Budget
111 | Buick
112 | BuildaHelpDesk
113 | BupaUK
114 | BurgerKing
115 | Burton_Menswear
116 | CDCustService
117 | COTABus
118 | CRCustomerChat
119 | CRTContactUs
120 | CTS___
121 | CableBill
122 | CadburyUK
123 | CalMac_Updates
124 | CallawayGolfCS
125 | CanadianPacific
126 | CapMetroATX
127 | CareShopBunzl
128 | CaroKopp
129 | Cartii
130 | CheapTixHearsU
131 | ChiccoUSA
132 | ChryslerCares
133 | Cignaquestions
134 | CiteThisForMe
135 | CitiBikeNYC
136 | Citibank
137 | CitySprint_help
138 | ClaireLBSmith
139 | ClairesEurope
140 | CoastCath
141 | ComcastILLINOIS
142 | ComcastOrSWWa
143 | CoopEnergy
144 | Costco
145 | CoxHelp
146 | DCBService
147 | DEWALT_UK
148 | DEWALTtough
149 | DFSCare
150 | DIBsupport
151 | DICKS
152 | DIGICELJamaica
153 | DIRECTV
154 | DNBcares
155 | DPDCustomerCare
156 | DSGSupport
157 | DStvNg
158 | DTSatWIT
159 | DarkBunnyTees
160 | DawnCarillion
161 | Deliveroo
162 | Deliveroo_IE
163 | Desk
164 | DevonCC
165 | DianaHSmith
166 | DigitalRiverInc
167 | Dillards
168 | Discover
169 | Discovery_SA
170 | DivvyBikes
171 | Dreams_Beds
172 | DressCircleShop
173 | DrinkSparkletts
174 | DunelmUK
175 | ENMAX
176 | EPCOR
177 | EYellin
178 | EarthLink
179 | EatNaturalBars
180 | EllenKeeble
181 | EmbracePetIns
182 | EmersonIT
183 | EntergyArk
184 | Enterprise
185 | EvansCycles
186 | EversourceCT
187 | EviteSupport
188 | Expedia
189 | ExpressHelp
190 | FFGames
191 | FH_CustomerCare
192 | FLBlueCares
193 | FLLFlyer
194 | FTcare
195 | FUT_COINSTORE
196 | FabFitFunCS
197 | FabSupportTeam
198 | Fanatics
199 | Fandom_Insider
200 | FedExCanadaHelp
201 | FedExEurope
202 | FedExHelpEU
203 | FeelGoodPark
204 | FiatCareUK
205 | FiguresToyCo
206 | FinnairHelps
207 | Fly_Norwegian
208 | Fon
209 | FonCare
210 | Footasylum
211 | Ford
212 | Forever21Help
213 | FortisBC
214 | FrankEliason
215 | FreeviewTV
216 | FromYouFlowers
217 | FrontierCare
218 | FunjetVacations
219 | FunkoDCLegion
220 | GEICO_Service
221 | GETMEIN
222 | GM
223 | GRT_ROW
224 | GSMA_Care
225 | Gap
226 | GarudaCares
227 | GenesisHousing
228 | GeoffRamm
229 | GeorgiaPower
230 | GeoxCares
231 | GlideUK
232 | GloCare
233 | GlossyboxUK
234 | GoSmartNC
235 | GoTriangle
236 | Go_CheshireWest
237 | Go_Wireless
238 | GongshowService
239 | Google
240 | GrandHyattSD
241 | GreatClipsCares
242 | Groove
243 | Grubhub_Care
244 | Gymshark
245 | Gymshark_Help
246 | HEB
247 | HMSHost
248 | HQhair_Help
249 | HRBlockAnswers
250 | HSBC_Sport
251 | HSBC_UAE
252 | HSBC_UK
253 | HSBC_US
254 | HSScustomercare
255 | HalfordsCycling
256 | HarrisTeeter
257 | HarrodsService
258 | HarryandDavid
259 | HawaiianAir
260 | HeatherJStrout
261 | HelloKit
262 | HilltopNorfolk
263 | HollisterCoHelp
264 | HomeDepotCanada
265 | HomesenseUK
266 | Honda
267 | HondaCustSvc
268 | HondaPowersprts
269 | Hootsuite_Help
270 | Hotwire
271 | Huawei
272 | HuaweiMobile
273 | HudsonshoesUK
274 | HullUni_ICT
275 | HumanaHelp
276 | HwnElectric
277 | HyattChurchill
278 | Hyken
279 | IGearBrand
280 | IKEAIESupport
281 | IKEAUKSupport
282 | IKEAUSAHelp
283 | INDOCHINO
284 | INDOT
285 | INFINITICare
286 | INFINITIUSA
287 | IcelandFoods
288 | InMotionCares
289 | Incite_Group
290 | IndyDPW
291 | IslandAirHawaii
292 | Iyengarish
293 | JDhelpteam
294 | JIRAServiceDesk
295 | JLcustserv
296 | JLove55
297 | JMstore
298 | JabraEurope
299 | Jabra_US
300 | JackWills
301 | JagexInfinity
302 | JamboPayCare
303 | Jamie1973
304 | JapanHelpDesk
305 | Jeep
306 | Jesshillcakes
307 | JetBlue
308 | JetHeads
309 | JetstarAirways
310 | Jetstar_NZ
311 | JimEllisAudi
312 | JimatPlanters
313 | JonesClocks
314 | KFC_UKI
315 | KFC_UKI_Help
316 | KLM_UK
317 | KakaoTalkPH
318 | KateNasser
319 | KauaiSA
320 | Kazel_Kimpo
321 | KenyaPower
322 | KimiNozoGuy
323 | KingFlyKOTD
324 | KitbagCS
325 | KitchenAid_CAre
326 | KrogerSupport
327 | LVcares
328 | LandRover_UK
329 | LeapCard
330 | LenovoANZ
331 | LibertyHelpDesk
332 | LibertyMutual
333 | LidsAssist
334 | LinkedInHelp
335 | LiveChat
336 | LiveNationON
337 | LivePhishCS
338 | Ljbpieces
339 | Logitech_ANZ
340 | Lovehoney
341 | LowesCares
342 | LubeStop
343 | MACcosmetics
344 | MBTA_CR
345 | MCMComicCon
346 | METROHouAlerts
347 | MLBFanSupport
348 | MRPHelp
349 | MRPfashion
350 | MTA
351 | MTGOTraders
352 | MTN180
353 | MUSupportDesk
354 | MVPHealthCare
355 | MWilbanks
356 | MacsalesCSTS
357 | Macys
358 | MadameTussauds
359 | MallforAfrica
360 | ManiereDeVoir
361 | MarcGoodmanBos
362 | MasterofMaltCS
363 | Matalan
364 | MaxiCosiUK
365 | MaytagBrand
366 | MaytagCare
367 | Mazda_SA
368 | McDonalds
369 | Mcheza_Care
370 | Medela_US
371 | MetroBank_Help
372 | MetroTransitMN
373 | MicrosoftHelps
374 | Minted
375 | Misfit
376 | MissSelfridge
377 | Missytohbadt
378 | ModCloth_Cares
379 | MrCabinetCareOC
380 | MrsGammaLabs
381 | Musictoday
382 | MwaveAu
383 | MyDoncaster
384 | NISMO_USA
385 | NOOK_Care
386 | NOOK_Care_UK
387 | NYSC
388 | NYTCare
389 | Nathane_Jackson
390 | NationalPro
391 | NealTopf
392 | NetflixANZ
393 | NetflixUK
394 | Netflix_CA
395 | NetoneSupport
396 | NeweggService
397 | Nipponyasancom
398 | NissanUSA
399 | Nordstrom
400 | Norfolkholidays
401 | Nutrisystem
402 | OEcare
403 | OasisFashion
404 | OfficeCandyGals
405 | OfficeDivvy
406 | OldNavyCA
407 | Ooma
408 | OoredooCare
409 | OoredooQatar
410 | OpenMike_TV
411 | OptumRx
412 | OrangeKe_Care
413 | OtterBox
414 | OtterBoxCS
415 | PASmithjr
416 | PBCares
417 | PCRichardandSon
418 | PECOconnect
419 | PIA_Cust_unCare
420 | PINGTourEurope
421 | PLC_Singapore
422 | PMInstitute
423 | PNCBank_Help
424 | PNMtalk
425 | POTUS_CustServ
426 | PaddyPowerShops
427 | PalaceMovies
428 | PanaService_UK
429 | Panago_Pizza
430 | PanasonicUSA
431 | PapaJohns
432 | PaperMate
433 | ParkHyattChi
434 | PatchworkUrchin
435 | PatriotsProShop
436 | PayPalInfoSec
437 | PaychexService
438 | Paytmcare
439 | PeabodyLDN
440 | PenskeCares
441 | PeteFyfe
442 | PeterPanBus
443 | PeterboroughCC
444 | PeugeotZA
445 | Photobox
446 | PioneerDJ
447 | PlayDoh
448 | PlayStationAU
449 | PlentyOfFish
450 | Pokemon
451 | Porsche
452 | Primark
453 | PrincesTrust
454 | ProFlowers
455 | ProtectionOne
456 | Publix
457 | PublixHelps
458 | PurolatorHelp
459 | QDStores
460 | Quinny_UK
461 | RACWA
462 | RAC_Care
463 | RBC
464 | RBC_Canada
465 | RBWM
466 | RCommCare
467 | RamCares
468 | RamseyCare
469 | Rand_Water
470 | RaneDJ
471 | RatedPeople
472 | Reachout_mcd
473 | ReebokUK
474 | Relish
475 | RideUTA
476 | RobSylvan
477 | RogersBiz
478 | RogersHelps
479 | Ronseal
480 | RoyalMail
481 | SAS
482 | SBEGlobal
483 | SEPTA_SOCIAL
484 | SFBayBikeShare
485 | SP_EnergyPeople
486 | STATravel_UK
487 | Safaricom_Care
488 | SageSupport
489 | SainsburysNews
490 | SaksService
491 | SamTrans
492 | SamsungCareSA
493 | SaskTel
494 | Schnittgemuese
495 | ScholasticClub
496 | ScottishPower
497 | Sears
498 | SeattleSPU
499 | Selfridges
500 | ServiceDivaBren
501 | Service_Queen
502 | SharisBerries
503 | ShiseidoUSA
504 | ShoePalace
505 | ShopRiteStores
506 | ShopRuche
507 | Sifter
508 | SilkFred
509 | SimplyhealthUK
510 | SkyCinemaUK
511 | SkypeSupport
512 | Sleepys
513 | SleepysCare
514 | SneakerRefresh
515 | SnoPUD
516 | Sofology
517 | SofologyHelp
518 | Sony
519 | SoundTransit
520 | SouthwestAir
521 | SparkNZ
522 | SpecializedCSUK
523 | Spencers_Retail
524 | Spinneys_Dubai
525 | SpiritAirlines
526 | Spokeo_Care
527 | SquareTrade
528 | StanChart
529 | StaplesUK
530 | StarHub
531 | StockTwitsHelp
532 | StonecareMike
533 | StraubsMarkets
534 | Stylistpick
535 | SubaruCanada
536 | SubwayListens
537 | Suddenlink
538 | SuddenlinkHelp
539 | SunCares
540 | Suncorp
541 | SuperShuttle
542 | Superdry
543 | Superdry_Care
544 | SwingSetService
545 | TCoughlin
546 | TEAVANA
547 | TELUS
548 | TEPenergy
549 | TKMaxx_UK
550 | TK_HelpDesk
551 | TLGTourHelp
552 | TMLewin
553 | TMRQld
554 | TMobileHelp
555 | TNTUKCare
556 | TSA
557 | TTCsue
558 | TW2CayC
559 | TWC_Help
560 | TWimpeySupport
561 | TacoBellTeam
562 | TandemDiabetes
563 | Tartineaubeurre
564 | TeamSanshee
565 | TeamSantone
566 | TeamShieldBSC
567 | TechUncensored
568 | TekCenter
569 | TelkomZA
570 | Tesco
571 | TessutiHelpTeam
572 | TextNowHelp
573 | TheBookPeople
574 | TheFrontDesk
575 | TheGymGroup
576 | TheLondonEye
577 | TheRAC_UK
578 | TheRTStore
579 | TheSharck
580 | ThomasCookCares
581 | ThomasCookUK
582 | ThomsonCares
583 | ThreadSence
584 | ThreeUK
585 | TiVo
586 | TiVoSupport
587 | TicketWeb
588 | Ticketmaster
589 | TicketmasterCA
590 | TomTom_SA
591 | Topheratl
592 | Topman
593 | TopshopHelp
594 | TotalGymDirect
595 | ToyotaCustCare
596 | ToyotaRacing
597 | ToysRUs
598 | TradeMe
599 | TruGreen
600 | Trustpilot
601 | UCLMainLibrary
602 | UFhd
603 | UHaul_Cares
604 | UKEnterprise
605 | UKMUJI
606 | UKVolkswagen
607 | UNB_ITS
608 | UPS_UK
609 | USAA_help
610 | USPS
611 | USPSHelp
612 | UWDoIT
613 | UnityQAThomas
614 | UpDesk
615 | VIZIOsupport
616 | VMUcare
617 | Venture_Cycles
618 | VerizonSupport
619 | VeryHelpers
620 | VirginAmerica
621 | VirginMediaIE
622 | Vitality_UK
623 | VitamixUK
624 | Vodacom
625 | Vodacom111
626 | VodacomRugga
627 | VodafoneAU_Help
628 | WAGSocialCare
629 | WHSmith
630 | WMSpillett
631 | WReynoldsYoung
632 | Walgreens
633 | WarThunder
634 | WatchShop
635 | WeChatZA
636 | WellsFargo
637 | WhirlpoolCare
638 | Whirlpool_CAre
639 | WholeFoods
640 | WilsonGolf
641 | WindowsSupport
642 | WingateHotels
643 | Wizards_Help
644 | XOCare
645 | YahooCare
646 | ZAGGdaily
647 | ZARA
648 | ZARA_Care
649 | Zapatosdesigner
650 | ZapposLuxury
651 | ZeekMarketplace
652 | ZoomSphere
653 | Zopim
654 | _MBService
655 | _rightio
656 | _valpow_
657 | abbeygroup
658 | abbylynne19
659 | abellio_surrey
660 | acemtp
661 | ack
662 | acmemarkets
663 | acnestudios
664 | adamholz
665 | adidasGhana
666 | adidasUK
667 | adrianflux
668 | adrianswinscoe
669 | advocare
670 | airnzuk
671 | airtel_care
672 | alabamapower
673 | alamocares
674 | alpharooms
675 | alwaysriding
676 | americangiant
677 | andertonsmusic
678 | andrew_heister
679 | annabelkarmel
680 | asbcares
681 | ask_progressive
682 | askpermanenttsb
683 | asksurfline
684 | askvisa
685 | astros
686 | atomtickets
687 | audibleuk
688 | audiireland
689 | beautybay
690 | belk
691 | bestdealtv
692 | bexdeep
693 | beyondthedesk
694 | bigmikeyvegas
695 | blackanddecker
696 | bobbyrayburns
697 | boohoo_cshelp
698 | bookdepository
699 | booksamillion
700 | bravern
701 | brooksrunning
702 | bryanz85
703 | builds_io
704 | cableONE
705 | cafepress
706 | calorireland
707 | cam4_gay
708 | cars_portsmouth
709 | casualclassics
710 | cesarkeller
711 | cheftyler
712 | chevrolet
713 | chokemetoo
714 | chrisfenech1
715 | clarkshelp
716 | comcastcares
717 | comparemkt_care
718 | comparethemkt
719 | craftsman
720 | cstmrsvc
721 | ctcustomercare
722 | cvspharmacy
723 | danaddicott
724 | danandphilshop
725 | dancathy
726 | debthelpdesk
727 | deliverydotcom
728 | devinfinlay
729 | dgingiss
730 | dimonet
731 | directvnow
732 | djccwl
733 | dnataSupport
734 | dodgecares
735 | dongmsd
736 | doxiecare
737 | durafloruk
738 | eBay_UK
739 | easons
740 | easternbank
741 | edfenergy
742 | edfenergycomms
743 | eflow_freeflow
744 | eh_custcare
745 | eirNews
746 | elfcosmetics
747 | emikathure
748 | enterprisecares
749 | epicgeargaming
750 | epointsjordan
751 | esetcares
752 | etisalat
753 | eventbritehelp
754 | express
755 | farfetch
756 | faux_punk
757 | fdarenahelp
758 | fiatcares
759 | firstdirect
760 | fisherpaykelaus
761 | flySAA_US
762 | fontspring
763 | forduk
764 | foxrentcar
765 | freshlypicked
766 | fryselectronics
767 | gadventures
768 | geniusfoods
769 | getrespond
770 | getsatisfaction
771 | gigaclear
772 | glasses_direct
773 | gogreenride
774 | googlemaps
775 | graveshambc
776 | handtec
777 | hbonow
778 | helpscout
779 | hm
780 | hm_custserv
781 | hmaustralia
782 | hmcanada
783 | hmsouthafrica
784 | holden_aus
785 | holidayautos
786 | holidaytaxisCS
787 | houseoffraser
788 | hsamueljeweller
789 | hsr
790 | i_hate_ironing
791 | iiNet
792 | instituteofcs
793 | inthestyleUK
794 | itsemiel
795 | jAzzyF_BaBy
796 | jabrasport
797 | jackd
798 | jackiecas1
799 | jasoneden
800 | jaybaer
801 | jazzpk
802 | jcpenney
803 | jct600
804 | jeffreyboutique
805 | jessops
806 | jmspool
807 | joepindar
808 | joythestore
809 | jscheel
810 | k9cuisine
811 | kellie_brooks
812 | ketpoole
813 | kevinGEEdavis
814 | kiddicare
815 | kisluvkis
816 | kongacare
817 | latterash
818 | lidl_ni
819 | lidlcustomerc
820 | lids
821 | lidscanada
822 | lordandtaylor
823 | mamasandpapas
824 | mandkbutchers
825 | markmadison
826 | martindentist
827 | masnHelpDesk
828 | mattlatmatt
829 | mattr
830 | megabus
831 | melbournemuseum
832 | micahsolomon
833 | michaelshearer
834 | mitsucars
835 | mjanczewski
836 | monkiworld
837 | moreforURdollar
838 | mrpatto
839 | musicMagpie
840 | myUHC
841 | mycellcom
842 | nationalcares
843 | nationalgridus
844 | ncbja
845 | neimanmarcus
846 | netflix
847 | nextofficial
848 | nmkrobinson
849 | nokiamobile
850 | notonthehighst
851 | npowerhq
852 | nspowerinc
853 | nuerasupport
854 | nvidiacc
855 | nxcare
856 | officialpescm
857 | olympicfinishes
858 | overseasescape
859 | oxygenfreejump
860 | parracity
861 | paulodetarso24
862 | petedenton
863 | pfgregg
864 | picturehouses
865 | plusnethelp
866 | pond5
867 | premestateswine
868 | princessdvonb
869 | qatarairways
870 | redbox
871 | rhapsaadic
872 | riteaid
873 | ritual_co
874 | riverisland
875 | rwhelp
876 | sageuk
877 | sainsburys
878 | samanthastarmer
879 | samsungmobileng
880 | schuh
881 | scienceworks_mv
882 | scotiabank
883 | scottish_water
884 | scottrade
885 | secreteyesEW
886 | seetickets
887 | shaws
888 | shopmissa
889 | sizehelpteam
890 | sleepnumber
891 | smtickets
892 | sobeys
893 | spacecojo
894 | sportingindex
895 | spring
896 | sseairtricity
897 | statravelAU
898 | subaru_usa
899 | sunglassesshop
900 | supermartie
901 | swoonstars
902 | taportugal
903 | teeofftimes
904 | tescomobile
905 | tescomobilecare
906 | tescomobileire
907 | tfbalerts
908 | theMasterLink
909 | thebitchdesk
910 | thebondteam
911 | thekirbycompany
912 | thenutribullet
913 | thomaswales
914 | tjmaxx
915 | toister
916 | tonydataman
917 | townshoes
918 | traceychurray
919 | trafficscotland
920 | travelocity
921 | trimet
922 | twinklresources
923 | uga_eits
924 | united
925 | vaillantuk
926 | verabradleycare
927 | vigglesupport
928 | vmbusinesshelp
929 | vtaservice
930 | vueling
931 | vwcares
932 | w1zz
933 | wahoofitness
934 | warjohi
935 | we_energies
936 | whirlpoolusa
937 | wimrampen
938 | withazed
939 | wow_air
940 | wowairsupport
941 | wtzgoodPHL
942 | yoox
943 | zaggcare
944 | 


--------------------------------------------------------------------------------
/tasks/twitter/extract_official_twitter_dialogs.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | """extract_official_twitter_dialogs.py:
  4 |    A script to extract text from twitter dialogs.
  5 | 
  6 |    Copyright (c) 2017 Takaaki Hori  (thori@merl.com)
  7 | 
  8 |    This software is released under the MIT License.
  9 |    http://opensource.org/licenses/mit-license.php
 10 | 
 11 | """
 12 | 
 13 | import json
 14 | import sys
 15 | import os
 16 | import six
 17 | import re
 18 | import argparse
 19 | from datetime import datetime
 20 | from tqdm import tqdm
 21 | from nltk.tokenize import casual_tokenize
 22 | 
 23 | if six.PY2:
 24 |     reload(sys)
 25 |     sys.setdefaultencoding('utf-8')
 26 | 
 27 | 
 28 | def find_sequential_tweets(tweet, group):
 29 |     """ Check if the tweet is in multiple tweets of a single turn
 30 |         Args: 
 31 |            tweet: the target tweet
 32 |            group: list of groups of multiple tweets
 33 |         Return:
 34 |            index of group in which the target is included
 35 |            if no group found, return -1
 36 |     """
 37 |     if len(group) == 0:
 38 |         return -1
 39 |     tfm = '%a %b %d %H:%M:%S +0000 %Y'
 40 |     tw_time = datetime.strptime(tweet['created_at'], tfm)
 41 |     #print (time1)
 42 |     for m,elm in enumerate(group):
 43 |         for et in elm[1].values():
 44 |             #print(et['created_at'])
 45 |             tdiff = (tw_time - datetime.strptime(et['created_at'], tfm))
 46 |             if tweet['in_reply_to_status_id'] is not None \
 47 |                and tweet['in_reply_to_status_id']==et['in_reply_to_status_id'] \
 48 |                and tweet['user']['id']==et['user']['id'] \
 49 |                and abs(tdiff.total_seconds()) < 600:
 50 |                 return m
 51 |     return -1
 52 | 
 53 | 
 54 | def validate_dialog(dialog, max_turns):
 55 |     """ Check if the dialog consists of multiple turns with equal or 
 56 |         less than max_turns by two users without truncated tweets.
 57 |         Args:
 58 |             dialog: target dialog
 59 |             max_turns: upper bound of #turns per dialog
 60 |         Return:
 61 |             True if the conditions are all satisfied
 62 |             False, otherwise
 63 |     """
 64 |     if len(dialog) > max_turns:
 65 |         return False
 66 |     # skip dialogs including truncated tweets or more users
 67 |     users = set()
 68 |     for utterance in dialog:
 69 |         for tid,tweet in utterance.items():
 70 |             if tweet['truncated'] == True:
 71 |                 return False
 72 |             users.add(tweet['user']['id'])
 73 |     if len(users) != 2:
 74 |         return False
 75 |     return True
 76 | 
 77 | 
 78 | def preprocess(text, name, speaker='U', first_name=None):
 79 |     """ normalize and tokenize raw text 
 80 |     args:
 81 |         text: input raw text (str)
 82 |         name: user name (str)
 83 |         first_name: user's first name (str)
 84 |         sys_utt: flag if this is a sysmtem's turn (bool)
 85 |     return:
 86 |         normalized text (str)
 87 |     """
 88 |     # modify apostrophe character
 89 |     text = re.sub(u'’',"'",text)
 90 |     text = re.sub(u'(“|”)','',text)
 91 |     # remove handle names in the beginning
 92 |     text = re.sub(r'^(@[A-Za-z0-9_]+[\.;, ])+','',text)
 93 |     # remove connected tweets indicator e.g. (1/2) (2/2)
 94 |     text = re.sub(r'(^|[\(\[ ])[1234]\/[2345]([\)\] ]|$)',' ',text)
 95 |     # replace long numbers
 96 |     text = re.sub(r'(?<=[ A-Z])(\+\d|\d\-|\d\d\d+|\(\d\d+\))[\d\- ]+\d\d\d','<NUMBERS>',text)
 97 |     # replace user name in system response
 98 |     if speaker == 'S':
 99 |         if name:
100 |             text = re.sub('@'+name, '<USER>', text)
101 |         if first_name:
102 |             text = re.sub('(^|[^A-Za-z0-9])'+first_name+'($|[^A-Za-z0-9])', '\\1<USER>\\2', text)
103 | 
104 |     # tokenize and replace entities
105 |     words = casual_tokenize(text, preserve_case=False,reduce_len=True)
106 |     for n in six.moves.range(len(words)):
107 |         token = words[n]
108 |         # replace entities with tags (E-MAIL, URL, NUMBERS, USER, etc)
109 |         token = re.sub(r'^([a-z0-9_\.\-]+@[a-z0-9_\.\-]+\.[a-z]+)$','<E-MAIL>',token)
110 |         token = re.sub(r'^https?:\S+$','<URL>',token)
111 |         token = re.sub('^<numbers>$','<NUMBERS>',token)
112 |         token = re.sub('^<user>$','<USER>',token)
113 |         # make spaces for apostrophe and period
114 |         token = re.sub(r'^([a-z]+)\'([a-z]+)$','\\1 \'\\2',token)
115 |         token = re.sub(r'^([a-z]+)\.([a-z]+)$','\\1 . \\2',token)
116 |         words[n] = token
117 |     # join
118 |     text = ' '.join(words)
119 |     # remove signature of tweets (e.g. ... ^TH, - John, etc.)
120 |     if speaker == 'S':
121 |         text = re.sub(u'[\\^\\-~–][\\-– ]*([a-z]+\\s*|[a-z ]{2,8})(<URL>\\s*$|\\.\\s*$|$)','\\2',text)
122 |         if not re.search(r' (thanks|thnks|thx)\s*$', text):
123 |             text = re.sub(u'(?<= [\\-,!?.–])\\s*[a-z]+\\s*$','',text)
124 | 
125 |     return text
126 | 
127 | 
128 | def print_dialog(dialog, fo, sys_name='', debug=False):
129 |     """ print a dialog 
130 |     args:
131 |         dialog (list): list of tweet groups
132 |         sys_name (str): screen_name of system
133 |         file (object): file object to write text
134 |         debug (bool): debug mode
135 |     """
136 |     user_name = ''
137 |     user_first_name = ''
138 |     prev_speaker = ''
139 |     # print a dialog 
140 |     for utterance in dialog:
141 |         for n,tweet in enumerate(sorted(utterance.values(), key=lambda u:u['id'])):
142 |             screen_name = tweet['user']['screen_name']
143 |             name = tweet['user']['name']
144 |             text = tweet['text']
145 | 
146 |             if screen_name.lower() == sys_name.lower():
147 |                 speaker = 'S'
148 |                 if not debug:
149 |                     text = preprocess(text, user_name, speaker=speaker, first_name=user_first_name)
150 |             else:
151 |                 speaker = 'U'
152 |                 if not debug:
153 |                     text = preprocess(text, system_name, speaker=speaker)
154 |                 # set user's screen name and first name to replace the names
155 |                 # to a common symbol (e.g. <USER>) in system utterances
156 |                 user_name = screen_name
157 |                 tokens = name.split()
158 |                 if len(tokens) > 0 and len(tokens[0]) > 2:
159 |                     m = re.match(r'([A-Za-z0-9]+)$', tokens[0])
160 |                     if m:
161 |                         user_first_name = m.group(1)
162 |                         #print tokens[0],user_first_name
163 | 
164 |             if prev_speaker:
165 |                 if prev_speaker != speaker:
166 |                     six.print_('\n%s: %s' % (speaker,text), file=fo, end='')
167 |                 else: # connect to the previous tweet
168 |                     six.print_(' %s' % (text), file=fo, end='')
169 |             else:
170 |                 six.print_('%s: %s' % (speaker,text), file=fo, end='')
171 | 
172 |             prev_speaker = speaker
173 | 
174 |     six.print_('\n', file=fo)
175 | 
176 | 
177 | def limit_dialogs(dialog_set, id_info):
178 |     """
179 |     limit dialog extraction by IDs
180 |     """
181 |     new_dialog_set = {}
182 |     for did in dialog_set:
183 |         dialog = dialog_set[did]
184 |         new_dialog = []
185 |         end = None
186 |         for n in six.moves.range(len(dialog)-1,-1,-1):
187 |             id_str = dialog[n]["id_str"]
188 |             if end is None and id_str in id_info:
189 |                 end = id_info[id_str]
190 |             if end is not None:
191 |                 new_dialog.insert(0,dialog[n])
192 |                 if dialog[n]['id'] == end:
193 |                     break
194 |         if len(new_dialog) > 0:
195 |             new_id = new_dialog[-1]['id_str']
196 |             new_dialog_set[new_id] = new_dialog
197 |     return new_dialog_set
198 | 
199 | 
200 | if __name__ == "__main__":
201 |     # parse command line
202 |     parser = argparse.ArgumentParser()
203 |     parser.add_argument('-t', '--target', help='read screen names from a file')
204 |     parser.add_argument('-o', '--output', help='output processed text into a file')
205 |     parser.add_argument('-d', '--debug', action='store_true', help='debug mode')
206 |     parser.add_argument('--data-dir', help='specify data directory')
207 |     parser.add_argument('--max-turns', default=20, 
208 |                         help='exclude long dialogs by this number')
209 |     parser.add_argument('--id-file', default='', help='use begin-end ID file')
210 |     parser.add_argument('--no-progress-bar', action='store_true', 
211 |                         help='show progress bar')
212 |     parser.add_argument('files', metavar='FN', nargs='*',
213 |                         help='filenames of twitter dialogs in json format')
214 |     args = parser.parse_args()
215 | 
216 |     # prepare filenames
217 |     target_files = set()
218 |     if args.target:
219 |         for ln in open(args.target, 'r').readlines():
220 |             target_files.add(os.path.join(args.data_dir, ln.strip() + '.json'))
221 |     for fn in args.files:
222 |         target_files.add(fn)
223 | 
224 |     target_files = sorted(list(target_files))
225 |     # open output file
226 |     if args.output:
227 |         fo = open(args.output,'w')
228 |     else:
229 |         fo = sys.stdout
230 |     # use id range file
231 |     if args.id_file:
232 |         id_info = json.load(open(args.id_file,'r'))
233 |     else:
234 |         id_info = None
235 | 
236 |     # extract dialogs
237 |     # (1) merge separated tweets that can be considered one utterance
238 |     # (2) filter irregular dialogs, e.g. too long, including truncated tweets, etc.
239 |     # (3) tokenize and normalize sentence text
240 |     if not args.no_progress_bar:
241 |         target_files = tqdm(target_files)
242 | 
243 |     for fn in target_files:
244 |         dialog_set = json.load(open(fn,'r'))
245 |         m = re.search(r'(^|\/)([^\/]+)\.json',fn)
246 |         if m:
247 |             system_name = m.group(2)
248 |         else:
249 |             raise Exception('no match to a screen name in %s' % fn)
250 | 
251 |         if id_info is not None:
252 |             dialog_set = limit_dialogs(dialog_set, id_info[system_name])
253 | 
254 |         # make a tweet tree to merge sequential tweets by the same user
255 |         root = {}
256 |         for tid_str in sorted(dialog_set.keys()):
257 |             dialog = dialog_set[tid_str]
258 |             if dialog[0]['lang'] == 'en':
259 |                 tid = dialog[0]['id']
260 |                 if tid not in root:
261 |                     root[tid] = ([],{tid:dialog[0]})
262 |                 node = root[tid][0]
263 |                 for tweet in dialog[1:]:
264 |                     tid = tweet['id']
265 |                     m = find_sequential_tweets(tweet,node)
266 |                     if m >= 0:
267 |                         node[m][1][tid] = tweet
268 |                     else:
269 |                         node.append(([],{tid:tweet}))
270 |                     node = node[m][0]
271 |     
272 |         # depth-first search and extract dialogs
273 |         stack = []
274 |         for tid in sorted(root.keys()):
275 |             stack.append((root[tid][0], [root[tid][1]]))
276 | 
277 |         while len(stack)>0:
278 |             node,dseq = stack.pop()
279 |             if len(node) == 0:
280 |                 if validate_dialog(dseq, args.max_turns):
281 |                     print_dialog(dseq, fo, sys_name=system_name, debug=args.debug)
282 |             else:
283 |                 for elm in node:
284 |                     stack.append((elm[0], dseq + [elm[1]]))
285 | 
286 |     if args.output:
287 |         fo.close()
288 | 
289 | 


--------------------------------------------------------------------------------
/tasks/twitter/extract_official_twitter_testset.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | """extract_official_twitter_dialogs.py:
  4 |    A script to extract text from twitter dialogs.
  5 | 
  6 |    Copyright (c) 2017 Takaaki Hori  (thori@merl.com)
  7 | 
  8 |    This software is released under the MIT License.
  9 |    http://opensource.org/licenses/mit-license.php
 10 | 
 11 | """
 12 | 
 13 | import json
 14 | import sys
 15 | import os
 16 | import six
 17 | import re
 18 | import argparse
 19 | from datetime import datetime
 20 | from tqdm import tqdm
 21 | from nltk.tokenize import casual_tokenize
 22 | 
 23 | if six.PY2:
 24 |     reload(sys)
 25 |     sys.setdefaultencoding('utf-8')
 26 | 
 27 | def preprocess(text, name, speaker='U', first_name=None):
 28 |     """ normalize and tokenize raw text 
 29 |     args:
 30 |         text: input raw text (str)
 31 |         name: user name (str)
 32 |         first_name: user's first name (str)
 33 |         sys_utt: flag if this is a sysmtem's turn (bool)
 34 |     return:
 35 |         normalized text (str)
 36 |     """
 37 |     # modify apostrophe character
 38 |     text = re.sub(u'’',"'",text)
 39 |     text = re.sub(u'(“|”)','',text)
 40 |     # remove handle names in the beginning
 41 |     text = re.sub(r'^(@[A-Za-z0-9_]+[\.;, ])+','',text)
 42 |     # remove connected tweets indicator e.g. (1/2) (2/2)
 43 |     text = re.sub(r'(^|[\(\[ ])[1234]\/[2345]([\)\] ]|$)',' ',text)
 44 |     # replace long numbers
 45 |     text = re.sub(r'(?<=[ A-Z])(\+\d|\d\-|\d\d\d+|\(\d\d+\))[\d\- ]+\d\d\d','<NUMBERS>',text)
 46 |     # replace user name in system response
 47 |     if speaker == 'S':
 48 |         if name:
 49 |             text = re.sub('@'+name, '<USER>', text)
 50 |         if first_name:
 51 |             text = re.sub('(^|[^A-Za-z0-9])'+first_name+'($|[^A-Za-z0-9])', '\\1<USER>\\2', text)
 52 | 
 53 |     # tokenize and replace entities
 54 |     words = casual_tokenize(text, preserve_case=False,reduce_len=True)
 55 |     for n in six.moves.range(len(words)):
 56 |         token = words[n]
 57 |         # replace entities with tags (E-MAIL, URL, NUMBERS, USER, etc)
 58 |         token = re.sub(r'^([a-z0-9_\.\-]+@[a-z0-9_\.\-]+\.[a-z]+)$','<E-MAIL>',token)
 59 |         token = re.sub(r'^https?:\S+$','<URL>',token)
 60 |         token = re.sub('^<numbers>$','<NUMBERS>',token)
 61 |         token = re.sub('^<user>$','<USER>',token)
 62 |         # make spaces for apostrophe and period
 63 |         token = re.sub(r'^([a-z]+)\'([a-z]+)$','\\1 \'\\2',token)
 64 |         token = re.sub(r'^([a-z]+)\.([a-z]+)$','\\1 . \\2',token)
 65 |         words[n] = token
 66 |     # join
 67 |     text = ' '.join(words)
 68 |     # remove signature of tweets (e.g. ... ^TH, - John, etc.)
 69 |     if speaker == 'S':
 70 |         text = re.sub(u'[\\^\\-~–][\\-– ]*([a-z]+\\s*|[a-z ]{2,8})(<URL>\\s*$|\\.\\s*$|$)','\\2',text)
 71 |         if not re.search(r' (thanks|thnks|thx)\s*$', text):
 72 |             text = re.sub(u'(?<= [\\-,!?.–])\\s*[a-z]+\\s*$','',text)
 73 | 
 74 |     return text
 75 | 
 76 | 
 77 | if __name__ == "__main__":
 78 |     # parse command line
 79 |     parser = argparse.ArgumentParser()
 80 |     parser.add_argument('-t', '--target', help='read screen names from a file')
 81 |     parser.add_argument('-o', '--output', help='output processed text into a file')
 82 |     parser.add_argument('-d', '--debug', action='store_true', help='debug mode')
 83 |     parser.add_argument('--data-dir', help='specify data directory')
 84 |     parser.add_argument('--id-list', required=True, help='use ID list file')
 85 |     parser.add_argument('--no-progress-bar', action='store_true', 
 86 |                         help='show progress bar')
 87 |     parser.add_argument('files', metavar='FN', nargs='*',
 88 |                         help='filenames of twitter dialogs in json format')
 89 |     args = parser.parse_args()
 90 | 
 91 |     # prepare filenames
 92 |     target_files = set()
 93 |     if args.target:
 94 |         for ln in open(args.target, 'r').readlines():
 95 |             target_files.add(os.path.join(args.data_dir, ln.strip() + '.json'))
 96 |     for fn in args.files:
 97 |         target_files.add(fn)
 98 | 
 99 |     target_files = sorted(list(target_files))
100 |     # read ID list
101 |     id_list = json.load(open(args.id_list,'r'))
102 |     id_pool = set()
103 |     for di in id_list:
104 |         for turn in di:
105 |             id_pool |= set(turn['ids'])
106 |     # extract necessary tweets
107 |     if not args.no_progress_bar:
108 |         target_files = tqdm(target_files)
109 |     tweet_pool = {}
110 |     for fn in target_files:
111 |         dialog_set = json.load(open(fn,'r'))
112 |         for tid_str in dialog_set:
113 |             for tweet in dialog_set[tid_str]:
114 |                 tid = tweet['id']
115 |                 if tid in id_pool:
116 |                     tweet_pool[tid] = tweet
117 |     # write text to file
118 |     if args.output:
119 |         fo = open(args.output,'w')
120 |     else:
121 |         fo = sys.stdout
122 |     if not args.no_progress_bar:
123 |         tweet_id_list = tqdm(id_list)
124 | 
125 |     n_dialogs = 0
126 |     n_turns = 0
127 |     n_turns_in_list = 0
128 |     for dialog in tweet_id_list:
129 |         user_name = ''
130 |         user_first_name = ''
131 |         system_name = ''
132 |         # print a dialog 
133 |         missed = False
134 |         for ti,turn in enumerate(dialog):
135 |             speaker = turn['speaker']
136 |             six.print_('%s:' % speaker, file=fo, end='')
137 |             if sum([tid in tweet_pool for tid in turn['ids']]) < len(turn['ids']):
138 |                 six.print_(' __MISSING__', file=fo)
139 |                 missed = True
140 | 
141 |             elif len(turn['ids'])==0 and ti==len(dialog)-1:
142 |                 six.print_(' __UNDISCLOSED__', file=fo)
143 |                 n_turns += 1
144 | 
145 |             else:
146 |                 for tid in turn['ids']:
147 |                     tweet = tweet_pool[tid]
148 |                     screen_name = tweet['user']['screen_name']
149 |                     name = tweet['user']['name']
150 |                     text = tweet['text']
151 |                     if speaker == 'S':
152 |                         if not args.debug:
153 |                             text = preprocess(text, user_name, speaker=speaker, first_name=user_first_name)
154 |                         system_name = screen_name
155 |                     else:
156 |                         if not args.debug:
157 |                             text = preprocess(text, system_name, speaker=speaker)
158 |                         # set user's screen name and first name to replace the names
159 |                         # to a common symbol (e.g. <USER>) in system utterances
160 |                         user_name = screen_name
161 |                         tokens = name.split()
162 |                         if len(tokens) > 0 and len(tokens[0]) > 2:
163 |                             m = re.match(r'([A-Za-z0-9]+)$', tokens[0])
164 |                             if m:
165 |                                 user_first_name = m.group(1)
166 | 
167 |                     six.print_(' %s' % text, file=fo, end='')
168 |                 six.print_('', file=fo)
169 |                 n_turns += 1
170 | 
171 |         n_turns_in_list += len(dialog)
172 |         if not missed:
173 |             n_dialogs += 1
174 |         six.print_('', file=fo)
175 | 
176 |     if args.output:
177 |         fo.close()
178 | 
179 |     print('--------------------------------------------------------')
180 |     print('Number of successfully extracted dialogs: %d in %d' % (n_dialogs, len(tweet_id_list)))
181 |     print('Number of successfully extracted turns: %d in %d' % (n_turns, n_turns_in_list))
182 |     print('--------------------------------------------------------')
183 | 
184 | 


--------------------------------------------------------------------------------
/tasks/twitter/extract_twitter_dialogs.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | """extract_twitter_dialogs.py:
  4 |    A script to extract text from twitter dialogs.
  5 | 
  6 |    Copyright (c) 2017 Takaaki Hori  (thori@merl.com)
  7 | 
  8 |    This software is released under the MIT License.
  9 |    http://opensource.org/licenses/mit-license.php
 10 | 
 11 | """
 12 | 
 13 | import json
 14 | import sys
 15 | import os
 16 | import six
 17 | import re
 18 | import argparse
 19 | from datetime import datetime
 20 | from tqdm import tqdm
 21 | from nltk.tokenize import casual_tokenize
 22 | 
 23 | if six.PY2:
 24 |     reload(sys)
 25 |     sys.setdefaultencoding('utf-8')
 26 | 
 27 | 
 28 | def find_sequential_tweets(tweet, group):
 29 |     """ Check if the tweet is in multiple tweets of a single turn
 30 |         Args: 
 31 |            tweet: the target tweet
 32 |            group: list of groups of multiple tweets
 33 |         Return:
 34 |            index of group in which the target is included
 35 |            if no group found, return -1
 36 |     """
 37 |     if len(group) == 0:
 38 |         return -1
 39 |     tfm = '%a %b %d %H:%M:%S +0000 %Y'
 40 |     tw_time = datetime.strptime(tweet['created_at'], tfm)
 41 |     #print (time1)
 42 |     for m,elm in enumerate(group):
 43 |         for et in elm[1].values():
 44 |             #print(et['created_at'])
 45 |             tdiff = (tw_time - datetime.strptime(et['created_at'], tfm))
 46 |             if tweet['in_reply_to_status_id'] is not None \
 47 |                and tweet['in_reply_to_status_id']==et['in_reply_to_status_id'] \
 48 |                and tweet['user']['id']==et['user']['id'] \
 49 |                and abs(tdiff.total_seconds()) < 600:
 50 |                 return m
 51 |     return -1
 52 | 
 53 | 
 54 | def validate_dialog(dialog, max_turns):
 55 |     """ Check if the dialog consists of multiple turns with equal or 
 56 |         less than max_turns by two users without truncated tweets.
 57 |         Args:
 58 |             dialog: target dialog
 59 |             max_turns: upper bound of #turns per dialog
 60 |         Return:
 61 |             True if the conditions are all satisfied
 62 |             False, otherwise
 63 |     """
 64 |     if len(dialog) > max_turns:
 65 |         return False
 66 |     # skip dialogs including truncated tweets or more users
 67 |     users = set()
 68 |     for utterance in dialog:
 69 |         for tid,tweet in utterance.items():
 70 |             if tweet['truncated'] == True:
 71 |                 return False
 72 |             users.add(tweet['user']['id'])
 73 |     if len(users) != 2:
 74 |         return False
 75 |     return True
 76 | 
 77 | 
 78 | def preprocess(text, name, speaker='U', first_name=None):
 79 |     """ normalize and tokenize raw text 
 80 |     args:
 81 |         text: input raw text (str)
 82 |         name: user name (str)
 83 |         first_name: user's first name (str)
 84 |         sys_utt: flag if this is a sysmtem's turn (bool)
 85 |     return:
 86 |         normalized text (str)
 87 |     """
 88 |     # modify apostrophe character
 89 |     text = re.sub(u'’',"'",text)
 90 |     text = re.sub(u'(“|”)','',text)
 91 |     # remove handle names in the beginning
 92 |     text = re.sub(r'^(@[A-Za-z0-9_]+[\.;, ])+','',text)
 93 |     # remove connected tweets indicator e.g. (1/2) (2/2)
 94 |     text = re.sub(r'(^|[\(\[ ])[1234]\/[2345]([\)\] ]|$)',' ',text)
 95 |     # replace long numbers
 96 |     text = re.sub(r'(?<=[ A-Z])(\+\d|\d\-|\d\d\d+|\(\d\d+\))[\d\- ]+\d\d\d','<NUMBERS>',text)
 97 |     # replace user name in system response
 98 |     if speaker == 'S':
 99 |         if name:
100 |             text = re.sub('@'+name, '<USER>', text)
101 |         if first_name:
102 |             text = re.sub('(^|[^A-Za-z0-9])'+first_name+'($|[^A-Za-z0-9])', '\\1<USER>\\2', text)
103 | 
104 |     # tokenize and replace entities
105 |     words = casual_tokenize(text, preserve_case=False,reduce_len=True)
106 |     for n in six.moves.range(len(words)):
107 |         token = words[n]
108 |         # replace entities with tags (E-MAIL, URL, NUMBERS, USER, etc)
109 |         token = re.sub(r'^([a-z0-9_\.\-]+@[a-z0-9_\.\-]+\.[a-z]+)$','<E-MAIL>',token)
110 |         token = re.sub(r'^https?:\S+$','<URL>',token)
111 |         token = re.sub('^<numbers>$','<NUMBERS>',token)
112 |         token = re.sub('^<user>$','<USER>',token)
113 |         # make spaces for apostrophe and period
114 |         token = re.sub(r'^([a-z]+)\'([a-z]+)$','\\1 \'\\2',token)
115 |         token = re.sub(r'^([a-z]+)\.([a-z]+)$','\\1 . \\2',token)
116 |         words[n] = token
117 |     # join
118 |     text = ' '.join(words)
119 |     # remove signature of tweets (e.g. ... ^TH, - John, etc.)
120 |     if speaker == 'S':
121 |         text = re.sub(u'[\\^\\-~–][\\-– ]*([a-z]+\\s*|[a-z ]{2,8})(<URL>\\s*$|\\.\\s*$|$)','\\2',text)
122 |         if not re.search(r' (thanks|thnks|thx)\s*$', text):
123 |             text = re.sub(u'(?<= [\\-,!?.–])\\s*[a-z]+\\s*$','',text)
124 | 
125 |     return text
126 | 
127 | 
128 | def print_dialog(dialog, fo, sys_name='', debug=False):
129 |     """ print a dialog 
130 |     args:
131 |         dialog (list): list of tweet groups
132 |         sys_name (str): screen_name of system
133 |         file (object): file object to write text
134 |         debug (bool): debug mode
135 |     """
136 |     user_name = ''
137 |     user_first_name = ''
138 |     prev_speaker = ''
139 |     # print a dialog 
140 |     for utterance in dialog:
141 |         for n,tweet in enumerate(sorted(utterance.values(), key=lambda u:u['id'])):
142 |             screen_name = tweet['user']['screen_name']
143 |             name = tweet['user']['name']
144 |             text = tweet['text']
145 |             if debug:
146 |                 six.print_('### %s: %s' % (name,text), file=fo)
147 | 
148 |             if screen_name.lower() == sys_name.lower():
149 |                 speaker = 'S'
150 |                 text = preprocess(text, user_name, speaker=speaker, first_name=user_first_name)
151 |             else:
152 |                 speaker = 'U'
153 |                 text = preprocess(text, system_name, speaker=speaker)
154 |                 # set user's screen name and first name to replace the names
155 |                 # to a common symbol (e.g. <USER>) in system utterances
156 |                 user_name = screen_name
157 |                 tokens = name.split()
158 |                 if len(tokens) > 0 and len(tokens[0]) > 2:
159 |                     m = re.match(r'([A-Za-z0-9]+)$', tokens[0])
160 |                     if m:
161 |                         user_first_name = m.group(1)
162 |                         #print tokens[0],user_first_name
163 | 
164 |             if prev_speaker:
165 |                 if prev_speaker != speaker:
166 |                     six.print_('\n%s: %s' % (speaker,text), file=fo, end='')
167 |                 else: # connect to the previous tweet
168 |                     six.print_(' %s' % (text), file=fo, end='')
169 |             else:
170 |                 six.print_('%s: %s' % (speaker,text), file=fo, end='')
171 | 
172 |             prev_speaker = speaker
173 | 
174 |     six.print_('\n', file=fo)
175 | 
176 | 
177 | if __name__ == "__main__":
178 |     # parse command line
179 |     parser = argparse.ArgumentParser()
180 |     parser.add_argument('-t', '--target', help='read screen names from a file')
181 |     parser.add_argument('-o', '--output', help='output processed text into a file')
182 |     parser.add_argument('-d', '--debug', action='store_true', help='debug mode')
183 |     parser.add_argument('--data-dir', help='specify data directory')
184 |     parser.add_argument('--max-turns', default=20, 
185 |                         help='exclude long dialogs by this number')
186 |     parser.add_argument('--no-progress-bar', action='store_true', 
187 |                         help='show progress bar')
188 |     parser.add_argument('files', metavar='FN', nargs='*',
189 |                         help='filenames of twitter dialogs in json format')
190 |     args = parser.parse_args()
191 | 
192 |     # prepare filenames
193 |     target_files = set()
194 |     if args.target:
195 |         for ln in open(args.target, 'r').readlines():
196 |             target_files.add(os.path.join(args.data_dir, ln.strip() + '.json'))
197 |     for fn in args.files:
198 |         target_files.add(fn)
199 | 
200 |     target_files = sorted(list(target_files))
201 |     # open output file
202 |     if args.output:
203 |         fo = open(args.output,'w')
204 |     else:
205 |         fo = sys.stdout
206 | 
207 |     # extract dialogs
208 |     # (1) merge separated tweets that can be considered one utterance
209 |     # (2) filter irregular dialogs, e.g. too long, including truncated tweets, etc.
210 |     # (3) tokenize and normalize sentence text
211 |     if not args.no_progress_bar:
212 |         target_files = tqdm(target_files)
213 | 
214 |     for fn in target_files:
215 |         dialog_set = json.load(open(fn,'r'))
216 |         m = re.search(r'(^|\/)([^\/]+)\.json',fn)
217 |         if m:
218 |             system_name = m.group(2)
219 |         else:
220 |             raise Exception('no match to a screen name in %s' % fn)
221 | 
222 |         # make a tweet tree to merge sequential tweets by the same user
223 |         root = {}
224 |         for tid_str in sorted(dialog_set.keys()):
225 |             dialog = dialog_set[tid_str]
226 |             if dialog[0]['lang'] == 'en':
227 |                 tid = dialog[0]['id']
228 |                 if tid not in root:
229 |                     root[tid] = ([],{tid:dialog[0]})
230 |                 node = root[tid][0]
231 |                 for tweet in dialog[1:]:
232 |                     tid = tweet['id']
233 |                     m = find_sequential_tweets(tweet,node)
234 |                     if m >= 0:
235 |                         node[m][1][tid] = tweet
236 |                     else:
237 |                         node.append(([],{tid:tweet}))
238 |                     node = node[m][0]
239 |     
240 |         # depth-first search and extract dialogs
241 |         stack = []
242 |         for tid in sorted(root.keys()):
243 |             stack.append((root[tid][0], [root[tid][1]]))
244 | 
245 |         while len(stack)>0:
246 |             node,dseq = stack.pop()
247 |             if len(node) == 0:
248 |                 if validate_dialog(dseq, args.max_turns):
249 |                     print_dialog(dseq, fo, sys_name=system_name, debug=args.debug)
250 |             else:
251 |                 for elm in node:
252 |                     stack.append((elm[0], dseq + [elm[1]]))
253 | 
254 |     if args.output:
255 |         fo.close()
256 | 
257 | 


--------------------------------------------------------------------------------
/tasks/twitter/make_official_data.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # you can change the directory path according to the location of stored data
 4 | stored_data=../../collect_twitter_dialogs/official_stored_data
 5 | 
 6 | # you need to download tweet ID info to specify the data sets
 7 | idfile=official_begin_end_ids.json
 8 | idlink=https://www.dropbox.com/s/8lmpu5dfw2hmpys/official_begin_end_ids.json.gz
 9 | echo downloading begin/end IDs for extracting data
10 | wget $idlink
11 | gunzip -f ${idfile}.gz
12 | 
13 | # extract train and dev sets
14 | echo extracting training set
15 | ./extract_official_twitter_dialogs.py --data-dir $stored_data --id-file $idfile -t official_account_names_train.txt -o twitter_official_data_train.txt
16 | 
17 | echo extracting development set
18 | ./extract_official_twitter_dialogs.py --data-dir $stored_data --id-file $idfile -t official_account_names_dev.txt -o twitter_official_data_dev.txt
19 | 
20 | # extract 500 samples from dev set randomly for tentative evaluation
21 | echo extracting test set
22 | ./utils/sample_dialogs.py twitter_official_data_dev.txt 500 > twitter_official_data_dev500.txt
23 | 
24 | echo done
25 | 
26 | echo checking data size
27 | ./utils/check_dialogs.py twitter_official_data_train.txt twitter_official_data_dev.txt
28 | 
29 | 


--------------------------------------------------------------------------------
/tasks/twitter/make_official_testset+refs.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # you can change the directory path according to the location of stored data
 4 | stored_data=../../collect_twitter_dialogs/official_stored_data
 5 | 
 6 | # tweet ID list of the test set
 7 | # (reference text is included since the challenge period already finished)
 8 | idlist=official_testset_ids+refs.json
 9 | gunzip -c ${idlist}.gz > $idlist
10 | 
11 | echo extracting the official test set
12 | extract_official_twitter_testset.py \
13 |     --id-list $idlist \
14 |     --data-dir $stored_data \
15 |     --target official_account_names_test.txt \
16 |     --output twitter_official_data_test+refs.txt
17 | 
18 | echo done
19 | 


--------------------------------------------------------------------------------
/tasks/twitter/make_official_testset.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # you can change the directory path according to the location of stored data
 4 | stored_data=../../collect_twitter_dialogs/official_stored_data
 5 | 
 6 | # tweet ID list of the test set
 7 | # (reference text is not provided, i.e., last turn of each dialog is empty)
 8 | idlist=official_testset_ids.json
 9 | gunzip -c ${idlist}.gz > $idlist
10 | 
11 | echo extracting the official test set
12 | extract_official_twitter_testset.py \
13 |     --id-list $idlist \
14 |     --data-dir $stored_data \
15 |     --target official_account_names_test.txt \
16 |     --output twitter_official_data_test.txt
17 | 
18 | echo done
19 | 


--------------------------------------------------------------------------------
/tasks/twitter/make_trial_data.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # you can change the directory path according to the location of stored data
 4 | stored_data=../../collect_twitter_dialogs/stored_data
 5 | 
 6 | # extract train and dev sets
 7 | echo extracting training set
 8 | ./extract_twitter_dialogs.py --data-dir $stored_data -t account_names_train.txt -o twitter_trial_data_train.txt
 9 | echo extracting development set
10 | ./extract_twitter_dialogs.py --data-dir $stored_data -t account_names_dev.txt -o twitter_trial_data_dev.txt
11 | 
12 | # extract 500 samples from dev set randomly for evaluation
13 | echo extracting test set
14 | ./utils/sample_dialogs.py twitter_trial_data_dev.txt 500 > twitter_trial_data_eval.txt
15 | 
16 | echo done
17 | 


--------------------------------------------------------------------------------
/tasks/twitter/official_account_names_dev.txt:
--------------------------------------------------------------------------------
  1 | 1800flowers
  2 | ALCATEL1TOUCH
  3 | ATTCares
  4 | AetnaHelp
  5 | AskCapitalOne
  6 | AskMTNGhana
  7 | AsurionCares
  8 | AvidSupport
  9 | Avis
 10 | BookChicClub
 11 | BostonMarki
 12 | CIPD
 13 | CMLFerry
 14 | CadburyIreland
 15 | Camper_CustCare
 16 | CastelliCafe
 17 | ChaseSupport
 18 | CollectorCorps
 19 | DollarGeneral
 20 | DolphinCares
 21 | ENMAXenergy
 22 | EmiratesSupport
 23 | Entergy
 24 | Equinox_Service
 25 | FCUK
 26 | FedExCanada
 27 | FoodLion
 28 | FootballIndexUK
 29 | FordSouthAfrica
 30 | GapCA
 31 | GiltService
 32 | GrandHyattNYC
 33 | GreyhoundBus
 34 | HDCares
 35 | HTCHelp
 36 | Heinens
 37 | IPFW_ITS_HD
 38 | IWCare
 39 | IrishRail
 40 | JKalena123
 41 | JawboneSupport
 42 | JeepCares
 43 | Jersey_City
 44 | KitchenAid_CA
 45 | Kohls
 46 | LQcontactus
 47 | LinkedInUK
 48 | Lo573
 49 | LoveWilko
 50 | LozHarvey
 51 | MDSHA
 52 | MaclarenService
 53 | MarshaCollier
 54 | MeteorPR
 55 | NBASTORE
 56 | O2IRL
 57 | OfficeShoes
 58 | OrbitzCareTeam
 59 | PRISM_NSA
 60 | PanagoCares
 61 | PhilipsCare_UK
 62 | PostOfficeNews
 63 | PoweradeGB
 64 | RareLondon
 65 | RubyBlogVAA
 66 | SAS_Cares
 67 | SCG_CS
 68 | SDixonVivint
 69 | SKYserves
 70 | SSE
 71 | ShoeboxedHelp
 72 | SkyIreland
 73 | SkySaga
 74 | SonySupportUSA
 75 | SouthWestWater
 76 | StarbucksUKCA
 77 | SunTzuHelpDesk
 78 | SurflineGH
 79 | Tech21Official
 80 | TheBodyShopUK
 81 | TicketmasterCS
 82 | TravelGuard
 83 | UconnectCares
 84 | VolvoCarUSA
 85 | Walmart
 86 | WasteWR
 87 | _omocat
 88 | askevanscycles
 89 | atmosenergy
 90 | audiblesupport
 91 | austin_reed
 92 | catalysthousing
 93 | champssports
 94 | consult53
 95 | dan_malarkey
 96 | digicelbarbados
 97 | dish
 98 | edfenergycs
 99 | feelunique
100 | hmunitedkingdom
101 | justhype
102 | kenmore
103 | lidl_ireland
104 | marilynsuttle
105 | miasampietro
106 | officedepot
107 | originenergy
108 | paycity
109 | paysafecard
110 | placesforpeople
111 | portlandwater
112 | pretavoir
113 | searscares
114 | skrill
115 | spreecoza
116 | stayhomeclub
117 | superbalist
118 | 


--------------------------------------------------------------------------------
/tasks/twitter/official_account_names_test.txt:
--------------------------------------------------------------------------------
  1 | 1Sale
  2 | ALawFirmTrainer
  3 | AUKEYofficial
  4 | AldiUSA
  5 | AmyLolaM
  6 | AshleyHomeHelp
  7 | AskAmexCanada
  8 | AskTigogh
  9 | BKLNIndustries
 10 | BestBuySupport
 11 | Blu_GH
 12 | CVSHealth
 13 | CXSocialCare
 14 | CanyonUK
 15 | ChipRBell
 16 | CustomerSure
 17 | DrinkCrystal
 18 | EdgarsFashion
 19 | EntergyLA
 20 | Everlane
 21 | FCHolidayCares
 22 | FamilyVideo
 23 | FedExHelp
 24 | FragranceShopUK
 25 | FreedomMobile
 26 | GlasgowSubway
 27 | Glimpsei
 28 | GrabTH
 29 | GreenFlagUK
 30 | HartmannLuggage
 31 | Hebrideangirl
 32 | HoldenSupport
 33 | HouseofHelpers
 34 | HydeHousing
 35 | IKEACASupport
 36 | IrishWater
 37 | JeepCareUK
 38 | KFCAustralia
 39 | KarmaloopHelp
 40 | LGUS
 41 | LMNmakeup
 42 | LeonelDorisca
 43 | Lexus
 44 | LogMeIn
 45 | LoveLullaBellz
 46 | LushWeCare
 47 | MIsabelgarcia65
 48 | MOO
 49 | MatalanHelp
 50 | MercedesUKTeam
 51 | Moto_Support
 52 | MyWakefield
 53 | NJTRANSIT
 54 | NetflixNordic
 55 | OneDayOnlycoza
 56 | PLDT_Cares
 57 | PNCBank
 58 | PayUmoney
 59 | PitneyBowes
 60 | SeattlesBest
 61 | Shayna
 62 | SubaruCustCare
 63 | SupergaUK
 64 | TOMSsupport
 65 | TTChelps
 66 | TelephoneDoctor
 67 | TheQAssist
 68 | TicketmasterIre
 69 | UKFast
 70 | UKNikon
 71 | UPMCHealthPlan
 72 | UlsterBank
 73 | UniqloUSA
 74 | UrbanWaxx
 75 | VodacomSoccer
 76 | WHGSupport
 77 | WW_Canada
 78 | WarbsNewburnBH
 79 | blockbuster
 80 | daysinn
 81 | enbridgegas
 82 | keybank
 83 | narrativecare
 84 | ntelcare
 85 | ntelng
 86 | pdbutterworth
 87 | poweradeau
 88 | redboxcare
 89 | reef84
 90 | sarahlennon08
 91 | service
 92 | servicenow
 93 | socalgas
 94 | sprintcare
 95 | starbuckshelp
 96 | superdrug
 97 | vapouriz
 98 | vauxhall
 99 | vmbusiness
100 | zumiezhelp
101 | 


--------------------------------------------------------------------------------
/tasks/twitter/official_account_names_train.txt:
--------------------------------------------------------------------------------
   1 | 1800PetMeds
   2 | 1800flowershelp
   3 | 1DFAQ
   4 | 64audio
   5 | 888sport
   6 | ABCustomerCare
   7 | ABTAtravel
   8 | AFC_Amy
   9 | AFCustomerCare
  10 | APCcustserv
  11 | ASHelpMe
  12 | ASOS
  13 | ASOS_Au
  14 | ASOS_Us
  15 | ASUHelpCenter
  16 | ASUSUSA
  17 | AVIVAIRELAND
  18 | AbercrombieHelp
  19 | AddisonLeeCabs
  20 | AdobeCare
  21 | AflacPhyllis
  22 | AirAsia
  23 | AirChinaNA
  24 | AirtelNigeria
  25 | Alamo
  26 | AlaskaAir
  27 | Albertsons
  28 | AlderTraining
  29 | AldiUK
  30 | Aldi_Ireland
  31 | AlfaRomeoCareUK
  32 | AlfaRomeoUSA
  33 | AliExpress_EN
  34 | AlissaDosSantos
  35 | AllClearID
  36 | Allegiant
  37 | AllianzTravelUS
  38 | Allstate
  39 | AllyCare
  40 | AlshayaHelpDesk
  41 | AmanaCare
  42 | AmazonHelp
  43 | AmazonVideoUK
  44 | AmbitiousPrints
  45 | AmericanAir
  46 | AmexAU
  47 | AndreaSWilson
  48 | AnglianHelp
  49 | AnkiSupport
  50 | AnthemBC_News
  51 | AntwanBarnes42
  52 | AppleSupport
  53 | AppleyardLondon
  54 | ArqivaWiFi
  55 | AsdaCustCare
  56 | Ask123ie
  57 | AskAmex
  58 | AskAmexAU
  59 | AskAnnaNouyou
  60 | AskCiti
  61 | AskDyson
  62 | AskEmblemHealth
  63 | AskExplorer
  64 | AskKBCIreland
  65 | AskLandsEnd
  66 | AskMarstons
  67 | AskMeHelpDesk
  68 | AskPS_UK
  69 | AskPaddyPower
  70 | AskSmythsToys
  71 | AskSubaruCanada
  72 | AskTSA
  73 | AskTarget
  74 | AskTeamUA
  75 | Ask_Spectrum
  76 | AskeBay
  77 | Askvodafonegh
  78 | AsmodeeNA
  79 | Atomos_News
  80 | AudiUK
  81 | AudiUKCare
  82 | AutodeskHelp
  83 | Ayres_Hotels
  84 | BCBSLA
  85 | BCC_Help
  86 | BCSSupport
  87 | BJsWholesale
  88 | BN_care
  89 | BOOKSetc_online
  90 | BP_UK
  91 | BP_plc
  92 | BRCustServ
  93 | BSSHelpDesk
  94 | BabiesRUs
  95 | BabyJogger
  96 | Bajaj_Finance
  97 | BananaRepublic
  98 | BandQ
  99 | BarclaycardNews
 100 | BarclaysUK
 101 | BeautyBarDotCom
 102 | Beaverbrooks
 103 | BedBathBeyond
 104 | BenefitUK
 105 | BernardBoutique
 106 | Bernstein
 107 | BiffaService
 108 | BikeMetro
 109 | BitcoinHelpDesk
 110 | BlackBerryHelp
 111 | Blanchford
 112 | Blendtec
 113 | BluestarHQ
 114 | Boars_Head
 115 | BoostCare
 116 | BootsUK
 117 | BordGaisEnergy
 118 | BoseService
 119 | BounceEnergy
 120 | BritishGas
 121 | BritishGasNews
 122 | Budget
 123 | Buick
 124 | BuildaHelpDesk
 125 | BupaUK
 126 | BurberryService
 127 | BurgerKing
 128 | Burton_Menswear
 129 | CCP_Help
 130 | CDCustService
 131 | COTABus
 132 | CRCustomerChat
 133 | CRTContactUs
 134 | CTS___
 135 | CableBill
 136 | CadburyUK
 137 | CalMac_Updates
 138 | CallawayGolfCS
 139 | CanadianPacific
 140 | CapMetroATX
 141 | CareShopBunzl
 142 | CareerBuilder
 143 | CaroKopp
 144 | Cartii
 145 | CellC_Support
 146 | CheapTixHearsU
 147 | ChiccoUSA
 148 | Chomperbee
 149 | ChrissyChong
 150 | ChryslerCares
 151 | Cignaquestions
 152 | CiteThisForMe
 153 | CitiBikeNYC
 154 | Citibank
 155 | CitySprint_help
 156 | ClaireLBSmith
 157 | ClairesEurope
 158 | CoastCath
 159 | ComcastILLINOIS
 160 | ComcastOrSWWa
 161 | CoopEnergy
 162 | Costco
 163 | CoxHelp
 164 | DCBService
 165 | DEWALT_UK
 166 | DEWALTtough
 167 | DFSCare
 168 | DIBsupport
 169 | DICKS
 170 | DIGICELJamaica
 171 | DIRECTV
 172 | DK_Assist
 173 | DNBcares
 174 | DPDCustomerCare
 175 | DSGSupport
 176 | DStvNg
 177 | DTSatWIT
 178 | DarielleRose
 179 | DarkBunnyTees
 180 | DawnCarillion
 181 | Deliveroo
 182 | Deliveroo_IE
 183 | DevonCC
 184 | DianaHSmith
 185 | DigitalRiverInc
 186 | Dillards
 187 | Discover
 188 | Discovery_SA
 189 | DivvyBikes
 190 | DocumentDB
 191 | DoorDash
 192 | Dreams_Beds
 193 | DressCircleShop
 194 | DrinkSparkletts
 195 | DunelmUK
 196 | DustoMan
 197 | ENMAX
 198 | EPCOR
 199 | EYellin
 200 | EarthLink
 201 | EatNaturalBars
 202 | EllenKeeble
 203 | EmbracePetIns
 204 | EmersonIT
 205 | EntergyArk
 206 | Enterprise
 207 | EvansCycles
 208 | EversourceCT
 209 | EviteSupport
 210 | Expedia
 211 | ExpressHelp
 212 | FFGames
 213 | FH_CustomerCare
 214 | FLBlueCares
 215 | FLLFlyer
 216 | FTcare
 217 | FUT_COINSTORE
 218 | FabFitFunCS
 219 | FabSupportTeam
 220 | FamilyBankKenya
 221 | Fanatics
 222 | Fandom_Insider
 223 | FantasticCleanL
 224 | FedExCanadaHelp
 225 | FedExEurope
 226 | FedExHelpEU
 227 | FeelGoodPark
 228 | FiatCareUK
 229 | FiguresToyCo
 230 | FinnairHelps
 231 | Fly_Norwegian
 232 | Fon
 233 | FonCare
 234 | Footasylum
 235 | Ford
 236 | Forever21Help
 237 | FortisBC
 238 | FrankEliason
 239 | FreeviewTV
 240 | FromYouFlowers
 241 | FrontierCare
 242 | FunjetVacations
 243 | FunkoDCLegion
 244 | GEICO_Service
 245 | GETMEIN
 246 | GM
 247 | GRT_ROW
 248 | GSMA_Care
 249 | Gap
 250 | GarudaCares
 251 | GenesisHousing
 252 | GeoffRamm
 253 | GeorgiaPower
 254 | GeoxCares
 255 | GlideUK
 256 | GloCare
 257 | GlossyboxUK
 258 | GoIdahHelpdesk
 259 | GoTriangle
 260 | Go_CheshireWest
 261 | Go_Wireless
 262 | GongshowService
 263 | Google
 264 | GrandHyattSD
 265 | GreatClipsCares
 266 | Groove
 267 | Grubhub_Care
 268 | Gymshark
 269 | Gymshark_Help
 270 | HEB
 271 | HMSHost
 272 | HQhair_Help
 273 | HRBlockAnswers
 274 | HSBC_Sport
 275 | HSBC_UAE
 276 | HSBC_UK
 277 | HSBC_US
 278 | HSScustomercare
 279 | HalfordsCycling
 280 | HarrisTeeter
 281 | HarrodsService
 282 | HarryandDavid
 283 | HawaiianAir
 284 | HeatherJStrout
 285 | HelloKit
 286 | HilltopNorfolk
 287 | HollisterCoHelp
 288 | HomeDepotCanada
 289 | HomesenseUK
 290 | Honda
 291 | HondaCustSvc
 292 | HondaPowersprts
 293 | Hootsuite_Help
 294 | Hotwire
 295 | Huawei
 296 | HuaweiMobile
 297 | HudsonshoesUK
 298 | HullUni_ICT
 299 | HumanaHelp
 300 | HwnElectric
 301 | HyattChicago
 302 | HyattChurchill
 303 | Hyken
 304 | IGearBrand
 305 | IKEAIESupport
 306 | IKEAUKSupport
 307 | IKEAUSAHelp
 308 | INDOCHINO
 309 | INDOT
 310 | INFINITICare
 311 | INFINITIUSA
 312 | IcelandFoods
 313 | InMotionCares
 314 | Incite_Group
 315 | IndyDPW
 316 | IslandAirHawaii
 317 | Iyengarish
 318 | JDhelpteam
 319 | JIRAServiceDesk
 320 | JLcustserv
 321 | JLove55
 322 | JMstore
 323 | JabraEurope
 324 | Jabra_US
 325 | JackWills
 326 | JagexInfinity
 327 | JamboPayCare
 328 | Jamie1973
 329 | JapanHelpDesk
 330 | JasonWeigandt
 331 | Jeep
 332 | Jesshillcakes
 333 | JetBlue
 334 | JetHeads
 335 | JetstarAirways
 336 | Jetstar_NZ
 337 | JimEllisAudi
 338 | JimatPlanters
 339 | JoinRelate
 340 | JonesClocks
 341 | JordanLeggett16
 342 | Jostens
 343 | KFC_UKI
 344 | KFC_UKI_Help
 345 | KLM_UK
 346 | KakaoTalkPH
 347 | KateNasser
 348 | KauaiSA
 349 | Kazel_Kimpo
 350 | KenyaPower
 351 | Kia
 352 | KimiNozoGuy
 353 | KingFlyKOTD
 354 | KitbagCS
 355 | KitchenAid_CAre
 356 | KrogerSupport
 357 | LPTechTeam
 358 | LVcares
 359 | LandRover_UK
 360 | LavishAlice
 361 | LeapCard
 362 | LenovoANZ
 363 | Level99Games
 364 | LibertyHelpDesk
 365 | LibertyMutual
 366 | LidsAssist
 367 | LinkedInHelp
 368 | LiveChat
 369 | LiveNationON
 370 | LivePhishCS
 371 | Ljbpieces
 372 | Logitech_ANZ
 373 | Lovehoney
 374 | LowesCares
 375 | LubeStop
 376 | MACcosmetics
 377 | MBTA_CR
 378 | MCMComicCon
 379 | METROHouAlerts
 380 | MLBFanSupport
 381 | MRPHelp
 382 | MRPfashion
 383 | MSE_Forum
 384 | MTA
 385 | MTGOTraders
 386 | MTN180
 387 | MUSupportDesk
 388 | MVPHealthCare
 389 | MWilbanks
 390 | Macys
 391 | MadameTussauds
 392 | MallforAfrica
 393 | MandT_Bank
 394 | ManiereDeVoir
 395 | MarcGoodmanBos
 396 | MariNBCSD
 397 | MarkSuppNet
 398 | MasterofMaltCS
 399 | Matalan
 400 | MaxiCosiUK
 401 | MaytagBrand
 402 | MaytagCare
 403 | Mazda_SA
 404 | McDonalds
 405 | Mcheza_Care
 406 | Medela_US
 407 | MetroBank_Help
 408 | MetroTransitMN
 409 | MicrosoftHelps
 410 | Minted
 411 | Misfit
 412 | MissSelfridge
 413 | Missytohbadt
 414 | ModCloth_Cares
 415 | MrCabinetCareOC
 416 | MrsGammaLabs
 417 | Musictoday
 418 | MwaveAu
 419 | MyDoncaster
 420 | NISMO_USA
 421 | NOOK_Care
 422 | NOOK_Care_UK
 423 | NYSC
 424 | NYTCare
 425 | Nathane_Jackson
 426 | NationalPro
 427 | NealTopf
 428 | NetflixANZ
 429 | NetflixUK
 430 | Netflix_CA
 431 | NetoneSupport
 432 | NeweggService
 433 | Nipponyasancom
 434 | NissanUSA
 435 | Nordstrom
 436 | Norfolkholidays
 437 | Nutrisystem
 438 | OEcare
 439 | OasisFashion
 440 | OfficeCandyGals
 441 | OfficeDivvy
 442 | OldNavyCA
 443 | One_Bobby_Moore
 444 | Ooma
 445 | OoredooCare
 446 | OoredooQatar
 447 | OpenMike_TV
 448 | OptumRx
 449 | OtelSupport
 450 | OtterBox
 451 | OtterBoxCS
 452 | PASmithjr
 453 | PBCares
 454 | PCRichardandSon
 455 | PECOconnect
 456 | PIA_Cust_unCare
 457 | PINGTourEurope
 458 | PLC_Singapore
 459 | PMInstitute
 460 | PNCBank_Help
 461 | PNMtalk
 462 | POTUS_CustServ
 463 | PaddyPowerShops
 464 | PalaceMovies
 465 | PanaService_UK
 466 | Panago_Pizza
 467 | PanasonicUSA
 468 | PapaJohns
 469 | PaperMate
 470 | ParkHyattChi
 471 | PartyCity
 472 | PatchworkUrchin
 473 | PatriotsProShop
 474 | PayPalInfoSec
 475 | PaychexService
 476 | Paytmcare
 477 | PeabodyLDN
 478 | PenskeCares
 479 | PeteFyfe
 480 | PeterPanBus
 481 | PeterboroughCC
 482 | PeugeotZA
 483 | Photobox
 484 | PioneerDJ
 485 | PlayDoh
 486 | PlayStationAU
 487 | PlentyOfFish
 488 | Pokemon
 489 | Porsche
 490 | Primark
 491 | PrincesTrust
 492 | ProFlowers
 493 | ProtectionOne
 494 | Publix
 495 | PublixHelps
 496 | PurolatorHelp
 497 | QDStores
 498 | Quinny_UK
 499 | RACWA
 500 | RAC_Care
 501 | RBC
 502 | RBKC_CS
 503 | RBWM
 504 | RCommCare
 505 | RIPTA_RI
 506 | RamCares
 507 | RamseyCare
 508 | Rand_Water
 509 | RaneDJ
 510 | RatedPeople
 511 | Reachout_mcd
 512 | ReebokUK
 513 | Relish
 514 | RideUTA
 515 | RobSylvan
 516 | RogersBiz
 517 | RogersHelps
 518 | Ronseal
 519 | RoyalMail
 520 | SACustCare
 521 | SAS
 522 | SBEGlobal
 523 | SEPTA_SOCIAL
 524 | SP_EnergyPeople
 525 | STATravel_UK
 526 | Safaricom_Care
 527 | SageSupport
 528 | SainsburysNews
 529 | SaksService
 530 | SallyBeautyUK
 531 | SamTrans
 532 | SamsungCareSA
 533 | SaskTel
 534 | Schnittgemuese
 535 | ScholasticClub
 536 | Scienceofsport
 537 | ScottishPower
 538 | Sears
 539 | SearsHomeExpert
 540 | SeattleSPU
 541 | Selfridges
 542 | ServiceDivaBren
 543 | Service_Queen
 544 | SharisBerries
 545 | ShiseidoUSA
 546 | ShoePalace
 547 | ShopRiteStores
 548 | ShopRuche
 549 | Sifter
 550 | SilkFred
 551 | SimplyhealthUK
 552 | SkyCinemaUK
 553 | SkypeSupport
 554 | Sleepys
 555 | SleepysCare
 556 | SneakerRefresh
 557 | SnoPUD
 558 | SoBiHamilton
 559 | SocialCaddieGC
 560 | Sofology
 561 | SofologyHelp
 562 | Sony
 563 | SoundTransit
 564 | SouthwestAir
 565 | SparkNZ
 566 | SpecializedCSUK
 567 | Spencers_Retail
 568 | Spinneys_Dubai
 569 | SpiritAirlines
 570 | Spokeo_Care
 571 | SquareTrade
 572 | StanChart
 573 | StaplesUK
 574 | StarHub
 575 | Stixzadinia
 576 | StockTwitsHelp
 577 | StonecareMike
 578 | StraubsMarkets
 579 | Stylistpick
 580 | SubaruCanada
 581 | SubwayListens
 582 | Suddenlink
 583 | SuddenlinkHelp
 584 | SunCares
 585 | Suncorp
 586 | SuperShuttle
 587 | Superdry
 588 | Superdry_Care
 589 | SwingSetService
 590 | TCoughlin
 591 | TEAVANA
 592 | TELUS
 593 | TEPenergy
 594 | TKMaxx_UK
 595 | TK_HelpDesk
 596 | TLGTourHelp
 597 | TMLewin
 598 | TMRQld
 599 | TMobileHelp
 600 | TNTUKCare
 601 | TSA
 602 | TTCsue
 603 | TW2CayC
 604 | TWC_Help
 605 | TWimpeySupport
 606 | TacoBellTeam
 607 | TandemDiabetes
 608 | Tartineaubeurre
 609 | TeamSanshee
 610 | TeamSantone
 611 | TeamShieldBSC
 612 | TeamStubHub
 613 | TechUncensored
 614 | TekCenter
 615 | TelkomZA
 616 | Tesco
 617 | TessutiHelpTeam
 618 | TextNowHelp
 619 | TheBookPeople
 620 | TheFrontDesk
 621 | TheGymGroup
 622 | TheLondonEye
 623 | TheRAC_UK
 624 | TheRTStore
 625 | TheSharck
 626 | ThomasCookCares
 627 | ThomasCookUK
 628 | ThomsonCares
 629 | ThreadSence
 630 | ThreeUK
 631 | TiVo
 632 | TiVoSupport
 633 | TicketWeb
 634 | Ticketmaster
 635 | TicketmasterCA
 636 | TomTom_SA
 637 | Topheratl
 638 | Topman
 639 | TopshopHelp
 640 | ToshibaUSAhelp
 641 | TotalGymDirect
 642 | ToyotaCustCare
 643 | ToyotaRacing
 644 | ToysRUs
 645 | TradeMe
 646 | TruGreen
 647 | Trustpilot
 648 | UCLMainLibrary
 649 | UHaul_Cares
 650 | UKEnterprise
 651 | UKMUJI
 652 | UKVolkswagen
 653 | UNB_ITS
 654 | UPS_UK
 655 | USAA_help
 656 | USPS
 657 | USPSHelp
 658 | UWDoIT
 659 | UnityQAThomas
 660 | UpDesk
 661 | VIZIOsupport
 662 | VMUcare
 663 | VendHelp
 664 | Venture_Cycles
 665 | VerizonSupport
 666 | VeryHelpers
 667 | VirginAmerica
 668 | VirginMediaIE
 669 | Visit_Jax
 670 | Vitality_UK
 671 | VitamixUK
 672 | Vodacom
 673 | Vodacom111
 674 | VodacomRugga
 675 | VodafoneAU_Help
 676 | WAGSocialCare
 677 | WB_Games_UK
 678 | WHSmith
 679 | WMSpillett
 680 | WOW_WAY
 681 | WReynoldsYoung
 682 | Walgreens
 683 | WarThunder
 684 | WatchShop
 685 | WeChatZA
 686 | WellsFargo
 687 | WesBooth
 688 | WhirlpoolCare
 689 | Whirlpool_CAre
 690 | WholeFoods
 691 | WilsonGolf
 692 | WingateHotels
 693 | Wizards_Help
 694 | Wkc10
 695 | XOCare
 696 | YahooCare
 697 | ZAGGdaily
 698 | ZARA
 699 | ZARA_Care
 700 | Zalebs
 701 | Zapatosdesigner
 702 | ZapposLuxury
 703 | ZeekMarketplace
 704 | Zipbuds
 705 | ZoomSphere
 706 | _MBService
 707 | _rightio
 708 | _valpow_
 709 | abbeygroup
 710 | abbylynne19
 711 | abellio_surrey
 712 | acemtp
 713 | ack
 714 | acmemarkets
 715 | acnestudios
 716 | adamholz
 717 | adidasGhana
 718 | adidasUK
 719 | adrianflux
 720 | adrianswinscoe
 721 | advocare
 722 | airnzuk
 723 | airtel_care
 724 | alabamapower
 725 | alamocares
 726 | alexanderlewis
 727 | alexbarinka
 728 | alpharooms
 729 | alwaysriding
 730 | alwayswithyoumw
 731 | americangiant
 732 | andertonsmusic
 733 | andrew_heister
 734 | annabelkarmel
 735 | askMsQ
 736 | ask_progressive
 737 | askpermanenttsb
 738 | asksurfline
 739 | askvisa
 740 | astros
 741 | atomtickets
 742 | audibleuk
 743 | audiireland
 744 | beautybay
 745 | belk
 746 | bestdealtv
 747 | bexdeep
 748 | beyondthedesk
 749 | bigmikeyvegas
 750 | blackanddecker
 751 | bobbyrayburns
 752 | boohoo_cshelp
 753 | bookdepository
 754 | booksamillion
 755 | bravern
 756 | brooksrunning
 757 | builds_io
 758 | cableONE
 759 | cafepress
 760 | calgary
 761 | callummay
 762 | calorireland
 763 | cam4_gay
 764 | cars_portsmouth
 765 | casualclassics
 766 | cesarkeller
 767 | cheftyler
 768 | chevrolet
 769 | chokemetoo
 770 | chrisfenech1
 771 | clarkshelp
 772 | comcastcares
 773 | comparemkt_care
 774 | comparethemkt
 775 | craftsman
 776 | cstmrsvc
 777 | ctcustomercare
 778 | cvspharmacy
 779 | danaddicott
 780 | danandphilshop
 781 | dancathy
 782 | debthelpdesk
 783 | deliverydotcom
 784 | devinfinlay
 785 | dgingiss
 786 | dimonet
 787 | directvnow
 788 | dj0s0
 789 | djccwl
 790 | dnataSupport
 791 | dodgecares
 792 | dongmsd
 793 | doxiecare
 794 | durafloruk
 795 | eBay_UK
 796 | easons
 797 | easternbank
 798 | econet_support
 799 | edfenergy
 800 | edfenergycomms
 801 | eflow_freeflow
 802 | eh_custcare
 803 | eirNews
 804 | elfcosmetics
 805 | emikathure
 806 | enterprisecares
 807 | epicgeargaming
 808 | epointsjordan
 809 | esetcares
 810 | espnplayer
 811 | etisalat
 812 | eventbritehelp
 813 | express
 814 | farfetch
 815 | faux_punk
 816 | fdarenahelp
 817 | fiatcares
 818 | firstdirect
 819 | fisherpaykelaus
 820 | flySAA_US
 821 | fontspring
 822 | forduk
 823 | foxrentcar
 824 | freshlypicked
 825 | fryselectronics
 826 | gadventures
 827 | gamoid
 828 | gardenknowhow
 829 | garyteamasics
 830 | geniusfoods
 831 | getsatisfaction
 832 | gigaclear
 833 | glasses_direct
 834 | gogreenride
 835 | googlemaps
 836 | graveshambc
 837 | handtec
 838 | hbonow
 839 | helpscout
 840 | hhavrilesky
 841 | hm
 842 | hm_custserv
 843 | hmaustralia
 844 | hmcanada
 845 | hmsouthafrica
 846 | holden_aus
 847 | holidayautos
 848 | holidaytaxisCS
 849 | houseoffraser
 850 | hsamueljeweller
 851 | hsr
 852 | i_hate_ironing
 853 | iiNet
 854 | instituteofcs
 855 | inthestyleUK
 856 | ironicaccount
 857 | itsemiel
 858 | jAzzyF_BaBy
 859 | jabrasport
 860 | jackd
 861 | jackiecas1
 862 | jasoneden
 863 | jaybaer
 864 | jazzpk
 865 | jcpenney
 866 | jct600
 867 | jeffreyboutique
 868 | jessops
 869 | jmspool
 870 | joepindar
 871 | joythestore
 872 | jscheel
 873 | junebugbatticus
 874 | k9cuisine
 875 | kellie_brooks
 876 | ketpoole
 877 | kevinGEEdavis
 878 | kiddicare
 879 | kimryan1974
 880 | kisluvkis
 881 | kongacare
 882 | latterash
 883 | leseadzimah
 884 | lesleybarnes
 885 | lesmicek
 886 | lidl_ni
 887 | lids
 888 | lidscanada
 889 | lootcrate
 890 | lordandtaylor
 891 | mamasandpapas
 892 | mandkbutchers
 893 | markmadison
 894 | martindentist
 895 | masnHelpDesk
 896 | mattlatmatt
 897 | mattr
 898 | mdltech
 899 | megabus
 900 | melbournemuseum
 901 | micahsolomon
 902 | michaelshearer
 903 | mitsucars
 904 | mjanczewski
 905 | monkiworld
 906 | moreforURdollar
 907 | mrpatto
 908 | musicMagpie
 909 | musicMagpieCS
 910 | musicofbelle
 911 | mycellcom
 912 | nationalcares
 913 | nationalgridus
 914 | ncbja
 915 | neimanmarcus
 916 | netflix
 917 | nextofficial
 918 | nikkicarrtqs
 919 | nmkrobinson
 920 | nokiamobile
 921 | notonthehighst
 922 | notzakihas
 923 | npowerhq
 924 | nspowerinc
 925 | nuerasupport
 926 | nvidiacc
 927 | nxcare
 928 | officialpescm
 929 | olympicfinishes
 930 | oxygenfreejump
 931 | parracity
 932 | paulodetarso24
 933 | petedenton
 934 | pfgregg
 935 | picturehouses
 936 | plusnethelp
 937 | pond5
 938 | premestateswine
 939 | princessdvonb
 940 | qatarairways
 941 | ramennoodlesMC
 942 | redbox
 943 | rhapsaadic
 944 | rideact
 945 | riteaid
 946 | ritual_co
 947 | riverisland
 948 | rossrader
 949 | rwhelp
 950 | sageuk
 951 | sainsburys
 952 | samanthastarmer
 953 | samsungmobileng
 954 | schuh
 955 | scienceworks_mv
 956 | scotiabank
 957 | scottish_water
 958 | scottrade
 959 | secreteyesEW
 960 | seetickets
 961 | shaws
 962 | shopmissa
 963 | simplybusiness
 964 | sizehelpteam
 965 | sleepnumber
 966 | smtickets
 967 | sobeys
 968 | spacecojo
 969 | sportingindex
 970 | spring
 971 | sproutsfm
 972 | sseairtricity
 973 | statravelAU
 974 | subaru_usa
 975 | sunglassesshop
 976 | supermartie
 977 | swannsecurity
 978 | swoonstars
 979 | tacey
 980 | taportugal
 981 | teeofftimes
 982 | tescomobile
 983 | tescomobilecare
 984 | tescomobileire
 985 | tfbalerts
 986 | theMasterLink
 987 | thebitchdesk
 988 | thebondteam
 989 | thekirbycompany
 990 | thenutribullet
 991 | thinkmedialabs
 992 | thomaswales
 993 | tineywristwatch
 994 | tjmaxx
 995 | toister
 996 | tonydataman
 997 | townshoes
 998 | traceychurray
 999 | trafficscotland
1000 | travelocity
1001 | trimet
1002 | twinklresources
1003 | uga_eits
1004 | united
1005 | vaillantuk
1006 | verabradleycare
1007 | verygoodservice
1008 | vigglesupport
1009 | vmbusinesshelp
1010 | vtaservice
1011 | vueling
1012 | vwcares
1013 | w1zz
1014 | wahoofitness
1015 | warjohi
1016 | we_energies
1017 | whirlpoolusa
1018 | wimrampen
1019 | withazed
1020 | wow_air
1021 | wowairsupport
1022 | wtzgoodPHL
1023 | yoox
1024 | zaggcare
1025 | 


--------------------------------------------------------------------------------
/tasks/twitter/official_testset_ids+refs.json.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dialogtekgeek/DSTC6-End-to-End-Conversation-Modeling/f142c3025e03adef99c0d97be663f0799adf5524/tasks/twitter/official_testset_ids+refs.json.gz


--------------------------------------------------------------------------------
/tasks/twitter/official_testset_ids.json.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dialogtekgeek/DSTC6-End-to-End-Conversation-Modeling/f142c3025e03adef99c0d97be663f0799adf5524/tasks/twitter/official_testset_ids.json.gz


--------------------------------------------------------------------------------
/tasks/twitter/utils:
--------------------------------------------------------------------------------
1 | ../utils


--------------------------------------------------------------------------------
/tasks/utils/bleu_score.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- coding: utf-8 -*-
 3 | """BLEU score for dialog system output
 4 | 
 5 |    Copyright (c) 2017 Takaaki Hori  (thori@merl.com)
 6 | 
 7 |    This software is released under the MIT License.
 8 |    http://opensource.org/licenses/mit-license.php
 9 | 
10 | """
11 | 
12 | import sys
13 | from nltk.translate.bleu_score import corpus_bleu
14 | 
15 | refs = []
16 | hyps = []
17 | for utt in open(sys.argv[1],'r').readlines():
18 |     if utt.startswith('S_REF:'):
19 |         ref = utt.replace('S_REF:','').split()
20 |         refs.append([ref])
21 | 
22 |     if utt.startswith('S_HYP:'): 
23 |         hyp = utt.replace('S_HYP:','').split()
24 |         hyps.append(hyp)
25 | 
26 | # obtain BLEU1-4
27 | print("--------------------------------")
28 | print("Evaluated file: " + sys.argv[1])
29 | print("Number or references: %d" % len(refs))
30 | print("Number or hypotheses: %d" % len(hyps))
31 | print("--------------------------------")
32 | if len(refs) > 0 and len(hyps) > 0 and len(refs)==len(hyps):
33 |     result = []
34 |     for n in [1,2,3,4]:
35 |         weights = [1./n] * n
36 |         print('Bleu%d: %f' % (n, corpus_bleu(refs,hyps,weights=weights)))
37 | else:
38 |     print("Error: mismatched reeferences and hypotheses.")
39 | print("--------------------------------")
40 | 
41 | 


--------------------------------------------------------------------------------
/tasks/utils/check_dialogs.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | """Check sizes of extracted dialog data
 3 | 
 4 |    Copyright (c) 2017 Takaaki Hori  (thori@merl.com)
 5 | 
 6 |    This software is released under the MIT License.
 7 |    http://opensource.org/licenses/mit-license.php
 8 | 
 9 | """
10 | 
11 | import sys
12 | 
13 | class dataset:
14 |     def __init__(self, name, filename, n_dialogs, n_utters, n_words):
15 |         self.name = name
16 |         self.filename = filename
17 |         self.n_dialogs = n_dialogs
18 |         self.n_utters = n_utters
19 |         self.n_words = n_words
20 | 
21 |     def check(self, n_dialogs, n_utters, n_words):
22 |         print('--- %s set ---' % self.name)
23 |         print('n_dialogs: %d (%.2f %% difference from reference: %d)' % (n_dialogs, float(abs(self.n_dialogs - n_dialogs))/n_dialogs*100, self.n_dialogs))
24 |         print('n_utters: %d (%.2f %% difference from reference: %d)' % (n_utters, float(abs(self.n_utters - n_utters))/n_utters*100, self.n_utters))
25 |         print('n_words: %d (%.2f %% difference from reference: %d)' % (n_words, float(abs(self.n_words - n_words))/n_words*100, self.n_words))
26 | 
27 | 
28 | if __name__ == "__main__":
29 |     #set data info (target sizes for official data sets)
30 |     train_set = dataset('train', sys.argv[1], 888201, 2157389, 40073702)
31 |     dev_set = dataset('dev', sys.argv[2], 107506, 262228, 4900743)
32 | 
33 |     for data in [train_set, dev_set]:
34 |         n_utters = 0
35 |         n_dialogs = 0
36 |         n_words = 0
37 |         for ln in open(data.filename,'r').readlines():
38 |             tokens = ln.split()
39 |             if len(tokens) > 0:
40 |                 n_words += len(tokens)-1
41 |                 n_utters += 1
42 |             else:
43 |                 n_dialogs += 1
44 | 
45 |         data.check(n_dialogs, n_utters, n_words)
46 |     print("----------------------------------------------------")
47 |     print("If the data size differences are greater than 1%, ")
48 |     print("please contact the track organizers (chori@merl.com, thori@merl.com)")
49 |     print("with the above information.")
50 |     print("----------------------------------------------------")
51 | 


--------------------------------------------------------------------------------
/tasks/utils/sample_dialogs.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- coding: utf-8 -*-
 3 | 
 4 | import sys
 5 | import random
 6 | import six
 7 | import copy
 8 | 
 9 | dialogs = []
10 | dialog = []
11 | for ln in open(sys.argv[1], 'r').readlines():
12 |     ln = ln.strip()
13 |     if ln != '':
14 |         dialog.append(ln)
15 |         # partial dialogs are also included
16 |         if len(dialog)>=2 and ln.startswith('S:'):
17 |             dialogs.append(copy.copy(dialog))
18 |     else:
19 |         dialog = []
20 | 
21 | random.seed(99)
22 | # 2nd argument is necessary for compatibility between python2.7~ and 3.2~
23 | random.shuffle(dialogs, random.random) 
24 | n_samples = int(sys.argv[2])
25 | 
26 | for i in six.moves.range(n_samples):
27 |     for d in dialogs[i]:
28 |         print(d)
29 |     print('')
30 | 
31 | 


--------------------------------------------------------------------------------