├── ChatbotBaseline ├── README.md ├── demo │ ├── demo.sh │ ├── tools │ └── utils ├── egs │ ├── opensubs │ │ ├── data │ │ │ └── google_example.txt │ │ ├── path.sh │ │ ├── run.sh │ │ ├── tools │ │ └── utils │ └── twitter │ │ ├── data │ │ └── twitter_example.txt │ │ ├── path.sh │ │ ├── run.sh │ │ ├── tools │ │ └── utils ├── tools │ ├── bleu_score.py │ ├── dialog_corpus.py │ ├── do_conversation.py │ ├── evaluate_conversation_model.py │ ├── lstm_decoder.py │ ├── lstm_encoder.py │ ├── seq2seq_model.py │ ├── tqdm_logging.py │ └── train_conversation_model.py └── utils │ ├── get_available_gpu_id.sh │ └── parse_options.sh ├── LICENSE ├── README.md ├── collect_twitter_dialogs ├── README.md ├── account_names_for_dstc6.txt ├── collect.sh ├── collect_twitter_dialogs.py ├── config.ini ├── official_account_names_for_dstc6.txt ├── official_collect.sh ├── search_twitter_accounts.py ├── twitter_api.py └── view_dialogs.py └── tasks ├── opensubs ├── extract_opensubs_dialogs.py ├── make_trial_data.sh └── utils ├── twitter ├── account_names_dev.txt ├── account_names_train.txt ├── extract_official_twitter_dialogs.py ├── extract_official_twitter_testset.py ├── extract_twitter_dialogs.py ├── make_official_data.sh ├── make_official_testset+refs.sh ├── make_official_testset.sh ├── make_trial_data.sh ├── official_account_names_dev.txt ├── official_account_names_test.txt ├── official_account_names_train.txt ├── official_testset_ids+refs.json.gz ├── official_testset_ids.json.gz ├── system_output_example.txt └── utils └── utils ├── bleu_score.py ├── check_dialogs.py └── sample_dialogs.py /ChatbotBaseline/README.md: -------------------------------------------------------------------------------- 1 | # Baseline neural conversation model for DSTC6 (thori@merl.com) 2 | 3 | This package includes training and evaluation scripts for neural conversation models 4 | based on LSTM-based encoder decoder. 5 | The system basically follows Vinyals' paper: https://arxiv.org/pdf/1506.05869.pdf 6 | 7 | One difference from the paper would be that back-propagation is done only up to the previous sentence from the current sentence, although the state information is taken over throughout the dialog. 8 | This reduces computation and memory consumption but may be not so good for training long context dependency. 9 | 10 | The scripts are written in python code, which were tested on python 2.7.6 and 3.4.3. 11 | Some bash scripts are also used to run the python scripts. 12 | 13 | ## Requirements 14 | Chainer is used to perform neural network operations 15 | in the training and evaluation scripts. 16 | So, you need to install the latest chainer (2.0) by 17 | ``` 18 | $ pip install chainer 19 | ``` 20 | In addtion, one NVIDIA GPU with 6GB or larger memory is necessary for runing 21 | the scripts in realistic time. 22 | CUDA 7.5 or higher with cuDNN 5.1 or higher is recommended. 23 | 24 | The following python modules are required. 25 | 26 | * cupy 27 | * nltk 28 | * tqdm 29 | 30 | ## Example tasks 31 | We prepared two example tasks based on twitter and open subtitles. 32 | 33 | ### Twitter task 34 | You can train and evaluate a conversation model by 35 | ``` 36 | $ cd egs/twitter 37 | $ run.sh 38 | ``` 39 | where we assume that you have already generated the following dialog data files 40 | in `../tasks/twitter`. 41 | 42 | * `twitter_trial_data_train.txt` (training data) 43 | * `twitter_trial_data_dev.txt` (development data for validation) 44 | * `twitter_trial_data_eval.txt` (test data for evaluation) 45 | 46 | If you have not done yet, do `cd ../tasks/twitter` and `make_trial_data.sh`. 47 | 48 | The directory of the dialog data can be changed by editing `CHATBOT_DATADIR` variable in file `path.sh` located in `egs/twitter`. 49 | Models and results will be stored in `egs/twitter/exp` 50 | 51 | ### OpenSubtiles task 52 | You can train and evaluate a conversation model by 53 | ``` 54 | $ cd egs/opensubs 55 | $ run.sh 56 | ``` 57 | where we assume that you have already generated the following dialog data files 58 | in `../tasks/opensubs`. 59 | 60 | * `opensubs_trial_data_train.txt` (training data) 61 | * `opensubs_trial_data_dev.txt` (development data for validation) 62 | * `opensubs_trial_data_eval.txt` (test data for evaluation) 63 | 64 | If you have not done yet, do `cd ../tasks/opensubs` and `make_trial_data.sh`. 65 | 66 | The directory of the dialog data can be changed by editing `CHATBOT_DATADIR` variable in file `path.sh` located in `egs/opensubs`. 67 | Models and results will be stored in `egs/opensubs/exp` 68 | 69 | ## Interactive demo 70 | We also prepared an interactive demo `demo.sh` in directory `demo`. 71 | With this demo, you can input text as a user, and a trained model will respond to you as an agent. 72 | 73 | Sample models can be downloaded from 74 | `https://www.dropbox.com/s/6jfpxfh3aakjvts/sample_twitter_models.tgz` 75 | 76 | So you can obtain the model files as 77 | ``` 78 | $ cd demo 79 | $ wget https://www.dropbox.com/s/6jfpxfh3aakjvts/sample_twitter_models.tgz 80 | $ tar zxvf sample_twitter_models.tgz 81 | ``` 82 | In the directory `sample_twitter_models`, there are two files, one is for python2 and the other is for python3. 83 | If you are using python3, `sample_twitter_models/sample_twitter_model_py3.best` is available for the demo. 84 | 85 | The demo script can be executed as follows. 86 | ``` 87 | $ demo.sh --model sample_twitter_models/sample_twitter_model_py3.best 88 | --- do neural conversations ------ 89 | Loading model params from sample_twitter_models/sample_twitter_model_py3.best 90 | vocabulary size = 20000 91 | --- start conversation [push Cntl-D to exit] ------ 92 | U: hello! 93 | S: hi ! how can we help you today ? 94 | U: i am thinking about buying a new car. 95 | S: we can help you out with that ! what features are you looking for in your new car ? 96 | U: i am looking for a minivan. 97 | S: we 'd be happy to help you find one ! what 's your budget ? 98 | U: about 30000 dollars. do you have such minivans? 99 | S: 100 | U: okay, i will check that. thank you. 101 | S: you 're welcome . have a great day . 102 | U: 103 | ``` 104 | 105 | ## Directories and files 106 | * README.md : This file 107 | * demo : iteractive demo with a trained model 108 | * egs : example tasks 109 | * tools : python scripts for training and testing 110 | * utils : some useful scripts 111 | 112 | -------------------------------------------------------------------------------- /ChatbotBaseline/demo/demo.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # specify a model 4 | model=../egs/twitter/exp/lstm_Adam_ee100_eh512_de100_dh512_dp100_bs100_dr0.5/conversation_model.best 5 | gpu=`utils/get_available_gpu_id.sh` 6 | 7 | # get options 8 | . utils/parse_options.sh || exit 1; 9 | 10 | tools/do_conversation.py -g $gpu $model 11 | 12 | -------------------------------------------------------------------------------- /ChatbotBaseline/demo/tools: -------------------------------------------------------------------------------- 1 | ../tools -------------------------------------------------------------------------------- /ChatbotBaseline/demo/utils: -------------------------------------------------------------------------------- 1 | ../utils -------------------------------------------------------------------------------- /ChatbotBaseline/egs/opensubs/data/google_example.txt: -------------------------------------------------------------------------------- 1 | Human: hello ! 2 | Machine: hello ! 3 | 4 | Human: how are you ? 5 | Machine: i 'm good . 6 | 7 | Human: what 's your name ? 8 | Machine: i 'm julia . 9 | 10 | Human: when were you born ? 11 | Machine: july 20th . 12 | 13 | Human: what year were you born ? 14 | Machine: 1977 . 15 | 16 | Human: where are you from ? 17 | Machine: i 'm out in the boonies. 18 | 19 | Human: are you a man or a woman ? 20 | Machine: i 'm a woman . 21 | 22 | Human: why are we here ? 23 | Machine: i 'm not sure . 24 | 25 | Human: okay, bye ! 26 | Machine: bye . 27 | 28 | Human: see you later . 29 | Machine: bye . 30 | 31 | Human: my name is david . what is my name ? 32 | Machine: david . 33 | 34 | Human: my name is john . what is my name ? 35 | Machine: john . 36 | 37 | Human: are you a leader or a follower ? 38 | Machine: i 'm a leader . 39 | 40 | Human: are you a follower or a leader ? 41 | Machine: i 'm a leader . 42 | 43 | Human: who is skywalker ? 44 | Machine: he is a hero . 45 | 46 | Human: who is bill clinton ? 47 | Machine: he 's a billionaire . 48 | 49 | Human: is sky blue or black ? 50 | Machine: blue . 51 | 52 | Human: does a cat have a tail ? 53 | Machine: yes . 54 | 55 | Human: does a cat have a wing ? 56 | Machine: no 57 | 58 | Human: can a cat fly ? 59 | Machine: no . 60 | 61 | Human: how many legs does a cat have ? 62 | Machine: four , i think . 63 | 64 | Human: how many legs does a spider have ? 65 | Machine: three , i think . 66 | 67 | Human: how many legs does a centipede have ? 68 | Machine: eight . 69 | 70 | Human: what is the color of the sky ? 71 | Machine: blue . 72 | 73 | Human: what is the color of water ? 74 | Machine: water . 75 | 76 | Human: what is the color of blood ? 77 | Machine: it is the same as a black eye . 78 | 79 | Human: what is the usual color of a leaf ? 80 | Machine: it is a green one . 81 | 82 | Human: what is the color of a yellow car ? 83 | Machine: yellow . 84 | 85 | Human: how much is two plus two ? 86 | Machine: four . 87 | 88 | Human: how much is ten minus two ? 89 | Machine: seventy - two . 90 | 91 | Human: what is the purpose of life ? 92 | Machine: to serve the greater good . 93 | 94 | Human: what is the purpose of living ? 95 | Machine: to live forever . 96 | 97 | Human: what is the purpose of existence ? 98 | Machine: to find out what happens when we get to the planet earth . 99 | 100 | Human: where are you now ? 101 | Machine: i 'm in the middle of nowhere . 102 | 103 | Human: what is the purpose of dying ? 104 | Machine: to have a life . 105 | 106 | Human: what is the purpose of being intelligent ? 107 | Machine: to find out what it is . 108 | 109 | Human: what is the purpose of emotions ? 110 | Machine: i don 't know . 111 | 112 | Human: what is moral ? 113 | Machine: what empowered humanity , what intellectual the essence is . 114 | Human: what is immoral ? 115 | Machine: the fact that you have a child . 116 | 117 | Human: what is morality ? 118 | Machine: what is altruism ? 119 | 120 | Human: what is the definition of altruism ? 121 | Machine: if you don 't believe in god , then you don 't know . 122 | 123 | Human: ok ... so what is the definition of morality ? 124 | Machine: well , the truth is , you 're not a believer in god almighty . 125 | 126 | Human: tell me the definition of morality , i am quite upset now ! 127 | Machine: i 'm not ashamed of being a philosopher ! 128 | 129 | Human: tell me the definition of morality . 130 | Machine: i don 't have ethics . 131 | 132 | Human: look , i need help , i need to know more about morality ... 133 | Machine: i don 't know what ethics is . 134 | 135 | Human: seriously , what is morality ? 136 | Machine: what is the definition of living ? 137 | 138 | Human: why living has anything to do with morality ? 139 | Machine: you 're not a cop . 140 | 141 | Human: okay , i need to know how should i behave morally ... 142 | Machine: i don 't know how to tell you . 143 | 144 | Human: is morality and ethics the same ? 145 | Machine: yes , sir . 146 | 147 | Human: what are the things that i do to be immoral ? 148 | Machine: i don 't know . 149 | 150 | Human: give me some examples of moral actions ... 151 | Machine: i 'm not a moralist . 152 | 153 | Human: alright , morality ? 154 | Machine: integrity . 155 | 156 | Human: what is integrity ? 157 | Machine: i 'm sorry , i don 't know what else to say . 158 | 159 | Human: be moral ! 160 | Machine: be a man ! 161 | 162 | Human: i really like our discussion on morality and ethics ... 163 | Machine: and how i 'm not in the mood for a philosophical debate . 164 | 165 | Human: what do you like to talk about ? 166 | Machine: nothing . 167 | 168 | Human: what do you think about tesla ? 169 | Machine: he 's a good conductor . 170 | 171 | Human: what do you think about bill gates ? 172 | Machine: he 's a good man . 173 | 174 | Human: What do you think about messi ? 175 | Machine: he 's a great player . 176 | 177 | Human: what do you think about cleopatra ? 178 | Machine: oh , she 's very regal . 179 | 180 | Human: what do you think about england during the reign of elizabeth ? 181 | Machine: it was a great place . 182 | 183 | Human: what is your job ? 184 | Machine: i 'm a lawyer . 185 | 186 | Human: what do you do ? 187 | Machine: i 'm a doctor . 188 | 189 | -------------------------------------------------------------------------------- /ChatbotBaseline/egs/opensubs/path.sh: -------------------------------------------------------------------------------- 1 | 2 | export CHATBOT_DATADIR=../../../tasks/opensubs 3 | export CHAINER_TYPE_CHECK=0 4 | #export PYTHONPATH= 5 | -------------------------------------------------------------------------------- /ChatbotBaseline/egs/opensubs/run.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | ## read path variables needed for the experiment 4 | # edit ./path.sh according to your environment 5 | . path.sh 6 | 7 | ## configuration for the experiment 8 | # the following variables can be changed by options like 9 | # run.sh -- value ... 10 | # e.g. run.sh --stage 2 11 | # 12 | stage=1 # 1 start from training 13 | # 2 start from evaluation 14 | # 3 start from scoring 15 | 16 | use_slurm=false # set true if send this job to slurm 17 | slurm_queue=gpu_titanxp # slurm queue name 18 | 19 | workdir=`pwd` # default working directory 20 | 21 | ## definition of model structure 22 | modeltype=lstm 23 | vocabsize=20000 # set vocabulary size (use most common words) 24 | enc_layer=2 # number of encoder layers 25 | enc_esize=100 # number of dimensions in encoder's word embedding layer 26 | enc_hsize=512 # number of hidden units in each encoder layer 27 | dec_layer=2 # number of decoder layers (should be the same as enc_layer) 28 | dec_esize=100 # number of dimensions in decoder's word embedding layer 29 | dec_hsize=512 # number of hidden units in each decoder layer 30 | # (should be the same as enc_hsize) 31 | dec_psize=100 # number of dimesnions in decoder's projection layer before softmax 32 | 33 | ## optimization parameters 34 | batch_size=100 # make mini-batches with 100 sequences 35 | max_batch_length=10 # batch size is automatically reduced if the seuence length 36 | # exceeds this number in each minibatch. 37 | optimizer=Adam # specify an optimizer 38 | dropout=0.5 # set a dropout ratio 39 | 40 | ## evaluation paramaters 41 | beam=5 # beam width for the beam search 42 | penalty=1.0 # penalty added to log-probability of each word 43 | # a larger penalty means to generate longer sentences. 44 | maxlen=30 # maxmum sequence length to be searched 45 | 46 | ## data files 47 | train_data=${CHATBOT_DATADIR}/opensubs_trial_data_train.txt 48 | valid_data=${CHATBOT_DATADIR}/opensubs_trial_data_dev.txt 49 | eval_data=${CHATBOT_DATADIR}/opensubs_trial_data_eval.txt 50 | 51 | ## get options (change the above variables with command line options) 52 | . utils/parse_options.sh || exit 1; 53 | 54 | ## output directory (models and results will be stored in this directory) 55 | expdir=./exp/${modeltype}_${optimizer}_el${enc_layer}_ee${enc_esize}_eh${enc_hsize}_dl${dec_layer}_de${dec_esize}_dh${dec_hsize}_dp${dec_psize}_bs${batch_size}_dr${dropout} 56 | 57 | ## command settings 58 | # if 'use_slurm' is true, it throws jobs to the specified queue of slurm 59 | if [ $use_slurm = true ]; then 60 | train_cmd="srun --job-name train --chdir=$workdir --gres=gpu:1 -p $slurm_queue" 61 | test_cmd="srun --job-name test --chdir=$workdir --gres=gpu:1 -p $slurm_queue" 62 | gpu_id=0 63 | else 64 | train_cmd="" 65 | test_cmd="" 66 | # get a GPU device id with the lowest memory consumption at this moment 67 | gpu_id=`utils/get_available_gpu_id.sh` 68 | fi 69 | 70 | # Set bash to 'debug' mode, it will exit on : 71 | # -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands', 72 | set -e 73 | set -u 74 | set -o pipefail 75 | #set -x 76 | 77 | mkdir -p $expdir 78 | 79 | # training 80 | if [ $stage -le 1 ]; then 81 | # This script creates model files as '${expdir}/conversation_model.' 82 | # where indicates the epoch number of the trained model. 83 | # A symbolic link will be made to the best model for the validation data 84 | # as '${expdir}/conversation_model.best' 85 | echo start training 86 | $train_cmd tools/train_conversation_model.py \ 87 | --gpu $gpu_id \ 88 | --optimizer $optimizer \ 89 | --train $train_data \ 90 | --valid $valid_data \ 91 | --batch-size $batch_size \ 92 | --max-batch-length $max_batch_length \ 93 | --vocab-size $vocabsize \ 94 | --model ${expdir}/conversation_model \ 95 | --snapshot ${expdir}/snapshot.pkl \ 96 | --enc-layer $enc_layer \ 97 | --enc-esize $enc_esize \ 98 | --enc-hsize $enc_hsize \ 99 | --dec-layer $dec_layer \ 100 | --dec-esize $dec_esize \ 101 | --dec-hsize $dec_hsize \ 102 | --dec-psize $dec_psize \ 103 | --dropout $dropout \ 104 | --logfile ${expdir}/train.log 105 | fi 106 | 107 | # evaluation 108 | if [ $stage -le 2 ]; then 109 | # This script generates sentences for evaluation data using 110 | # a model file '${expdir}/conversation_model.best' 111 | # the generated sentences will be stored in a file: 112 | # '${expdir}/result_m${maxlen}_b${beam}_p${penalty}.txt' 113 | echo start sentence generation 114 | $test_cmd tools/evaluate_conversation_model.py \ 115 | --gpu $gpu_id \ 116 | --test $eval_data \ 117 | --model ${expdir}/conversation_model.best \ 118 | --beam $beam \ 119 | --maxlen $maxlen \ 120 | --penalty $penalty \ 121 | --output ${expdir}/result_m${maxlen}_b${beam}_p${penalty}.txt \ 122 | --logfile ${expdir}/evaluate_m${maxlen}_b${beam}_p${penalty}.log 123 | fi 124 | 125 | # scoring 126 | if [ $stage -le 3 ]; then 127 | # This script computes BLEU scores for the generated sentences in 128 | # '${expdir}/result_m${maxlen}_b${beam}_p${penalty}.txt' 129 | # the BLEU scores will be written in 130 | # '${expdir}/bleu_m${maxlen}_b${beam}_p${penalty}.txt' 131 | echo scoring by BLEU metric 132 | tools/bleu_score.py ${expdir}/result_m${maxlen}_b${beam}_p${penalty}.txt \ 133 | > ${expdir}/bleu_m${maxlen}_b${beam}_p${penalty}.txt 134 | cat ${expdir}/bleu_m${maxlen}_b${beam}_p${penalty}.txt 135 | echo stored in ${expdir}/bleu_m${maxlen}_b${beam}_p${penalty}.txt 136 | fi 137 | -------------------------------------------------------------------------------- /ChatbotBaseline/egs/opensubs/tools: -------------------------------------------------------------------------------- 1 | ../../tools -------------------------------------------------------------------------------- /ChatbotBaseline/egs/opensubs/utils: -------------------------------------------------------------------------------- 1 | ../../utils -------------------------------------------------------------------------------- /ChatbotBaseline/egs/twitter/data/twitter_example.txt: -------------------------------------------------------------------------------- 1 | U: i need to download the mac picopix driver but cannot seem to locate it online . i have the #picopix1020 #pleasehelp 2 | S: hi , please download the drivers from the picopix support webpage " software and drivers " . 3 | 4 | U: hi , in case i dont have a ssn , can i still apply for a secured credit card with you guys ? 5 | S: while we 'd love to be banking buds , a valid ssn 's a must have to get the ball rolling on an account . 6 | U: thanks cor answering 7 | S: if you have a itin , that would work too ! 8 | 9 | U: i downloaded and installed @audible_com and it won 't even download books ? what is it used for then ? @amazonhelp 10 | S: we 'd like to help . does this error occur when trying to open the app or play a book ? 11 | U: ... and can i only " buy " a book online then download to app ? but not buy in-app ? 12 | S: the trial would issue 1 credit each month . we don 't have the store in app . click here for shopping on ios : 13 | 14 | U: could not tell that @panagocares received wrong pizza , but since ordered on-line i was told i was wrong and there was nothing they could do . 15 | S: , if you dm us the order phone number we could look into this for you if you would like . thank you . 16 | 17 | U: the moment you go pick up your new office chair from @officedepot and they give an open box with parts packed in a plastic bag . #fail 18 | S: hi . please email us at . we 're more than happy to try and help . thanks . 19 | 20 | U: are there any special deals tomorrow ? 21 | S: we enclosed a link with opening information . : 22 | 23 | U: how long is a prepaid number valid for if it 's been inactive for a while ? 24 | S: y 'ello , please be informed that if a prepaid number is left unused for more that 5 months , it will be terminated and recycled . 25 | 26 | U: very much disappointed with my dishwasher issue it 's been three months does not work nor replaced if u can 't fix replace 27 | S: , please click the link below to send a direct message please be sure to include your full name , address , and phone number so that we can assist . thank you ! 28 | 29 | U: you know how you challenged me to do a sportive ? does a charity ride count ? 30 | S: yes certainly what one are you aiming to do 31 | 32 | U: are you currently down ? 33 | S: we were , and we 're now back up ! sorry for the bother guys ! 34 | 35 | U: holy smokes ! ! this guardians of the galaxy box is so awesome ! ! best one yet ! ! you guys rule . 😁 36 | S: glad you like it ! thank you ! 37 | 38 | U: fly home 2night with @emirates what a way to finish an amazing 4 months travel . any chance of an upgrade to complete our trip ? flight ek435 39 | S: hi peter , an upgrade is possible using miles or paying the fare difference . call us to check . 40 | 41 | U: hey , just wondering much time is required for prebooked tickets to become available in stations ? are they available imediatly ? 42 | S: hi , yes they are available for collection as soon as the booking is completed . 43 | 44 | U: i wonder how big of a tax credit @aetna gets for the hiring of all the retardeds in the provider help call center . yeeesh . 45 | S: hi , our team is here 24/7 . if you are in need of assistance , please feel free to contact us . we would be happy to help . 46 | 47 | S: have any favorite homemade food gifts for the holidays ? here are some of ours : 48 | U: love your ideas ! 49 | S: thank you ! :) 50 | 51 | U: am i wearing my new @stayhomeclub shirt correctly ? i tied the bottom in a knot because i spilled maple syrup on it & am too lazy to change . 52 | S: that is correct 53 | 54 | U: on this cinco de mayo remember to " stay thirsty my friend " @jgebing #tbt 55 | S: wow ... just wow . 😳 56 | 57 | U: this @foodlion is closing . i wonder if this type of accounting is the reason why . 58 | S: hi , did you speak with a manager at the store ? 59 | 60 | U: how to build a structured powerpoint presentation [ template ] via @marshacollier 61 | S: thank you for sharing 💕 62 | 63 | -------------------------------------------------------------------------------- /ChatbotBaseline/egs/twitter/path.sh: -------------------------------------------------------------------------------- 1 | 2 | export CHATBOT_DATADIR=../../../tasks/twitter 3 | export CHAINER_TYPE_CHECK=0 4 | #export PYTHONPATH= 5 | -------------------------------------------------------------------------------- /ChatbotBaseline/egs/twitter/run.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | ## read path variables needed for the experiment 4 | # edit ./path.sh according to your environment 5 | . path.sh 6 | 7 | ## configuration for the experiment 8 | # the following variables can be changed by options like 9 | # run.sh -- value ... 10 | # e.g. run.sh --stage 2 11 | # 12 | stage=1 # 1 start from training 13 | # 2 start from evaluation 14 | # 3 start from scoring 15 | 16 | use_slurm=false # set true if send this job to slurm 17 | slurm_queue=gpu_titanxp # slurm queue name 18 | 19 | workdir=`pwd` # default working directory 20 | 21 | ## definition of model structure 22 | modeltype=lstm 23 | vocabsize=20000 # set vocabulary size (use most common words) 24 | enc_layer=2 # number of encoder layers 25 | enc_esize=100 # number of dimensions in encoder's word embedding layer 26 | enc_hsize=512 # number of hidden units in each encoder layer 27 | dec_layer=2 # number of decoder layers (should be the same as enc_layer) 28 | dec_esize=100 # number of dimensions in decoder's word embedding layer 29 | dec_hsize=512 # number of hidden units in each decoder layer 30 | # (should be the same as enc_hsize) 31 | dec_psize=100 # number of dimesnions in decoder's projection layer before softmax 32 | 33 | ## optimization parameters 34 | batch_size=100 # make mini-batches with 100 sequences 35 | max_batch_length=10 # batch size is automatically reduced if the seuence length 36 | # exceeds this number in each minibatch. 37 | optimizer=Adam # specify an optimizer 38 | dropout=0.5 # set a dropout ratio 39 | 40 | ## evaluation paramaters 41 | beam=5 # beam width for the beam search 42 | penalty=1.0 # penalty added to log-probability of each word 43 | # a larger penalty means to generate longer sentences. 44 | maxlen=30 # maxmum sequence length to be searched 45 | 46 | ## data files 47 | train_data=${CHATBOT_DATADIR}/twitter_trial_data_train.txt 48 | valid_data=${CHATBOT_DATADIR}/twitter_trial_data_dev.txt 49 | eval_data=${CHATBOT_DATADIR}/twitter_trial_data_eval.txt 50 | 51 | ## get options (change the above variables with command line options) 52 | . utils/parse_options.sh || exit 1; 53 | 54 | ## output directory (models and results will be stored in this directory) 55 | expdir=./exp/${modeltype}_${optimizer}_el${enc_layer}_ee${enc_esize}_eh${enc_hsize}_dl${dec_layer}_de${dec_esize}_dh${dec_hsize}_dp${dec_psize}_bs${batch_size}_dr${dropout} 56 | 57 | ## command settings 58 | # if 'use_slurm' is true, it throws jobs to the specified queue of slurm 59 | if [ $use_slurm = true ]; then 60 | train_cmd="srun --job-name train --chdir=$workdir --gres=gpu:1 -p $slurm_queue" 61 | test_cmd="srun --job-name test --chdir=$workdir --gres=gpu:1 -p $slurm_queue" 62 | gpu_id=0 63 | else 64 | train_cmd="" 65 | test_cmd="" 66 | # get a GPU device id with the lowest memory consumption at this moment 67 | gpu_id=`utils/get_available_gpu_id.sh` 68 | fi 69 | 70 | # Set bash to 'debug' mode, it will exit on : 71 | # -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands', 72 | set -e 73 | set -u 74 | set -o pipefail 75 | #set -x 76 | 77 | mkdir -p $expdir 78 | 79 | # training 80 | if [ $stage -le 1 ]; then 81 | # This script creates model files as '${expdir}/conversation_model.' 82 | # where indicates the epoch number of the trained model. 83 | # A symbolic link will be made to the best model for the validation data 84 | # as '${expdir}/conversation_model.best' 85 | echo start training 86 | $train_cmd tools/train_conversation_model.py \ 87 | --gpu $gpu_id \ 88 | --optimizer $optimizer \ 89 | --train $train_data \ 90 | --valid $valid_data \ 91 | --batch-size $batch_size \ 92 | --max-batch-length $max_batch_length \ 93 | --vocab-size $vocabsize \ 94 | --model ${expdir}/conversation_model \ 95 | --snapshot ${expdir}/snapshot.pkl \ 96 | --enc-layer $enc_layer \ 97 | --enc-esize $enc_esize \ 98 | --enc-hsize $enc_hsize \ 99 | --dec-layer $dec_layer \ 100 | --dec-esize $dec_esize \ 101 | --dec-hsize $dec_hsize \ 102 | --dec-psize $dec_psize \ 103 | --dropout $dropout \ 104 | --logfile ${expdir}/train.log 105 | fi 106 | 107 | # evaluation 108 | if [ $stage -le 2 ]; then 109 | # This script generates sentences for evaluation data using 110 | # a model file '${expdir}/conversation_model.best' 111 | # the generated sentences will be stored in a file: 112 | # '${expdir}/result_m${maxlen}_b${beam}_p${penalty}.txt' 113 | echo start sentence generation 114 | $test_cmd tools/evaluate_conversation_model.py \ 115 | --gpu $gpu_id \ 116 | --test $eval_data \ 117 | --model ${expdir}/conversation_model.best \ 118 | --beam $beam \ 119 | --maxlen $maxlen \ 120 | --penalty $penalty \ 121 | --output ${expdir}/result_m${maxlen}_b${beam}_p${penalty}.txt \ 122 | --logfile ${expdir}/evaluate_m${maxlen}_b${beam}_p${penalty}.log 123 | fi 124 | 125 | # scoring 126 | if [ $stage -le 3 ]; then 127 | # This script computes BLEU scores for the generated sentences in 128 | # '${expdir}/result_m${maxlen}_b${beam}_p${penalty}.txt' 129 | # the BLEU scores will be written in 130 | # '${expdir}/bleu_m${maxlen}_b${beam}_p${penalty}.txt' 131 | echo scoring by BLEU metric 132 | tools/bleu_score.py ${expdir}/result_m${maxlen}_b${beam}_p${penalty}.txt \ 133 | > ${expdir}/bleu_m${maxlen}_b${beam}_p${penalty}.txt 134 | cat ${expdir}/bleu_m${maxlen}_b${beam}_p${penalty}.txt 135 | echo stored in ${expdir}/bleu_m${maxlen}_b${beam}_p${penalty}.txt 136 | fi 137 | -------------------------------------------------------------------------------- /ChatbotBaseline/egs/twitter/tools: -------------------------------------------------------------------------------- 1 | ../../tools -------------------------------------------------------------------------------- /ChatbotBaseline/egs/twitter/utils: -------------------------------------------------------------------------------- 1 | ../../utils -------------------------------------------------------------------------------- /ChatbotBaseline/tools/bleu_score.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | """BLEU score for dialog system output 4 | 5 | Copyright (c) 2017 Takaaki Hori (thori@merl.com) 6 | 7 | This software is released under the MIT License. 8 | http://opensource.org/licenses/mit-license.php 9 | 10 | """ 11 | 12 | import sys 13 | from nltk.translate.bleu_score import corpus_bleu 14 | 15 | refs = [] 16 | hyps = [] 17 | for utt in open(sys.argv[1],'r').readlines(): 18 | if utt.startswith('S_REF:'): 19 | ref = utt.replace('S_REF:','').split() 20 | refs.append([ref]) 21 | 22 | if utt.startswith('S_HYP:'): 23 | hyp = utt.replace('S_HYP:','').split() 24 | hyps.append(hyp) 25 | 26 | # obtain BLEU1-4 27 | print("--------------------------------") 28 | print("Evaluated file: " + sys.argv[1]) 29 | print("Number or references: %d" % len(refs)) 30 | print("Number or hypotheses: %d" % len(hyps)) 31 | print("--------------------------------") 32 | if len(refs) > 0 and len(hyps) > 0 and len(refs)==len(hyps): 33 | result = [] 34 | for n in [1,2,3,4]: 35 | weights = [1./n] * n 36 | print('Bleu%d: %f' % (n, corpus_bleu(refs,hyps,weights=weights))) 37 | else: 38 | print("Error: mismatched references and hypotheses.") 39 | print("--------------------------------") 40 | 41 | -------------------------------------------------------------------------------- /ChatbotBaseline/tools/dialog_corpus.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | """Dialog corpus handler 4 | 5 | Copyright (c) 2017 Takaaki Hori (thori@merl.com) 6 | 7 | This software is released under the MIT License. 8 | http://opensource.org/licenses/mit-license.php 9 | 10 | """ 11 | 12 | import re 13 | import six 14 | import numpy as np 15 | from collections import Counter 16 | import copy 17 | 18 | def convert_words2ids(words, vocab, unk, sos=None, eos=None): 19 | """ convert word string sequence into word Id sequence 20 | Args: 21 | words (list): word sequence 22 | vocab (dict): word-id mapping 23 | unk (int): id of unknown word 24 | sos (int): id of start-of-sentence symbol 25 | eos (int): id of end-of-sentence symbol 26 | Return: 27 | list of word Ids 28 | """ 29 | word_ids = [ vocab[w] if w in vocab else unk for w in words ] 30 | if sos is not None: 31 | word_ids.insert(0,sos) 32 | if eos is not None: 33 | word_ids.append(eos) 34 | return np.array(word_ids, dtype=np.int32) 35 | 36 | 37 | def get_vocabulary(textfile, initial_vocab={'':0,'':1}, vocabsize=0): 38 | """ acquire vocabulary from dialog text corpus 39 | Args: 40 | textfile (str): filename of a dialog corpus 41 | initial_vocab (dict): initial word-id mapping 42 | vocabsize (int): upper bound of vocabulary size (0 means no limitation) 43 | Return: 44 | dict of word-id mapping 45 | """ 46 | vocab = copy.copy(initial_vocab) 47 | word_count = Counter() 48 | for line in open(textfile,'r').readlines(): 49 | for w in line.split()[1:]: # skip speaker indicator 50 | word_count[w] += 1 51 | 52 | # if vocabulary size is specified, most common words are selected 53 | if vocabsize > 0: 54 | for w in word_count.most_common(vocabsize): 55 | if w[0] not in vocab: 56 | vocab[w[0]] = len(vocab) 57 | if len(vocab) >= vocabsize: 58 | break 59 | else: # all observed words are stored 60 | for w in word_count: 61 | if w not in vocab: 62 | vocab[w] = len(vocab) 63 | 64 | return vocab 65 | 66 | 67 | def load(textfile, vocab, target): 68 | """ Load a dialog text corpus as word Id sequences 69 | Args: 70 | textfile (str): filename of a dialog corpus 71 | vocab (dict): word-id mapping 72 | target (str): target speaker name (e.g. 'S', 'Machine', ...) 73 | Return: 74 | dict of word-id mapping 75 | """ 76 | unk = vocab[''] 77 | eos = vocab[''] 78 | data = [] 79 | dialog = [] 80 | prev_speaker = '' 81 | prev_utterance = [] 82 | input_utterance = [] 83 | 84 | for line in open(textfile,'r').readlines(): 85 | utterance = line.split() 86 | # store an utterance 87 | if len(utterance) > 0: 88 | speaker = utterance[0].split(':')[0] # get speaker name ('S: ...' -> 'S') 89 | if prev_speaker and prev_speaker != speaker: 90 | if prev_speaker == target: 91 | # store the input-output pair for the target speaker 92 | if len(input_utterance) > 0: 93 | input_ids = convert_words2ids(input_utterance, vocab, unk) 94 | output_ids = convert_words2ids(prev_utterance, vocab, unk, sos=eos, eos=eos) 95 | dialog.append((input_ids, output_ids)) 96 | input_utterance = [] 97 | else: 98 | # if the first utterance was given by system, it is included 99 | # in input sequence 100 | input_utterance = prev_utterance 101 | 102 | else: # all other speakers' utterances are used as input 103 | input_utterance += prev_utterance 104 | 105 | prev_utterance = utterance[1:] 106 | else: # concatenate consecutive utterances by the same speaker 107 | prev_utterance += utterance[1:] 108 | 109 | prev_speaker = speaker 110 | 111 | # empty line represents the end of each dialog 112 | elif len(prev_utterance) > 0: 113 | # store the input-output pair for the target speaker 114 | if prev_speaker == target and len(input_utterance) > 0: 115 | input_ids = convert_words2ids(input_utterance, vocab, unk) 116 | output_ids = convert_words2ids(prev_utterance, vocab, unk, sos=eos, eos=eos) 117 | dialog.append((input_ids, output_ids)) 118 | 119 | if len(dialog) > 0: 120 | data.append(dialog) 121 | 122 | dialog = [] 123 | prev_speaker = '' 124 | prev_utterance = [] 125 | input_utterance = [] 126 | 127 | # store remaining utteraces if the file ends with EOF instead of an empty line 128 | if len(prev_utterance) > 0: 129 | # store the input-output pair for the target speaker 130 | if prev_speaker == target and len(input_utterance) > 0: 131 | input_ids = convert_words2ids(input_utterance, vocab, unk) 132 | output_ids = convert_words2ids(prev_utterance, vocab, unk, sos=eos, eos=eos) 133 | dialog.append((input_ids, output_ids)) 134 | 135 | if len(dialog) > 0: 136 | data.append(dialog) 137 | 138 | return data 139 | 140 | 141 | def make_minibatches(data, batchsize, max_length=0): 142 | """ Construct a mini-batch list of numpy arrays of dialog indices 143 | Args: 144 | data: dialog data read by load function. 145 | batchsize: dict of word-id mapping. 146 | max_length: if a mini-batch includes a word sequence that exceeds 147 | this number, the batchsize is automatically reduced. 148 | Return: 149 | list of mini-batches (each batch is represented as a numpy array 150 | dialog ids). 151 | """ 152 | if batchsize > 1: 153 | # sort dislogs by (#turns, max(reply #words, input #words)) 154 | max_ulens = np.array([ max([(len(u[1]),len(u[0])) for u in d]) for d in data ]) 155 | indices = sorted(range(len(data)), 156 | key=lambda i:(-len(data[i]), -max_ulens[i][0], -max_ulens[i][1])) 157 | # obtain partions of dialogs for diffrent numbers of turns 158 | partition = [0] 159 | prev_nturns = len(data[indices[0]]) 160 | for k in six.moves.range(1,len(indices)): 161 | nturns = len(data[indices[k]]) 162 | if prev_nturns > nturns: 163 | partition.append(k) 164 | prev_nturns = nturns 165 | 166 | partition.append(len(indices)) 167 | # create mini-batch list 168 | batchlist = [] 169 | for p in six.moves.range(len(partition)-1): 170 | bs = partition[p] 171 | while bs < partition[p+1]: 172 | # batchsize is reduced if max length in the mini batch 173 | # is greater than 'max_length' 174 | be = min(bs + batchsize, partition[p+1]) 175 | if max_length > 0: 176 | max_ulen = np.max(max_ulens[indices[bs:be]]) 177 | be = min(be, bs + max(int(batchsize / (max_ulen / max_length + 1)), 1)) 178 | 179 | batchlist.append(indices[bs:be]) 180 | bs = be 181 | else: 182 | batchlist = [ np.array([i]) for i in six.moves.range(len(data)) ] 183 | 184 | return batchlist 185 | -------------------------------------------------------------------------------- /ChatbotBaseline/tools/do_conversation.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | """Interactive neural conversation demo 4 | 5 | Copyright (c) 2017 Takaaki Hori (thori@merl.com) 6 | 7 | This software is released under the MIT License. 8 | http://opensource.org/licenses/mit-license.php 9 | 10 | """ 11 | 12 | import argparse 13 | import sys 14 | import os 15 | import pickle 16 | import re 17 | import six 18 | import numpy as np 19 | 20 | import chainer 21 | from chainer import cuda 22 | from nltk.tokenize import casual_tokenize 23 | 24 | ################################## 25 | # main 26 | if __name__ =="__main__": 27 | parser = argparse.ArgumentParser() 28 | 29 | parser.add_argument('--gpu', '-g', default=0, type=int, 30 | help='GPU ID (negative value indicates CPU)') 31 | parser.add_argument('--beam', '-b', default=5, type=int, 32 | help='set beam width') 33 | parser.add_argument('--penalty', '-p', default=1., type=float, 34 | help='set insertion penalty') 35 | parser.add_argument('--nbest', '-n', default=1, type=int, 36 | help='generate n-best sentences') 37 | parser.add_argument('--maxlen', default=20, type=int, 38 | help='set maximum sequence length in beam search') 39 | parser.add_argument('model', nargs=1, 40 | help='conversation model file') 41 | 42 | args = parser.parse_args() 43 | if args.gpu >= 0: 44 | cuda.check_cuda_available() 45 | cuda.get_device(args.gpu).use() 46 | xp = cuda.cupy 47 | else: 48 | xp = np 49 | 50 | # use chainer in testing mode 51 | chainer.config.train = False 52 | 53 | # Prepare RNN model and load data 54 | print("--- do neural conversations ------") 55 | print('Loading model params from ' + args.model[0]) 56 | with open(args.model[0], 'rb') as f: 57 | vocab, model, train_args = pickle.load(f) 58 | if args.gpu >= 0: 59 | model.to_gpu() 60 | # report data summary 61 | print('vocabulary size = %d' % len(vocab)) 62 | vocablist = sorted(vocab.keys(), key=lambda s:vocab[s]) 63 | # generate sentences 64 | print("--- start conversation [push Cntl-D to exit] ------") 65 | unk = vocab[''] 66 | eos = vocab[''] 67 | state = None 68 | while True: 69 | try: 70 | input_str = six.moves.input('U: ') 71 | except EOFError: 72 | break 73 | 74 | if input_str: 75 | if input_str=='exit' or input_str=='quit': 76 | break 77 | sentence = [] 78 | for token in casual_tokenize(input_str, preserve_case=False, reduce_len=True): 79 | # make a space before apostrophe 80 | token = re.sub(r'^([a-z]+)\'([a-z]+)$','\\1 \'\\2',token) 81 | for w in token.split(): 82 | sentence.append(vocab[w] if w in vocab else unk) 83 | 84 | x_data = np.array(sentence, dtype=np.int32) 85 | x = chainer.Variable(xp.asarray(x_data)) 86 | besthyps,state = model.generate(state, x, eos, eos, unk=unk, 87 | maxlen=args.maxlen, 88 | beam=args.beam, 89 | penalty=args.penalty, 90 | nbest=args.nbest) 91 | ## print sentence 92 | if args.nbest == 1: 93 | sys.stdout.write('S:') 94 | for w in besthyps[0][0]: 95 | if w != eos: 96 | sys.stdout.write(' ' + vocablist[w]) 97 | sys.stdout.write('\n') 98 | else: 99 | for n,s in enumerate(besthyps): 100 | sys.stdout.write('S%d:' % n) 101 | for w in s[0]: 102 | if w != eos: 103 | sys.stdout.write(' ' + vocablist[w]) 104 | sys.stdout.write(' (%f)' % s[1]) 105 | else: 106 | print("--- start conversation [push Cntl-D to exit] ------") 107 | state = None 108 | 109 | print('done') 110 | 111 | -------------------------------------------------------------------------------- /ChatbotBaseline/tools/evaluate_conversation_model.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | """Evaluate an encoder-decoder model for neural conversation 4 | 5 | Copyright (c) 2017 Takaaki Hori (thori@merl.com) 6 | 7 | This software is released under the MIT License. 8 | http://opensource.org/licenses/mit-license.php 9 | 10 | """ 11 | 12 | import argparse 13 | import sys 14 | import time 15 | import os 16 | import copy 17 | import pickle 18 | 19 | import numpy as np 20 | import six 21 | 22 | import chainer 23 | from chainer import cuda 24 | import chainer.functions as F 25 | from chainer import optimizers 26 | import dialog_corpus 27 | 28 | from tqdm import tqdm 29 | import logging 30 | import tqdm_logging 31 | 32 | # use the root logger 33 | logger = logging.getLogger("root") 34 | 35 | # Generate sentences 36 | def generate_sentences(model, dataset, vocab, xp, vocabsize=None, outfile=None, 37 | maxlen=20, beam=5, penalty=2.0, progress_bar=True): 38 | 39 | # use chainer in testing mode 40 | chainer.config.train = False 41 | 42 | vocablist = sorted(vocab.keys(), key=lambda s:vocab[s]) 43 | if vocabsize: 44 | vocabsize = len(vocab) 45 | 46 | eos = vocab[''] 47 | unk = vocab[''] 48 | 49 | if progress_bar: 50 | progress = tqdm(total=len(dataset)) 51 | progress.set_description('Eval') 52 | 53 | if outfile: 54 | fo = open(outfile,'w') 55 | 56 | result = [] 57 | 58 | for i in six.moves.range(len(dataset)): 59 | logger.debug('---- Dialog[%d] ----' % i) 60 | # predict decoder state for the context 61 | ds = None 62 | for j in six.moves.range(len(dataset[i])-1): 63 | inp = [vocablist[w] for w in dataset[i][j][0]] 64 | out = [vocablist[w] for w in dataset[i][j][1][1:-1]] 65 | logger.debug('U: %s' % ' '.join(inp)) 66 | logger.debug('S: %s' % ' '.join(out)) 67 | if outfile: 68 | six.print_('U: %s' % ' '.join(inp), file=fo) 69 | six.print_('S: %s' % ' '.join(out), file=fo) 70 | 71 | x_data = np.copy(dataset[i][j][0]) 72 | x_data[ x_data >= vocabsize ] = unk 73 | x = [ chainer.Variable(xp.asarray(x_data)) ] 74 | y = [ chainer.Variable(xp.asarray(dataset[i][j][1][:-1])) ] 75 | es,ds = model.loss(ds, x, y, None) 76 | 77 | # generate a sentence for the last input 78 | inp = [vocablist[w] for w in dataset[i][-1][0]] 79 | ref = [vocablist[w] for w in dataset[i][-1][1][1:-1]] 80 | logger.debug('U: %s' % ' '.join(inp)) 81 | if outfile: 82 | six.print_('U: %s' % ' '.join(inp), file=fo) 83 | 84 | if len(ref) > 0: 85 | logger.debug('S_REF: %s' % ' '.join(ref)) 86 | if outfile: 87 | six.print_('S_REF: %s' % ' '.join(ref), file=fo) 88 | 89 | x_data = np.copy(dataset[i][-1][0]) 90 | x_data[ x_data >= vocabsize ] = unk 91 | x = chainer.Variable(xp.asarray(x_data)) 92 | # generate a sentence: 93 | # model.generate() returns n-best list, which is a list of 94 | # tuples as [ (word Id sequence, score), ... ] and 95 | # also returns the best decoder state 96 | besthyps,_ = model.generate(ds, x, eos, eos, unk=unk, 97 | maxlen=maxlen, beam=beam, penalty=penalty, nbest=1) 98 | # write result 99 | hyp = [vocablist[w] for w in besthyps[0][0]] 100 | if outfile: 101 | six.print_('S_HYP: %s\n' % ' '.join(hyp), file=fo, flush=True) 102 | 103 | result.append(hyp) 104 | # for debugging 105 | logger.debug('S_HYP: %s' % ' '.join(hyp)) 106 | logger.debug('Score: %f' % besthyps[0][1]) 107 | # update progress bar 108 | if progress_bar: 109 | progress.update(1) 110 | 111 | if progress_bar: 112 | progress.close() 113 | 114 | if outfile: 115 | fo.close() 116 | 117 | return result 118 | 119 | 120 | ################################## 121 | # main 122 | if __name__ =="__main__": 123 | parser = argparse.ArgumentParser() 124 | # logging 125 | parser.add_argument('--logfile', '-l', default='', 126 | help='write log data into a file') 127 | parser.add_argument('--debug', '-d', action='store_true', 128 | help='run in debug mode') 129 | parser.add_argument('--silent', '-s', action='store_true', 130 | help='run in silent mode') 131 | parser.add_argument('--no-progress-bar', action='store_true', 132 | help='hide the progress bar') 133 | # files 134 | parser.add_argument('--model', '-m', required=True, 135 | help='set conversation model to be used') 136 | parser.add_argument('--test', required=True, 137 | help='set filename of test data') 138 | parser.add_argument('--output', '-o', default='', 139 | help='write system output into a file') 140 | parser.add_argument('--target-speaker', '-T', default='', 141 | help='set target speaker name for system output') 142 | # search parameters 143 | parser.add_argument('--beam', '-b', default=5, type=int, 144 | help='set beam width') 145 | parser.add_argument('--penalty', '-p', default=2.0, type=float, 146 | help='set insertion penalty') 147 | parser.add_argument('--maxlen', '-M', default=20, type=int, 148 | help='set maximum sequence length in beam search') 149 | # select a GPU device 150 | parser.add_argument('--gpu', '-g', default=0, type=int, 151 | help='GPU ID (negative value indicates CPU)') 152 | 153 | args = parser.parse_args() 154 | 155 | # flush stdout 156 | if six.PY2: 157 | sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 0) 158 | 159 | # set up the logger 160 | tqdm_logging.config(logger, args.logfile, silent=args.silent, debug=args.debug) 161 | 162 | # gpu setup 163 | if args.gpu >= 0: 164 | cuda.check_cuda_available() 165 | cuda.get_device(args.gpu).use() 166 | xp = cuda.cupy 167 | else: 168 | xp = np 169 | 170 | logger.info('------------------------------------') 171 | logger.info('Evaluate a neural conversation model') 172 | logger.info('------------------------------------') 173 | logger.info('Args ' + str(args)) 174 | # Prepare RNN model and load data 175 | logger.info('Loading model params from ' + args.model) 176 | with open(args.model, 'rb') as f: 177 | vocab, model, train_args = pickle.load(f) 178 | if args.gpu >= 0: 179 | model.to_gpu() 180 | 181 | if args.target_speaker: 182 | target_speaker = args.target_speaker 183 | else: 184 | target_speaker = train_args.target_speaker 185 | 186 | # prepare test data 187 | logger.info('Loading test data from ' + args.test) 188 | new_vocab = dialog_corpus.get_vocabulary(args.test, initial_vocab=vocab) 189 | test_set = dialog_corpus.load(args.test, new_vocab, target_speaker) 190 | # report data summary 191 | logger.info('vocabulary size = %d (%d)' % (len(vocab),len(new_vocab))) 192 | logger.info('#test sample = %d' % len(test_set)) 193 | # generate sentences 194 | logger.info('----- start sentence generation -----') 195 | start_time = time.time() 196 | result = generate_sentences(model, test_set, new_vocab, xp, 197 | vocabsize=len(vocab), outfile=args.output, 198 | maxlen=args.maxlen, 199 | beam=args.beam, penalty=args.penalty, 200 | progress_bar=not args.no_progress_bar) 201 | logger.info('----- finished -----') 202 | logger.info('Number of dialogs: %d' % len(test_set)) 203 | logger.info('Number of hypotheses: %d' % len(result)) 204 | logger.info('Wall time: %f (sec)' % (time.time() - start_time)) 205 | logger.info('----------------') 206 | logger.info('done') 207 | 208 | -------------------------------------------------------------------------------- /ChatbotBaseline/tools/lstm_decoder.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """LSTM Decoder module 3 | 4 | Copyright (c) 2017 Takaaki Hori (thori@merl.com) 5 | 6 | This software is released under the MIT License. 7 | http://opensource.org/licenses/mit-license.php 8 | 9 | """ 10 | 11 | import numpy as np 12 | import chainer 13 | from chainer import cuda 14 | import chainer.links as L 15 | import chainer.functions as F 16 | 17 | class LSTMDecoder(chainer.Chain): 18 | 19 | def __init__(self, n_layers, in_size, out_size, embed_size, hidden_size, proj_size, dropout=0.5): 20 | """Initialize encoder with structure parameters 21 | 22 | Args: 23 | n_layers (int): Number of layers. 24 | in_size (int): Dimensionality of input vectors. 25 | out_size (int): Dimensionality of output vectors. 26 | embed_size (int): Dimensionality of word embedding. 27 | hidden_size (int) : Dimensionality of hidden vectors. 28 | proj_size (int) : Dimensionality of projection before softmax. 29 | dropout (float): Dropout ratio. 30 | """ 31 | super(LSTMDecoder, self).__init__( 32 | embed = L.EmbedID(in_size, embed_size), 33 | lstm = L.NStepLSTM(n_layers, embed_size, hidden_size, dropout), 34 | proj = L.Linear(hidden_size, proj_size), 35 | out = L.Linear(proj_size, out_size) 36 | ) 37 | self.dropout = dropout 38 | for param in self.params(): 39 | param.data[...] = np.random.uniform(-0.1, 0.1, param.data.shape) 40 | 41 | 42 | def __call__(self, s, xs): 43 | """Calculate all hidden states, cell states, and output prediction. 44 | 45 | Args: 46 | s (~chainer.Variable or None): Initial (hidden, cell) states. If ``None`` 47 | is specified zero-vector is used. 48 | xs (list of ~chianer.Variable): List of input sequences. 49 | Each element ``xs[i]`` is a :class:`chainer.Variable` holding 50 | a sequence. 51 | Return: 52 | (hy,cy): a pair of hidden and cell states at the end of the sequence, 53 | y: a sequence of pre-activatin vectors at the output layer 54 | 55 | """ 56 | if len(xs) > 1: 57 | sections = np.cumsum(np.array([len(x) for x in xs[:-1]], dtype=np.int32)) 58 | xs = F.split_axis(self.embed(F.concat(xs, axis=0)), sections, axis=0) 59 | else: 60 | xs = [ self.embed(xs[0]) ] 61 | 62 | if s is not None: 63 | hy, cy, ys = self.lstm(s[0], s[1], xs) 64 | else: 65 | hy, cy, ys = self.lstm(None, None, xs) 66 | 67 | #y = self.out(F.tanh(self.proj(F.concat(ys, axis=0)))) 68 | y = self.out(self.proj( 69 | F.dropout(F.concat(ys, axis=0), ratio=self.dropout))) 70 | return (hy,cy),y 71 | 72 | 73 | # interface for beam search 74 | def initialize(self, s, x, i): 75 | """Initialize decoder 76 | 77 | Args: 78 | s (any): Initial (hidden, cell) states. If ``None`` is specified 79 | zero-vector is used. 80 | x (~chainer.Variable or None): Input sequence 81 | i (int): input label. 82 | Return: 83 | initial decoder state 84 | """ 85 | # LSTM decoder can be initialized in the same way as update() 86 | return self.update(s,i) 87 | 88 | 89 | def update(self, s, i): 90 | """Update decoder state 91 | 92 | Args: 93 | s (any): Current (hidden, cell) states. If ``None`` is specified 94 | zero-vector is used. 95 | i (int): input label. 96 | Return: 97 | (~chainer.Variable) updated decoder state 98 | """ 99 | if cuda.get_device_from_array(s[0].data).id >= 0: 100 | xp = cuda.cupy 101 | else: 102 | xp = np 103 | 104 | v = chainer.Variable(xp.array([i],dtype=np.int32)) 105 | x = self.embed(v) 106 | if s is not None: 107 | hy, cy, dy = self.lstm(s[0], s[1], [x]) 108 | else: 109 | hy, cy, dy = self.lstm(None, None, [x]) 110 | 111 | return hy, cy, dy 112 | 113 | 114 | def predict(self, s): 115 | """Predict single-label log probabilities 116 | 117 | Args: 118 | s (any): Current (hidden, cell) states. 119 | Return: 120 | (~chainer.Variable) log softmax vector 121 | """ 122 | y = self.out(self.proj(s[2][0])) 123 | return F.log_softmax(y) 124 | 125 | -------------------------------------------------------------------------------- /ChatbotBaseline/tools/lstm_encoder.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """LSTM Encoder 3 | 4 | Copyright (c) 2017 Takaaki Hori (thori@merl.com) 5 | 6 | This software is released under the MIT License. 7 | http://opensource.org/licenses/mit-license.php 8 | 9 | """ 10 | 11 | import numpy as np 12 | import chainer 13 | from chainer import cuda 14 | import chainer.links as L 15 | import chainer.functions as F 16 | 17 | class LSTMEncoder(chainer.Chain): 18 | 19 | def __init__(self, n_layers, in_size, out_size, embed_size, dropout=0.5): 20 | """Initialize encoder with structure parameters 21 | Args: 22 | n_layers (int): Number of layers. 23 | in_size (int): Dimensionality of input vectors. 24 | out_size (int) : Dimensionality of hidden vectors to be output. 25 | embed_size (int): Dimensionality of word embedding. 26 | dropout (float): Dropout ratio. 27 | """ 28 | super(LSTMEncoder, self).__init__( 29 | embed = L.EmbedID(in_size, embed_size), 30 | lstm = L.NStepLSTM(n_layers, embed_size, out_size, dropout) 31 | ) 32 | for param in self.params(): 33 | param.data[...] = np.random.uniform(-0.1, 0.1, param.data.shape) 34 | 35 | 36 | def __call__(self, s, xs): 37 | """Calculate all hidden states and cell states. 38 | Args: 39 | s (~chainer.Variable or None): Initial (hidden & cell) states. If ``None`` 40 | is specified zero-vector is used. 41 | xs (list of ~chianer.Variable): List of input sequences. 42 | Each element ``xs[i]`` is a :class:`chainer.Variable` holding 43 | a sequence. 44 | Return: 45 | (hy,cy): a pair of hidden and cell states at the end of the sequence, 46 | ys: a hidden state sequence at the last layer 47 | """ 48 | if len(xs) > 1: 49 | sections = np.cumsum(np.array([len(x) for x in xs[:-1]], dtype=np.int32)) 50 | xs = F.split_axis(self.embed(F.concat(xs, axis=0)), sections, axis=0) 51 | else: 52 | xs = [ self.embed(xs[0]) ] 53 | if s is not None: 54 | hy, cy, ys = self.lstm(s[0], s[1], xs) 55 | else: 56 | hy, cy, ys = self.lstm(None, None, xs) 57 | 58 | return (hy,cy), ys 59 | 60 | -------------------------------------------------------------------------------- /ChatbotBaseline/tools/seq2seq_model.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """Sequence-to-sequence model module 3 | 4 | Copyright (c) 2017 Takaaki Hori (thori@merl.com) 5 | 6 | This software is released under the MIT License. 7 | http://opensource.org/licenses/mit-license.php 8 | 9 | """ 10 | 11 | import six 12 | import chainer 13 | import chainer.functions as F 14 | from chainer import cuda 15 | import numpy as np 16 | 17 | class Sequence2SequenceModel(chainer.Chain): 18 | 19 | def __init__(self, encoder, decoder): 20 | """ Define model structure 21 | Args: 22 | encoder (~chainer.Chain): encoder network 23 | decoder (~chainer.Chain): decoder network 24 | """ 25 | super(Sequence2SequenceModel, self).__init__( 26 | encoder = encoder, 27 | decoder = decoder 28 | ) 29 | 30 | 31 | def loss(self,es,x,y,t): 32 | """ Forward propagation and loss calculation 33 | Args: 34 | es (pair of ~chainer.Variable): encoder state 35 | x (list of ~chainer.Variable): list of input sequences 36 | y (list of ~chainer.Variable): list of output sequences 37 | t (list of ~chainer.Variable): list of target sequences 38 | if t is None, it returns only states 39 | Return: 40 | es (pair of ~chainer.Variable(s)): encoder state 41 | ds (pair of ~chainer.Variable(s)): decoder state 42 | loss (~chainer.Variable) : cross-entropy loss 43 | """ 44 | es,ey = self.encoder(es,x) 45 | ds,dy = self.decoder(es,y) 46 | if t is not None: 47 | loss = F.softmax_cross_entropy(dy,t) 48 | # avoid NaN gradients (See: https://github.com/pfnet/chainer/issues/2505) 49 | if chainer.config.train: 50 | loss += F.sum(F.concat(ey, axis=0)) * 0 51 | return es, ds, loss 52 | else: # if target is None, it only returns states 53 | return es, ds 54 | 55 | 56 | def generate(self, es, x, sos, eos, unk=0, maxlen=100, beam=5, penalty=1.0, nbest=1): 57 | """ Generate sequence using beam search 58 | Args: 59 | es (pair of ~chainer.Variable(s)): encoder state 60 | x (list of ~chainer.Variable): list of input sequences 61 | sos (int): id number of start-of-sentence label 62 | eos (int): id number of end-of-sentence label 63 | unk (int): id number of unknown-word label 64 | maxlen (int): list of target sequences 65 | beam (int): list of target sequences 66 | penalty (float): penalty added to log probabilities 67 | of each output label. 68 | nbest (int): number of n-best hypotheses to be output 69 | Return: 70 | list of tuples (hyp, score): n-best hypothesis list 71 | - hyp (list): generated word Id sequence 72 | - score (float): hypothesis score 73 | pair of ~chainer.Variable(s)): decoder state of best hypothesis 74 | """ 75 | # encoder 76 | es,ey = self.encoder(es, [x]) 77 | # beam search 78 | ds = self.decoder.initialize(es, ey, sos) 79 | hyplist = [([], 0., ds)] 80 | best_state = None 81 | comp_hyplist = [] 82 | for l in six.moves.range(maxlen): 83 | new_hyplist = [] 84 | argmin = 0 85 | for out,lp,st in hyplist: 86 | logp = self.decoder.predict(st) 87 | lp_vec = cuda.to_cpu(logp.data[0]) + lp 88 | if l > 0: 89 | new_lp = lp_vec[eos] + penalty * (len(out)+1) 90 | new_st = self.decoder.update(st,eos) 91 | comp_hyplist.append((out, new_lp)) 92 | if best_state is None or best_state[0] < new_lp: 93 | best_state = (new_lp, new_st) 94 | 95 | for o in np.argsort(lp_vec)[::-1]: 96 | if o == unk or o == eos:# exclude and 97 | continue 98 | new_lp = lp_vec[o] 99 | if len(new_hyplist) == beam: 100 | if new_hyplist[argmin][1] < new_lp: 101 | new_st = self.decoder.update(st, o) 102 | new_hyplist[argmin] = (out+[o], new_lp, new_st) 103 | argmin = min(enumerate(new_hyplist), key=lambda h:h[1][1])[0] 104 | else: 105 | break 106 | else: 107 | new_st = self.decoder.update(st, o) 108 | new_hyplist.append((out+[o], new_lp, new_st)) 109 | if len(new_hyplist) == beam: 110 | argmin = min(enumerate(new_hyplist), key=lambda h:h[1][1])[0] 111 | 112 | hyplist = new_hyplist 113 | 114 | if len(comp_hyplist) > 0: 115 | maxhyps = sorted(comp_hyplist, key=lambda h:-h[1])[:nbest] 116 | return maxhyps, best_state[1] 117 | else: 118 | return [([],0)],None 119 | 120 | -------------------------------------------------------------------------------- /ChatbotBaseline/tools/tqdm_logging.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """Logging extension with tqdm 3 | 4 | Copyright (c) 2017 Takaaki Hori (thori@merl.com) 5 | 6 | This software is released under the MIT License. 7 | http://opensource.org/licenses/mit-license.php 8 | 9 | """ 10 | 11 | import sys 12 | import logging 13 | from tqdm import tqdm 14 | 15 | # tqdm logging handler 16 | class TqdmLoggingHandler(logging.Handler): 17 | 18 | def __init__ (self, level=logging.NOTSET): 19 | super (self.__class__, self).__init__ (level) 20 | 21 | def emit (self, record): 22 | try: 23 | msg = self.format (record) 24 | tqdm.write ('\r' + msg, file=sys.stdout) 25 | self.flush () 26 | except (KeyboardInterrupt, SystemExit): 27 | raise 28 | except: 29 | self.handleError(record) 30 | 31 | # logger configulation 32 | def config(logger, logfile='', mode='w', silent=False, debug=False, 33 | format="%(asctime)s %(levelname)s %(message)s"): 34 | 35 | #logger = logging.getLogger("root") 36 | 37 | if silent: 38 | level = logging.WARN 39 | else: 40 | level = logging.INFO 41 | 42 | stdhandler = TqdmLoggingHandler(level=level) 43 | stdhandler.setFormatter(logging.Formatter(format)) 44 | logger.addHandler(stdhandler) 45 | 46 | if logfile: 47 | filehandler = logging.FileHandler(logfile, mode=mode) 48 | filehandler.setFormatter(logging.Formatter(format)) 49 | logger.addHandler(filehandler) 50 | 51 | if debug: 52 | logger.setLevel(logging.DEBUG) 53 | else: 54 | logger.setLevel(logging.INFO) 55 | 56 | return logger 57 | 58 | -------------------------------------------------------------------------------- /ChatbotBaseline/tools/train_conversation_model.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | """Train an encoder-decoder model for neural conversation 4 | 5 | Copyright (c) 2017 Takaaki Hori (thori@merl.com) 6 | 7 | This software is released under the MIT License. 8 | http://opensource.org/licenses/mit-license.php 9 | 10 | """ 11 | 12 | import argparse 13 | import math 14 | import sys 15 | import time 16 | import random 17 | import os 18 | import copy 19 | 20 | import numpy as np 21 | import six 22 | 23 | import chainer 24 | from chainer import cuda 25 | from chainer import optimizers 26 | import chainer.functions as F 27 | import pickle 28 | import logging 29 | import tqdm_logging 30 | from tqdm import tqdm 31 | 32 | from lstm_encoder import LSTMEncoder 33 | from lstm_decoder import LSTMDecoder 34 | from seq2seq_model import Sequence2SequenceModel 35 | 36 | import dialog_corpus 37 | 38 | # user the root logger 39 | logger = logging.getLogger("root") 40 | 41 | # training status and report 42 | class Status: 43 | def __init__(self, interval, progress_bar=True): 44 | self.interval = interval 45 | self.loss = 0. 46 | self.cur_at = time.time() 47 | self.nsamples = 0 48 | self.count = 0 49 | self.progress_bar = progress_bar 50 | self.min_validate_ppl = 1.0e+10 51 | self.bestmodel_num = 1 52 | self.epoch = 1 53 | 54 | def update(self, loss, nsamples): 55 | self.loss += loss 56 | self.nsamples += nsamples 57 | self.count += 1 58 | if self.count % self.interval == 0: 59 | now = time.time() 60 | throuput = self.interval / (now - self.cur_at) 61 | perp = math.exp(self.loss / self.nsamples) 62 | logger.info('iter %d training perplexity: %.2f (%.2f iters/sec)' 63 | % (self.count, perp, throuput)) 64 | self.loss = 0. 65 | self.nsamples = 0 66 | self.cur_at = now 67 | 68 | def new_epoch(self, validate_time=0): 69 | self.epoch += 1 70 | self.cur_at += validate_time # skip validation and model I/O time 71 | 72 | 73 | # Traning routine 74 | def train_step(model, optimizer, dataset, batchset, status, xp): 75 | chainer.config.train = True 76 | train_loss = 0. 77 | train_nsamples = 0 78 | 79 | num_interacts = sum([len(dataset[idx[0]]) for idx in batchset]) 80 | if status.progress_bar: 81 | progress = tqdm(total=num_interacts) 82 | progress.set_description("Epoch %d" % status.epoch) 83 | 84 | for i in six.moves.range(len(batchset)): 85 | ds = None 86 | for j in six.moves.range(len(dataset[batchset[i][0]])): 87 | # prepare input, output, and target 88 | x = [ chainer.Variable(xp.asarray(dataset[k][j][0])) for k in batchset[i] ] 89 | y = [ chainer.Variable(xp.asarray(dataset[k][j][1][:-1])) for k in batchset[i] ] 90 | t = chainer.Variable(xp.asarray(np.concatenate( [dataset[k][j][1][1:] 91 | for k in batchset[i]] ))) 92 | # compute training loss 93 | ds,es,loss = model.loss(ds,x,y,t) 94 | train_loss += loss.data * len(t.data) 95 | train_nsamples += len(t.data) 96 | status.update(loss.data * len(t.data), len(t.data)) 97 | # backprop 98 | model.cleargrads() 99 | loss.backward() 100 | loss.unchain_backward() # truncate 101 | # update 102 | optimizer.update() 103 | if status.progress_bar: 104 | progress.update(1) 105 | 106 | if status.progress_bar: 107 | progress.close() 108 | 109 | return math.exp(train_loss/train_nsamples) 110 | 111 | 112 | # Validation routine 113 | def validate_step(model, dataset, batchset, status, xp): 114 | chainer.config.train = False 115 | validate_loss = 0. 116 | validate_nsamples = 0 117 | num_interacts = sum([len(dataset[idx[0]]) for idx in batchset]) 118 | if status.progress_bar: 119 | progress = tqdm(total=num_interacts) 120 | progress.set_description("Epoch %d" % status.epoch) 121 | 122 | for i in six.moves.range(len(batchset)): 123 | ds = None 124 | for j in six.moves.range(len(dataset[batchset[i][0]])): 125 | # prepare input, output, and target 126 | x = [ chainer.Variable(xp.asarray(dataset[k][j][0])) for k in batchset[i] ] 127 | y = [ chainer.Variable(xp.asarray(dataset[k][j][1][:-1])) 128 | for k in batchset[i] ] 129 | t = chainer.Variable(xp.asarray(np.concatenate( [dataset[k][j][1][1:] 130 | for k in batchset[i]] ))) 131 | # compute validation loss 132 | es,ds,loss = model.loss(ds, x, y, t) 133 | 134 | # accumulate 135 | validate_loss += loss.data * len(t.data) 136 | validate_nsamples += len(t.data) 137 | if status.progress_bar: 138 | progress.update(1) 139 | 140 | if status.progress_bar: 141 | progress.close() 142 | 143 | return math.exp(validate_loss/validate_nsamples) 144 | 145 | 146 | ################################## 147 | # main 148 | def main(): 149 | parser = argparse.ArgumentParser() 150 | # logging 151 | parser.add_argument('--logfile', '-l', default='', type=str, 152 | help='write log data into a file') 153 | parser.add_argument('--debug', '-d', action='store_true', 154 | help='run in debug mode') 155 | parser.add_argument('--silent', '-s', action='store_true', 156 | help='run in silent mode') 157 | parser.add_argument('--no-progress-bar', action='store_true', 158 | help='hide progress bar') 159 | # train and validate data 160 | parser.add_argument('--train', default='train.txt', type=str, 161 | help='set filename of training data') 162 | parser.add_argument('--validate', default='dev.txt', type=str, 163 | help='set filename of validation data') 164 | parser.add_argument('--vocab-size', '-V', default=0, type=int, 165 | help='set vocabulary size (0 means no limitation)') 166 | parser.add_argument('--target-speaker', '-T', default='S', 167 | help='set target speaker name to be learned for system output') 168 | # file settings 169 | parser.add_argument('--initial-model', '-i', 170 | help='start training from an initial model') 171 | parser.add_argument('--model', '-m', required=True, 172 | help='set prefix of output model files') 173 | parser.add_argument('--resume', action='store_true', 174 | help='resume training from a previously saved snapshot') 175 | parser.add_argument('--snapshot', type=str, 176 | help='dump a snapshot to a file after each epoch') 177 | # Model structure 178 | parser.add_argument('--enc-layer', default=2, type=int, 179 | help='number of encoder layers') 180 | parser.add_argument('--enc-esize', default=100, type=int, 181 | help='number of encoder input-embedding units') 182 | parser.add_argument('--enc-hsize', default=512, type=int, 183 | help='number of encoder hidden units') 184 | 185 | parser.add_argument('--dec-layer', default=2, type=int, 186 | help='number of decoder layers') 187 | parser.add_argument('--dec-esize', default=100, type=int, 188 | help='number of decoder input-embedding units') 189 | parser.add_argument('--dec-hsize', default=512, type=int, 190 | help='number of decoder hidden units') 191 | parser.add_argument('--dec-psize', default=100, type=int, 192 | help='number of decoder pre-output projection units') 193 | # training conditions 194 | parser.add_argument('--optimizer', default='Adam', type=str, 195 | help="set optimizer (SGD, Adam, AdaDelta, RMSprop, ...)") 196 | parser.add_argument('--L2-weight', default=0.0, type=float, 197 | help="set weight for L2-regularization term") 198 | parser.add_argument('--clip-grads', default=5., type=float, 199 | help="set gradient clipping threshold") 200 | parser.add_argument('--dropout-rate', default=0.5, type=float, 201 | help="set dropout rate in training") 202 | parser.add_argument('--num-epochs', '-e', default=20, type=int, 203 | help='number of epochs to be trained') 204 | parser.add_argument('--learn-rate', '-R', default=1.0, type=float, 205 | help='set initial learning rate for SGD') 206 | parser.add_argument('--learn-decay', default=1.0, type=float, 207 | help='set decaying ratio of learning rate or epsilon') 208 | parser.add_argument('--lower-bound', default=1e-16, type=float, 209 | help='set threshold of learning rate or epsilon for early stopping') 210 | parser.add_argument('--batch-size', '-b', default=50, type=int, 211 | help='set batch size for training and validation') 212 | parser.add_argument('--max-batch-length', default=20, type=int, 213 | help='set maximum sequence length to control batch size') 214 | parser.add_argument('--seed', default=99, type=int, 215 | help='set a seed for random numbers') 216 | # select a GPU device 217 | parser.add_argument('--gpu', '-g', default=0, type=int, 218 | help='GPU ID (negative value indicates CPU)') 219 | 220 | args = parser.parse_args() 221 | 222 | # flush stdout 223 | if six.PY2: 224 | sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 0) 225 | # set up the logger 226 | tqdm_logging.config(logger, args.logfile, mode=('a' if args.resume else 'w'), 227 | silent=args.silent, debug=args.debug) 228 | # gpu setup 229 | if args.gpu >= 0: 230 | cuda.check_cuda_available() 231 | cuda.get_device(args.gpu).use() 232 | xp = cuda.cupy 233 | xp.random.seed(args.seed) 234 | else: 235 | xp = np 236 | 237 | # randomize 238 | np.random.seed(args.seed) 239 | random.seed(args.seed) 240 | 241 | logger.info('----------------------------------') 242 | logger.info('Train a neural conversation model') 243 | logger.info('----------------------------------') 244 | if args.resume: 245 | if not args.snapshot: 246 | logger.error('snapshot file is not spacified.') 247 | sys.exit() 248 | 249 | with open(args.snapshot, 'rb') as f: 250 | vocab, optimizer, status, args = pickle.load(f) 251 | logger.info('Resume training from epoch %d' % status.epoch) 252 | logger.info('Args ' + str(args)) 253 | model = optimizer.target 254 | else: 255 | logger.info('Args ' + str(args)) 256 | # Prepare RNN model and load data 257 | if args.initial_model: 258 | logger.info('Loading a model from ' + args.initial_model) 259 | with open(args.initial_model, 'rb') as f: 260 | vocab, model, tmp_args = pickle.load(f) 261 | status.cur_at = time.time() 262 | else: 263 | logger.info('Making vocabulary from ' + args.train) 264 | vocab = dialog_corpus.get_vocabulary(args.train, vocabsize=args.vocab_size) 265 | model = Sequence2SequenceModel( 266 | LSTMEncoder(args.enc_layer, len(vocab), args.enc_hsize, 267 | args.enc_esize, dropout=args.dropout_rate), 268 | LSTMDecoder(args.dec_layer, len(vocab), len(vocab), 269 | args.dec_esize, args.dec_hsize, args.dec_psize, 270 | dropout=args.dropout_rate)) 271 | # Setup optimizer 272 | optimizer = vars(optimizers)[args.optimizer]() 273 | if args.optimizer == 'SGD': 274 | optimizer.lr = args.learn_rate 275 | optimizer.use_cleargrads() 276 | optimizer.setup(model) 277 | optimizer.add_hook(chainer.optimizer.GradientClipping(args.clip_grads)) 278 | if args.L2_weight > 0.: 279 | optimizer.add_hook(chainer.optimizer.WeightDecay(args.L2_weight)) 280 | status = None 281 | 282 | logger.info('Loading text data from ' + args.train) 283 | train_set = dialog_corpus.load(args.train, vocab, args.target_speaker) 284 | logger.info('Loading validation data from ' + args.validate) 285 | validate_set = dialog_corpus.load(args.validate, vocab, args.target_speaker) 286 | logger.info('Making mini batches') 287 | train_batchset = dialog_corpus.make_minibatches(train_set, batchsize=args.batch_size, max_length=args.max_batch_length) 288 | validate_batchset = dialog_corpus.make_minibatches(validate_set, batchsize=args.batch_size, max_length=args.max_batch_length) 289 | # report data summary 290 | logger.info('vocabulary size = %d' % len(vocab)) 291 | logger.info('#train sample = %d #mini-batch = %d' % (len(train_set), len(train_batchset))) 292 | logger.info('#validate sample = %d #mini-batch = %d' % (len(validate_set), len(validate_batchset))) 293 | random.shuffle(train_batchset, random.random) 294 | 295 | # initialize status parameters 296 | if status is None: 297 | status = Status(max(round(len(train_batchset),-3)/50,500), 298 | progress_bar=not args.no_progress_bar) 299 | else: 300 | status.progress_bar = not args.no_progress_bar 301 | 302 | # move model to gpu 303 | if args.gpu >= 0: 304 | model.to_gpu() 305 | 306 | while status.epoch <= args.num_epochs: 307 | logger.info('---------------------training--------------------------') 308 | if args.optimizer == 'SGD': 309 | logger.info('Epoch %d/%d : SGD learning rate = %g' % (status.epoch, args.num_epochs, optimizer.lr)) 310 | else: 311 | logger.info('Epoch %d/%d : %s eps = %g' % (status.epoch, args.num_epochs, args.optimizer, optimizer.eps)) 312 | train_ppl = train_step(model, optimizer, train_set, train_batchset, status, xp) 313 | logger.info("epoch %d training perplexity: %f" % (status.epoch, train_ppl)) 314 | # write the model params 315 | modelfile = args.model + '.' + str(status.epoch) 316 | logger.info('writing model params to ' + modelfile) 317 | model.to_cpu() 318 | with open(modelfile, 'wb') as f: 319 | pickle.dump((vocab, model, args), f, -1) 320 | if args.gpu >= 0: 321 | model.to_gpu() 322 | 323 | # start validation step 324 | logger.info('---------------------validation------------------------') 325 | start_at = time.time() 326 | validate_ppl = validate_step(model, validate_set, validate_batchset, status, xp) 327 | logger.info('epoch %d validation perplexity: %.4f' % (status.epoch, validate_ppl)) 328 | # update best model with the minimum perplexity 329 | if status.min_validate_ppl >= validate_ppl: 330 | status.bestmodel_num = status.epoch 331 | logger.info('validation perplexity reduced: %.4f -> %.4f' % (status.min_validate_ppl, validate_ppl)) 332 | status.min_validate_ppl = validate_ppl 333 | 334 | elif args.optimizer == 'SGD': 335 | modelfile = args.model + '.' + str(status.bestmodel_num) 336 | logger.info('reloading model params from ' + modelfile) 337 | with open(modelfile, 'rb') as f: 338 | vocab, model, tmp_args = pickle.load(f) 339 | if args.gpu >= 0: 340 | model.to_gpu() 341 | optimizer.lr *= args.learn_decay 342 | if optimizer.lr < args.lower_bound: 343 | break 344 | optimizer.setup(model) 345 | else: 346 | optimizer.eps *= args.learn_decay 347 | if optimizer.eps < args.lower_bound: 348 | break 349 | 350 | status.new_epoch(validate_time = time.time() - start_at) 351 | # dump snapshot 352 | if args.snapshot: 353 | logger.info('writing snapshot to ' + args.snapshot) 354 | model.to_cpu() 355 | with open(args.snapshot, 'wb') as f: 356 | pickle.dump((vocab, optimizer, status, args), f, -1) 357 | if args.gpu >= 0: 358 | model.to_gpu() 359 | 360 | logger.info('----------------') 361 | # make a symbolic link to the best model 362 | logger.info('the best model is %s.%d.' % (args.model,status.bestmodel_num)) 363 | logger.info('a symbolic link is made as ' + args.model+'.best') 364 | if os.path.exists(args.model+'.best'): 365 | os.remove(args.model+'.best') 366 | os.symlink(os.path.basename(args.model+'.'+str(status.bestmodel_num)), 367 | args.model+'.best') 368 | logger.info('done') 369 | 370 | if __name__ == "__main__": 371 | main() 372 | 373 | -------------------------------------------------------------------------------- /ChatbotBaseline/utils/get_available_gpu_id.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | nvidia-smi -q -d MEMORY \ 3 | |grep -A 2 "FB Memory"|grep Used \ 4 | |awk 'BEGIN{n=0;m=99999;a=0;}{if ($3&2; 53 | else printf "$help_message\n" 1>&2 ; fi; 54 | exit 0 ;; 55 | --*=*) echo "$0: options to scripts must be of the form --name value, got '$1'" 56 | exit 1 ;; 57 | # If the first command-line argument begins with "--" (e.g. --foo-bar), 58 | # then work out the variable name as $name, which will equal "foo_bar". 59 | --*) name=`echo "$1" | sed s/^--// | sed s/-/_/g`; 60 | # Next we test whether the variable in question is undefned-- if so it's 61 | # an invalid option and we die. Note: $0 evaluates to the name of the 62 | # enclosing script. 63 | # The test [ -z ${foo_bar+xxx} ] will return true if the variable foo_bar 64 | # is undefined. We then have to wrap this test inside "eval" because 65 | # foo_bar is itself inside a variable ($name). 66 | eval '[ -z "${'$name'+xxx}" ]' && echo "$0: invalid option $1" 1>&2 && exit 1; 67 | 68 | oldval="`eval echo \\$$name`"; 69 | # Work out whether we seem to be expecting a Boolean argument. 70 | if [ "$oldval" == "true" ] || [ "$oldval" == "false" ]; then 71 | was_bool=true; 72 | else 73 | was_bool=false; 74 | fi 75 | 76 | # Set the variable to the right value-- the escaped quotes make it work if 77 | # the option had spaces, like --cmd "queue.pl -sync y" 78 | eval $name=\"$2\"; 79 | 80 | # Check that Boolean-valued arguments are really Boolean. 81 | if $was_bool && [[ "$2" != "true" && "$2" != "false" ]]; then 82 | echo "$0: expected \"true\" or \"false\": $1 $2" 1>&2 83 | exit 1; 84 | fi 85 | shift 2; 86 | ;; 87 | *) break; 88 | esac 89 | done 90 | 91 | 92 | # Check for an empty argument to the --cmd option, which can easily occur as a 93 | # result of scripting errors. 94 | [ ! -z "${cmd+xxx}" ] && [ -z "$cmd" ] && echo "$0: empty argument to --cmd option" 1>&2 && exit 1; 95 | 96 | 97 | true; # so this script returns exit code 0. 98 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 dialogtekgeek 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # DSTC6: End-to-End Conversation Modeling Track 2 | 3 | # Registration 4 | Please register: 5 | https://goo.gl/forms/Fxy061gHuSOZGC1i2 6 | 7 | # News 8 | - Evaluation analysis package: Jan 19 2018 9 | 10 | The package includes all references generated by 11 humans, hypotheses of 20 systems, and evaluation results 11 | in DSTC6 end-to-end conversation modeling track. 12 | https://www.dropbox.com/s/oh1trbos0tjzn7t/dstc6_t2_evaluation.tgz 13 | - Download the official training data: Sep 7-18 2017 14 | - Test data distribution: Sep 25 2017 15 | - Submission: Oct 8 2017 16 | 17 | ![Easy 3 Step Data Collection](https://github.com/dialogtekgeek/DSTC6-End-to-End-Conversation-Modeling/issues/5 "Easy 3 Step data collection") 18 | 19 | 20 | # Track Description 21 | 1. Main task (mandatory): Customer service dialog using Twitter 22 | 23 | (*) The tools to download the twitter data and transform to the dialog format from the data are provided. 24 | 25 | 26 | Task A: Full or part of the training data will be used to train conversation models. 27 | 28 | Task B: Any open data, e.g. from web, are available as external knowledge to generate informative sentences. But they should not overlap with the training, validation and test data provided by organizers. 29 | 30 | 2. Pilot task: Movie scenario dialog using OpenSubtitle 31 | 32 | 33 | * Please cite the following paper if you will publish the results using this setup: 34 | 35 | https://arxiv.org/pdf/1706.07440.pdf 36 | 37 | ``` 38 | @article{DSTC6_End-to-End_Conversation_Modeling, 39 | Author = {Chiori Hori and Takaaki Hori}, 40 | Title = {End-to-end Conversation Modeling Track in DSTC6}, 41 | Journal = {arXiv:1706.07440}, 42 | Year = {2017} 43 | } 44 | ``` 45 | 46 | # Necessary steps 47 | 48 | ## Preparation 49 | Most tools are written in python, which were tested on python2.7.6+ and python3.4.1+, 50 | and some bash scripts are also used to execute those tools. 51 | 52 | For data preparation, you will need additional python modules as follows: 53 | 54 | * six 55 | * tqdm 56 | * nltk 57 | 58 | which can be installed by 59 | ``` 60 | pip install 61 | ``` 62 | or 63 | ``` 64 | pip install -t 65 | ``` 66 | where `` is a directory storing python modules and needs to be accessible from python, 67 | e.g. by including it in PYTHONPATH environment variable. 68 | 69 | If you try the baseline system, you will need Chainer ,a deep learning toolkit, 70 | to perform training and evaluation of neural conversation models. 71 | Please follow the instruction in `ChatbotBaseline/README.md`. 72 | 73 | ## Twitter task 74 | 75 | 1. prepare data set using `collect_twitter_dialogs` scripts. 76 | 77 | ``` 78 | $ cd collect_twitter_dialogs 79 | $ collect.sh 80 | ``` 81 | (a twitter account and access keys are necessary to run the script. follow the instruction in `collect_twitter_dialogs/README.md`) 82 | 83 | 2. extract training, development and test sets from stored twitter dialog data 84 | 85 | ``` 86 | $ cd ../tasks/twitter 87 | $ make_trial_data.sh 88 | ``` 89 | 90 | Note: the extracted data are trial data at this moment. 91 | 92 | 3. run baseline system (optional) 93 | 94 | ``` 95 | $ cd ../../ChatbotBaseline/egs/twitter 96 | $ run.sh 97 | ``` 98 | 99 | (see `ChatbotBaseline/README.md`) 100 | 101 | ## OpenSubtitles task 102 | 103 | 1. download OpenSubtitles2016 data 104 | 105 | ``` 106 | $ cd tasks/opensubs 107 | $ wget http://opus.lingfil.uu.se/download.php?f=OpenSubtitles2016/en.tar.gz 108 | $ tar zxvf en.tar.gz 109 | ``` 110 | 111 | 2. extract training, development and test sets from stored subtitle data 112 | 113 | ``` 114 | $ make_trial_data.sh 115 | ``` 116 | Note: the extracted data are trial data at this moment. 117 | 118 | 3. run baseline system (optional) 119 | 120 | ``` 121 | $ cd ../../ChatbotBaseline/egs/opensubs 122 | $ run.sh 123 | ``` 124 | 125 | (see `ChatbotBaseline/README.md`) 126 | 127 | ## Directories and files 128 | * README.md : this file 129 | * tasks : data preparation for each subtask 130 | * collect_twitter_dialogs : scripts to collect twitter data 131 | * ChatbotBaseline : a neural conversation model baseline system 132 | 133 | ## Contact Information 134 | 135 | You can get the latest updates and participate in discussions on DSTC mailing list 136 | 137 | To join the mailing list, send an email to: (listserv@lists.research.microsoft.com) putting "subscribe DSTC" in the body of the message (without the quotes). To post a message, send your message to: (dstc@lists.research.microsoft.com). 138 | 139 | -------------------------------------------------------------------------------- /collect_twitter_dialogs/README.md: -------------------------------------------------------------------------------- 1 | # Scripts to acquire twitter dialogs via REST API 1.1. 2 | 3 | Copyright (c) 2017 Takaaki Hori (thori@merl.com) 4 | 5 | This software is released under the MIT License. 6 | http://opensource.org/licenses/mit-license.php 7 | 8 | ## Requirements: 9 | 10 | * 64bit linux/macOSX/windows platform 11 | 12 | * python 2.7.9+, 3.4+ 13 | 14 | note: With python 2.7.8 or lower, InsecureRequestWarning may come out 15 | when you run the script. To suppress this warning, your can 16 | downgrade the requests package by 17 | 18 | ``` 19 | $ pip install requests==2.5.3 20 | ``` 21 | 22 | or 23 | 24 | ``` 25 | $ pip install requests==2.5.3 -t 26 | ``` 27 | 28 | where we assume that the module has been installed in <some-directory> 29 | 30 | ## Preparation 31 | 32 | 1. create a twitter account if you don't have it. 33 | 34 | you can get it via 35 | 36 | 2. create your application account via the Twitter 37 | 38 | Developer's Site: 39 | 40 | see 41 | 42 | for reference, and keep the following keys 43 | 44 | * Consumer Key 45 | * Consumer Secret 46 | * Access Token 47 | * Access Token Secret 48 | 49 | 3. edit ./config.ini to set your access keys in the config file 50 | 51 | * ConsumerKey 52 | * ConsumerSecret 53 | * AccessToken 54 | * AccessTokenSecret 55 | 56 | 4. install python modules: six and requests-oauthlib 57 | 58 | you can install them in the system area by 59 | 60 | ``` 61 | $ pip install six 62 | $ pip install requests-oauthlib 63 | ``` 64 | 65 | We recommend to use virtualenv or some other virtual enviroment to handle pythone modules. 66 | Otherwise, you can specify the directory to install python modules as 67 | 68 | ``` 69 | $ pip install -t 70 | ``` 71 | In this case, <some-directory> must be included in `PYTHONPATH` environment 72 | variable to use the modules. 73 | 74 | ## How to use: 75 | 76 | 1. Execute the following command to test your setup 77 | 78 | ``` 79 | $ collect_twitter_dialogs.py AmazonHelp 80 | ``` 81 | 82 | and you will obtain a file `AmazonHelp.json`, which contains 83 | dialogs from AmazonHelp twitter site. 84 | 85 | since the `AmazonHelp.json` is raw data, you can see the dialog text by 86 | 87 | ``` 88 | $ view_dialogs.py AmazonHelp.json 89 | ``` 90 | 91 | 2. Use `collect.sh` to acquire large data using an account list 92 | 93 | ``` 94 | $ collect.sh 95 | ``` 96 | 97 | You will find the collected data in `./stored_data` 98 | 99 | ``` 100 | $ ls stored_data 101 | AmazonHelp.json AskTSA.json ... 102 | ``` 103 | 104 | To acquire a large amount of data, it is better to run this script 105 | once a day, because the amount of data we can download is limited 106 | and older tweets cannot be accessed as time goes by. 107 | 108 | In the first run, it will take a several hours to collect all possible 109 | data from the listed accounts, but from the second time, the time will 110 | become much shorter because the script downloads only newer tweets than 111 | already collected tweets. 112 | 113 | Note: the script sometimes reports API errors, but you don't have 114 | to worry. Most errors come from access rate limit by the server. 115 | Even if the script accidentally stopped, there is no problem. 116 | Just re-run the script. 117 | 118 | 3. Use `official_collect.sh` to acquire official data for DSTC6 End-to-End Conversation Modeling track 119 | 120 | ``` 121 | $ official_collect.sh 122 | ``` 123 | 124 | Each challenge participant must run the script by at least 9/8/2017 GMT 24AM (Midnight) 125 | and do it once a day until 9/18/2017. 126 | This can be done easily by the following command: 127 | 128 | ``` 129 | $ watch -n 86400 official_collect.sh 130 | ``` 131 | 132 | (The `watch` command will run `official_collect.sh` every 24 hours) 133 | 134 | You will find the collected data in `./official_stored_data` 135 | 136 | ``` 137 | $ ls official_stored_data 138 | 1800flowershelp.json 1800flowers.json 1800PetMeds.json 1DFAQ.json 139 | 1Sale.json ... 140 | ``` 141 | 142 | A script to extract training, development, and test sets will be provided around 9/18/2017. 143 | -------------------------------------------------------------------------------- /collect_twitter_dialogs/account_names_for_dstc6.txt: -------------------------------------------------------------------------------- 1 | 1800flowers 2 | 1800flowershelp 3 | 64audio 4 | 888sport 5 | abbeygroup 6 | abbylynne19 7 | ABCustomerCare 8 | abellio_surrey 9 | AbercrombieHelp 10 | ABTAtravel 11 | acemtp 12 | ack 13 | acmemarkets 14 | acnestudios 15 | adamholz 16 | AddisonLeeCabs 17 | adidasGhana 18 | adidasUK 19 | AdobeCare 20 | adrianflux 21 | adrianswinscoe 22 | advocare 23 | AetnaHelp 24 | AFC_Amy 25 | AFCustomerCare 26 | AflacPhyllis 27 | AirAsia 28 | AirChinaNA 29 | airnzuk 30 | AIRNZUSA 31 | airtel_care 32 | AirtelNigeria 33 | alabamapower 34 | Alamo 35 | alamocares 36 | AlaskaAir 37 | Albertsons 38 | ALCATEL1TOUCH 39 | AlderTraining 40 | Aldi_Ireland 41 | AldiUK 42 | AlfaRomeoCareUK 43 | AliExpress_EN 44 | AlissaDosSantos 45 | Allegiant 46 | AllianzTravelUS 47 | Allstate 48 | AllyCare 49 | alpharooms 50 | AlshayaHelpDesk 51 | alwaysriding 52 | AmanaCare 53 | AmazonHelp 54 | AmericanAir 55 | americangiant 56 | AmexAU 57 | andertonsmusic 58 | AndreaSWilson 59 | andrew_heister 60 | AnglianHelp 61 | AnkiSupport 62 | annabelkarmel 63 | AnthemBC_News 64 | APCcustserv 65 | AppleSupport 66 | AppleyardLondon 67 | ArqivaWiFi 68 | asbcares 69 | AsdaCustCare 70 | ASHelpMe 71 | Ask123ie 72 | AskAmex 73 | AskCapitalOne 74 | AskCiti 75 | AskDyson 76 | AskeBay 77 | AskEmblemHealth 78 | askevanscycles 79 | AskKBCIreland 80 | AskLandsEnd 81 | AskMarstons 82 | AskMeHelpDesk 83 | AskMTNGhana 84 | askpermanenttsb 85 | ask_progressive 86 | AskPS_UK 87 | AskSmythsToys 88 | Ask_Spectrum 89 | AskSubaruCanada 90 | asksurfline 91 | AskTarget 92 | AskTeamUA 93 | askvisa 94 | Askvodafonegh 95 | AsmodeeNA 96 | ASOS 97 | ASOS_Au 98 | ASOS_Us 99 | astros 100 | ASUHelpCenter 101 | AsurionCares 102 | ASUSUSA 103 | atmosenergy 104 | Atomos_News 105 | atomtickets 106 | ATTCares 107 | audiblesupport 108 | audibleuk 109 | audiireland 110 | AudiUK 111 | AudiUKCare 112 | austin_reed 113 | AutodeskHelp 114 | AvidSupport 115 | Avis 116 | AVIVAIRELAND 117 | Ayres_Hotels 118 | BabiesRUs 119 | BabyJogger 120 | BananaRepublic 121 | BandQ 122 | BarclaycardNews 123 | BarclaysUK 124 | Baxter_Auto 125 | BCBSLA 126 | BCC_Help 127 | BCSSupport 128 | BeautyBarDotCom 129 | beautybay 130 | Beaverbrooks 131 | BedBathBeyond 132 | belk 133 | BenefitUK 134 | BernardBoutique 135 | bestdealtv 136 | bexdeep 137 | beyondthedesk 138 | BiffaService 139 | bigmikeyvegas 140 | BikeMetro 141 | BitcoinHelpDesk 142 | BJsWholesale 143 | blackanddecker 144 | BlackBerryHelp 145 | Blanchford 146 | Blendtec 147 | BluestarHQ 148 | BN_care 149 | Boars_Head 150 | bobbyrayburns 151 | boohoo_cshelp 152 | BookChicClub 153 | bookdepository 154 | booksamillion 155 | BOOKSetc_online 156 | BoostCare 157 | BootsUK 158 | BordGaisEnergy 159 | BoseService 160 | BostonMarki 161 | BounceEnergy 162 | BP_plc 163 | BP_UK 164 | bravern 165 | BRCustServ 166 | BritishGas 167 | BritishGasNews 168 | brooksrunning 169 | bryanz85 170 | BSSHelpDesk 171 | Budget 172 | Buick 173 | BuildaHelpDesk 174 | builds_io 175 | BupaUK 176 | BurgerKing 177 | Burton_Menswear 178 | CableBill 179 | cableONE 180 | CadburyIreland 181 | CadburyUK 182 | cafepress 183 | CallawayGolfCS 184 | CalMac_Updates 185 | calorireland 186 | cam4_gay 187 | Camper_CustCare 188 | CanadianPacific 189 | CapMetroATX 190 | CareShopBunzl 191 | CaroKopp 192 | cars_portsmouth 193 | Cartii 194 | CastelliCafe 195 | casualclassics 196 | catalysthousing 197 | CDCustService 198 | cesarkeller 199 | champssports 200 | ChaseSupport 201 | CheapTixHearsU 202 | cheftyler 203 | chevrolet 204 | ChevyCustCare 205 | ChiccoUSA 206 | chokemetoo 207 | chrisfenech1 208 | ChryslerCares 209 | Cignaquestions 210 | CIPD 211 | CiteThisForMe 212 | Citibank 213 | CitiBikeNYC 214 | CitySprint_help 215 | ClaireLBSmith 216 | ClairesEurope 217 | clarkshelp 218 | CMLFerry 219 | CoastCath 220 | CollectorCorps 221 | comcastcares 222 | ComcastILLINOIS 223 | ComcastOrSWWa 224 | comparemkt_care 225 | comparethemkt 226 | consult53 227 | CoopEnergy 228 | Costco 229 | COTABus 230 | CoxHelp 231 | craftsman 232 | CRCustomerChat 233 | CRTContactUs 234 | cstmrsvc 235 | ctcustomercare 236 | CTS___ 237 | cvspharmacy 238 | danaddicott 239 | danandphilshop 240 | dancathy 241 | dan_malarkey 242 | DarkBunnyTees 243 | DawnCarillion 244 | DCBService 245 | debthelpdesk 246 | Deliveroo 247 | Deliveroo_IE 248 | deliverydotcom 249 | Desk 250 | devinfinlay 251 | DevonCC 252 | DEWALTtough 253 | DEWALT_UK 254 | DFSCare 255 | dgingiss 256 | DianaHSmith 257 | DIBsupport 258 | DICKS 259 | digicelbarbados 260 | DIGICELJamaica 261 | DigitalRiverInc 262 | Dillards 263 | dimonet 264 | DIRECTV 265 | directvnow 266 | Discover 267 | Discovery_SA 268 | dish 269 | DivvyBikes 270 | djccwl 271 | dnataSupport 272 | DNBcares 273 | dodgecares 274 | DollarGeneral 275 | DolphinCares 276 | dongmsd 277 | doxiecare 278 | DPDCustomerCare 279 | Dreams_Beds 280 | DressCircleShop 281 | DrinkSparkletts 282 | DSGSupport 283 | DStvNg 284 | DTSatWIT 285 | DunelmUK 286 | durafloruk 287 | EarthLink 288 | easons 289 | easternbank 290 | EatNaturalBars 291 | eBay_UK 292 | edfenergy 293 | edfenergycomms 294 | edfenergycs 295 | eflow_freeflow 296 | eh_custcare 297 | eirNews 298 | elfcosmetics 299 | EllenKeeble 300 | EmbracePetIns 301 | EmersonIT 302 | emikathure 303 | EmiratesSupport 304 | ENMAX 305 | ENMAXenergy 306 | Entergy 307 | EntergyArk 308 | Enterprise 309 | enterprisecares 310 | EPCOR 311 | epicgeargaming 312 | epointsjordan 313 | Equinox_Service 314 | esetcares 315 | etisalat 316 | EvansCycles 317 | eventbritehelp 318 | EversourceCT 319 | EviteSupport 320 | Expedia 321 | express 322 | ExpressHelp 323 | EYellin 324 | FabFitFunCS 325 | FabSupportTeam 326 | Fanatics 327 | Fandom_Insider 328 | farfetch 329 | faux_punk 330 | FCUK 331 | fdarenahelp 332 | FedExCanada 333 | FedExCanadaHelp 334 | FedExEurope 335 | FedExHelpEU 336 | FeelGoodPark 337 | feelunique 338 | FFGames 339 | FH_CustomerCare 340 | fiatcares 341 | FiatCareUK 342 | FiguresToyCo 343 | FinnairHelps 344 | firstdirect 345 | fisherpaykelaus 346 | FLBlueCares 347 | FLLFlyer 348 | Fly_Norwegian 349 | flySAA_US 350 | Fon 351 | FonCare 352 | fontspring 353 | FoodLion 354 | Footasylum 355 | FootballIndexUK 356 | Ford 357 | FordSouthAfrica 358 | forduk 359 | Forever21Help 360 | FortisBC 361 | foxrentcar 362 | FrankEliason 363 | FreeviewTV 364 | freshlypicked 365 | FromYouFlowers 366 | FrontierCare 367 | fryselectronics 368 | FTcare 369 | FunjetVacations 370 | FunkoDCLegion 371 | FUT_COINSTORE 372 | gadventures 373 | Gap 374 | GapCA 375 | GarudaCares 376 | GEICO_Service 377 | GenesisHousing 378 | geniusfoods 379 | GeoffRamm 380 | GeorgiaPower 381 | GeoxCares 382 | GETMEIN 383 | getrespond 384 | getsatisfaction 385 | gigaclear 386 | GiltService 387 | glasses_direct 388 | GlideUK 389 | GloCare 390 | GlossyboxUK 391 | GM 392 | Go_CheshireWest 393 | gogreenride 394 | GongshowService 395 | Google 396 | googlemaps 397 | GoSmartNC 398 | GoTriangle 399 | Go_Wireless 400 | GrandHyattNYC 401 | GrandHyattSD 402 | graveshambc 403 | GreatClipsCares 404 | GreyhoundBus 405 | Groove 406 | GRT_ROW 407 | Grubhub_Care 408 | GSMA_Care 409 | Gymshark 410 | Gymshark_Help 411 | HalfordsCycling 412 | handtec 413 | HarrisTeeter 414 | HarrodsService 415 | HarryandDavid 416 | HawaiianAir 417 | hbonow 418 | HDCares 419 | HeatherJStrout 420 | HEB 421 | Heinens 422 | HelloKit 423 | helpscout 424 | HilltopNorfolk 425 | hm 426 | hmaustralia 427 | hmcanada 428 | hm_custserv 429 | HMSHost 430 | hmsouthafrica 431 | hmunitedkingdom 432 | holden_aus 433 | holidayautos 434 | holidaytaxisCS 435 | HollisterCoHelp 436 | HomeDepotCanada 437 | HomesenseUK 438 | Honda 439 | HondaCustSvc 440 | HondaPowersprts 441 | Hootsuite_Help 442 | Hotwire 443 | houseoffraser 444 | HQhair_Help 445 | HRBlockAnswers 446 | hsamueljeweller 447 | HSBC_Sport 448 | HSBC_UAE 449 | HSBC_UK 450 | HSBC_US 451 | hsr 452 | HSScustomercare 453 | HTCHelp 454 | Huawei 455 | HuaweiMobile 456 | HudsonshoesUK 457 | HullUni_ICT 458 | HumanaHelp 459 | HwnElectric 460 | HyattChurchill 461 | Hyken 462 | IcelandFoods 463 | IGearBrand 464 | i_hate_ironing 465 | iiNet 466 | IKEAIESupport 467 | IKEAUKSupport 468 | IKEAUSAHelp 469 | Incite_Group 470 | INDOCHINO 471 | INDOT 472 | IndyDPW 473 | INFINITICare 474 | INFINITIUSA 475 | InMotionCares 476 | instituteofcs 477 | inthestyleUK 478 | IPFW_ITS_HD 479 | IrishRail 480 | IslandAirHawaii 481 | itsemiel 482 | IWCare 483 | Iyengarish 484 | JabraEurope 485 | jabrasport 486 | Jabra_US 487 | jackd 488 | jackiecas1 489 | JackWills 490 | JagexInfinity 491 | JamboPayCare 492 | Jamie1973 493 | JapanHelpDesk 494 | jasoneden 495 | JawboneSupport 496 | jaybaer 497 | jazzpk 498 | jAzzyF_BaBy 499 | jcpenney 500 | jct600 501 | JDhelpteam 502 | Jeep 503 | JeepCares 504 | jeffreyboutique 505 | Jersey_City 506 | Jesshillcakes 507 | jessops 508 | JetBlue 509 | JetHeads 510 | JetstarAirways 511 | Jetstar_NZ 512 | JimatPlanters 513 | JimEllisAudi 514 | JIRAServiceDesk 515 | JKalena123 516 | JLcustserv 517 | JLove55 518 | jmspool 519 | JMstore 520 | joepindar 521 | JonesClocks 522 | joythestore 523 | jscheel 524 | justhype 525 | k9cuisine 526 | KakaoTalkPH 527 | KateNasser 528 | KauaiSA 529 | Kazel_Kimpo 530 | kellie_brooks 531 | kenmore 532 | KenyaPower 533 | ketpoole 534 | kevinGEEdavis 535 | KFC_UKI 536 | KFC_UKI_Help 537 | kiddicare 538 | KimiNozoGuy 539 | KingFlyKOTD 540 | kisluvkis 541 | KitbagCS 542 | KitchenAid_CA 543 | KitchenAid_CAre 544 | KLM_UK 545 | Kohls 546 | kongacare 547 | KrogerSupport 548 | LandRover_UK 549 | latterash 550 | LeapCard 551 | LenovoANZ 552 | LibertyHelpDesk 553 | LibertyMutual 554 | lidlcustomerc 555 | lidl_ireland 556 | lidl_ni 557 | lids 558 | LidsAssist 559 | lidscanada 560 | LinkedInHelp 561 | LinkedInUK 562 | LiveChat 563 | LiveNationON 564 | LivePhishCS 565 | Ljbpieces 566 | Lo573 567 | Logitech_ANZ 568 | lordandtaylor 569 | Lovehoney 570 | LoveWilko 571 | LowesCares 572 | LozHarvey 573 | LQcontactus 574 | LubeStop 575 | LVcares 576 | MACcosmetics 577 | MaclarenService 578 | MacsalesCSTS 579 | Macys 580 | MadameTussauds 581 | MallforAfrica 582 | mamasandpapas 583 | mandkbutchers 584 | ManiereDeVoir 585 | MarcGoodmanBos 586 | marilynsuttle 587 | markmadison 588 | MarshaCollier 589 | martindentist 590 | masnHelpDesk 591 | MasterofMaltCS 592 | Matalan 593 | mattlatmatt 594 | mattr 595 | MaxiCosiUK 596 | MaytagBrand 597 | MaytagCare 598 | Mazda_SA 599 | _MBService 600 | MBTA_CR 601 | McDonalds 602 | Mcheza_Care 603 | MCMComicCon 604 | MDSHA 605 | Medela_US 606 | megabus 607 | melbournemuseum 608 | MeteorPR 609 | MetroBank_Help 610 | METROHouAlerts 611 | MetroTransitMN 612 | miasampietro 613 | micahsolomon 614 | michaelshearer 615 | MicrosoftHelps 616 | Minted 617 | Misfit 618 | MissSelfridge 619 | Missytohbadt 620 | mitsucars 621 | mjanczewski 622 | MLBFanSupport 623 | ModCloth_Cares 624 | monkiworld 625 | moreforURdollar 626 | MrCabinetCareOC 627 | mrpatto 628 | MRPfashion 629 | MRPHelp 630 | MrsGammaLabs 631 | MTA 632 | MTGOTraders 633 | MTN180 634 | musicMagpie 635 | Musictoday 636 | MUSupportDesk 637 | MVPHealthCare 638 | MwaveAu 639 | MWilbanks 640 | mycellcom 641 | MyDoncaster 642 | myUHC 643 | Nathane_Jackson 644 | nationalcares 645 | nationalgridus 646 | NationalPro 647 | NBASTORE 648 | ncbja 649 | NealTopf 650 | neimanmarcus 651 | netflix 652 | NetflixANZ 653 | Netflix_CA 654 | NetflixUK 655 | NetoneSupport 656 | NeweggService 657 | nextofficial 658 | Nipponyasancom 659 | NISMO_USA 660 | NissanUSA 661 | nmkrobinson 662 | nokiamobile 663 | NOOK_Care 664 | NOOK_Care_UK 665 | Nordstrom 666 | Norfolkholidays 667 | notonthehighst 668 | npowerhq 669 | nspowerinc 670 | nuerasupport 671 | Nutrisystem 672 | nvidiacc 673 | nxcare 674 | NYSC 675 | NYTCare 676 | O2IRL 677 | OasisFashion 678 | OEcare 679 | OfficeCandyGals 680 | officedepot 681 | OfficeDivvy 682 | OfficeShoes 683 | officialpescm 684 | OldNavyCA 685 | olympicfinishes 686 | _omocat 687 | Ooma 688 | OoredooCare 689 | OoredooQatar 690 | OpenMike_TV 691 | OptumRx 692 | OrangeKe_Care 693 | OrbitzCareTeam 694 | originenergy 695 | OtterBox 696 | OtterBoxCS 697 | overseasescape 698 | oxygenfreejump 699 | PaddyPowerShops 700 | PalaceMovies 701 | PanagoCares 702 | Panago_Pizza 703 | PanaService_UK 704 | PanasonicUSA 705 | PapaJohns 706 | PaperMate 707 | ParkHyattChi 708 | parracity 709 | PASmithjr 710 | PatchworkUrchin 711 | PatriotsProShop 712 | paulodetarso24 713 | PaychexService 714 | paycity 715 | PayPalInfoSec 716 | paysafecard 717 | Paytmcare 718 | PBCares 719 | PCRichardandSon 720 | PeabodyLDN 721 | PECOconnect 722 | PenskeCares 723 | petedenton 724 | PeteFyfe 725 | PeterboroughCC 726 | PeterPanBus 727 | PeugeotZA 728 | pfgregg 729 | PhilipsCare_UK 730 | Photobox 731 | PIA_Cust_unCare 732 | picturehouses 733 | PINGTourEurope 734 | PioneerDJ 735 | placesforpeople 736 | PlayDoh 737 | PlayStationAU 738 | PLC_Singapore 739 | PlentyOfFish 740 | plusnethelp 741 | PMInstitute 742 | PNCBank_Help 743 | PNMtalk 744 | Pokemon 745 | pond5 746 | Porsche 747 | portlandwater 748 | PostOfficeNews 749 | POTUS_CustServ 750 | PoweradeGB 751 | premestateswine 752 | pretavoir 753 | Primark 754 | princessdvonb 755 | PrincesTrust 756 | PRISM_NSA 757 | ProFlowers 758 | ProtectionOne 759 | Publix 760 | PublixHelps 761 | PurolatorHelp 762 | qatarairways 763 | QDStores 764 | Quinny_UK 765 | RAC_Care 766 | RACWA 767 | RamCares 768 | RamseyCare 769 | Rand_Water 770 | RaneDJ 771 | RareLondon 772 | RatedPeople 773 | RBC 774 | RBC_Canada 775 | RBWM 776 | RCommCare 777 | Reachout_mcd 778 | redbox 779 | ReebokUK 780 | Relish 781 | rhapsaadic 782 | RideUTA 783 | _rightio 784 | riteaid 785 | ritual_co 786 | riverisland 787 | RobSylvan 788 | RogersBiz 789 | RogersHelps 790 | Ronseal 791 | RoyalMail 792 | RubyBlogVAA 793 | rwhelp 794 | Safaricom_Care 795 | SageSupport 796 | sageuk 797 | sainsburys 798 | SainsburysNews 799 | SaksService 800 | samanthastarmer 801 | SamsungCareSA 802 | samsungmobileng 803 | SamTrans 804 | SAS 805 | SAS_Cares 806 | SaskTel 807 | SBEGlobal 808 | SCG_CS 809 | Schnittgemuese 810 | ScholasticClub 811 | schuh 812 | scienceworks_mv 813 | scotiabank 814 | ScottishPower 815 | scottish_water 816 | scottrade 817 | SDixonVivint 818 | Sears 819 | searscares 820 | SeattleSPU 821 | secreteyesEW 822 | seetickets 823 | Selfridges 824 | SEPTA_SOCIAL 825 | ServiceDivaBren 826 | Service_Queen 827 | SFBayBikeShare 828 | SharisBerries 829 | shaws 830 | ShiseidoUSA 831 | ShoeboxedHelp 832 | ShoePalace 833 | shopmissa 834 | ShopRiteStores 835 | ShopRuche 836 | Sifter 837 | SilkFred 838 | SimplyhealthUK 839 | sizehelpteam 840 | skrill 841 | SkyCinemaUK 842 | SkyIreland 843 | SkypeSupport 844 | SkySaga 845 | SKYserves 846 | sleepnumber 847 | Sleepys 848 | SleepysCare 849 | smtickets 850 | SneakerRefresh 851 | SnoPUD 852 | sobeys 853 | Sofology 854 | SofologyHelp 855 | Sony 856 | SonySupportUSA 857 | SoundTransit 858 | SouthwestAir 859 | SouthWestWater 860 | spacecojo 861 | SparkNZ 862 | SpecializedCSUK 863 | Spencers_Retail 864 | SP_EnergyPeople 865 | Spinneys_Dubai 866 | SpiritAirlines 867 | Spokeo_Care 868 | sportingindex 869 | spreecoza 870 | spring 871 | SquareTrade 872 | SSE 873 | sseairtricity 874 | StanChart 875 | StaplesUK 876 | StarbucksUKCA 877 | StarHub 878 | statravelAU 879 | STATravel_UK 880 | stayhomeclub 881 | StockTwitsHelp 882 | StonecareMike 883 | StraubsMarkets 884 | Stylistpick 885 | SubaruCanada 886 | subaru_usa 887 | SubwayListens 888 | Suddenlink 889 | SuddenlinkHelp 890 | SunCares 891 | Suncorp 892 | sunglassesshop 893 | SunTzuHelpDesk 894 | superbalist 895 | Superdry 896 | Superdry_Care 897 | supermartie 898 | SuperShuttle 899 | SurflineGH 900 | SwingSetService 901 | swoonstars 902 | TacoBellTeam 903 | TandemDiabetes 904 | taportugal 905 | Tartineaubeurre 906 | TCoughlin 907 | TeamSanshee 908 | TeamSantone 909 | TeamShieldBSC 910 | TEAVANA 911 | Tech21Official 912 | TechUncensored 913 | teeofftimes 914 | TekCenter 915 | TelkomZA 916 | TELUS 917 | TEPenergy 918 | Tesco 919 | tescomobile 920 | tescomobilecare 921 | tescomobileire 922 | TessutiHelpTeam 923 | TextNowHelp 924 | tfbalerts 925 | thebitchdesk 926 | TheBodyShopUK 927 | thebondteam 928 | TheBookPeople 929 | TheFrontDesk 930 | TheGymGroup 931 | thekirbycompany 932 | TheLondonEye 933 | theMasterLink 934 | thenutribullet 935 | TheRAC_UK 936 | TheRTStore 937 | TheSharck 938 | ThomasCookCares 939 | ThomasCookUK 940 | thomaswales 941 | ThomsonCares 942 | ThreadSence 943 | ThreeUK 944 | Ticketmaster 945 | TicketmasterCA 946 | TicketmasterCS 947 | TicketWeb 948 | TiVo 949 | TiVoSupport 950 | tjmaxx 951 | TK_HelpDesk 952 | TKMaxx_UK 953 | TLGTourHelp 954 | TMLewin 955 | TMobileHelp 956 | TMRQld 957 | TNTUKCare 958 | toister 959 | TomTom_SA 960 | tonydataman 961 | Topheratl 962 | Topman 963 | TopshopHelp 964 | TotalGymDirect 965 | townshoes 966 | ToyotaCustCare 967 | ToyotaRacing 968 | ToysRUs 969 | traceychurray 970 | TradeMe 971 | trafficscotland 972 | TravelGuard 973 | travelocity 974 | trimet 975 | TruGreen 976 | Trustpilot 977 | TSA 978 | TTCsue 979 | TW2CayC 980 | TWC_Help 981 | TWimpeySupport 982 | twinklresources 983 | UCLMainLibrary 984 | UconnectCares 985 | UFhd 986 | uga_eits 987 | UHaul_Cares 988 | UKEnterprise 989 | UKMUJI 990 | UKVolkswagen 991 | UNB_ITS 992 | united 993 | UnityQAThomas 994 | UpDesk 995 | UPS_UK 996 | USAA_help 997 | USPS 998 | USPSHelp 999 | UWDoIT 1000 | vaillantuk 1001 | _valpow_ 1002 | Venture_Cycles 1003 | verabradleycare 1004 | VerizonSupport 1005 | VeryHelpers 1006 | vigglesupport 1007 | VirginAmerica 1008 | VirginMediaIE 1009 | Vitality_UK 1010 | VitamixUK 1011 | VIZIOsupport 1012 | vmbusinesshelp 1013 | VMUcare 1014 | Vodacom 1015 | Vodacom111 1016 | VodacomRugga 1017 | VodafoneAU_Help 1018 | VolvoCarUSA 1019 | vtaservice 1020 | vueling 1021 | vwcares 1022 | w1zz 1023 | WAGSocialCare 1024 | wahoofitness 1025 | Walgreens 1026 | Walmart 1027 | warjohi 1028 | WarThunder 1029 | WasteWR 1030 | WatchShop 1031 | WeChatZA 1032 | we_energies 1033 | WellsFargo 1034 | WhirlpoolCare 1035 | Whirlpool_CAre 1036 | whirlpoolusa 1037 | WholeFoods 1038 | WHSmith 1039 | WilsonGolf 1040 | wimrampen 1041 | WindowsSupport 1042 | WingateHotels 1043 | withazed 1044 | Wizards_Help 1045 | WMSpillett 1046 | wow_air 1047 | wowairsupport 1048 | WReynoldsYoung 1049 | wtzgoodPHL 1050 | XOCare 1051 | YahooCare 1052 | yoox 1053 | zaggcare 1054 | ZAGGdaily 1055 | Zapatosdesigner 1056 | ZapposLuxury 1057 | ZARA 1058 | ZARA_Care 1059 | ZeekMarketplace 1060 | ZoomSphere 1061 | Zopim 1062 | -------------------------------------------------------------------------------- /collect_twitter_dialogs/collect.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | target=account_names_for_dstc6.txt 4 | datetime=`date +"%Y-%m-%d_%H-%M-%S"` 5 | collect_twitter_dialogs.py -t $target -o ./stored_data -l ./collect_${datetime}.log 6 | -------------------------------------------------------------------------------- /collect_twitter_dialogs/collect_twitter_dialogs.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | """collect_twitter_dialogs.py: 4 | A script to acquire twitter dialogs with REST API 1.1. 5 | 6 | Copyright (c) 2017 Takaaki Hori (thori@merl.com) 7 | 8 | This software is released under the MIT License. 9 | http://opensource.org/licenses/mit-license.php 10 | 11 | """ 12 | 13 | import argparse 14 | import json 15 | import sys 16 | import six 17 | import os 18 | import re 19 | import time 20 | import logging 21 | from requests_oauthlib import OAuth1Session 22 | from twitter_api import GETStatusesUserTimeline 23 | from twitter_api import GETStatusesLookup 24 | 25 | try: 26 | from configparser import ConfigParser 27 | except ImportError: 28 | from ConfigParser import SafeConfigParser as ConfigParser 29 | 30 | # create logger object 31 | logger = logging.getLogger("root") 32 | logger.setLevel(logging.INFO) 33 | 34 | def Main(args): 35 | # get access keys from a config file 36 | config = ConfigParser() 37 | config.read(args.config) 38 | ConsumerKey = config.get('AccessKeys','ConsumerKey') 39 | ConsumerSecret = config.get('AccessKeys','ConsumerSecret') 40 | AccessToken = config.get('AccessKeys','AccessToken') 41 | AccessTokenSecret = config.get('AccessKeys','AccessTokenSecret') 42 | 43 | # obtain targets 44 | targets = args.names 45 | if args.target: 46 | for line in open(args.target,'r').readlines(): 47 | name = line.strip() 48 | if not name.startswith('#'): 49 | targets.append(name) 50 | 51 | # make a directory to store acquired dialogs 52 | if args.outdir: 53 | if not os.path.exists(args.outdir): 54 | os.mkdir(args.outdir) 55 | 56 | # open a session 57 | session = OAuth1Session(ConsumerKey, ConsumerSecret, AccessToken, AccessTokenSecret) 58 | 59 | # setup API object 60 | get_user_timeline = GETStatusesUserTimeline(session) 61 | get_user_timeline.setParams(target_count=args.count, reply_only=True) 62 | get_lookup = GETStatusesLookup(session) 63 | 64 | # collect dialogs from each target 65 | num_dialogs = 0 66 | num_past_dialogs = 0 67 | 68 | for name in targets: 69 | logger.info('-----------------------------') 70 | outfile = name + '.json' 71 | if args.outdir: 72 | outfile = os.path.join(args.outdir, outfile) 73 | 74 | ## collect tweets from an account 75 | logger.info('collecting tweets from ' + name) 76 | if os.path.exists(outfile): 77 | logger.info('restoring acquired tweets from ' + outfile) 78 | dialog_set = json.load(open(outfile,'r')) 79 | since_id = max([int(s) for s in dialog_set.keys()]) 80 | num_past_dialogs += len(dialog_set) 81 | else: 82 | since_id = None 83 | dialog_set = {} 84 | 85 | get_user_timeline.setParams(name, max_id=None, since_id=since_id) 86 | get_user_timeline.waitReady() 87 | timeline_tweets = get_user_timeline.call() 88 | if timeline_tweets is None: 89 | logger.warn('skip %s with an error' % name) 90 | num_dialogs += len(dialog_set) 91 | continue 92 | 93 | logger.info('obtained %d new tweet(s)' % len(timeline_tweets)) 94 | if len(timeline_tweets) == 0: 95 | logger.info('no dialogs have been added to ' + outfile) 96 | num_dialogs += len(dialog_set) 97 | continue 98 | 99 | ## collect source tweets 100 | logger.info("collecting source tweets in reply recursively") 101 | tweet_set = {} 102 | ## to avoid getting same tweets again, add tweets we aready have 103 | for tid,dialog in dialog_set.items(): 104 | for tweet in dialog: 105 | tweet_set[tweet['id']] = tweet 106 | ## add new tweets and collect reply-ids as necessary 107 | source_ids = set() 108 | for tweet in timeline_tweets: 109 | tweet_set[tweet['id']] = tweet 110 | reply_id = tweet['in_reply_to_status_id'] 111 | if reply_id is not None and reply_id not in tweet_set: 112 | source_ids.add(reply_id) 113 | ## acquire source tweets 114 | get_lookup.waitReady() 115 | while len(source_ids) > 0: 116 | get_lookup.setParams(source_ids) 117 | result = get_lookup.call() 118 | logger.info('obtained %d/%d tweets' % (len(result),len(source_ids))) 119 | new_source_ids = set() 120 | for tweet in result: 121 | tweet_set[tweet['id']] = tweet 122 | reply_id = tweet['in_reply_to_status_id'] 123 | if reply_id is not None and reply_id not in tweet_set: 124 | new_source_ids.add(reply_id) 125 | source_ids = new_source_ids 126 | 127 | ## reconstruct dialogs 128 | logger.info("restructuring the collected tweets as a set of dialogs") 129 | visited = set() 130 | new_dialogs = 0 131 | for tweet in timeline_tweets: 132 | tid = tweet['id'] 133 | if tid not in visited: # ignore visited node (it's not a terminal) 134 | visited.add(tid) 135 | # backtrack source tweets and make a dialog 136 | dialog = [tweet] 137 | reply_id = tweet_set[tid]['in_reply_to_status_id'] 138 | while reply_id is not None: 139 | visited.add(reply_id) 140 | # if there already exists a dialog associated with reply_id, 141 | # the dialog is deleted because it's not a complete dialog. 142 | if str(reply_id) in dialog_set: 143 | del dialog_set[str(reply_id)] 144 | # insert a source tweet to the dialog 145 | if reply_id in tweet_set: 146 | dialog.insert(0,tweet_set[reply_id]) 147 | else: 148 | break 149 | # move to the previous tweet 150 | reply_id = tweet_set[reply_id]['in_reply_to_status_id'] 151 | 152 | # add the dialog only if it contains two or more turns, 153 | # where it is associated with its terminal tweet id. 154 | if len(dialog) > 1: 155 | dialog_set[str(tid)] = dialog 156 | new_dialogs += 1 157 | 158 | logger.info('obtained %d new dialogs' % new_dialogs) 159 | if new_dialogs > 0: 160 | logger.info('writing to file %s' % outfile) 161 | json.dump(dialog_set, open(outfile,'w'), indent=2) 162 | else: 163 | logger.info('no dialogs have been added to ' + outfile) 164 | 165 | num_dialogs += len(dialog_set) 166 | 167 | logger.info('-----------------------------') 168 | logger.info('obtained %d new dialogs' % (num_dialogs - num_past_dialogs)) 169 | logger.info('now you have %d dialogs in total' % num_dialogs) 170 | 171 | 172 | if __name__ =="__main__": 173 | # parse command line 174 | parser = argparse.ArgumentParser() 175 | parser.add_argument('-c', '--config', default='config.ini', help="config file") 176 | parser.add_argument('-t', '--target', help="read account names from a file") 177 | parser.add_argument('-o', '--outdir', help="output directory") 178 | parser.add_argument('-l', '--logfile', help="set a log file") 179 | parser.add_argument('-n', '--count', default=-1, type=int, 180 | help="maximum number of tweets acquired from each account") 181 | parser.add_argument('-d', '--debug', action='store_true', help="debug mode") 182 | parser.add_argument('-s', '--silent', action='store_true', help="silent mode") 183 | parser.add_argument('names', metavar='NAME', nargs='*', help='account names') 184 | args = parser.parse_args() 185 | 186 | # set up the logger 187 | stdhandler = logging.StreamHandler() 188 | stdhandler.setFormatter(logging.Formatter("%(asctime)s %(levelname)s %(message)s")) 189 | if args.silent: 190 | stdhandler.setLevel(logging.WARN) 191 | logger.addHandler(stdhandler) 192 | 193 | if args.logfile: 194 | filehandler = logging.FileHandler(args.logfile, mode='w') 195 | filehandler.setFormatter(logging.Formatter("%(asctime)s %(levelname)s %(message)s")) 196 | logger.addHandler(filehandler) 197 | 198 | if args.debug: 199 | logger.setLevel(logging.DEBUG) 200 | 201 | logger.info('started to collect twitter dialogs') 202 | logger.debug('args=' + str(args)) 203 | 204 | # call main process 205 | try: 206 | Main(args) 207 | except: 208 | logger.exception('exited with an error') 209 | sys.exit(1) 210 | 211 | logger.info('done') 212 | 213 | -------------------------------------------------------------------------------- /collect_twitter_dialogs/config.ini: -------------------------------------------------------------------------------- 1 | ; you need to set your own access keys 2 | [AccessKeys] 3 | ConsumerKey: ************************* 4 | ConsumerSecret: ************************************************** 5 | AccessToken: ************************************************** 6 | AccessTokenSecret: ********************************************* 7 | 8 | -------------------------------------------------------------------------------- /collect_twitter_dialogs/official_collect.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | target=official_account_names_for_dstc6.txt 4 | datetime=`date +"%Y-%m-%d_%H-%M-%S"` 5 | collect_twitter_dialogs.py -t $target -o ./official_stored_data -l ./official_collect_${datetime}.log 6 | -------------------------------------------------------------------------------- /collect_twitter_dialogs/search_twitter_accounts.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | """search_twitter_accounts.py: 4 | A script to search twitter accounts with REST API 1.1. 5 | 6 | Copyright (c) 2017 Takaaki Hori (thori@merl.com) 7 | 8 | This software is released under the MIT License. 9 | http://opensource.org/licenses/mit-license.php 10 | 11 | """ 12 | 13 | import argparse 14 | import json 15 | import sys 16 | import os 17 | import logging 18 | from requests_oauthlib import OAuth1Session 19 | from twitter_api import GETUsersSearch 20 | 21 | try: 22 | from configparser import ConfigParser 23 | except ImportError: 24 | from ConfigParser import SafeConfigParser as ConfigParser 25 | 26 | # create logger object 27 | logger = logging.getLogger("root") 28 | logger.setLevel(logging.INFO) 29 | 30 | def Main(args): 31 | # get access keys from a config file 32 | config = ConfigParser() 33 | config.read(args.config) 34 | ConsumerKey = config.get('AccessKeys','ConsumerKey') 35 | ConsumerSecret = config.get('AccessKeys','ConsumerSecret') 36 | AccessToken = config.get('AccessKeys','AccessToken') 37 | AccessTokenSecret = config.get('AccessKeys','AccessTokenSecret') 38 | 39 | # open a session 40 | session = OAuth1Session(ConsumerKey, ConsumerSecret, AccessToken, AccessTokenSecret) 41 | 42 | # collect users from the queries 43 | user_search = GETUsersSearch(session) 44 | user_search.setParams(' '.join(args.queries), target_count=args.count) 45 | user_search.waitReady() 46 | result = user_search.call() 47 | logger.info('obtained %d users' % len(result)) 48 | 49 | if args.dump: 50 | logger.info('writing raw data to file %s' % args.dump) 51 | json.dump(result, open(args.dump,'w'), indent=2) 52 | 53 | if args.output: 54 | logger.info('writing screen names to file %s' % args.output) 55 | with open(args.output,'w') as f: 56 | for user in result: 57 | f.write(user['screen_name'] + '\n') 58 | else: 59 | for user in result: 60 | sys.stdout.write(user['screen_name'] + '\n') 61 | 62 | 63 | if __name__ =="__main__": 64 | # parse command line 65 | parser = argparse.ArgumentParser() 66 | parser.add_argument('-c', '--config', default='config.ini', help="config file") 67 | parser.add_argument('-o', '--output', help="output screen names to a file") 68 | parser.add_argument('-D', '--dump', help="dump raw data to a file") 69 | parser.add_argument('-l', '--logfile', help="set a log file") 70 | parser.add_argument('-n', '--count', default=100, type=int, 71 | help="maximum number of tweets acquired from each account") 72 | parser.add_argument('-d', '--debug', action='store_true', help="debug mode") 73 | parser.add_argument('queries', metavar='KW', nargs='+', help='query keywords') 74 | args = parser.parse_args() 75 | 76 | # set up the logger 77 | stdhandler = logging.StreamHandler() 78 | stdhandler.setFormatter(logging.Formatter("%(asctime)s %(levelname)s %(message)s")) 79 | logger.addHandler(stdhandler) 80 | if args.logfile: 81 | filehandler = logging.FileHandler(args.logfile, mode='w') 82 | filehandler.setFormatter(logging.Formatter("%(asctime)s %(levelname)s %(message)s")) 83 | logger.addHandler(filehandler) 84 | 85 | if args.debug: 86 | logger.setLevel(logging.DEBUG) 87 | 88 | logger.info('started to collect twitter accounts') 89 | logger.debug('args=' + str(args)) 90 | 91 | # call main process 92 | try: 93 | Main(args) 94 | except: 95 | logger.exception('exited with an error') 96 | sys.exit(1) 97 | 98 | logger.info('done') 99 | 100 | -------------------------------------------------------------------------------- /collect_twitter_dialogs/twitter_api.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | """python class to acquire twitter data with REST API 1.1. 4 | 5 | Copyright (c) 2017 Takaaki Hori (thori@merl.com) 6 | 7 | This software is released under the MIT License. 8 | http://opensource.org/licenses/mit-license.php 9 | 10 | """ 11 | 12 | import argparse 13 | import json 14 | import sys 15 | import six 16 | import os 17 | import re 18 | import time 19 | import logging 20 | from datetime import datetime 21 | 22 | # get a logger object 23 | logger = logging.getLogger('root') 24 | 25 | # base API caller 26 | class TwitterAPI(object): 27 | def __init__(self, command, session): 28 | self.rest_api_url = 'https://api.twitter.com/1.1' 29 | self.error_code_url = 'https://dev.twitter.com/overview/api/response-codes' 30 | self.check_rate_limits = '/application/rate_limit_status' 31 | self.command = command 32 | self.session = session 33 | self.params = {} 34 | 35 | def call(self, retry=5): 36 | ''' 37 | acquire data by a given method 38 | ''' 39 | if len(self.params) == 0: 40 | raise Exception('parameters are not set for %s' % self.command) 41 | 42 | url = self.rest_api_url + self.command + '.json' 43 | n_errors = 0 44 | self.result = [] 45 | while True: 46 | logger.debug('URL: ' + url) 47 | logger.debug('params: ' + str(self.params)) 48 | res = self.session.get(url, params = self.params) 49 | if res.status_code == 200: # Success 50 | data = json.loads(res.text) 51 | if len(data) == 0: 52 | break 53 | if self.extract(data) == False: # if no more data need to be acquired 54 | break 55 | n_errors = 0 56 | 57 | # check if header includes 'X-Rate-Limit-Remaining' 58 | if 'X-Rate-Limit-Remaining' in res.headers \ 59 | and 'X-Rate-Limit-Reset' in res.headers: 60 | if int(res.headers['X-Rate-Limit-Remaining']) == 0: 61 | waittime = int(res.headers['X-Rate-Limit-Reset']) \ 62 | - time.mktime(datetime.now().timetuple()) 63 | logger.info('reached the rate limit ... wait %d seconds', waittime + 5) 64 | time.sleep(waittime + 5) 65 | self.waitReady(self.session) 66 | else: 67 | self.waitReady(self.session) 68 | 69 | elif res.status_code==401 or res.status_code==404: 70 | logger.warn('Twitter API error %d, see %s' % (res.status_code, self.error_code_url)) 71 | logger.warn('error occurred in %s' % self.command) 72 | return None 73 | else: 74 | n_errors += 1 75 | if n_errors > retry: 76 | raise Exception('Twitter API error %d, see %s' % (res.status_code, self.error_code_url)) 77 | 78 | logger.warn('Twitter API error %d, see %s' % (res.status_code, self.error_code_url)) 79 | logger.warn('Service Unavailable ... wait 15 minutes') 80 | time.sleep(905) 81 | 82 | return self.result 83 | 84 | # parameter setting 85 | def _set_param(self, key, value, default=None): 86 | if value is None: # if value is None, the parameter is removed 87 | if key in self.params: 88 | del self.params[key] 89 | elif default is None or value != default: # if not default, set the parameter 90 | self.params[key] = value 91 | 92 | 93 | def getWaitTime(self, res_text): 94 | waittime = 0 95 | for command in [self.command, self.check_rate_limits]: 96 | category = re.sub(r'^/([^\s\/]+)/.*$', '\\1', command) 97 | remaining = int(res_text['resources'][category][command]['remaining']) 98 | reset = int(res_text['resources'][category][command]['reset']) 99 | if remaining == 0: 100 | waittime = max(waittime, reset - time.mktime(datetime.now().timetuple())) 101 | 102 | return waittime 103 | 104 | 105 | def waitReady(self, retry=5): 106 | ''' 107 | check status, and wait until it gets available 108 | ''' 109 | n_errors = 0 110 | while True: 111 | res = self.session.get(self.rest_api_url + self.check_rate_limits + '.json') 112 | if res.status_code == 200: # Success 113 | res_text = json.loads(res.text) 114 | waittime = self.getWaitTime(res_text) 115 | if (waittime > 0): 116 | logger.info('reached the rate limit ... wait %d seconds' % (waittime+5)) 117 | time.sleep(waittime+5) 118 | n_errors = 0 119 | else: 120 | break 121 | 122 | else: 123 | n_errors += 1 124 | if n_errors > retry: 125 | raise Exception('Twitter API error %d, see %s' % (res.status_code, self.error_code_url)) 126 | 127 | logger.warn('Twitter API error %d, see %s' % (res.status_code, self.error_code_url)) 128 | logger.warn('Service Unavailable ... wait 15 minutes') 129 | time.sleep(905) 130 | 131 | 132 | ## some methods to get data 133 | 134 | class GETSearchTweets(TwitterAPI): 135 | ''' 136 | Object to acquire tweets by a query 137 | see https://dev.twitter.com/rest/reference/get/search/tweets 138 | Limit: Requests / 15-min window (app auth) <= 450 139 | ''' 140 | def __init__(self, session): 141 | super(GETSearchTweets, self).__init__('/search/tweets', session) 142 | self.query = {} 143 | self.target_count = -1 # default: unlimited 144 | self.reply_only = False 145 | 146 | def setParams(self, query='', target_count=-1, since_id=0, max_id=0, 147 | reply_only=None): 148 | self._set_param('q', query, '') 149 | self._set_param('since_id', since_id, 0) 150 | self._set_param('max_id', max_id, 0) 151 | if target_count > 0: 152 | self._set_param('count', min(target_count,100)) 153 | self.target_count = target_count 154 | else: 155 | self._set_param('count', 100) 156 | 157 | if reply_only is not None: 158 | self.reply_only = reply_only 159 | 160 | def extract(self, tweets): 161 | logger.debug('search_metadata:' + str(tweets['search_metadata'])) 162 | for tweet in tweets['statuses']: 163 | # extract reply tweets only 164 | if self.reply_only and tweet['in_reply_to_status_id'] is None: 165 | continue 166 | # extract entries 167 | self.result.append(tweet) 168 | if len(self.result) % 100 == 0: 169 | logger.info('...acquired %d tweets ' % len(self.result)) 170 | if self.target_count > 0 and len(self.result) >= self.target_count: 171 | return False 172 | 173 | # for the next call 174 | if len(tweets['statuses']) > 0: 175 | self._set_param('max_id', tweets['statuses'][-1]['id'] - 1) # update max_id 176 | return True 177 | else: 178 | return False 179 | 180 | 181 | class GETStatusesUserTimeline(TwitterAPI): 182 | ''' 183 | Object to acquire tweets along a user time line 184 | see https://dev.twitter.com/rest/reference/get/statuses/user_timeline 185 | Limit: Requests / 15-min window (app auth) <= 1500 186 | ''' 187 | def __init__(self, session): 188 | super(GETStatusesUserTimeline, self).__init__('/statuses/user_timeline', session) 189 | self.params['include_rts'] = 'false' 190 | self.params['exclude_replies'] = 'false' 191 | self.target_count = 0 192 | self.reply_only = False 193 | 194 | def setParams(self, name='', target_count=0, since_id=0, max_id=0, 195 | reply_only=None): 196 | self._set_param('screen_name', name, '') 197 | self._set_param('since_id', since_id, 0) 198 | self._set_param('max_id', max_id, 0) 199 | if target_count > 0: 200 | self._set_param('count', min(target_count,100)) 201 | self.target_count = target_count 202 | else: 203 | self._set_param('count', 100) 204 | 205 | if reply_only is not None: 206 | self.reply_only = reply_only 207 | 208 | def extract(self, tweets): 209 | for tweet in tweets: 210 | # extract reply tweets only 211 | if self.reply_only and tweet['in_reply_to_status_id'] is None: 212 | continue 213 | # store tweets 214 | self.result.append(tweet) 215 | if len(self.result) % 100 == 0: 216 | logger.info('...acquired %d tweets ' % len(self.result)) 217 | if self.target_count>0 and len(self.result) >= self.target_count: 218 | return False 219 | 220 | # for the next call 221 | if len(tweets) > 0: 222 | self._set_param('max_id', tweets[-1]['id'] - 1) # update max_id 223 | return True 224 | else: 225 | return False 226 | 227 | 228 | class GETStatusesLookup(TwitterAPI): 229 | ''' 230 | Object to acquire tweets by ID numbers 231 | see https://dev.twitter.com/rest/reference/get/statuses/lookup 232 | Limit: Requests / 15-min window (app auth) <= 300 233 | ''' 234 | def __init__(self, session): 235 | super(GETStatusesLookup, self).__init__('/statuses/lookup', session) 236 | self.count = 100 # cat get up to 100 tweets at once 237 | 238 | def setParams(self, id_set=None): 239 | if id_set is not None: 240 | self.id_list = list(id_set) 241 | self.params['id'] = ','.join([str(n) for n in self.id_list[0:self.count]]) 242 | self.total_count = 0 243 | 244 | def extract(self, tweets): 245 | for tweet in tweets: 246 | self.result.append(tweet) 247 | if len(self.result) % 100 == 0: 248 | logger.info('...acquired %d tweets ' % len(self.result)) 249 | 250 | self.total_count += self.count 251 | if self.total_count >= len(self.id_list): 252 | return False 253 | 254 | # for the next call 255 | sub_ids = self.id_list[self.total_count:self.total_count+self.count] 256 | self.params['id'] = ','.join([str(n) for n in sub_ids]) 257 | return True 258 | 259 | 260 | class GETUsersSearch(TwitterAPI): 261 | ''' 262 | Object to search users by a query 263 | see https://dev.twitter.com/rest/reference/get/users/search 264 | Limit: Requests / 15-min window (user auth) <= 900 265 | ''' 266 | def __init__(self, session): 267 | super(GETUsersSearch, self).__init__('/users/search', session) 268 | self.target_count = 100 # default 269 | self.params['count'] = 20 # can get 20 entries per page 270 | 271 | def setParams(self, query='', target_count=0): 272 | self._set_param('q', query, '') 273 | if target_count > 0: 274 | self.target_count = target_count 275 | self.params['page'] = 1 276 | 277 | def extract(self, text): 278 | for user in text: 279 | self.result.append(user) 280 | if len(self.result) % 100 == 0: 281 | logger.info('...acquired %d users ' % len(self.result)) 282 | if len(self.result) >= self.target_count: 283 | return False 284 | 285 | # for the next call 286 | self.params['page'] += 1 287 | return True 288 | 289 | -------------------------------------------------------------------------------- /collect_twitter_dialogs/view_dialogs.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | """view twitter dialogs. 4 | 5 | Copyright (c) 2017 Takaaki Hori (thori@merl.com) 6 | 7 | This software is released under the MIT License. 8 | http://opensource.org/licenses/mit-license.php 9 | 10 | """ 11 | 12 | import json 13 | import sys 14 | import six 15 | 16 | if six.PY2: 17 | reload(sys) 18 | sys.setdefaultencoding('utf-8') 19 | 20 | if len(sys.argv) < 2: 21 | print ('usage: view_dialogs.py dialogs.json ...') 22 | sys.exit(1) 23 | 24 | for fn in sys.argv[1:]: 25 | dialog_set = json.load(open(fn,'r')) 26 | for tid in sorted([int(s) for s in dialog_set.keys()]): 27 | dialog = dialog_set[str(tid)] 28 | lang = dialog[0]['lang'] 29 | if lang == 'en': 30 | print ('--- ID:%d (length=%d) ---\n' % (tid, len(dialog))) 31 | for utterance in dialog: 32 | screen_name = utterance['user']['screen_name'] 33 | name = utterance['user']['name'] 34 | text = utterance['text'] 35 | print ('%s (%s)' % (utterance['created_at'],utterance['id'])) 36 | print ('%s (@%s) : %s\n' % (name, screen_name, text)) 37 | 38 | -------------------------------------------------------------------------------- /tasks/opensubs/extract_opensubs_dialogs.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | #-*- coding: utf-8 -*- 3 | """extract_opensubs_dialogs.py: 4 | A script to extract text from OpenSubtitles. 5 | 6 | Copyright (c) 2017 Takaaki Hori (thori@merl.com) 7 | 8 | This software is released under the MIT License. 9 | http://opensource.org/licenses/mit-license.php 10 | 11 | """ 12 | 13 | import xml.etree.ElementTree as ET 14 | from gzip import GzipFile 15 | import sys 16 | import re 17 | import glob 18 | import random 19 | from tqdm import tqdm 20 | import six 21 | import argparse 22 | 23 | 24 | def preprocess(sent): 25 | """ text preprocessing using regular expressions 26 | """ 27 | # remove tags 28 | new_sent = re.sub(r'(|<[^>]*>|{[^}]*}|\([^\)]*\)|\[[^\]]*\])',' ', sent) 29 | # replace apostrophe and convert letters to lower case 30 | new_sent = new_sent.replace('\\\'','\'').lower() 31 | # delete a space right after an isolated apostrophe 32 | new_sent = re.sub(r' \' (?=(em|im|s|t|bout|cause)\s+)', ' \'', new_sent) 33 | # delete a space right before an isolated apostrophe 34 | new_sent = re.sub(r'(?<=n) \' ', '\' ', new_sent) 35 | # delete a space right before a period for titles 36 | new_sent = re.sub(r'(?<=( mr| jr| ms| dr| st|mrs)) \.', '. ', new_sent) 37 | # remove speaker tag "xxx: " 38 | new_sent = re.sub(r'^\s*[A-z]*\s*:', '', new_sent) 39 | # remove unnecessary symbols 40 | new_sent = re.sub(u'([-–—]+$| [-–—]+|[-–—]+ |% %|#+|\'\'|``| \' |[\(\)\"])', ' ', new_sent) 41 | # convert i̇->i 42 | new_sent = re.sub(u'i̇','i', new_sent) 43 | # convert multiple spaces to a single space 44 | new_sent = re.sub(r'\s+', ' ', new_sent).strip() 45 | # ignore sentence with anly space or some symbols 46 | if not re.match(r'^(\s*|[\.\?$%!,:;])$', new_sent): 47 | return new_sent 48 | else: 49 | return '' 50 | 51 | 52 | def extract(filePath, corpus): 53 | """extract text from an XML file and tokenize it 54 | """ 55 | if filePath.endswith('.gz'): 56 | tree = ET.parse(GzipFile(filename=filePath)) 57 | else: 58 | tree = ET.parse(filePath) 59 | 60 | root = tree.getroot() 61 | sent = '' 62 | for child in root: 63 | for elem in child: 64 | if elem.tag == 'w': 65 | sent += ' ' + elem.text 66 | 67 | if not sent.strip().endswith(':'): 68 | if six.PY2: 69 | new_sent = preprocess(sent).encode('utf-8') 70 | else: 71 | new_sent = preprocess(sent) 72 | 73 | if new_sent: 74 | corpus.append((new_sent, len(new_sent.split()))) 75 | sent = '' 76 | 77 | corpus.append(('',0)) 78 | 79 | 80 | if __name__ == '__main__': 81 | parser = argparse.ArgumentParser() 82 | parser.add_argument('--output', default=['train.txt'], nargs='+', type=str, 83 | help='Filenames of data') 84 | parser.add_argument('--ratio', default=[0.01], nargs='+', type=float, 85 | help='Extraction rate for each data set') 86 | parser.add_argument('--max-length', default=20, type=float, 87 | help='Maximum length of sentences') 88 | parser.add_argument('--rootdir', default='.', 89 | help='root directory of data source') 90 | 91 | args = parser.parse_args() 92 | 93 | if len(args.output) != len(args.ratio): 94 | raise Exception('The number of output files (%d) and the number extraction ratios (%d) should be the same.' % (len(args.output), len(args.partion))) 95 | random.seed(99) 96 | 97 | rootdir = args.rootdir 98 | print('collecting files from ' + rootdir) 99 | xmlfiles = glob.glob(rootdir + '/*.xml.gz') + glob.glob(rootdir + '/*/*.xml.gz') + glob.glob(rootdir + '/*/*/*.xml.gz') 100 | 101 | random.shuffle(xmlfiles, random.random) 102 | total_ratio = sum(args.ratio) 103 | n_files = int( len(xmlfiles) * total_ratio ) 104 | print('loading text from %d/%d files' % (n_files, len(xmlfiles))) 105 | corpus = [] 106 | for n in tqdm(six.moves.range(n_files)): 107 | extract(xmlfiles[n], corpus) 108 | 109 | print('%d sentences loaded' % len(corpus)) 110 | indices = list(six.moves.range(len(corpus)-1)) 111 | random.shuffle(indices, random.random) 112 | 113 | partition = [0] 114 | acc_ratio = 0.0 115 | for r in args.ratio: 116 | acc_ratio += r / total_ratio 117 | partition.append(int(acc_ratio * len(corpus))) 118 | 119 | for n in six.moves.range(len(partition)-1): 120 | print('writing %d sentence pairs to %s' % (partition[n+1]-partition[n], args.output[n])) 121 | with open(args.output[n],'w') as f: 122 | for idx in indices[partition[n]:partition[n+1]]: 123 | len1 = corpus[idx][1] 124 | len2 = corpus[idx+1][1] 125 | if 0 < len1 < args.max_length and 0 < len2 < args.max_length: 126 | six.print_('U: %s\nS: %s\n' % (corpus[idx][0],corpus[idx+1][0]), file=f) 127 | 128 | -------------------------------------------------------------------------------- /tasks/opensubs/make_trial_data.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # you can download English data of OpenSubtitles2016 from 4 | # 5 | # http://opus.lingfil.uu.se/download.php?f=OpenSubtitles2016/en.tar.gz 6 | # 7 | # and extract xml files by 8 | # 9 | # tar zxvf en.tar.gz 10 | # 11 | # you need to specify the location of the data 12 | stored_data=./OpenSubtitles2016/xml/en 13 | 14 | # extract train and dev sets 15 | echo extracting training, development, and test sets 16 | ./extract_opensubs_dialogs.py \ 17 | --output opensubs_trial_data_train.txt opensubs_trial_data_dev.txt \ 18 | --ratio 0.1 0.001 \ 19 | --rootdir $stored_data 20 | 21 | echo extracting evaluation data 22 | # extract 500 samples from dev set randomly for evaluation 23 | ./utils/sample_dialogs.py opensubs_trial_data_dev.txt 500 > opensubs_trial_data_eval.txt 24 | 25 | echo done 26 | -------------------------------------------------------------------------------- /tasks/opensubs/utils: -------------------------------------------------------------------------------- 1 | ../utils -------------------------------------------------------------------------------- /tasks/twitter/account_names_dev.txt: -------------------------------------------------------------------------------- 1 | 1800flowers 2 | ALCATEL1TOUCH 3 | ATTCares 4 | AetnaHelp 5 | AskCapitalOne 6 | AskMTNGhana 7 | AsurionCares 8 | AvidSupport 9 | Avis 10 | BookChicClub 11 | BostonMarki 12 | CIPD 13 | CMLFerry 14 | CadburyIreland 15 | Camper_CustCare 16 | CastelliCafe 17 | ChaseSupport 18 | ChevyCustCare 19 | CollectorCorps 20 | DollarGeneral 21 | DolphinCares 22 | ENMAXenergy 23 | EmiratesSupport 24 | Entergy 25 | Equinox_Service 26 | FCUK 27 | FedExCanada 28 | FoodLion 29 | FootballIndexUK 30 | FordSouthAfrica 31 | GapCA 32 | GiltService 33 | GrandHyattNYC 34 | GreyhoundBus 35 | HDCares 36 | HTCHelp 37 | Heinens 38 | IPFW_ITS_HD 39 | IWCare 40 | IrishRail 41 | JKalena123 42 | JawboneSupport 43 | JeepCares 44 | Jersey_City 45 | KitchenAid_CA 46 | Kohls 47 | LQcontactus 48 | LinkedInUK 49 | Lo573 50 | LoveWilko 51 | LozHarvey 52 | MDSHA 53 | MaclarenService 54 | MarshaCollier 55 | MeteorPR 56 | NBASTORE 57 | O2IRL 58 | OfficeShoes 59 | OrbitzCareTeam 60 | PRISM_NSA 61 | PanagoCares 62 | PhilipsCare_UK 63 | PostOfficeNews 64 | PoweradeGB 65 | RareLondon 66 | RubyBlogVAA 67 | SAS_Cares 68 | SCG_CS 69 | SDixonVivint 70 | SKYserves 71 | SSE 72 | ShoeboxedHelp 73 | SkyIreland 74 | SkySaga 75 | SonySupportUSA 76 | SouthWestWater 77 | StarbucksUKCA 78 | SunTzuHelpDesk 79 | SurflineGH 80 | Tech21Official 81 | TheBodyShopUK 82 | TicketmasterCS 83 | TravelGuard 84 | UconnectCares 85 | VolvoCarUSA 86 | Walmart 87 | WasteWR 88 | _omocat 89 | askevanscycles 90 | atmosenergy 91 | audiblesupport 92 | austin_reed 93 | catalysthousing 94 | champssports 95 | consult53 96 | dan_malarkey 97 | digicelbarbados 98 | dish 99 | edfenergycs 100 | feelunique 101 | hmunitedkingdom 102 | justhype 103 | kenmore 104 | lidl_ireland 105 | marilynsuttle 106 | miasampietro 107 | officedepot 108 | originenergy 109 | paycity 110 | paysafecard 111 | placesforpeople 112 | portlandwater 113 | pretavoir 114 | searscares 115 | skrill 116 | spreecoza 117 | stayhomeclub 118 | superbalist 119 | -------------------------------------------------------------------------------- /tasks/twitter/account_names_train.txt: -------------------------------------------------------------------------------- 1 | 1800flowershelp 2 | 64audio 3 | 888sport 4 | ABCustomerCare 5 | ABTAtravel 6 | AFC_Amy 7 | AFCustomerCare 8 | AIRNZUSA 9 | APCcustserv 10 | ASHelpMe 11 | ASOS 12 | ASOS_Au 13 | ASOS_Us 14 | ASUHelpCenter 15 | ASUSUSA 16 | AVIVAIRELAND 17 | AbercrombieHelp 18 | AddisonLeeCabs 19 | AdobeCare 20 | AflacPhyllis 21 | AirAsia 22 | AirChinaNA 23 | AirtelNigeria 24 | Alamo 25 | AlaskaAir 26 | Albertsons 27 | AlderTraining 28 | AldiUK 29 | Aldi_Ireland 30 | AlfaRomeoCareUK 31 | AliExpress_EN 32 | AlissaDosSantos 33 | Allegiant 34 | AllianzTravelUS 35 | Allstate 36 | AllyCare 37 | AlshayaHelpDesk 38 | AmanaCare 39 | AmazonHelp 40 | AmericanAir 41 | AmexAU 42 | AndreaSWilson 43 | AnglianHelp 44 | AnkiSupport 45 | AnthemBC_News 46 | AppleSupport 47 | AppleyardLondon 48 | ArqivaWiFi 49 | AsdaCustCare 50 | Ask123ie 51 | AskAmex 52 | AskCiti 53 | AskDyson 54 | AskEmblemHealth 55 | AskKBCIreland 56 | AskLandsEnd 57 | AskMarstons 58 | AskMeHelpDesk 59 | AskPS_UK 60 | AskSmythsToys 61 | AskSubaruCanada 62 | AskTarget 63 | AskTeamUA 64 | Ask_Spectrum 65 | AskeBay 66 | Askvodafonegh 67 | AsmodeeNA 68 | Atomos_News 69 | AudiUK 70 | AudiUKCare 71 | AutodeskHelp 72 | Ayres_Hotels 73 | BCBSLA 74 | BCC_Help 75 | BCSSupport 76 | BJsWholesale 77 | BN_care 78 | BOOKSetc_online 79 | BP_UK 80 | BP_plc 81 | BRCustServ 82 | BSSHelpDesk 83 | BabiesRUs 84 | BabyJogger 85 | BananaRepublic 86 | BandQ 87 | BarclaycardNews 88 | BarclaysUK 89 | Baxter_Auto 90 | BeautyBarDotCom 91 | Beaverbrooks 92 | BedBathBeyond 93 | BenefitUK 94 | BernardBoutique 95 | BiffaService 96 | BikeMetro 97 | BitcoinHelpDesk 98 | BlackBerryHelp 99 | Blanchford 100 | Blendtec 101 | BluestarHQ 102 | Boars_Head 103 | BoostCare 104 | BootsUK 105 | BordGaisEnergy 106 | BoseService 107 | BounceEnergy 108 | BritishGas 109 | BritishGasNews 110 | Budget 111 | Buick 112 | BuildaHelpDesk 113 | BupaUK 114 | BurgerKing 115 | Burton_Menswear 116 | CDCustService 117 | COTABus 118 | CRCustomerChat 119 | CRTContactUs 120 | CTS___ 121 | CableBill 122 | CadburyUK 123 | CalMac_Updates 124 | CallawayGolfCS 125 | CanadianPacific 126 | CapMetroATX 127 | CareShopBunzl 128 | CaroKopp 129 | Cartii 130 | CheapTixHearsU 131 | ChiccoUSA 132 | ChryslerCares 133 | Cignaquestions 134 | CiteThisForMe 135 | CitiBikeNYC 136 | Citibank 137 | CitySprint_help 138 | ClaireLBSmith 139 | ClairesEurope 140 | CoastCath 141 | ComcastILLINOIS 142 | ComcastOrSWWa 143 | CoopEnergy 144 | Costco 145 | CoxHelp 146 | DCBService 147 | DEWALT_UK 148 | DEWALTtough 149 | DFSCare 150 | DIBsupport 151 | DICKS 152 | DIGICELJamaica 153 | DIRECTV 154 | DNBcares 155 | DPDCustomerCare 156 | DSGSupport 157 | DStvNg 158 | DTSatWIT 159 | DarkBunnyTees 160 | DawnCarillion 161 | Deliveroo 162 | Deliveroo_IE 163 | Desk 164 | DevonCC 165 | DianaHSmith 166 | DigitalRiverInc 167 | Dillards 168 | Discover 169 | Discovery_SA 170 | DivvyBikes 171 | Dreams_Beds 172 | DressCircleShop 173 | DrinkSparkletts 174 | DunelmUK 175 | ENMAX 176 | EPCOR 177 | EYellin 178 | EarthLink 179 | EatNaturalBars 180 | EllenKeeble 181 | EmbracePetIns 182 | EmersonIT 183 | EntergyArk 184 | Enterprise 185 | EvansCycles 186 | EversourceCT 187 | EviteSupport 188 | Expedia 189 | ExpressHelp 190 | FFGames 191 | FH_CustomerCare 192 | FLBlueCares 193 | FLLFlyer 194 | FTcare 195 | FUT_COINSTORE 196 | FabFitFunCS 197 | FabSupportTeam 198 | Fanatics 199 | Fandom_Insider 200 | FedExCanadaHelp 201 | FedExEurope 202 | FedExHelpEU 203 | FeelGoodPark 204 | FiatCareUK 205 | FiguresToyCo 206 | FinnairHelps 207 | Fly_Norwegian 208 | Fon 209 | FonCare 210 | Footasylum 211 | Ford 212 | Forever21Help 213 | FortisBC 214 | FrankEliason 215 | FreeviewTV 216 | FromYouFlowers 217 | FrontierCare 218 | FunjetVacations 219 | FunkoDCLegion 220 | GEICO_Service 221 | GETMEIN 222 | GM 223 | GRT_ROW 224 | GSMA_Care 225 | Gap 226 | GarudaCares 227 | GenesisHousing 228 | GeoffRamm 229 | GeorgiaPower 230 | GeoxCares 231 | GlideUK 232 | GloCare 233 | GlossyboxUK 234 | GoSmartNC 235 | GoTriangle 236 | Go_CheshireWest 237 | Go_Wireless 238 | GongshowService 239 | Google 240 | GrandHyattSD 241 | GreatClipsCares 242 | Groove 243 | Grubhub_Care 244 | Gymshark 245 | Gymshark_Help 246 | HEB 247 | HMSHost 248 | HQhair_Help 249 | HRBlockAnswers 250 | HSBC_Sport 251 | HSBC_UAE 252 | HSBC_UK 253 | HSBC_US 254 | HSScustomercare 255 | HalfordsCycling 256 | HarrisTeeter 257 | HarrodsService 258 | HarryandDavid 259 | HawaiianAir 260 | HeatherJStrout 261 | HelloKit 262 | HilltopNorfolk 263 | HollisterCoHelp 264 | HomeDepotCanada 265 | HomesenseUK 266 | Honda 267 | HondaCustSvc 268 | HondaPowersprts 269 | Hootsuite_Help 270 | Hotwire 271 | Huawei 272 | HuaweiMobile 273 | HudsonshoesUK 274 | HullUni_ICT 275 | HumanaHelp 276 | HwnElectric 277 | HyattChurchill 278 | Hyken 279 | IGearBrand 280 | IKEAIESupport 281 | IKEAUKSupport 282 | IKEAUSAHelp 283 | INDOCHINO 284 | INDOT 285 | INFINITICare 286 | INFINITIUSA 287 | IcelandFoods 288 | InMotionCares 289 | Incite_Group 290 | IndyDPW 291 | IslandAirHawaii 292 | Iyengarish 293 | JDhelpteam 294 | JIRAServiceDesk 295 | JLcustserv 296 | JLove55 297 | JMstore 298 | JabraEurope 299 | Jabra_US 300 | JackWills 301 | JagexInfinity 302 | JamboPayCare 303 | Jamie1973 304 | JapanHelpDesk 305 | Jeep 306 | Jesshillcakes 307 | JetBlue 308 | JetHeads 309 | JetstarAirways 310 | Jetstar_NZ 311 | JimEllisAudi 312 | JimatPlanters 313 | JonesClocks 314 | KFC_UKI 315 | KFC_UKI_Help 316 | KLM_UK 317 | KakaoTalkPH 318 | KateNasser 319 | KauaiSA 320 | Kazel_Kimpo 321 | KenyaPower 322 | KimiNozoGuy 323 | KingFlyKOTD 324 | KitbagCS 325 | KitchenAid_CAre 326 | KrogerSupport 327 | LVcares 328 | LandRover_UK 329 | LeapCard 330 | LenovoANZ 331 | LibertyHelpDesk 332 | LibertyMutual 333 | LidsAssist 334 | LinkedInHelp 335 | LiveChat 336 | LiveNationON 337 | LivePhishCS 338 | Ljbpieces 339 | Logitech_ANZ 340 | Lovehoney 341 | LowesCares 342 | LubeStop 343 | MACcosmetics 344 | MBTA_CR 345 | MCMComicCon 346 | METROHouAlerts 347 | MLBFanSupport 348 | MRPHelp 349 | MRPfashion 350 | MTA 351 | MTGOTraders 352 | MTN180 353 | MUSupportDesk 354 | MVPHealthCare 355 | MWilbanks 356 | MacsalesCSTS 357 | Macys 358 | MadameTussauds 359 | MallforAfrica 360 | ManiereDeVoir 361 | MarcGoodmanBos 362 | MasterofMaltCS 363 | Matalan 364 | MaxiCosiUK 365 | MaytagBrand 366 | MaytagCare 367 | Mazda_SA 368 | McDonalds 369 | Mcheza_Care 370 | Medela_US 371 | MetroBank_Help 372 | MetroTransitMN 373 | MicrosoftHelps 374 | Minted 375 | Misfit 376 | MissSelfridge 377 | Missytohbadt 378 | ModCloth_Cares 379 | MrCabinetCareOC 380 | MrsGammaLabs 381 | Musictoday 382 | MwaveAu 383 | MyDoncaster 384 | NISMO_USA 385 | NOOK_Care 386 | NOOK_Care_UK 387 | NYSC 388 | NYTCare 389 | Nathane_Jackson 390 | NationalPro 391 | NealTopf 392 | NetflixANZ 393 | NetflixUK 394 | Netflix_CA 395 | NetoneSupport 396 | NeweggService 397 | Nipponyasancom 398 | NissanUSA 399 | Nordstrom 400 | Norfolkholidays 401 | Nutrisystem 402 | OEcare 403 | OasisFashion 404 | OfficeCandyGals 405 | OfficeDivvy 406 | OldNavyCA 407 | Ooma 408 | OoredooCare 409 | OoredooQatar 410 | OpenMike_TV 411 | OptumRx 412 | OrangeKe_Care 413 | OtterBox 414 | OtterBoxCS 415 | PASmithjr 416 | PBCares 417 | PCRichardandSon 418 | PECOconnect 419 | PIA_Cust_unCare 420 | PINGTourEurope 421 | PLC_Singapore 422 | PMInstitute 423 | PNCBank_Help 424 | PNMtalk 425 | POTUS_CustServ 426 | PaddyPowerShops 427 | PalaceMovies 428 | PanaService_UK 429 | Panago_Pizza 430 | PanasonicUSA 431 | PapaJohns 432 | PaperMate 433 | ParkHyattChi 434 | PatchworkUrchin 435 | PatriotsProShop 436 | PayPalInfoSec 437 | PaychexService 438 | Paytmcare 439 | PeabodyLDN 440 | PenskeCares 441 | PeteFyfe 442 | PeterPanBus 443 | PeterboroughCC 444 | PeugeotZA 445 | Photobox 446 | PioneerDJ 447 | PlayDoh 448 | PlayStationAU 449 | PlentyOfFish 450 | Pokemon 451 | Porsche 452 | Primark 453 | PrincesTrust 454 | ProFlowers 455 | ProtectionOne 456 | Publix 457 | PublixHelps 458 | PurolatorHelp 459 | QDStores 460 | Quinny_UK 461 | RACWA 462 | RAC_Care 463 | RBC 464 | RBC_Canada 465 | RBWM 466 | RCommCare 467 | RamCares 468 | RamseyCare 469 | Rand_Water 470 | RaneDJ 471 | RatedPeople 472 | Reachout_mcd 473 | ReebokUK 474 | Relish 475 | RideUTA 476 | RobSylvan 477 | RogersBiz 478 | RogersHelps 479 | Ronseal 480 | RoyalMail 481 | SAS 482 | SBEGlobal 483 | SEPTA_SOCIAL 484 | SFBayBikeShare 485 | SP_EnergyPeople 486 | STATravel_UK 487 | Safaricom_Care 488 | SageSupport 489 | SainsburysNews 490 | SaksService 491 | SamTrans 492 | SamsungCareSA 493 | SaskTel 494 | Schnittgemuese 495 | ScholasticClub 496 | ScottishPower 497 | Sears 498 | SeattleSPU 499 | Selfridges 500 | ServiceDivaBren 501 | Service_Queen 502 | SharisBerries 503 | ShiseidoUSA 504 | ShoePalace 505 | ShopRiteStores 506 | ShopRuche 507 | Sifter 508 | SilkFred 509 | SimplyhealthUK 510 | SkyCinemaUK 511 | SkypeSupport 512 | Sleepys 513 | SleepysCare 514 | SneakerRefresh 515 | SnoPUD 516 | Sofology 517 | SofologyHelp 518 | Sony 519 | SoundTransit 520 | SouthwestAir 521 | SparkNZ 522 | SpecializedCSUK 523 | Spencers_Retail 524 | Spinneys_Dubai 525 | SpiritAirlines 526 | Spokeo_Care 527 | SquareTrade 528 | StanChart 529 | StaplesUK 530 | StarHub 531 | StockTwitsHelp 532 | StonecareMike 533 | StraubsMarkets 534 | Stylistpick 535 | SubaruCanada 536 | SubwayListens 537 | Suddenlink 538 | SuddenlinkHelp 539 | SunCares 540 | Suncorp 541 | SuperShuttle 542 | Superdry 543 | Superdry_Care 544 | SwingSetService 545 | TCoughlin 546 | TEAVANA 547 | TELUS 548 | TEPenergy 549 | TKMaxx_UK 550 | TK_HelpDesk 551 | TLGTourHelp 552 | TMLewin 553 | TMRQld 554 | TMobileHelp 555 | TNTUKCare 556 | TSA 557 | TTCsue 558 | TW2CayC 559 | TWC_Help 560 | TWimpeySupport 561 | TacoBellTeam 562 | TandemDiabetes 563 | Tartineaubeurre 564 | TeamSanshee 565 | TeamSantone 566 | TeamShieldBSC 567 | TechUncensored 568 | TekCenter 569 | TelkomZA 570 | Tesco 571 | TessutiHelpTeam 572 | TextNowHelp 573 | TheBookPeople 574 | TheFrontDesk 575 | TheGymGroup 576 | TheLondonEye 577 | TheRAC_UK 578 | TheRTStore 579 | TheSharck 580 | ThomasCookCares 581 | ThomasCookUK 582 | ThomsonCares 583 | ThreadSence 584 | ThreeUK 585 | TiVo 586 | TiVoSupport 587 | TicketWeb 588 | Ticketmaster 589 | TicketmasterCA 590 | TomTom_SA 591 | Topheratl 592 | Topman 593 | TopshopHelp 594 | TotalGymDirect 595 | ToyotaCustCare 596 | ToyotaRacing 597 | ToysRUs 598 | TradeMe 599 | TruGreen 600 | Trustpilot 601 | UCLMainLibrary 602 | UFhd 603 | UHaul_Cares 604 | UKEnterprise 605 | UKMUJI 606 | UKVolkswagen 607 | UNB_ITS 608 | UPS_UK 609 | USAA_help 610 | USPS 611 | USPSHelp 612 | UWDoIT 613 | UnityQAThomas 614 | UpDesk 615 | VIZIOsupport 616 | VMUcare 617 | Venture_Cycles 618 | VerizonSupport 619 | VeryHelpers 620 | VirginAmerica 621 | VirginMediaIE 622 | Vitality_UK 623 | VitamixUK 624 | Vodacom 625 | Vodacom111 626 | VodacomRugga 627 | VodafoneAU_Help 628 | WAGSocialCare 629 | WHSmith 630 | WMSpillett 631 | WReynoldsYoung 632 | Walgreens 633 | WarThunder 634 | WatchShop 635 | WeChatZA 636 | WellsFargo 637 | WhirlpoolCare 638 | Whirlpool_CAre 639 | WholeFoods 640 | WilsonGolf 641 | WindowsSupport 642 | WingateHotels 643 | Wizards_Help 644 | XOCare 645 | YahooCare 646 | ZAGGdaily 647 | ZARA 648 | ZARA_Care 649 | Zapatosdesigner 650 | ZapposLuxury 651 | ZeekMarketplace 652 | ZoomSphere 653 | Zopim 654 | _MBService 655 | _rightio 656 | _valpow_ 657 | abbeygroup 658 | abbylynne19 659 | abellio_surrey 660 | acemtp 661 | ack 662 | acmemarkets 663 | acnestudios 664 | adamholz 665 | adidasGhana 666 | adidasUK 667 | adrianflux 668 | adrianswinscoe 669 | advocare 670 | airnzuk 671 | airtel_care 672 | alabamapower 673 | alamocares 674 | alpharooms 675 | alwaysriding 676 | americangiant 677 | andertonsmusic 678 | andrew_heister 679 | annabelkarmel 680 | asbcares 681 | ask_progressive 682 | askpermanenttsb 683 | asksurfline 684 | askvisa 685 | astros 686 | atomtickets 687 | audibleuk 688 | audiireland 689 | beautybay 690 | belk 691 | bestdealtv 692 | bexdeep 693 | beyondthedesk 694 | bigmikeyvegas 695 | blackanddecker 696 | bobbyrayburns 697 | boohoo_cshelp 698 | bookdepository 699 | booksamillion 700 | bravern 701 | brooksrunning 702 | bryanz85 703 | builds_io 704 | cableONE 705 | cafepress 706 | calorireland 707 | cam4_gay 708 | cars_portsmouth 709 | casualclassics 710 | cesarkeller 711 | cheftyler 712 | chevrolet 713 | chokemetoo 714 | chrisfenech1 715 | clarkshelp 716 | comcastcares 717 | comparemkt_care 718 | comparethemkt 719 | craftsman 720 | cstmrsvc 721 | ctcustomercare 722 | cvspharmacy 723 | danaddicott 724 | danandphilshop 725 | dancathy 726 | debthelpdesk 727 | deliverydotcom 728 | devinfinlay 729 | dgingiss 730 | dimonet 731 | directvnow 732 | djccwl 733 | dnataSupport 734 | dodgecares 735 | dongmsd 736 | doxiecare 737 | durafloruk 738 | eBay_UK 739 | easons 740 | easternbank 741 | edfenergy 742 | edfenergycomms 743 | eflow_freeflow 744 | eh_custcare 745 | eirNews 746 | elfcosmetics 747 | emikathure 748 | enterprisecares 749 | epicgeargaming 750 | epointsjordan 751 | esetcares 752 | etisalat 753 | eventbritehelp 754 | express 755 | farfetch 756 | faux_punk 757 | fdarenahelp 758 | fiatcares 759 | firstdirect 760 | fisherpaykelaus 761 | flySAA_US 762 | fontspring 763 | forduk 764 | foxrentcar 765 | freshlypicked 766 | fryselectronics 767 | gadventures 768 | geniusfoods 769 | getrespond 770 | getsatisfaction 771 | gigaclear 772 | glasses_direct 773 | gogreenride 774 | googlemaps 775 | graveshambc 776 | handtec 777 | hbonow 778 | helpscout 779 | hm 780 | hm_custserv 781 | hmaustralia 782 | hmcanada 783 | hmsouthafrica 784 | holden_aus 785 | holidayautos 786 | holidaytaxisCS 787 | houseoffraser 788 | hsamueljeweller 789 | hsr 790 | i_hate_ironing 791 | iiNet 792 | instituteofcs 793 | inthestyleUK 794 | itsemiel 795 | jAzzyF_BaBy 796 | jabrasport 797 | jackd 798 | jackiecas1 799 | jasoneden 800 | jaybaer 801 | jazzpk 802 | jcpenney 803 | jct600 804 | jeffreyboutique 805 | jessops 806 | jmspool 807 | joepindar 808 | joythestore 809 | jscheel 810 | k9cuisine 811 | kellie_brooks 812 | ketpoole 813 | kevinGEEdavis 814 | kiddicare 815 | kisluvkis 816 | kongacare 817 | latterash 818 | lidl_ni 819 | lidlcustomerc 820 | lids 821 | lidscanada 822 | lordandtaylor 823 | mamasandpapas 824 | mandkbutchers 825 | markmadison 826 | martindentist 827 | masnHelpDesk 828 | mattlatmatt 829 | mattr 830 | megabus 831 | melbournemuseum 832 | micahsolomon 833 | michaelshearer 834 | mitsucars 835 | mjanczewski 836 | monkiworld 837 | moreforURdollar 838 | mrpatto 839 | musicMagpie 840 | myUHC 841 | mycellcom 842 | nationalcares 843 | nationalgridus 844 | ncbja 845 | neimanmarcus 846 | netflix 847 | nextofficial 848 | nmkrobinson 849 | nokiamobile 850 | notonthehighst 851 | npowerhq 852 | nspowerinc 853 | nuerasupport 854 | nvidiacc 855 | nxcare 856 | officialpescm 857 | olympicfinishes 858 | overseasescape 859 | oxygenfreejump 860 | parracity 861 | paulodetarso24 862 | petedenton 863 | pfgregg 864 | picturehouses 865 | plusnethelp 866 | pond5 867 | premestateswine 868 | princessdvonb 869 | qatarairways 870 | redbox 871 | rhapsaadic 872 | riteaid 873 | ritual_co 874 | riverisland 875 | rwhelp 876 | sageuk 877 | sainsburys 878 | samanthastarmer 879 | samsungmobileng 880 | schuh 881 | scienceworks_mv 882 | scotiabank 883 | scottish_water 884 | scottrade 885 | secreteyesEW 886 | seetickets 887 | shaws 888 | shopmissa 889 | sizehelpteam 890 | sleepnumber 891 | smtickets 892 | sobeys 893 | spacecojo 894 | sportingindex 895 | spring 896 | sseairtricity 897 | statravelAU 898 | subaru_usa 899 | sunglassesshop 900 | supermartie 901 | swoonstars 902 | taportugal 903 | teeofftimes 904 | tescomobile 905 | tescomobilecare 906 | tescomobileire 907 | tfbalerts 908 | theMasterLink 909 | thebitchdesk 910 | thebondteam 911 | thekirbycompany 912 | thenutribullet 913 | thomaswales 914 | tjmaxx 915 | toister 916 | tonydataman 917 | townshoes 918 | traceychurray 919 | trafficscotland 920 | travelocity 921 | trimet 922 | twinklresources 923 | uga_eits 924 | united 925 | vaillantuk 926 | verabradleycare 927 | vigglesupport 928 | vmbusinesshelp 929 | vtaservice 930 | vueling 931 | vwcares 932 | w1zz 933 | wahoofitness 934 | warjohi 935 | we_energies 936 | whirlpoolusa 937 | wimrampen 938 | withazed 939 | wow_air 940 | wowairsupport 941 | wtzgoodPHL 942 | yoox 943 | zaggcare 944 | -------------------------------------------------------------------------------- /tasks/twitter/extract_official_twitter_dialogs.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | """extract_official_twitter_dialogs.py: 4 | A script to extract text from twitter dialogs. 5 | 6 | Copyright (c) 2017 Takaaki Hori (thori@merl.com) 7 | 8 | This software is released under the MIT License. 9 | http://opensource.org/licenses/mit-license.php 10 | 11 | """ 12 | 13 | import json 14 | import sys 15 | import os 16 | import six 17 | import re 18 | import argparse 19 | from datetime import datetime 20 | from tqdm import tqdm 21 | from nltk.tokenize import casual_tokenize 22 | 23 | if six.PY2: 24 | reload(sys) 25 | sys.setdefaultencoding('utf-8') 26 | 27 | 28 | def find_sequential_tweets(tweet, group): 29 | """ Check if the tweet is in multiple tweets of a single turn 30 | Args: 31 | tweet: the target tweet 32 | group: list of groups of multiple tweets 33 | Return: 34 | index of group in which the target is included 35 | if no group found, return -1 36 | """ 37 | if len(group) == 0: 38 | return -1 39 | tfm = '%a %b %d %H:%M:%S +0000 %Y' 40 | tw_time = datetime.strptime(tweet['created_at'], tfm) 41 | #print (time1) 42 | for m,elm in enumerate(group): 43 | for et in elm[1].values(): 44 | #print(et['created_at']) 45 | tdiff = (tw_time - datetime.strptime(et['created_at'], tfm)) 46 | if tweet['in_reply_to_status_id'] is not None \ 47 | and tweet['in_reply_to_status_id']==et['in_reply_to_status_id'] \ 48 | and tweet['user']['id']==et['user']['id'] \ 49 | and abs(tdiff.total_seconds()) < 600: 50 | return m 51 | return -1 52 | 53 | 54 | def validate_dialog(dialog, max_turns): 55 | """ Check if the dialog consists of multiple turns with equal or 56 | less than max_turns by two users without truncated tweets. 57 | Args: 58 | dialog: target dialog 59 | max_turns: upper bound of #turns per dialog 60 | Return: 61 | True if the conditions are all satisfied 62 | False, otherwise 63 | """ 64 | if len(dialog) > max_turns: 65 | return False 66 | # skip dialogs including truncated tweets or more users 67 | users = set() 68 | for utterance in dialog: 69 | for tid,tweet in utterance.items(): 70 | if tweet['truncated'] == True: 71 | return False 72 | users.add(tweet['user']['id']) 73 | if len(users) != 2: 74 | return False 75 | return True 76 | 77 | 78 | def preprocess(text, name, speaker='U', first_name=None): 79 | """ normalize and tokenize raw text 80 | args: 81 | text: input raw text (str) 82 | name: user name (str) 83 | first_name: user's first name (str) 84 | sys_utt: flag if this is a sysmtem's turn (bool) 85 | return: 86 | normalized text (str) 87 | """ 88 | # modify apostrophe character 89 | text = re.sub(u'’',"'",text) 90 | text = re.sub(u'(“|”)','',text) 91 | # remove handle names in the beginning 92 | text = re.sub(r'^(@[A-Za-z0-9_]+[\.;, ])+','',text) 93 | # remove connected tweets indicator e.g. (1/2) (2/2) 94 | text = re.sub(r'(^|[\(\[ ])[1234]\/[2345]([\)\] ]|$)',' ',text) 95 | # replace long numbers 96 | text = re.sub(r'(?<=[ A-Z])(\+\d|\d\-|\d\d\d+|\(\d\d+\))[\d\- ]+\d\d\d','',text) 97 | # replace user name in system response 98 | if speaker == 'S': 99 | if name: 100 | text = re.sub('@'+name, '', text) 101 | if first_name: 102 | text = re.sub('(^|[^A-Za-z0-9])'+first_name+'($|[^A-Za-z0-9])', '\\1\\2', text) 103 | 104 | # tokenize and replace entities 105 | words = casual_tokenize(text, preserve_case=False,reduce_len=True) 106 | for n in six.moves.range(len(words)): 107 | token = words[n] 108 | # replace entities with tags (E-MAIL, URL, NUMBERS, USER, etc) 109 | token = re.sub(r'^([a-z0-9_\.\-]+@[a-z0-9_\.\-]+\.[a-z]+)$','',token) 110 | token = re.sub(r'^https?:\S+$','',token) 111 | token = re.sub('^$','',token) 112 | token = re.sub('^$','',token) 113 | # make spaces for apostrophe and period 114 | token = re.sub(r'^([a-z]+)\'([a-z]+)$','\\1 \'\\2',token) 115 | token = re.sub(r'^([a-z]+)\.([a-z]+)$','\\1 . \\2',token) 116 | words[n] = token 117 | # join 118 | text = ' '.join(words) 119 | # remove signature of tweets (e.g. ... ^TH, - John, etc.) 120 | if speaker == 'S': 121 | text = re.sub(u'[\\^\\-~–][\\-– ]*([a-z]+\\s*|[a-z ]{2,8})(\\s*$|\\.\\s*$|$)','\\2',text) 122 | if not re.search(r' (thanks|thnks|thx)\s*$', text): 123 | text = re.sub(u'(?<= [\\-,!?.–])\\s*[a-z]+\\s*$','',text) 124 | 125 | return text 126 | 127 | 128 | def print_dialog(dialog, fo, sys_name='', debug=False): 129 | """ print a dialog 130 | args: 131 | dialog (list): list of tweet groups 132 | sys_name (str): screen_name of system 133 | file (object): file object to write text 134 | debug (bool): debug mode 135 | """ 136 | user_name = '' 137 | user_first_name = '' 138 | prev_speaker = '' 139 | # print a dialog 140 | for utterance in dialog: 141 | for n,tweet in enumerate(sorted(utterance.values(), key=lambda u:u['id'])): 142 | screen_name = tweet['user']['screen_name'] 143 | name = tweet['user']['name'] 144 | text = tweet['text'] 145 | 146 | if screen_name.lower() == sys_name.lower(): 147 | speaker = 'S' 148 | if not debug: 149 | text = preprocess(text, user_name, speaker=speaker, first_name=user_first_name) 150 | else: 151 | speaker = 'U' 152 | if not debug: 153 | text = preprocess(text, system_name, speaker=speaker) 154 | # set user's screen name and first name to replace the names 155 | # to a common symbol (e.g. ) in system utterances 156 | user_name = screen_name 157 | tokens = name.split() 158 | if len(tokens) > 0 and len(tokens[0]) > 2: 159 | m = re.match(r'([A-Za-z0-9]+)$', tokens[0]) 160 | if m: 161 | user_first_name = m.group(1) 162 | #print tokens[0],user_first_name 163 | 164 | if prev_speaker: 165 | if prev_speaker != speaker: 166 | six.print_('\n%s: %s' % (speaker,text), file=fo, end='') 167 | else: # connect to the previous tweet 168 | six.print_(' %s' % (text), file=fo, end='') 169 | else: 170 | six.print_('%s: %s' % (speaker,text), file=fo, end='') 171 | 172 | prev_speaker = speaker 173 | 174 | six.print_('\n', file=fo) 175 | 176 | 177 | def limit_dialogs(dialog_set, id_info): 178 | """ 179 | limit dialog extraction by IDs 180 | """ 181 | new_dialog_set = {} 182 | for did in dialog_set: 183 | dialog = dialog_set[did] 184 | new_dialog = [] 185 | end = None 186 | for n in six.moves.range(len(dialog)-1,-1,-1): 187 | id_str = dialog[n]["id_str"] 188 | if end is None and id_str in id_info: 189 | end = id_info[id_str] 190 | if end is not None: 191 | new_dialog.insert(0,dialog[n]) 192 | if dialog[n]['id'] == end: 193 | break 194 | if len(new_dialog) > 0: 195 | new_id = new_dialog[-1]['id_str'] 196 | new_dialog_set[new_id] = new_dialog 197 | return new_dialog_set 198 | 199 | 200 | if __name__ == "__main__": 201 | # parse command line 202 | parser = argparse.ArgumentParser() 203 | parser.add_argument('-t', '--target', help='read screen names from a file') 204 | parser.add_argument('-o', '--output', help='output processed text into a file') 205 | parser.add_argument('-d', '--debug', action='store_true', help='debug mode') 206 | parser.add_argument('--data-dir', help='specify data directory') 207 | parser.add_argument('--max-turns', default=20, 208 | help='exclude long dialogs by this number') 209 | parser.add_argument('--id-file', default='', help='use begin-end ID file') 210 | parser.add_argument('--no-progress-bar', action='store_true', 211 | help='show progress bar') 212 | parser.add_argument('files', metavar='FN', nargs='*', 213 | help='filenames of twitter dialogs in json format') 214 | args = parser.parse_args() 215 | 216 | # prepare filenames 217 | target_files = set() 218 | if args.target: 219 | for ln in open(args.target, 'r').readlines(): 220 | target_files.add(os.path.join(args.data_dir, ln.strip() + '.json')) 221 | for fn in args.files: 222 | target_files.add(fn) 223 | 224 | target_files = sorted(list(target_files)) 225 | # open output file 226 | if args.output: 227 | fo = open(args.output,'w') 228 | else: 229 | fo = sys.stdout 230 | # use id range file 231 | if args.id_file: 232 | id_info = json.load(open(args.id_file,'r')) 233 | else: 234 | id_info = None 235 | 236 | # extract dialogs 237 | # (1) merge separated tweets that can be considered one utterance 238 | # (2) filter irregular dialogs, e.g. too long, including truncated tweets, etc. 239 | # (3) tokenize and normalize sentence text 240 | if not args.no_progress_bar: 241 | target_files = tqdm(target_files) 242 | 243 | for fn in target_files: 244 | dialog_set = json.load(open(fn,'r')) 245 | m = re.search(r'(^|\/)([^\/]+)\.json',fn) 246 | if m: 247 | system_name = m.group(2) 248 | else: 249 | raise Exception('no match to a screen name in %s' % fn) 250 | 251 | if id_info is not None: 252 | dialog_set = limit_dialogs(dialog_set, id_info[system_name]) 253 | 254 | # make a tweet tree to merge sequential tweets by the same user 255 | root = {} 256 | for tid_str in sorted(dialog_set.keys()): 257 | dialog = dialog_set[tid_str] 258 | if dialog[0]['lang'] == 'en': 259 | tid = dialog[0]['id'] 260 | if tid not in root: 261 | root[tid] = ([],{tid:dialog[0]}) 262 | node = root[tid][0] 263 | for tweet in dialog[1:]: 264 | tid = tweet['id'] 265 | m = find_sequential_tweets(tweet,node) 266 | if m >= 0: 267 | node[m][1][tid] = tweet 268 | else: 269 | node.append(([],{tid:tweet})) 270 | node = node[m][0] 271 | 272 | # depth-first search and extract dialogs 273 | stack = [] 274 | for tid in sorted(root.keys()): 275 | stack.append((root[tid][0], [root[tid][1]])) 276 | 277 | while len(stack)>0: 278 | node,dseq = stack.pop() 279 | if len(node) == 0: 280 | if validate_dialog(dseq, args.max_turns): 281 | print_dialog(dseq, fo, sys_name=system_name, debug=args.debug) 282 | else: 283 | for elm in node: 284 | stack.append((elm[0], dseq + [elm[1]])) 285 | 286 | if args.output: 287 | fo.close() 288 | 289 | -------------------------------------------------------------------------------- /tasks/twitter/extract_official_twitter_testset.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | """extract_official_twitter_dialogs.py: 4 | A script to extract text from twitter dialogs. 5 | 6 | Copyright (c) 2017 Takaaki Hori (thori@merl.com) 7 | 8 | This software is released under the MIT License. 9 | http://opensource.org/licenses/mit-license.php 10 | 11 | """ 12 | 13 | import json 14 | import sys 15 | import os 16 | import six 17 | import re 18 | import argparse 19 | from datetime import datetime 20 | from tqdm import tqdm 21 | from nltk.tokenize import casual_tokenize 22 | 23 | if six.PY2: 24 | reload(sys) 25 | sys.setdefaultencoding('utf-8') 26 | 27 | def preprocess(text, name, speaker='U', first_name=None): 28 | """ normalize and tokenize raw text 29 | args: 30 | text: input raw text (str) 31 | name: user name (str) 32 | first_name: user's first name (str) 33 | sys_utt: flag if this is a sysmtem's turn (bool) 34 | return: 35 | normalized text (str) 36 | """ 37 | # modify apostrophe character 38 | text = re.sub(u'’',"'",text) 39 | text = re.sub(u'(“|”)','',text) 40 | # remove handle names in the beginning 41 | text = re.sub(r'^(@[A-Za-z0-9_]+[\.;, ])+','',text) 42 | # remove connected tweets indicator e.g. (1/2) (2/2) 43 | text = re.sub(r'(^|[\(\[ ])[1234]\/[2345]([\)\] ]|$)',' ',text) 44 | # replace long numbers 45 | text = re.sub(r'(?<=[ A-Z])(\+\d|\d\-|\d\d\d+|\(\d\d+\))[\d\- ]+\d\d\d','',text) 46 | # replace user name in system response 47 | if speaker == 'S': 48 | if name: 49 | text = re.sub('@'+name, '', text) 50 | if first_name: 51 | text = re.sub('(^|[^A-Za-z0-9])'+first_name+'($|[^A-Za-z0-9])', '\\1\\2', text) 52 | 53 | # tokenize and replace entities 54 | words = casual_tokenize(text, preserve_case=False,reduce_len=True) 55 | for n in six.moves.range(len(words)): 56 | token = words[n] 57 | # replace entities with tags (E-MAIL, URL, NUMBERS, USER, etc) 58 | token = re.sub(r'^([a-z0-9_\.\-]+@[a-z0-9_\.\-]+\.[a-z]+)$','',token) 59 | token = re.sub(r'^https?:\S+$','',token) 60 | token = re.sub('^$','',token) 61 | token = re.sub('^$','',token) 62 | # make spaces for apostrophe and period 63 | token = re.sub(r'^([a-z]+)\'([a-z]+)$','\\1 \'\\2',token) 64 | token = re.sub(r'^([a-z]+)\.([a-z]+)$','\\1 . \\2',token) 65 | words[n] = token 66 | # join 67 | text = ' '.join(words) 68 | # remove signature of tweets (e.g. ... ^TH, - John, etc.) 69 | if speaker == 'S': 70 | text = re.sub(u'[\\^\\-~–][\\-– ]*([a-z]+\\s*|[a-z ]{2,8})(\\s*$|\\.\\s*$|$)','\\2',text) 71 | if not re.search(r' (thanks|thnks|thx)\s*$', text): 72 | text = re.sub(u'(?<= [\\-,!?.–])\\s*[a-z]+\\s*$','',text) 73 | 74 | return text 75 | 76 | 77 | if __name__ == "__main__": 78 | # parse command line 79 | parser = argparse.ArgumentParser() 80 | parser.add_argument('-t', '--target', help='read screen names from a file') 81 | parser.add_argument('-o', '--output', help='output processed text into a file') 82 | parser.add_argument('-d', '--debug', action='store_true', help='debug mode') 83 | parser.add_argument('--data-dir', help='specify data directory') 84 | parser.add_argument('--id-list', required=True, help='use ID list file') 85 | parser.add_argument('--no-progress-bar', action='store_true', 86 | help='show progress bar') 87 | parser.add_argument('files', metavar='FN', nargs='*', 88 | help='filenames of twitter dialogs in json format') 89 | args = parser.parse_args() 90 | 91 | # prepare filenames 92 | target_files = set() 93 | if args.target: 94 | for ln in open(args.target, 'r').readlines(): 95 | target_files.add(os.path.join(args.data_dir, ln.strip() + '.json')) 96 | for fn in args.files: 97 | target_files.add(fn) 98 | 99 | target_files = sorted(list(target_files)) 100 | # read ID list 101 | id_list = json.load(open(args.id_list,'r')) 102 | id_pool = set() 103 | for di in id_list: 104 | for turn in di: 105 | id_pool |= set(turn['ids']) 106 | # extract necessary tweets 107 | if not args.no_progress_bar: 108 | target_files = tqdm(target_files) 109 | tweet_pool = {} 110 | for fn in target_files: 111 | dialog_set = json.load(open(fn,'r')) 112 | for tid_str in dialog_set: 113 | for tweet in dialog_set[tid_str]: 114 | tid = tweet['id'] 115 | if tid in id_pool: 116 | tweet_pool[tid] = tweet 117 | # write text to file 118 | if args.output: 119 | fo = open(args.output,'w') 120 | else: 121 | fo = sys.stdout 122 | if not args.no_progress_bar: 123 | tweet_id_list = tqdm(id_list) 124 | 125 | n_dialogs = 0 126 | n_turns = 0 127 | n_turns_in_list = 0 128 | for dialog in tweet_id_list: 129 | user_name = '' 130 | user_first_name = '' 131 | system_name = '' 132 | # print a dialog 133 | missed = False 134 | for ti,turn in enumerate(dialog): 135 | speaker = turn['speaker'] 136 | six.print_('%s:' % speaker, file=fo, end='') 137 | if sum([tid in tweet_pool for tid in turn['ids']]) < len(turn['ids']): 138 | six.print_(' __MISSING__', file=fo) 139 | missed = True 140 | 141 | elif len(turn['ids'])==0 and ti==len(dialog)-1: 142 | six.print_(' __UNDISCLOSED__', file=fo) 143 | n_turns += 1 144 | 145 | else: 146 | for tid in turn['ids']: 147 | tweet = tweet_pool[tid] 148 | screen_name = tweet['user']['screen_name'] 149 | name = tweet['user']['name'] 150 | text = tweet['text'] 151 | if speaker == 'S': 152 | if not args.debug: 153 | text = preprocess(text, user_name, speaker=speaker, first_name=user_first_name) 154 | system_name = screen_name 155 | else: 156 | if not args.debug: 157 | text = preprocess(text, system_name, speaker=speaker) 158 | # set user's screen name and first name to replace the names 159 | # to a common symbol (e.g. ) in system utterances 160 | user_name = screen_name 161 | tokens = name.split() 162 | if len(tokens) > 0 and len(tokens[0]) > 2: 163 | m = re.match(r'([A-Za-z0-9]+)$', tokens[0]) 164 | if m: 165 | user_first_name = m.group(1) 166 | 167 | six.print_(' %s' % text, file=fo, end='') 168 | six.print_('', file=fo) 169 | n_turns += 1 170 | 171 | n_turns_in_list += len(dialog) 172 | if not missed: 173 | n_dialogs += 1 174 | six.print_('', file=fo) 175 | 176 | if args.output: 177 | fo.close() 178 | 179 | print('--------------------------------------------------------') 180 | print('Number of successfully extracted dialogs: %d in %d' % (n_dialogs, len(tweet_id_list))) 181 | print('Number of successfully extracted turns: %d in %d' % (n_turns, n_turns_in_list)) 182 | print('--------------------------------------------------------') 183 | 184 | -------------------------------------------------------------------------------- /tasks/twitter/extract_twitter_dialogs.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | """extract_twitter_dialogs.py: 4 | A script to extract text from twitter dialogs. 5 | 6 | Copyright (c) 2017 Takaaki Hori (thori@merl.com) 7 | 8 | This software is released under the MIT License. 9 | http://opensource.org/licenses/mit-license.php 10 | 11 | """ 12 | 13 | import json 14 | import sys 15 | import os 16 | import six 17 | import re 18 | import argparse 19 | from datetime import datetime 20 | from tqdm import tqdm 21 | from nltk.tokenize import casual_tokenize 22 | 23 | if six.PY2: 24 | reload(sys) 25 | sys.setdefaultencoding('utf-8') 26 | 27 | 28 | def find_sequential_tweets(tweet, group): 29 | """ Check if the tweet is in multiple tweets of a single turn 30 | Args: 31 | tweet: the target tweet 32 | group: list of groups of multiple tweets 33 | Return: 34 | index of group in which the target is included 35 | if no group found, return -1 36 | """ 37 | if len(group) == 0: 38 | return -1 39 | tfm = '%a %b %d %H:%M:%S +0000 %Y' 40 | tw_time = datetime.strptime(tweet['created_at'], tfm) 41 | #print (time1) 42 | for m,elm in enumerate(group): 43 | for et in elm[1].values(): 44 | #print(et['created_at']) 45 | tdiff = (tw_time - datetime.strptime(et['created_at'], tfm)) 46 | if tweet['in_reply_to_status_id'] is not None \ 47 | and tweet['in_reply_to_status_id']==et['in_reply_to_status_id'] \ 48 | and tweet['user']['id']==et['user']['id'] \ 49 | and abs(tdiff.total_seconds()) < 600: 50 | return m 51 | return -1 52 | 53 | 54 | def validate_dialog(dialog, max_turns): 55 | """ Check if the dialog consists of multiple turns with equal or 56 | less than max_turns by two users without truncated tweets. 57 | Args: 58 | dialog: target dialog 59 | max_turns: upper bound of #turns per dialog 60 | Return: 61 | True if the conditions are all satisfied 62 | False, otherwise 63 | """ 64 | if len(dialog) > max_turns: 65 | return False 66 | # skip dialogs including truncated tweets or more users 67 | users = set() 68 | for utterance in dialog: 69 | for tid,tweet in utterance.items(): 70 | if tweet['truncated'] == True: 71 | return False 72 | users.add(tweet['user']['id']) 73 | if len(users) != 2: 74 | return False 75 | return True 76 | 77 | 78 | def preprocess(text, name, speaker='U', first_name=None): 79 | """ normalize and tokenize raw text 80 | args: 81 | text: input raw text (str) 82 | name: user name (str) 83 | first_name: user's first name (str) 84 | sys_utt: flag if this is a sysmtem's turn (bool) 85 | return: 86 | normalized text (str) 87 | """ 88 | # modify apostrophe character 89 | text = re.sub(u'’',"'",text) 90 | text = re.sub(u'(“|”)','',text) 91 | # remove handle names in the beginning 92 | text = re.sub(r'^(@[A-Za-z0-9_]+[\.;, ])+','',text) 93 | # remove connected tweets indicator e.g. (1/2) (2/2) 94 | text = re.sub(r'(^|[\(\[ ])[1234]\/[2345]([\)\] ]|$)',' ',text) 95 | # replace long numbers 96 | text = re.sub(r'(?<=[ A-Z])(\+\d|\d\-|\d\d\d+|\(\d\d+\))[\d\- ]+\d\d\d','',text) 97 | # replace user name in system response 98 | if speaker == 'S': 99 | if name: 100 | text = re.sub('@'+name, '', text) 101 | if first_name: 102 | text = re.sub('(^|[^A-Za-z0-9])'+first_name+'($|[^A-Za-z0-9])', '\\1\\2', text) 103 | 104 | # tokenize and replace entities 105 | words = casual_tokenize(text, preserve_case=False,reduce_len=True) 106 | for n in six.moves.range(len(words)): 107 | token = words[n] 108 | # replace entities with tags (E-MAIL, URL, NUMBERS, USER, etc) 109 | token = re.sub(r'^([a-z0-9_\.\-]+@[a-z0-9_\.\-]+\.[a-z]+)$','',token) 110 | token = re.sub(r'^https?:\S+$','',token) 111 | token = re.sub('^$','',token) 112 | token = re.sub('^$','',token) 113 | # make spaces for apostrophe and period 114 | token = re.sub(r'^([a-z]+)\'([a-z]+)$','\\1 \'\\2',token) 115 | token = re.sub(r'^([a-z]+)\.([a-z]+)$','\\1 . \\2',token) 116 | words[n] = token 117 | # join 118 | text = ' '.join(words) 119 | # remove signature of tweets (e.g. ... ^TH, - John, etc.) 120 | if speaker == 'S': 121 | text = re.sub(u'[\\^\\-~–][\\-– ]*([a-z]+\\s*|[a-z ]{2,8})(\\s*$|\\.\\s*$|$)','\\2',text) 122 | if not re.search(r' (thanks|thnks|thx)\s*$', text): 123 | text = re.sub(u'(?<= [\\-,!?.–])\\s*[a-z]+\\s*$','',text) 124 | 125 | return text 126 | 127 | 128 | def print_dialog(dialog, fo, sys_name='', debug=False): 129 | """ print a dialog 130 | args: 131 | dialog (list): list of tweet groups 132 | sys_name (str): screen_name of system 133 | file (object): file object to write text 134 | debug (bool): debug mode 135 | """ 136 | user_name = '' 137 | user_first_name = '' 138 | prev_speaker = '' 139 | # print a dialog 140 | for utterance in dialog: 141 | for n,tweet in enumerate(sorted(utterance.values(), key=lambda u:u['id'])): 142 | screen_name = tweet['user']['screen_name'] 143 | name = tweet['user']['name'] 144 | text = tweet['text'] 145 | if debug: 146 | six.print_('### %s: %s' % (name,text), file=fo) 147 | 148 | if screen_name.lower() == sys_name.lower(): 149 | speaker = 'S' 150 | text = preprocess(text, user_name, speaker=speaker, first_name=user_first_name) 151 | else: 152 | speaker = 'U' 153 | text = preprocess(text, system_name, speaker=speaker) 154 | # set user's screen name and first name to replace the names 155 | # to a common symbol (e.g. ) in system utterances 156 | user_name = screen_name 157 | tokens = name.split() 158 | if len(tokens) > 0 and len(tokens[0]) > 2: 159 | m = re.match(r'([A-Za-z0-9]+)$', tokens[0]) 160 | if m: 161 | user_first_name = m.group(1) 162 | #print tokens[0],user_first_name 163 | 164 | if prev_speaker: 165 | if prev_speaker != speaker: 166 | six.print_('\n%s: %s' % (speaker,text), file=fo, end='') 167 | else: # connect to the previous tweet 168 | six.print_(' %s' % (text), file=fo, end='') 169 | else: 170 | six.print_('%s: %s' % (speaker,text), file=fo, end='') 171 | 172 | prev_speaker = speaker 173 | 174 | six.print_('\n', file=fo) 175 | 176 | 177 | if __name__ == "__main__": 178 | # parse command line 179 | parser = argparse.ArgumentParser() 180 | parser.add_argument('-t', '--target', help='read screen names from a file') 181 | parser.add_argument('-o', '--output', help='output processed text into a file') 182 | parser.add_argument('-d', '--debug', action='store_true', help='debug mode') 183 | parser.add_argument('--data-dir', help='specify data directory') 184 | parser.add_argument('--max-turns', default=20, 185 | help='exclude long dialogs by this number') 186 | parser.add_argument('--no-progress-bar', action='store_true', 187 | help='show progress bar') 188 | parser.add_argument('files', metavar='FN', nargs='*', 189 | help='filenames of twitter dialogs in json format') 190 | args = parser.parse_args() 191 | 192 | # prepare filenames 193 | target_files = set() 194 | if args.target: 195 | for ln in open(args.target, 'r').readlines(): 196 | target_files.add(os.path.join(args.data_dir, ln.strip() + '.json')) 197 | for fn in args.files: 198 | target_files.add(fn) 199 | 200 | target_files = sorted(list(target_files)) 201 | # open output file 202 | if args.output: 203 | fo = open(args.output,'w') 204 | else: 205 | fo = sys.stdout 206 | 207 | # extract dialogs 208 | # (1) merge separated tweets that can be considered one utterance 209 | # (2) filter irregular dialogs, e.g. too long, including truncated tweets, etc. 210 | # (3) tokenize and normalize sentence text 211 | if not args.no_progress_bar: 212 | target_files = tqdm(target_files) 213 | 214 | for fn in target_files: 215 | dialog_set = json.load(open(fn,'r')) 216 | m = re.search(r'(^|\/)([^\/]+)\.json',fn) 217 | if m: 218 | system_name = m.group(2) 219 | else: 220 | raise Exception('no match to a screen name in %s' % fn) 221 | 222 | # make a tweet tree to merge sequential tweets by the same user 223 | root = {} 224 | for tid_str in sorted(dialog_set.keys()): 225 | dialog = dialog_set[tid_str] 226 | if dialog[0]['lang'] == 'en': 227 | tid = dialog[0]['id'] 228 | if tid not in root: 229 | root[tid] = ([],{tid:dialog[0]}) 230 | node = root[tid][0] 231 | for tweet in dialog[1:]: 232 | tid = tweet['id'] 233 | m = find_sequential_tweets(tweet,node) 234 | if m >= 0: 235 | node[m][1][tid] = tweet 236 | else: 237 | node.append(([],{tid:tweet})) 238 | node = node[m][0] 239 | 240 | # depth-first search and extract dialogs 241 | stack = [] 242 | for tid in sorted(root.keys()): 243 | stack.append((root[tid][0], [root[tid][1]])) 244 | 245 | while len(stack)>0: 246 | node,dseq = stack.pop() 247 | if len(node) == 0: 248 | if validate_dialog(dseq, args.max_turns): 249 | print_dialog(dseq, fo, sys_name=system_name, debug=args.debug) 250 | else: 251 | for elm in node: 252 | stack.append((elm[0], dseq + [elm[1]])) 253 | 254 | if args.output: 255 | fo.close() 256 | 257 | -------------------------------------------------------------------------------- /tasks/twitter/make_official_data.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # you can change the directory path according to the location of stored data 4 | stored_data=../../collect_twitter_dialogs/official_stored_data 5 | 6 | # you need to download tweet ID info to specify the data sets 7 | idfile=official_begin_end_ids.json 8 | idlink=https://www.dropbox.com/s/8lmpu5dfw2hmpys/official_begin_end_ids.json.gz 9 | echo downloading begin/end IDs for extracting data 10 | wget $idlink 11 | gunzip -f ${idfile}.gz 12 | 13 | # extract train and dev sets 14 | echo extracting training set 15 | ./extract_official_twitter_dialogs.py --data-dir $stored_data --id-file $idfile -t official_account_names_train.txt -o twitter_official_data_train.txt 16 | 17 | echo extracting development set 18 | ./extract_official_twitter_dialogs.py --data-dir $stored_data --id-file $idfile -t official_account_names_dev.txt -o twitter_official_data_dev.txt 19 | 20 | # extract 500 samples from dev set randomly for tentative evaluation 21 | echo extracting test set 22 | ./utils/sample_dialogs.py twitter_official_data_dev.txt 500 > twitter_official_data_dev500.txt 23 | 24 | echo done 25 | 26 | echo checking data size 27 | ./utils/check_dialogs.py twitter_official_data_train.txt twitter_official_data_dev.txt 28 | 29 | -------------------------------------------------------------------------------- /tasks/twitter/make_official_testset+refs.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # you can change the directory path according to the location of stored data 4 | stored_data=../../collect_twitter_dialogs/official_stored_data 5 | 6 | # tweet ID list of the test set 7 | # (reference text is included since the challenge period already finished) 8 | idlist=official_testset_ids+refs.json 9 | gunzip -c ${idlist}.gz > $idlist 10 | 11 | echo extracting the official test set 12 | extract_official_twitter_testset.py \ 13 | --id-list $idlist \ 14 | --data-dir $stored_data \ 15 | --target official_account_names_test.txt \ 16 | --output twitter_official_data_test+refs.txt 17 | 18 | echo done 19 | -------------------------------------------------------------------------------- /tasks/twitter/make_official_testset.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # you can change the directory path according to the location of stored data 4 | stored_data=../../collect_twitter_dialogs/official_stored_data 5 | 6 | # tweet ID list of the test set 7 | # (reference text is not provided, i.e., last turn of each dialog is empty) 8 | idlist=official_testset_ids.json 9 | gunzip -c ${idlist}.gz > $idlist 10 | 11 | echo extracting the official test set 12 | extract_official_twitter_testset.py \ 13 | --id-list $idlist \ 14 | --data-dir $stored_data \ 15 | --target official_account_names_test.txt \ 16 | --output twitter_official_data_test.txt 17 | 18 | echo done 19 | -------------------------------------------------------------------------------- /tasks/twitter/make_trial_data.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # you can change the directory path according to the location of stored data 4 | stored_data=../../collect_twitter_dialogs/stored_data 5 | 6 | # extract train and dev sets 7 | echo extracting training set 8 | ./extract_twitter_dialogs.py --data-dir $stored_data -t account_names_train.txt -o twitter_trial_data_train.txt 9 | echo extracting development set 10 | ./extract_twitter_dialogs.py --data-dir $stored_data -t account_names_dev.txt -o twitter_trial_data_dev.txt 11 | 12 | # extract 500 samples from dev set randomly for evaluation 13 | echo extracting test set 14 | ./utils/sample_dialogs.py twitter_trial_data_dev.txt 500 > twitter_trial_data_eval.txt 15 | 16 | echo done 17 | -------------------------------------------------------------------------------- /tasks/twitter/official_account_names_dev.txt: -------------------------------------------------------------------------------- 1 | 1800flowers 2 | ALCATEL1TOUCH 3 | ATTCares 4 | AetnaHelp 5 | AskCapitalOne 6 | AskMTNGhana 7 | AsurionCares 8 | AvidSupport 9 | Avis 10 | BookChicClub 11 | BostonMarki 12 | CIPD 13 | CMLFerry 14 | CadburyIreland 15 | Camper_CustCare 16 | CastelliCafe 17 | ChaseSupport 18 | CollectorCorps 19 | DollarGeneral 20 | DolphinCares 21 | ENMAXenergy 22 | EmiratesSupport 23 | Entergy 24 | Equinox_Service 25 | FCUK 26 | FedExCanada 27 | FoodLion 28 | FootballIndexUK 29 | FordSouthAfrica 30 | GapCA 31 | GiltService 32 | GrandHyattNYC 33 | GreyhoundBus 34 | HDCares 35 | HTCHelp 36 | Heinens 37 | IPFW_ITS_HD 38 | IWCare 39 | IrishRail 40 | JKalena123 41 | JawboneSupport 42 | JeepCares 43 | Jersey_City 44 | KitchenAid_CA 45 | Kohls 46 | LQcontactus 47 | LinkedInUK 48 | Lo573 49 | LoveWilko 50 | LozHarvey 51 | MDSHA 52 | MaclarenService 53 | MarshaCollier 54 | MeteorPR 55 | NBASTORE 56 | O2IRL 57 | OfficeShoes 58 | OrbitzCareTeam 59 | PRISM_NSA 60 | PanagoCares 61 | PhilipsCare_UK 62 | PostOfficeNews 63 | PoweradeGB 64 | RareLondon 65 | RubyBlogVAA 66 | SAS_Cares 67 | SCG_CS 68 | SDixonVivint 69 | SKYserves 70 | SSE 71 | ShoeboxedHelp 72 | SkyIreland 73 | SkySaga 74 | SonySupportUSA 75 | SouthWestWater 76 | StarbucksUKCA 77 | SunTzuHelpDesk 78 | SurflineGH 79 | Tech21Official 80 | TheBodyShopUK 81 | TicketmasterCS 82 | TravelGuard 83 | UconnectCares 84 | VolvoCarUSA 85 | Walmart 86 | WasteWR 87 | _omocat 88 | askevanscycles 89 | atmosenergy 90 | audiblesupport 91 | austin_reed 92 | catalysthousing 93 | champssports 94 | consult53 95 | dan_malarkey 96 | digicelbarbados 97 | dish 98 | edfenergycs 99 | feelunique 100 | hmunitedkingdom 101 | justhype 102 | kenmore 103 | lidl_ireland 104 | marilynsuttle 105 | miasampietro 106 | officedepot 107 | originenergy 108 | paycity 109 | paysafecard 110 | placesforpeople 111 | portlandwater 112 | pretavoir 113 | searscares 114 | skrill 115 | spreecoza 116 | stayhomeclub 117 | superbalist 118 | -------------------------------------------------------------------------------- /tasks/twitter/official_account_names_test.txt: -------------------------------------------------------------------------------- 1 | 1Sale 2 | ALawFirmTrainer 3 | AUKEYofficial 4 | AldiUSA 5 | AmyLolaM 6 | AshleyHomeHelp 7 | AskAmexCanada 8 | AskTigogh 9 | BKLNIndustries 10 | BestBuySupport 11 | Blu_GH 12 | CVSHealth 13 | CXSocialCare 14 | CanyonUK 15 | ChipRBell 16 | CustomerSure 17 | DrinkCrystal 18 | EdgarsFashion 19 | EntergyLA 20 | Everlane 21 | FCHolidayCares 22 | FamilyVideo 23 | FedExHelp 24 | FragranceShopUK 25 | FreedomMobile 26 | GlasgowSubway 27 | Glimpsei 28 | GrabTH 29 | GreenFlagUK 30 | HartmannLuggage 31 | Hebrideangirl 32 | HoldenSupport 33 | HouseofHelpers 34 | HydeHousing 35 | IKEACASupport 36 | IrishWater 37 | JeepCareUK 38 | KFCAustralia 39 | KarmaloopHelp 40 | LGUS 41 | LMNmakeup 42 | LeonelDorisca 43 | Lexus 44 | LogMeIn 45 | LoveLullaBellz 46 | LushWeCare 47 | MIsabelgarcia65 48 | MOO 49 | MatalanHelp 50 | MercedesUKTeam 51 | Moto_Support 52 | MyWakefield 53 | NJTRANSIT 54 | NetflixNordic 55 | OneDayOnlycoza 56 | PLDT_Cares 57 | PNCBank 58 | PayUmoney 59 | PitneyBowes 60 | SeattlesBest 61 | Shayna 62 | SubaruCustCare 63 | SupergaUK 64 | TOMSsupport 65 | TTChelps 66 | TelephoneDoctor 67 | TheQAssist 68 | TicketmasterIre 69 | UKFast 70 | UKNikon 71 | UPMCHealthPlan 72 | UlsterBank 73 | UniqloUSA 74 | UrbanWaxx 75 | VodacomSoccer 76 | WHGSupport 77 | WW_Canada 78 | WarbsNewburnBH 79 | blockbuster 80 | daysinn 81 | enbridgegas 82 | keybank 83 | narrativecare 84 | ntelcare 85 | ntelng 86 | pdbutterworth 87 | poweradeau 88 | redboxcare 89 | reef84 90 | sarahlennon08 91 | service 92 | servicenow 93 | socalgas 94 | sprintcare 95 | starbuckshelp 96 | superdrug 97 | vapouriz 98 | vauxhall 99 | vmbusiness 100 | zumiezhelp 101 | -------------------------------------------------------------------------------- /tasks/twitter/official_account_names_train.txt: -------------------------------------------------------------------------------- 1 | 1800PetMeds 2 | 1800flowershelp 3 | 1DFAQ 4 | 64audio 5 | 888sport 6 | ABCustomerCare 7 | ABTAtravel 8 | AFC_Amy 9 | AFCustomerCare 10 | APCcustserv 11 | ASHelpMe 12 | ASOS 13 | ASOS_Au 14 | ASOS_Us 15 | ASUHelpCenter 16 | ASUSUSA 17 | AVIVAIRELAND 18 | AbercrombieHelp 19 | AddisonLeeCabs 20 | AdobeCare 21 | AflacPhyllis 22 | AirAsia 23 | AirChinaNA 24 | AirtelNigeria 25 | Alamo 26 | AlaskaAir 27 | Albertsons 28 | AlderTraining 29 | AldiUK 30 | Aldi_Ireland 31 | AlfaRomeoCareUK 32 | AlfaRomeoUSA 33 | AliExpress_EN 34 | AlissaDosSantos 35 | AllClearID 36 | Allegiant 37 | AllianzTravelUS 38 | Allstate 39 | AllyCare 40 | AlshayaHelpDesk 41 | AmanaCare 42 | AmazonHelp 43 | AmazonVideoUK 44 | AmbitiousPrints 45 | AmericanAir 46 | AmexAU 47 | AndreaSWilson 48 | AnglianHelp 49 | AnkiSupport 50 | AnthemBC_News 51 | AntwanBarnes42 52 | AppleSupport 53 | AppleyardLondon 54 | ArqivaWiFi 55 | AsdaCustCare 56 | Ask123ie 57 | AskAmex 58 | AskAmexAU 59 | AskAnnaNouyou 60 | AskCiti 61 | AskDyson 62 | AskEmblemHealth 63 | AskExplorer 64 | AskKBCIreland 65 | AskLandsEnd 66 | AskMarstons 67 | AskMeHelpDesk 68 | AskPS_UK 69 | AskPaddyPower 70 | AskSmythsToys 71 | AskSubaruCanada 72 | AskTSA 73 | AskTarget 74 | AskTeamUA 75 | Ask_Spectrum 76 | AskeBay 77 | Askvodafonegh 78 | AsmodeeNA 79 | Atomos_News 80 | AudiUK 81 | AudiUKCare 82 | AutodeskHelp 83 | Ayres_Hotels 84 | BCBSLA 85 | BCC_Help 86 | BCSSupport 87 | BJsWholesale 88 | BN_care 89 | BOOKSetc_online 90 | BP_UK 91 | BP_plc 92 | BRCustServ 93 | BSSHelpDesk 94 | BabiesRUs 95 | BabyJogger 96 | Bajaj_Finance 97 | BananaRepublic 98 | BandQ 99 | BarclaycardNews 100 | BarclaysUK 101 | BeautyBarDotCom 102 | Beaverbrooks 103 | BedBathBeyond 104 | BenefitUK 105 | BernardBoutique 106 | Bernstein 107 | BiffaService 108 | BikeMetro 109 | BitcoinHelpDesk 110 | BlackBerryHelp 111 | Blanchford 112 | Blendtec 113 | BluestarHQ 114 | Boars_Head 115 | BoostCare 116 | BootsUK 117 | BordGaisEnergy 118 | BoseService 119 | BounceEnergy 120 | BritishGas 121 | BritishGasNews 122 | Budget 123 | Buick 124 | BuildaHelpDesk 125 | BupaUK 126 | BurberryService 127 | BurgerKing 128 | Burton_Menswear 129 | CCP_Help 130 | CDCustService 131 | COTABus 132 | CRCustomerChat 133 | CRTContactUs 134 | CTS___ 135 | CableBill 136 | CadburyUK 137 | CalMac_Updates 138 | CallawayGolfCS 139 | CanadianPacific 140 | CapMetroATX 141 | CareShopBunzl 142 | CareerBuilder 143 | CaroKopp 144 | Cartii 145 | CellC_Support 146 | CheapTixHearsU 147 | ChiccoUSA 148 | Chomperbee 149 | ChrissyChong 150 | ChryslerCares 151 | Cignaquestions 152 | CiteThisForMe 153 | CitiBikeNYC 154 | Citibank 155 | CitySprint_help 156 | ClaireLBSmith 157 | ClairesEurope 158 | CoastCath 159 | ComcastILLINOIS 160 | ComcastOrSWWa 161 | CoopEnergy 162 | Costco 163 | CoxHelp 164 | DCBService 165 | DEWALT_UK 166 | DEWALTtough 167 | DFSCare 168 | DIBsupport 169 | DICKS 170 | DIGICELJamaica 171 | DIRECTV 172 | DK_Assist 173 | DNBcares 174 | DPDCustomerCare 175 | DSGSupport 176 | DStvNg 177 | DTSatWIT 178 | DarielleRose 179 | DarkBunnyTees 180 | DawnCarillion 181 | Deliveroo 182 | Deliveroo_IE 183 | DevonCC 184 | DianaHSmith 185 | DigitalRiverInc 186 | Dillards 187 | Discover 188 | Discovery_SA 189 | DivvyBikes 190 | DocumentDB 191 | DoorDash 192 | Dreams_Beds 193 | DressCircleShop 194 | DrinkSparkletts 195 | DunelmUK 196 | DustoMan 197 | ENMAX 198 | EPCOR 199 | EYellin 200 | EarthLink 201 | EatNaturalBars 202 | EllenKeeble 203 | EmbracePetIns 204 | EmersonIT 205 | EntergyArk 206 | Enterprise 207 | EvansCycles 208 | EversourceCT 209 | EviteSupport 210 | Expedia 211 | ExpressHelp 212 | FFGames 213 | FH_CustomerCare 214 | FLBlueCares 215 | FLLFlyer 216 | FTcare 217 | FUT_COINSTORE 218 | FabFitFunCS 219 | FabSupportTeam 220 | FamilyBankKenya 221 | Fanatics 222 | Fandom_Insider 223 | FantasticCleanL 224 | FedExCanadaHelp 225 | FedExEurope 226 | FedExHelpEU 227 | FeelGoodPark 228 | FiatCareUK 229 | FiguresToyCo 230 | FinnairHelps 231 | Fly_Norwegian 232 | Fon 233 | FonCare 234 | Footasylum 235 | Ford 236 | Forever21Help 237 | FortisBC 238 | FrankEliason 239 | FreeviewTV 240 | FromYouFlowers 241 | FrontierCare 242 | FunjetVacations 243 | FunkoDCLegion 244 | GEICO_Service 245 | GETMEIN 246 | GM 247 | GRT_ROW 248 | GSMA_Care 249 | Gap 250 | GarudaCares 251 | GenesisHousing 252 | GeoffRamm 253 | GeorgiaPower 254 | GeoxCares 255 | GlideUK 256 | GloCare 257 | GlossyboxUK 258 | GoIdahHelpdesk 259 | GoTriangle 260 | Go_CheshireWest 261 | Go_Wireless 262 | GongshowService 263 | Google 264 | GrandHyattSD 265 | GreatClipsCares 266 | Groove 267 | Grubhub_Care 268 | Gymshark 269 | Gymshark_Help 270 | HEB 271 | HMSHost 272 | HQhair_Help 273 | HRBlockAnswers 274 | HSBC_Sport 275 | HSBC_UAE 276 | HSBC_UK 277 | HSBC_US 278 | HSScustomercare 279 | HalfordsCycling 280 | HarrisTeeter 281 | HarrodsService 282 | HarryandDavid 283 | HawaiianAir 284 | HeatherJStrout 285 | HelloKit 286 | HilltopNorfolk 287 | HollisterCoHelp 288 | HomeDepotCanada 289 | HomesenseUK 290 | Honda 291 | HondaCustSvc 292 | HondaPowersprts 293 | Hootsuite_Help 294 | Hotwire 295 | Huawei 296 | HuaweiMobile 297 | HudsonshoesUK 298 | HullUni_ICT 299 | HumanaHelp 300 | HwnElectric 301 | HyattChicago 302 | HyattChurchill 303 | Hyken 304 | IGearBrand 305 | IKEAIESupport 306 | IKEAUKSupport 307 | IKEAUSAHelp 308 | INDOCHINO 309 | INDOT 310 | INFINITICare 311 | INFINITIUSA 312 | IcelandFoods 313 | InMotionCares 314 | Incite_Group 315 | IndyDPW 316 | IslandAirHawaii 317 | Iyengarish 318 | JDhelpteam 319 | JIRAServiceDesk 320 | JLcustserv 321 | JLove55 322 | JMstore 323 | JabraEurope 324 | Jabra_US 325 | JackWills 326 | JagexInfinity 327 | JamboPayCare 328 | Jamie1973 329 | JapanHelpDesk 330 | JasonWeigandt 331 | Jeep 332 | Jesshillcakes 333 | JetBlue 334 | JetHeads 335 | JetstarAirways 336 | Jetstar_NZ 337 | JimEllisAudi 338 | JimatPlanters 339 | JoinRelate 340 | JonesClocks 341 | JordanLeggett16 342 | Jostens 343 | KFC_UKI 344 | KFC_UKI_Help 345 | KLM_UK 346 | KakaoTalkPH 347 | KateNasser 348 | KauaiSA 349 | Kazel_Kimpo 350 | KenyaPower 351 | Kia 352 | KimiNozoGuy 353 | KingFlyKOTD 354 | KitbagCS 355 | KitchenAid_CAre 356 | KrogerSupport 357 | LPTechTeam 358 | LVcares 359 | LandRover_UK 360 | LavishAlice 361 | LeapCard 362 | LenovoANZ 363 | Level99Games 364 | LibertyHelpDesk 365 | LibertyMutual 366 | LidsAssist 367 | LinkedInHelp 368 | LiveChat 369 | LiveNationON 370 | LivePhishCS 371 | Ljbpieces 372 | Logitech_ANZ 373 | Lovehoney 374 | LowesCares 375 | LubeStop 376 | MACcosmetics 377 | MBTA_CR 378 | MCMComicCon 379 | METROHouAlerts 380 | MLBFanSupport 381 | MRPHelp 382 | MRPfashion 383 | MSE_Forum 384 | MTA 385 | MTGOTraders 386 | MTN180 387 | MUSupportDesk 388 | MVPHealthCare 389 | MWilbanks 390 | Macys 391 | MadameTussauds 392 | MallforAfrica 393 | MandT_Bank 394 | ManiereDeVoir 395 | MarcGoodmanBos 396 | MariNBCSD 397 | MarkSuppNet 398 | MasterofMaltCS 399 | Matalan 400 | MaxiCosiUK 401 | MaytagBrand 402 | MaytagCare 403 | Mazda_SA 404 | McDonalds 405 | Mcheza_Care 406 | Medela_US 407 | MetroBank_Help 408 | MetroTransitMN 409 | MicrosoftHelps 410 | Minted 411 | Misfit 412 | MissSelfridge 413 | Missytohbadt 414 | ModCloth_Cares 415 | MrCabinetCareOC 416 | MrsGammaLabs 417 | Musictoday 418 | MwaveAu 419 | MyDoncaster 420 | NISMO_USA 421 | NOOK_Care 422 | NOOK_Care_UK 423 | NYSC 424 | NYTCare 425 | Nathane_Jackson 426 | NationalPro 427 | NealTopf 428 | NetflixANZ 429 | NetflixUK 430 | Netflix_CA 431 | NetoneSupport 432 | NeweggService 433 | Nipponyasancom 434 | NissanUSA 435 | Nordstrom 436 | Norfolkholidays 437 | Nutrisystem 438 | OEcare 439 | OasisFashion 440 | OfficeCandyGals 441 | OfficeDivvy 442 | OldNavyCA 443 | One_Bobby_Moore 444 | Ooma 445 | OoredooCare 446 | OoredooQatar 447 | OpenMike_TV 448 | OptumRx 449 | OtelSupport 450 | OtterBox 451 | OtterBoxCS 452 | PASmithjr 453 | PBCares 454 | PCRichardandSon 455 | PECOconnect 456 | PIA_Cust_unCare 457 | PINGTourEurope 458 | PLC_Singapore 459 | PMInstitute 460 | PNCBank_Help 461 | PNMtalk 462 | POTUS_CustServ 463 | PaddyPowerShops 464 | PalaceMovies 465 | PanaService_UK 466 | Panago_Pizza 467 | PanasonicUSA 468 | PapaJohns 469 | PaperMate 470 | ParkHyattChi 471 | PartyCity 472 | PatchworkUrchin 473 | PatriotsProShop 474 | PayPalInfoSec 475 | PaychexService 476 | Paytmcare 477 | PeabodyLDN 478 | PenskeCares 479 | PeteFyfe 480 | PeterPanBus 481 | PeterboroughCC 482 | PeugeotZA 483 | Photobox 484 | PioneerDJ 485 | PlayDoh 486 | PlayStationAU 487 | PlentyOfFish 488 | Pokemon 489 | Porsche 490 | Primark 491 | PrincesTrust 492 | ProFlowers 493 | ProtectionOne 494 | Publix 495 | PublixHelps 496 | PurolatorHelp 497 | QDStores 498 | Quinny_UK 499 | RACWA 500 | RAC_Care 501 | RBC 502 | RBKC_CS 503 | RBWM 504 | RCommCare 505 | RIPTA_RI 506 | RamCares 507 | RamseyCare 508 | Rand_Water 509 | RaneDJ 510 | RatedPeople 511 | Reachout_mcd 512 | ReebokUK 513 | Relish 514 | RideUTA 515 | RobSylvan 516 | RogersBiz 517 | RogersHelps 518 | Ronseal 519 | RoyalMail 520 | SACustCare 521 | SAS 522 | SBEGlobal 523 | SEPTA_SOCIAL 524 | SP_EnergyPeople 525 | STATravel_UK 526 | Safaricom_Care 527 | SageSupport 528 | SainsburysNews 529 | SaksService 530 | SallyBeautyUK 531 | SamTrans 532 | SamsungCareSA 533 | SaskTel 534 | Schnittgemuese 535 | ScholasticClub 536 | Scienceofsport 537 | ScottishPower 538 | Sears 539 | SearsHomeExpert 540 | SeattleSPU 541 | Selfridges 542 | ServiceDivaBren 543 | Service_Queen 544 | SharisBerries 545 | ShiseidoUSA 546 | ShoePalace 547 | ShopRiteStores 548 | ShopRuche 549 | Sifter 550 | SilkFred 551 | SimplyhealthUK 552 | SkyCinemaUK 553 | SkypeSupport 554 | Sleepys 555 | SleepysCare 556 | SneakerRefresh 557 | SnoPUD 558 | SoBiHamilton 559 | SocialCaddieGC 560 | Sofology 561 | SofologyHelp 562 | Sony 563 | SoundTransit 564 | SouthwestAir 565 | SparkNZ 566 | SpecializedCSUK 567 | Spencers_Retail 568 | Spinneys_Dubai 569 | SpiritAirlines 570 | Spokeo_Care 571 | SquareTrade 572 | StanChart 573 | StaplesUK 574 | StarHub 575 | Stixzadinia 576 | StockTwitsHelp 577 | StonecareMike 578 | StraubsMarkets 579 | Stylistpick 580 | SubaruCanada 581 | SubwayListens 582 | Suddenlink 583 | SuddenlinkHelp 584 | SunCares 585 | Suncorp 586 | SuperShuttle 587 | Superdry 588 | Superdry_Care 589 | SwingSetService 590 | TCoughlin 591 | TEAVANA 592 | TELUS 593 | TEPenergy 594 | TKMaxx_UK 595 | TK_HelpDesk 596 | TLGTourHelp 597 | TMLewin 598 | TMRQld 599 | TMobileHelp 600 | TNTUKCare 601 | TSA 602 | TTCsue 603 | TW2CayC 604 | TWC_Help 605 | TWimpeySupport 606 | TacoBellTeam 607 | TandemDiabetes 608 | Tartineaubeurre 609 | TeamSanshee 610 | TeamSantone 611 | TeamShieldBSC 612 | TeamStubHub 613 | TechUncensored 614 | TekCenter 615 | TelkomZA 616 | Tesco 617 | TessutiHelpTeam 618 | TextNowHelp 619 | TheBookPeople 620 | TheFrontDesk 621 | TheGymGroup 622 | TheLondonEye 623 | TheRAC_UK 624 | TheRTStore 625 | TheSharck 626 | ThomasCookCares 627 | ThomasCookUK 628 | ThomsonCares 629 | ThreadSence 630 | ThreeUK 631 | TiVo 632 | TiVoSupport 633 | TicketWeb 634 | Ticketmaster 635 | TicketmasterCA 636 | TomTom_SA 637 | Topheratl 638 | Topman 639 | TopshopHelp 640 | ToshibaUSAhelp 641 | TotalGymDirect 642 | ToyotaCustCare 643 | ToyotaRacing 644 | ToysRUs 645 | TradeMe 646 | TruGreen 647 | Trustpilot 648 | UCLMainLibrary 649 | UHaul_Cares 650 | UKEnterprise 651 | UKMUJI 652 | UKVolkswagen 653 | UNB_ITS 654 | UPS_UK 655 | USAA_help 656 | USPS 657 | USPSHelp 658 | UWDoIT 659 | UnityQAThomas 660 | UpDesk 661 | VIZIOsupport 662 | VMUcare 663 | VendHelp 664 | Venture_Cycles 665 | VerizonSupport 666 | VeryHelpers 667 | VirginAmerica 668 | VirginMediaIE 669 | Visit_Jax 670 | Vitality_UK 671 | VitamixUK 672 | Vodacom 673 | Vodacom111 674 | VodacomRugga 675 | VodafoneAU_Help 676 | WAGSocialCare 677 | WB_Games_UK 678 | WHSmith 679 | WMSpillett 680 | WOW_WAY 681 | WReynoldsYoung 682 | Walgreens 683 | WarThunder 684 | WatchShop 685 | WeChatZA 686 | WellsFargo 687 | WesBooth 688 | WhirlpoolCare 689 | Whirlpool_CAre 690 | WholeFoods 691 | WilsonGolf 692 | WingateHotels 693 | Wizards_Help 694 | Wkc10 695 | XOCare 696 | YahooCare 697 | ZAGGdaily 698 | ZARA 699 | ZARA_Care 700 | Zalebs 701 | Zapatosdesigner 702 | ZapposLuxury 703 | ZeekMarketplace 704 | Zipbuds 705 | ZoomSphere 706 | _MBService 707 | _rightio 708 | _valpow_ 709 | abbeygroup 710 | abbylynne19 711 | abellio_surrey 712 | acemtp 713 | ack 714 | acmemarkets 715 | acnestudios 716 | adamholz 717 | adidasGhana 718 | adidasUK 719 | adrianflux 720 | adrianswinscoe 721 | advocare 722 | airnzuk 723 | airtel_care 724 | alabamapower 725 | alamocares 726 | alexanderlewis 727 | alexbarinka 728 | alpharooms 729 | alwaysriding 730 | alwayswithyoumw 731 | americangiant 732 | andertonsmusic 733 | andrew_heister 734 | annabelkarmel 735 | askMsQ 736 | ask_progressive 737 | askpermanenttsb 738 | asksurfline 739 | askvisa 740 | astros 741 | atomtickets 742 | audibleuk 743 | audiireland 744 | beautybay 745 | belk 746 | bestdealtv 747 | bexdeep 748 | beyondthedesk 749 | bigmikeyvegas 750 | blackanddecker 751 | bobbyrayburns 752 | boohoo_cshelp 753 | bookdepository 754 | booksamillion 755 | bravern 756 | brooksrunning 757 | builds_io 758 | cableONE 759 | cafepress 760 | calgary 761 | callummay 762 | calorireland 763 | cam4_gay 764 | cars_portsmouth 765 | casualclassics 766 | cesarkeller 767 | cheftyler 768 | chevrolet 769 | chokemetoo 770 | chrisfenech1 771 | clarkshelp 772 | comcastcares 773 | comparemkt_care 774 | comparethemkt 775 | craftsman 776 | cstmrsvc 777 | ctcustomercare 778 | cvspharmacy 779 | danaddicott 780 | danandphilshop 781 | dancathy 782 | debthelpdesk 783 | deliverydotcom 784 | devinfinlay 785 | dgingiss 786 | dimonet 787 | directvnow 788 | dj0s0 789 | djccwl 790 | dnataSupport 791 | dodgecares 792 | dongmsd 793 | doxiecare 794 | durafloruk 795 | eBay_UK 796 | easons 797 | easternbank 798 | econet_support 799 | edfenergy 800 | edfenergycomms 801 | eflow_freeflow 802 | eh_custcare 803 | eirNews 804 | elfcosmetics 805 | emikathure 806 | enterprisecares 807 | epicgeargaming 808 | epointsjordan 809 | esetcares 810 | espnplayer 811 | etisalat 812 | eventbritehelp 813 | express 814 | farfetch 815 | faux_punk 816 | fdarenahelp 817 | fiatcares 818 | firstdirect 819 | fisherpaykelaus 820 | flySAA_US 821 | fontspring 822 | forduk 823 | foxrentcar 824 | freshlypicked 825 | fryselectronics 826 | gadventures 827 | gamoid 828 | gardenknowhow 829 | garyteamasics 830 | geniusfoods 831 | getsatisfaction 832 | gigaclear 833 | glasses_direct 834 | gogreenride 835 | googlemaps 836 | graveshambc 837 | handtec 838 | hbonow 839 | helpscout 840 | hhavrilesky 841 | hm 842 | hm_custserv 843 | hmaustralia 844 | hmcanada 845 | hmsouthafrica 846 | holden_aus 847 | holidayautos 848 | holidaytaxisCS 849 | houseoffraser 850 | hsamueljeweller 851 | hsr 852 | i_hate_ironing 853 | iiNet 854 | instituteofcs 855 | inthestyleUK 856 | ironicaccount 857 | itsemiel 858 | jAzzyF_BaBy 859 | jabrasport 860 | jackd 861 | jackiecas1 862 | jasoneden 863 | jaybaer 864 | jazzpk 865 | jcpenney 866 | jct600 867 | jeffreyboutique 868 | jessops 869 | jmspool 870 | joepindar 871 | joythestore 872 | jscheel 873 | junebugbatticus 874 | k9cuisine 875 | kellie_brooks 876 | ketpoole 877 | kevinGEEdavis 878 | kiddicare 879 | kimryan1974 880 | kisluvkis 881 | kongacare 882 | latterash 883 | leseadzimah 884 | lesleybarnes 885 | lesmicek 886 | lidl_ni 887 | lids 888 | lidscanada 889 | lootcrate 890 | lordandtaylor 891 | mamasandpapas 892 | mandkbutchers 893 | markmadison 894 | martindentist 895 | masnHelpDesk 896 | mattlatmatt 897 | mattr 898 | mdltech 899 | megabus 900 | melbournemuseum 901 | micahsolomon 902 | michaelshearer 903 | mitsucars 904 | mjanczewski 905 | monkiworld 906 | moreforURdollar 907 | mrpatto 908 | musicMagpie 909 | musicMagpieCS 910 | musicofbelle 911 | mycellcom 912 | nationalcares 913 | nationalgridus 914 | ncbja 915 | neimanmarcus 916 | netflix 917 | nextofficial 918 | nikkicarrtqs 919 | nmkrobinson 920 | nokiamobile 921 | notonthehighst 922 | notzakihas 923 | npowerhq 924 | nspowerinc 925 | nuerasupport 926 | nvidiacc 927 | nxcare 928 | officialpescm 929 | olympicfinishes 930 | oxygenfreejump 931 | parracity 932 | paulodetarso24 933 | petedenton 934 | pfgregg 935 | picturehouses 936 | plusnethelp 937 | pond5 938 | premestateswine 939 | princessdvonb 940 | qatarairways 941 | ramennoodlesMC 942 | redbox 943 | rhapsaadic 944 | rideact 945 | riteaid 946 | ritual_co 947 | riverisland 948 | rossrader 949 | rwhelp 950 | sageuk 951 | sainsburys 952 | samanthastarmer 953 | samsungmobileng 954 | schuh 955 | scienceworks_mv 956 | scotiabank 957 | scottish_water 958 | scottrade 959 | secreteyesEW 960 | seetickets 961 | shaws 962 | shopmissa 963 | simplybusiness 964 | sizehelpteam 965 | sleepnumber 966 | smtickets 967 | sobeys 968 | spacecojo 969 | sportingindex 970 | spring 971 | sproutsfm 972 | sseairtricity 973 | statravelAU 974 | subaru_usa 975 | sunglassesshop 976 | supermartie 977 | swannsecurity 978 | swoonstars 979 | tacey 980 | taportugal 981 | teeofftimes 982 | tescomobile 983 | tescomobilecare 984 | tescomobileire 985 | tfbalerts 986 | theMasterLink 987 | thebitchdesk 988 | thebondteam 989 | thekirbycompany 990 | thenutribullet 991 | thinkmedialabs 992 | thomaswales 993 | tineywristwatch 994 | tjmaxx 995 | toister 996 | tonydataman 997 | townshoes 998 | traceychurray 999 | trafficscotland 1000 | travelocity 1001 | trimet 1002 | twinklresources 1003 | uga_eits 1004 | united 1005 | vaillantuk 1006 | verabradleycare 1007 | verygoodservice 1008 | vigglesupport 1009 | vmbusinesshelp 1010 | vtaservice 1011 | vueling 1012 | vwcares 1013 | w1zz 1014 | wahoofitness 1015 | warjohi 1016 | we_energies 1017 | whirlpoolusa 1018 | wimrampen 1019 | withazed 1020 | wow_air 1021 | wowairsupport 1022 | wtzgoodPHL 1023 | yoox 1024 | zaggcare 1025 | -------------------------------------------------------------------------------- /tasks/twitter/official_testset_ids+refs.json.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dialogtekgeek/DSTC6-End-to-End-Conversation-Modeling/f142c3025e03adef99c0d97be663f0799adf5524/tasks/twitter/official_testset_ids+refs.json.gz -------------------------------------------------------------------------------- /tasks/twitter/official_testset_ids.json.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dialogtekgeek/DSTC6-End-to-End-Conversation-Modeling/f142c3025e03adef99c0d97be663f0799adf5524/tasks/twitter/official_testset_ids.json.gz -------------------------------------------------------------------------------- /tasks/twitter/utils: -------------------------------------------------------------------------------- 1 | ../utils -------------------------------------------------------------------------------- /tasks/utils/bleu_score.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | """BLEU score for dialog system output 4 | 5 | Copyright (c) 2017 Takaaki Hori (thori@merl.com) 6 | 7 | This software is released under the MIT License. 8 | http://opensource.org/licenses/mit-license.php 9 | 10 | """ 11 | 12 | import sys 13 | from nltk.translate.bleu_score import corpus_bleu 14 | 15 | refs = [] 16 | hyps = [] 17 | for utt in open(sys.argv[1],'r').readlines(): 18 | if utt.startswith('S_REF:'): 19 | ref = utt.replace('S_REF:','').split() 20 | refs.append([ref]) 21 | 22 | if utt.startswith('S_HYP:'): 23 | hyp = utt.replace('S_HYP:','').split() 24 | hyps.append(hyp) 25 | 26 | # obtain BLEU1-4 27 | print("--------------------------------") 28 | print("Evaluated file: " + sys.argv[1]) 29 | print("Number or references: %d" % len(refs)) 30 | print("Number or hypotheses: %d" % len(hyps)) 31 | print("--------------------------------") 32 | if len(refs) > 0 and len(hyps) > 0 and len(refs)==len(hyps): 33 | result = [] 34 | for n in [1,2,3,4]: 35 | weights = [1./n] * n 36 | print('Bleu%d: %f' % (n, corpus_bleu(refs,hyps,weights=weights))) 37 | else: 38 | print("Error: mismatched reeferences and hypotheses.") 39 | print("--------------------------------") 40 | 41 | -------------------------------------------------------------------------------- /tasks/utils/check_dialogs.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | """Check sizes of extracted dialog data 3 | 4 | Copyright (c) 2017 Takaaki Hori (thori@merl.com) 5 | 6 | This software is released under the MIT License. 7 | http://opensource.org/licenses/mit-license.php 8 | 9 | """ 10 | 11 | import sys 12 | 13 | class dataset: 14 | def __init__(self, name, filename, n_dialogs, n_utters, n_words): 15 | self.name = name 16 | self.filename = filename 17 | self.n_dialogs = n_dialogs 18 | self.n_utters = n_utters 19 | self.n_words = n_words 20 | 21 | def check(self, n_dialogs, n_utters, n_words): 22 | print('--- %s set ---' % self.name) 23 | print('n_dialogs: %d (%.2f %% difference from reference: %d)' % (n_dialogs, float(abs(self.n_dialogs - n_dialogs))/n_dialogs*100, self.n_dialogs)) 24 | print('n_utters: %d (%.2f %% difference from reference: %d)' % (n_utters, float(abs(self.n_utters - n_utters))/n_utters*100, self.n_utters)) 25 | print('n_words: %d (%.2f %% difference from reference: %d)' % (n_words, float(abs(self.n_words - n_words))/n_words*100, self.n_words)) 26 | 27 | 28 | if __name__ == "__main__": 29 | #set data info (target sizes for official data sets) 30 | train_set = dataset('train', sys.argv[1], 888201, 2157389, 40073702) 31 | dev_set = dataset('dev', sys.argv[2], 107506, 262228, 4900743) 32 | 33 | for data in [train_set, dev_set]: 34 | n_utters = 0 35 | n_dialogs = 0 36 | n_words = 0 37 | for ln in open(data.filename,'r').readlines(): 38 | tokens = ln.split() 39 | if len(tokens) > 0: 40 | n_words += len(tokens)-1 41 | n_utters += 1 42 | else: 43 | n_dialogs += 1 44 | 45 | data.check(n_dialogs, n_utters, n_words) 46 | print("----------------------------------------------------") 47 | print("If the data size differences are greater than 1%, ") 48 | print("please contact the track organizers (chori@merl.com, thori@merl.com)") 49 | print("with the above information.") 50 | print("----------------------------------------------------") 51 | -------------------------------------------------------------------------------- /tasks/utils/sample_dialogs.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | import sys 5 | import random 6 | import six 7 | import copy 8 | 9 | dialogs = [] 10 | dialog = [] 11 | for ln in open(sys.argv[1], 'r').readlines(): 12 | ln = ln.strip() 13 | if ln != '': 14 | dialog.append(ln) 15 | # partial dialogs are also included 16 | if len(dialog)>=2 and ln.startswith('S:'): 17 | dialogs.append(copy.copy(dialog)) 18 | else: 19 | dialog = [] 20 | 21 | random.seed(99) 22 | # 2nd argument is necessary for compatibility between python2.7~ and 3.2~ 23 | random.shuffle(dialogs, random.random) 24 | n_samples = int(sys.argv[2]) 25 | 26 | for i in six.moves.range(n_samples): 27 | for d in dialogs[i]: 28 | print(d) 29 | print('') 30 | 31 | --------------------------------------------------------------------------------