├── LICENSE └── README.md /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2023 SLP-RL@HUJI 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Spoken StoryCloze Benchmark 2 | This is the official repository of the Spoken StoryCloze benchmark as introduced by Hassid, Michael, et al. 2023 [Textually Pretrained Speech Language Models](https://arxiv.org/pdf/2305.13009.pdf). 3 | 4 | ## Textual StoryCloze 5 | The textual [StoryCloze benchmark](https://arxiv.org/pdf/1604.01696v1.pdf) contains 4k five-sentence commonsense stories (split to validation and test sets). For each story, there is an additional negative sample, composed of the first four sentences followed by a negative continuation (ending sentence). The goal is to distinguish the original fifth sentence from the negative one. 6 | 7 | ## Spoken Benchmark 8 | To generate the spoken benchmark, we synthesize the stories from the test set using a single-speaker TTS system as provided by Wang, Changhan, et al. [fairseq Sˆ2: A Scalable and Integrable Speech Synthesis Toolkit, EMNLP 2021](https://arxiv.org/pdf/2109.06912.pdf), comprised of a [FastSpeech2.0](https://arxiv.org/pdf/2006.04558.pdf) (Ren et al., 9 | 2020) model and [HiFi-GAN](https://arxiv.org/pdf/2010.05646.pdf) vocoder (Kong et al., 2020). 10 | 11 | We release two versions of the Spoken Story Cloze benchmark: 12 | * Topic Story Cloze (tStoryCloze). 13 | * Spoken Story Cloze (sStoryCloze). 14 | 15 | For sStoryCLoze, we follow the original StoryCloze negative samples. With this benchmark, researchers can evaluate the models’ capabilities to capture fine-grained causal and temporal commonsense relations. 16 | 17 | For tStoryCloze, we randomly sample the negative ending sentence from the dataset. The premise behind tStoryCloze is to evaluate continuation coherence given a spoken prompt. This version is far easier for text-based language models, but quite challenging for speech-based language models. Similar to previous zero-shot speech metrics (e.g., sWUGGY), both speech segments are fed into the SpeechLM, and the probability of each spoken sentence is measured. The percentage of examples where the probability of the positive sample 18 | is higher than the negative ones being reported. 19 | 20 | ### Download 21 | You can download both benchmarks using the following links: [sStoryCloze](https://drive.google.com/file/d/19ZnkM4vjApCZipd7xQ1ESlOi5oBVrlFL/view?usp=sharing), [tStoryCloze](https://drive.google.com/file/d/17prYkldYb3w3Pyg3Pm77-VnE6nkD5jzG/view?usp=sharing). 22 | 23 | ## Evaluation 24 | 25 | ### tStoryCloze 26 | | Model | Reference | Accuracy | 27 | |:-------------:|---------------------|:--------:| 28 | | LLaMA-7B-text | Hassid, et al. 2023 | 98.3 | 29 | | TWIST-1.3B | Hassid, et al. 2023 | 61.3 | 30 | | TWIST-7B | Hassid, et al. 2023 | 64.4 | 31 | 32 | ### sStoryCloze 33 | TBD 34 | 35 | 36 | ## Citation 37 | If you use this benchmark and find it useful for your research/work, please cite our work using the following bibtex: 38 | ``` 39 | @article{hassid2023textually, 40 | title={Textually Pretrained Speech Language Models}, 41 | author={Hassid, Michael and Remez, Tal and Nguyen, Tu Anh and Gat, Itai and Conneau, Alexis and Kreuk, Felix and Copet, Jade and Defossez, Alexandre and Synnaeve, Gabriel and Dupoux, Emmanuel and others}, 42 | journal={arXiv preprint arXiv:2305.13009}, 43 | year={2023} 44 | } 45 | ``` 46 | --------------------------------------------------------------------------------