├── CODE_OF_CONDUCT.md ├── LICENSE ├── README.md ├── SECURITY.md └── data ├── .keep ├── MIMICS-Click.tsv ├── MIMICS-ClickExplore.tsv └── MIMICS-Manual.tsv /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Microsoft Open Source Code of Conduct 2 | 3 | This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). 4 | 5 | Resources: 6 | 7 | - [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/) 8 | - [Microsoft Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) 9 | - Contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with questions or concerns 10 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) Microsoft Corporation. 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # MIMICS: A Large-Scale Data Collection for Search Clarification 2 | Asking a clarification has also been recognized as a major component in conversational information seeking systems. MIMICS is a collection of search clarification datasets for real search queries sampled from the Bing query logs. Each clarification in MIMICS consists of a clarifying question and up to five candidate answers. Here is an example: 3 | 4 | | | | 5 | |-------------------------------------|-----------------------------------------------------------| 6 | | query | headaches | 7 | | question | What do you want to know about this medical condition? | 8 | | candidate answers (options) | symptom, treatment, causes, diagnosis, diet | 9 | 10 | 11 | MIMICS contains three datasets: 12 | + **MIMICS-Click** includes over 400k unique queries, their associated clarification panes, and the corresponding aggregated user interaction signals (i.e., clicks). 13 | + **MIMICS-ClickExplore** is an exploration data that includes aggregated user interaction signals for over 60k unique queries, each with multiple clarification panes. 14 | + **MIMICS-Manual** includes over 2k unique real search queries. Each query-clarification pair in this dataset has been manually labeled by at least three trained annotators. It contains graded quality labels for the clarifying question, the candidate answer set, and the landing result page for each candidate answer. 15 | 16 | MIMICS enables researchers to study a number of tasks related to search clarification, including clarification generation and selection, user engagement prediction for clarification, click models for clarification, and analyzing user interactions with search clarification. For more information, refer to the following paper: 17 | 18 | - Hamed Zamani, Gord Lueck, Everest Chen, Rodolfo Quispe, Flint Luu, and Nick Craswell. [**"MIMICS: A Large-Scale Data Collection for Search Clarification"**](https://arxiv.org/pdf/2006.10174.pdf), In Proceedings of the 29th ACM International on Conference on Information and Knowledge Management, 2020 (CIKM '20). 19 | 20 | For more information of clarification generation and user interactions with clarification, refer to the following artciles: 21 | - Hamed Zamani, Susan T. Dumais, Nick Craswell, Paul N. Bennett, and Gord Lueck. [**"Generating Clarifying Questions for Information Retrieval"**](https://dl.acm.org/doi/abs/10.1145/3366423.3380126). In Proceedings of the Web Conference, 2020 (WWW '20). 22 | - Hamed Zamani, Bhaskar Mitra, Everest Chen, Gord Lueck, Fernando Diaz, Paul N. Bennett, Nick Craswell, and Susan T. Dumais. [**"Analyzing and Learning from User Interactions for Search Clarification"**](https://arxiv.org/pdf/2006.00166.pdf). In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2020 (SIGIR ’20). 23 | 24 | 25 | 26 | 27 | ## Data Format 28 | The datasets are released in a tab-separated file format (TSV), with the header in the first row of each file. The column descriptions are given below. For more detail, refer to the paper mentioned above. 29 | 30 | ### MIMICS-Click and MIMICS-ClickExplore 31 | 32 | | Column(s) | Description | 33 | |-------------------------------------|-----------------------------------------------------------------------| 34 | | query | (string) The query text. | 35 | | question | (string) The clarifying question. | 36 | | option_1, ..., option_5 | (string) Up to five candidate answers. | 37 | | impression_level | (string) A three-level impression label (i.e., low, medium, or high). | 38 | | engagement_level | (integer) A label in [0, 10] representing total user engagements. | 39 | | option_cctr_1, ..., option_cctr_5 | (real) The conditional click probability on each candidate answer. | 40 | 41 | 42 | ### MIMICS-Manual 43 | 44 | | Column(s) | Description | 45 | |-------------------------------------|-----------------------------------------------------------------------| 46 | | query | (string) The query text. | 47 | | question | (string) The clarifying question. | 48 | | option_1, ..., option_5 | (string) Up to five candidate answers. | 49 | | question_label | (integer) A three-level quality label for the clarifying question | 50 | | options_overall_label | (integer) A three-level quality label for the candidate answer set | 51 | | option_label_1, ..., option_label_5 | (integer) The conditional click probability on each candidate answer. | 52 | 53 | 54 | ### The Bing API's Search Results for MIMICS Queries 55 | To enable researchers to use the web search result page (SERP) information, we have also released the search results returned by the Bing's Web Search API for all the queries in the MIMICS datasets. The SERP data can be downloaded from [here](http://ciir.cs.umass.edu/downloads/mimics-serp/MIMICS-BingAPI-results.zip) (3.2GB compressed, 16GB decompressed). Each line in the file can be loaded as a JSON object and contain all information returned by the Bing's Web Search API. 56 | 57 | 58 | ## Citation 59 | If you found MIMICS useful, you can cite the following article: 60 | ``` 61 | Hamed Zamani, Gord Lueck, Everest Chen, Rodolfo Quispe, Flint Luu, and Nick Craswell. "MIMICS: A Large-Scale Data Collection for Search Clarification", In Proc. of CIKM 2020. 62 | ``` 63 | 64 | bibtex: 65 | ``` 66 | @inproceedings{mimics, 67 | title={MIMICS: A Large-Scale Data Collection for Search Clarification}, 68 | author={Zamani, Hamed and Lueck, Gord and Chen, Everest and Quispe, Rodolfo and Luu, Flint and Craswell, Nick}, 69 | booktitle = {Proceedings of the 29th ACM International on Conference on Information and Knowledge Management}, 70 | series = {CIKM '20}, 71 | year={2020}, 72 | } 73 | ``` 74 | 75 | ## License 76 | MIMICS is distributed under the **MIT License**. See the `LICENSE` file for more information. 77 | 78 | 79 | ## Terms and Conditions 80 | The ORCAS datasets are intended for non-commercial research purposes only to promote advancement in the field of information retrieval and related areas, and is made available free of charge without extending any license or other intellectual property rights. The dataset is provided "as is" without warranty and usage of the data has risks since we may not own the underlying rights in the documents. We are not be liable for any damages related to use of the dataset. Feedback is voluntarily given and can be used as we see fit. By using any of these datasets you are automatically agreeing to abide by these terms and conditions. Upon violation of any of these terms, your rights to use the dataset will end automatically. If you have questions about use of the dataset or any research outputs in your products or services, we encourage you to undertake your own independent legal review. For other questions, please feel free to contact us at hazamani@microsoft.com. 81 | 82 | 83 | ## Contributing 84 | 85 | This project welcomes contributions and suggestions. Most contributions require you to agree to a 86 | Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us 87 | the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com. 88 | 89 | When you submit a pull request, a CLA bot will automatically determine whether you need to provide 90 | a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions 91 | provided by the bot. You will only need to do this once across all repos using our CLA. 92 | 93 | This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). 94 | For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or 95 | contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments. 96 | -------------------------------------------------------------------------------- /SECURITY.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | ## Security 4 | 5 | Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet), [Xamarin](https://github.com/xamarin), and [our GitHub organizations](https://opensource.microsoft.com/). 6 | 7 | If you believe you have found a security vulnerability in any Microsoft-owned repository that meets [Microsoft's definition of a security vulnerability](https://docs.microsoft.com/en-us/previous-versions/tn-archive/cc751383(v=technet.10)), please report it to us as described below. 8 | 9 | ## Reporting Security Issues 10 | 11 | **Please do not report security vulnerabilities through public GitHub issues.** 12 | 13 | Instead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://msrc.microsoft.com/create-report). 14 | 15 | If you prefer to submit without logging in, send email to [secure@microsoft.com](mailto:secure@microsoft.com). If possible, encrypt your message with our PGP key; please download it from the [Microsoft Security Response Center PGP Key page](https://www.microsoft.com/en-us/msrc/pgp-key-msrc). 16 | 17 | You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://www.microsoft.com/msrc). 18 | 19 | Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue: 20 | 21 | * Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.) 22 | * Full paths of source file(s) related to the manifestation of the issue 23 | * The location of the affected source code (tag/branch/commit or direct URL) 24 | * Any special configuration required to reproduce the issue 25 | * Step-by-step instructions to reproduce the issue 26 | * Proof-of-concept or exploit code (if possible) 27 | * Impact of the issue, including how an attacker might exploit the issue 28 | 29 | This information will help us triage your report more quickly. 30 | 31 | If you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our [Microsoft Bug Bounty Program](https://microsoft.com/msrc/bounty) page for more details about our active programs. 32 | 33 | ## Preferred Languages 34 | 35 | We prefer all communications to be in English. 36 | 37 | ## Policy 38 | 39 | Microsoft follows the principle of [Coordinated Vulnerability Disclosure](https://www.microsoft.com/en-us/msrc/cvd). 40 | 41 | -------------------------------------------------------------------------------- /data/.keep: -------------------------------------------------------------------------------- 1 | 2 | --------------------------------------------------------------------------------