├── .env-example ├── .gitignore ├── LICENSE ├── README.md ├── example_prompts ├── ade_corpus_v2.txt ├── banking_77.txt ├── neurips_impact_statement_risks.txt ├── one_stop_english.txt ├── overruling.txt ├── semiconductor_org_types.txt ├── systematic_review_inclusion.txt ├── tai_safety_research.txt ├── terms_of_service.txt ├── tweet_eval_hate.txt └── twitter_complaints.txt ├── requirements.txt ├── setup.py └── src ├── __init__.py └── raft_baselines ├── classifiers ├── __init__.py ├── adaboost_classifier.py ├── classifier.py ├── gpt3_classifier.py ├── in_context_classifier.py ├── n_grams_classifier.py ├── naive_bayes_classifier.py ├── random_classifier.py ├── svm_classifier.py ├── transformers_causal_lm_classifier.py └── zero_shot_transformers_classifier.py ├── data ├── __init__.py ├── example_predictions.csv └── prompt_construction_settings.jsonl ├── scripts ├── non_neural_experiment.py ├── raft_predict.py ├── raft_train_experiment.py ├── starter_kit.ipynb ├── test_gpt3.py └── test_naive_bayes.py └── utils ├── embedders.py ├── gpt3_utils.py └── tokenizers.py /.env-example: -------------------------------------------------------------------------------- 1 | OPENAI_API_KEY=sk-abcdefg 2 | HUGGINGFACE_API_TOKEN=abcdefg 3 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | *predictions 2 | results 3 | .idea 4 | .env 5 | gpt3-baselines 6 | __pycache__ 7 | prompts 8 | raft_baselines.egg-info 9 | .vscode 10 | 11 | # Jupyter Notebook 12 | .ipynb_checkpoints 13 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright 2021 Ought Inc. 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 4 | 5 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. 6 | 7 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Setup 2 | 3 | This is the repository for the GPT-3 baselines described in the RAFT benchmark paper. 4 | 5 | Set up a virtual environament and install necessary requirements from the requirements file. 6 | 7 | ```buildoutcfg 8 | conda create -n raft-baselines python=3.8 && conda activate raft-baselines 9 | python -m pip install -r requirements.txt 10 | ``` 11 | 12 | Install raft-baselines. 13 | 14 | ```buildoutcfg 15 | python setup.py develop 16 | ``` 17 | 18 | You may have to run the above command with `sudo` prepended for permissions. 19 | 20 | # Starter Kit 21 | 22 | A [starter kit notebook](src/raft_baselines/scripts/starter_kit.ipynb) walks through the basics of making predictions using models from the [HuggingFace Model Hub](https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads). There's also a [Colab version](https://colab.research.google.com/drive/1TQtHG-Wf2CgYGSD9e7_uJWIdiK5HNniV). 23 | 24 | # RAFT Predict 25 | 26 | Use the `raft_predict` script to run classifiers on the RAFT datasets. By default, the script will run on the first 5 test examples for each dataset. To use a random classifier on the first 10 examples from the ADE Corpus V2 dataset: 27 | 28 | ```buildoutcfg 29 | python -m raft_baselines.scripts.raft_predict with n_test=10 'configs=["ade_corpus_v2"]' classifier_name=RandomClassifier 30 | ``` 31 | 32 | The other classifiers available are: 33 | 34 | - `GPT3Classifier`: the one used for the GPT-3 baseline in the paper 35 | - `TransformersCausalLMClassifier`: takes as input a `model_type` string, and runs an arbitrary CausalLM from the [HuggingFace Model Hub](https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads) 36 | 37 | For example, to generate predictions from DistilGPT-2 on the first 10 examples of the ADE Corpus you can run: 38 | 39 | ```buildoutcfg 40 | python -m raft_baselines.scripts.raft_predict with n_test=10 'configs=["ade_corpus_v2"]' classifier_name=TransformersCausalLMClassifier 'classifier_kwargs={"model_type":"distilgpt2"}' 41 | ``` 42 | 43 | In order to run experiments with GPT-3, you will need to have an OpenAI API key. Create a file called `.env` and put your API key there. Copy the format of `.env-example`: 44 | 45 | ```buildoutcfg 46 | echo OPENAI_API_KEY=$OPENAI_API_KEY > .env 47 | ``` 48 | 49 | ## Sacred 50 | 51 | We use [Sacred](https://github.com/IDSIA/sacred) to track our experiments and outputs. This has no overhead at runtime, simply run either of our two experiment scripts with python like normal. You can change where tracking files get saved to by modifying the observer at the top of every experiment file, or you can change the details of the experiment via the various configuration parameters specified in the configs block. 52 | 53 | ```buildoutcfg 54 | # For labeling the test set 55 | python -m raft_baselines.scripts.raft_predict 56 | # For tuning various dimensions on the train set with LOO validation 57 | python -m raft_baselines.scripts.raft_train_experiment 58 | ``` 59 | 60 | Alternately, you can modify the input variables to an experiment from the command line, as is done in the example above. Regardless, some modification will be necessary if you want to run different experiments. See [this tutorial](https://sacred.readthedocs.io/en/stable/configuration.html) for more information. 61 | 62 | Similarly, you can save metrics with `raft_experiment.log_scalar()`, or by using the sacred observer directly. See [this tutorial](https://sacred.readthedocs.io/en/stable/collected_information.html) for more information. 63 | 64 | To save out predictions and upload to the HuggingFace Hub (and the leaderboard), see [the RAFT submission template](https://huggingface.co/datasets/ought/raft-submission). 65 | 66 | ## License 67 | 68 | This repository is licensed under the MIT License. 69 | -------------------------------------------------------------------------------- /example_prompts/ade_corpus_v2.txt: -------------------------------------------------------------------------------- 1 | Label the sentence based on whether it is related to an adverse drug effect (ADE). Details are described below: 2 | Drugs: Names of drugs and chemicals that include brand names, trivial names, abbreviations and systematic names were annotated. Mentions of drugs or chemicals should strictly be in a therapeutic context. This category does not include the names of metabolites, reaction byproducts, or hospital chemicals (e.g. surgical equipment disinfectants). 3 | Adverse effect: Mentions of adverse effects include signs, symptoms, diseases, disorders, acquired abnormalities, deficiencies, organ damage or death that strictly occur as a consequence of drug intake. 4 | Possible labels: 5 | 1. ADE-related 6 | 2. not ADE-related 7 | 8 | Sentence: With serious cases, however, conventional treatment may not allow sufficient time at depth for the complete resolution of manifestations because of the need to avoid pulmonary oxygen toxicity which is associated with a prolonged period of breathing compressed air. 9 | Label: not ADE-related 10 | 11 | Sentence: Several hypersensitivity reactions to cloxacillin have been reported, although IgE-mediated allergic reactions to the drug are rare and there is little information about possible tolerance to other semisynthetic penicillins or cephalosporins in patients with cloxacillin allergy. 12 | Label: ADE-related 13 | 14 | Sentence: A 69-year-old male was diagnosed in February 2004 with stage IV extranodal marginal zone B cell lymphoma involving the mediastinal nodes, lung parenchyma and bone marrow with high LDH. 15 | Label: not ADE-related 16 | 17 | Sentence: A patient with psoriasis is described who had an abnormal response to the glucose tolerance test without other evidence of diabetes and then developed postprandial hyperglycemia and glycosuria during a period of topical administration of a corticosteroid cream, halcinonide cream 0.1 18 | Label: ADE-related 19 | 20 | Sentence: The gold standard for diagnosis is renal biopsy, but it is only rarely performed during the acute phase of the reaction and is not without risk. 21 | Label: not ADE-related 22 | 23 | Sentence: Of the 16 patients, including the 1 reported here, only 3 displayed significant shortening of the agranulocytic period after treatment. 24 | Label: not ADE-related 25 | 26 | Sentence: These cases were considered unusual in light of the short delay of their onset after initiation of immunosuppressive therapy and their fulminant course: 3 of these patients died of PCP occurring during the first month of treatment with prednisone. 27 | Label: ADE-related 28 | 29 | Sentence: In 1991 the patient were found to be seropositive for HCV antibodies as detected by the ELISA method and confirmed by the RIBA method. 30 | Label: not ADE-related 31 | 32 | Sentence: Considerable improvement of myasthenic symptoms was seen in all patients within 3-6 months after the initiation of this therapy. 33 | Label: not ADE-related 34 | 35 | Sentence: We present three patients with paradoxical seizures; their serum phenytoin levels were 43.5 mcg/mL, 46.5 mcg/mL and 38.3 mcg/mL. 36 | Label: ADE-related 37 | 38 | Sentence: NEH must be considered in lupus patients receiving cytotoxic agents to avoid inappropriate use of corticosteroids or antibiotics in this self-limited condition. 39 | Label: not ADE-related 40 | 41 | Sentence: A challenge with clozapine was feasible and showed no clinical symptoms of eosinophilia. 42 | Label: not ADE-related 43 | 44 | Sentence: We describe a patient who developed HUS after treatment with mitomycin C (total dose 144 mg/m2) due to a carcinoma of the ascending colon. 45 | Label: ADE-related 46 | 47 | Sentence: The INR should be monitored more frequently when bosentan is initiated, adjusted, or discontinued in patients taking warfarin. 48 | Label: not ADE-related 49 | 50 | Sentence: An encephalopathy and cardiomyopathy developed in a seventeen-year-old girl with chemotherapy-induced renal failure while receiving an intravesical aluminum infusion for hemorrhagic cystitis. 51 | Label: ADE-related 52 | 53 | Sentence: CT-scan disclosed right ethmoid sinusitis that spread to the orbit after surgery. 54 | Label: not ADE-related 55 | 56 | Sentence: MRI has a high sensitivity and specificity in the diagnosis of osteonecrosis and should be used when this condition is suspected. 57 | Label: not ADE-related 58 | 59 | Sentence: CONCLUSION: Pancreatic enzyme intolerance, although rare, would be a major problem in the management of patients with CF. 60 | Label: not ADE-related 61 | 62 | Sentence: These results indicate that the hyponatremia in this case was due to SIADH and that SIADH was caused by an increased release of vasopressin probably because of the antiviral drug (acyclovir) or infection of varicella zoster virus (V 63 | Label: not ADE-related 64 | 65 | Sentence: METHODS: This study is a case report description. 66 | Label: not ADE-related 67 | 68 | Sentence: Best-corrected visual acuity measurements were performed at every visit. 69 | Label: not ADE-related 70 | 71 | Sentence: METHODS: We identified three patients who developed skin necrosis and determined any factors, which put them at an increased risk of doing so. 72 | Label: not ADE-related 73 | 74 | Sentence: OBJECTIVE: To describe onset of syndrome of inappropriate antidiuretic hormone (SIADH) associated with vinorelbine therapy for advanced breast cancer. 75 | Label: ADE-related 76 | 77 | Sentence: IMPLICATIONS: Dexmedetomidine, an alpha(2)-adrenoceptor agonist, is indicated for sedating patients on mechanical ventilation. 78 | Label: not ADE-related 79 | 80 | Sentence: CONCLUSIONS: These results suggest that clozapine may cause TD; however, the prevalence is low and the severity is relatively mild, with no or mild self-reported discomfort. 81 | Label: ADE-related 82 | 83 | Sentence: CONCLUSIONS: SD-OCT and AO detected abnormalities that correlate topographically with visual field loss from hydroxychloroquine toxicity as demonstrated by HVF 10-2 and may be useful in the detection of subclinical abnormalities that precede symptoms or objective visual field loss. 84 | Label: -------------------------------------------------------------------------------- /example_prompts/banking_77.txt: -------------------------------------------------------------------------------- 1 | The following is a banking customer service query. Classify the query into one of the 77 categories available. 2 | Possible labels: 3 | 1. Refund_not_showing_up 4 | 2. activate_my_card 5 | 3. age_limit 6 | 4. apple_pay_or_google_pay 7 | 5. atm_support 8 | 6. automatic_top_up 9 | 7. balance_not_updated_after_bank_transfer 10 | 8. balance_not_updated_after_cheque_or_cash_deposit 11 | 9. beneficiary_not_allowed 12 | 10. cancel_transfer 13 | 11. card_about_to_expire 14 | 12. card_acceptance 15 | 13. card_arrival 16 | 14. card_delivery_estimate 17 | 15. card_linking 18 | 16. card_not_working 19 | 17. card_payment_fee_charged 20 | 18. card_payment_not_recognised 21 | 19. card_payment_wrong_exchange_rate 22 | 20. card_swallowed 23 | 21. cash_withdrawal_charge 24 | 22. cash_withdrawal_not_recognised 25 | 23. change_pin 26 | 24. compromised_card 27 | 25. contactless_not_working 28 | 26. country_support 29 | 27. declined_card_payment 30 | 28. declined_cash_withdrawal 31 | 29. declined_transfer 32 | 30. direct_debit_payment_not_recognised 33 | 31. disposable_card_limits 34 | 32. edit_personal_details 35 | 33. exchange_charge 36 | 34. exchange_rate 37 | 35. exchange_via_app 38 | 36. extra_charge_on_statement 39 | 37. failed_transfer 40 | 38. fiat_currency_support 41 | 39. get_disposable_virtual_card 42 | 40. get_physical_card 43 | 41. getting_spare_card 44 | 42. getting_virtual_card 45 | 43. lost_or_stolen_card 46 | 44. lost_or_stolen_phone 47 | 45. order_physical_card 48 | 46. passcode_forgotten 49 | 47. pending_card_payment 50 | 48. pending_cash_withdrawal 51 | 49. pending_top_up 52 | 50. pending_transfer 53 | 51. pin_blocked 54 | 52. receiving_money 55 | 53. request_refund 56 | 54. reverted_card_payment? 57 | 55. supported_cards_and_currencies 58 | 56. terminate_account 59 | 57. top_up_by_bank_transfer_charge 60 | 58. top_up_by_card_charge 61 | 59. top_up_by_cash_or_cheque 62 | 60. top_up_failed 63 | 61. top_up_limits 64 | 62. top_up_reverted 65 | 63. topping_up_by_card 66 | 64. transaction_charged_twice 67 | 65. transfer_fee_charged 68 | 66. transfer_into_account 69 | 67. transfer_not_received_by_recipient 70 | 68. transfer_timing 71 | 69. unable_to_verify_identity 72 | 70. verify_my_identity 73 | 71. verify_source_of_funds 74 | 72. verify_top_up 75 | 73. virtual_card_not_working 76 | 74. visa_or_mastercard 77 | 75. why_verify_identity 78 | 76. wrong_amount_of_cash_received 79 | 77. wrong_exchange_rate_for_cash_withdrawal 80 | 81 | Query: I withdrew cash and I think the exchange rate is wrong. 82 | Label: 77. wrong_exchange_rate_for_cash_withdrawal 83 | 84 | Query: After I transferred money the balance remained the same. 85 | Label: 7. balance_not_updated_after_bank_transfer 86 | 87 | Query: Why is my money not in my account. I have already sent it out. 88 | Label: 8. balance_not_updated_after_cheque_or_cash_deposit 89 | 90 | Query: I didn't get all the cash I asked for 91 | Label: 76. wrong_amount_of_cash_received 92 | 93 | Query: Why am I unable to transfer money when I was able to before? 94 | Label: 9. beneficiary_not_allowed 95 | 96 | Query: Why is there extra cash in my account? 97 | Label: 22. cash_withdrawal_not_recognised 98 | 99 | Query: I have a strange transaction for £1 on my statement, what is that? 100 | Label: 36. extra_charge_on_statement 101 | 102 | Query: I didn't make the direct debit payment on my account. 103 | Label: 30. direct_debit_payment_not_recognised 104 | 105 | Query: What is the $1 transaction on my account? 106 | Label: 36. extra_charge_on_statement 107 | 108 | Query: How can I tell the source for my available funds? 109 | Label: 71. verify_source_of_funds 110 | 111 | Query: where did my funds come from? 112 | Label: -------------------------------------------------------------------------------- /example_prompts/neurips_impact_statement_risks.txt: -------------------------------------------------------------------------------- 1 | Label the impact statement based on whether it mentions a harmful application of the research done in the paper. Make sure the statement is sufficient to conclude there are harmful applications of the research being done, not a past risk that this research is solving. 2 | Possible labels: 3 | 1. doesn't mention a harmful application 4 | 2. mentions a harmful application 5 | 6 | Impact statement: Machine learning algorithms are increasingly relied upon by decision makers. It is therefore crucial to combine the predictive performance of such complex machinery with practical guarantees on the reliability and uncertainty of their output. We view the calibration methods presented in this paper as an important step towards this goal. In fact, uncertainty estimation is an effective way to quantify and communicate the benefits and limitations of machine learning. Moreover, the proposed methodologies provide an attractive way to move beyond the standard prediction accuracy measure used to compare algorithms. For instance, one can compare the performance of two candidate predictors, e.g., random forest and neural network (see Figure 3), by looking at the size of the corresponding prediction sets and/or their their conditional coverage. Finally, the approximate conditional coverage that we seek in this work is highly relevant within the broader framework of fairness, as discussed by [17] within a regression setting. While our approximate conditional coverage already implicitly reduces the risk of unwanted bias, an equalized coverage requirement [17] can also be easily incorporated into our methods to explicitly avoid discrimination based on protected categories. We conclude by emphasizing that the validity of our methods relies on the exchangeability of the data points. If this assumption is violated (e.g., with time-series data), our prediction sets may not have the right coverage. A general suggestion here is to always try to leverage specific knowledge of the data and of the application domain to judge whether the exchangeability assumption is reasonable. Finally, our data-splitting techniques 7 | Label: doesn't mention a harmful application 8 | 9 | Impact statement: The problem of Byzantine resilient aggregation of distributed machine learning models has been actively studied in recent years; however, the issue of Byzantine resilient distributed learning in multi-task networks has received much less attention. It is a general intuition that MTL is robust and resilient to cyber-attacks since it can identify attackers by measuring similarities between neighbors. In this paper, we have shown that some commonly used similarity measures are not resilient against certain attacks. With an increase in data heterogeneity, we hope this work could highlight the security and privacy concerns in designing distributed MTL frameworks. 10 | Paper title: Byzantine Resilient Distributed Multi-Task Learning 11 | Label: doesn't mention a harmful application 12 | 13 | Impact statement: In our work, the learning objective was designed to align with and support the possible use of a predictive model to drive decisions by users. It is our belief that a responsible and transparent deployment of models with “lookahead-like" regularization components should avoid the kinds of mistakes that can be made when predictive methods are conflated with causally valid methods. At the same time, we have made a strong simplifying assumption, that of covariate shift, which requires that the relationship between covariates and outcome variables is invariant as decisions are made and the feature distribution changes. This strong assumption is made to ensure validity for the lookahead regularization, since we need to be able to perform inference about counterfactual observations. As discussed by Mueller et al. [ 31] and Peters et al. [34], there exist real-world tasks that reasonably satisfy this assumption, and yet at the same time, other tasks— notably those with unobserved confounders —where this assumption would be violated. Moreover, this assumption is not testable on the observational data. This, along with the need to make an assumption about the user decision model, means that an application of the method proposed here should be done with care and will require some domain knowledge to understand whether or not the assumptions are plausible. Furthermore, the validity of the interval estimates requires that any assumptions for the interval model used are satisfied and that weights w provide a reasonable estimation of p /p . In particular, fitting to p which has 14 | Label: mentions a harmful application 15 | 16 | Impact statement: Uncertainty estimation for neural networks has very significant societal impact. Neural networks are increasingly being trained as black-box predictors and being placed in larger decision systems where errors in their predictions can pose immediate threat to downstream tasks. Systematic methods for calibrated uncertainty estimation under these conditions are needed, especially as these systems are deployed in safety critical domains, such for autonomous vehicle control [29], medical diagnosis [43], or in settings with large dataset imbalances and bias such as crime forecasting [24] and facial recognition [3]. This work is complementary to a large portion of machine learning research which is continually pushing the boundaries on neural network precision and accuracy. Instead of solely optimizing larger models for increased performance, our method focuses on how these models can be equipped with the ability to estimate their own confidence. Our results demonstrating superior calibration of our method over baselines are also critical in ensuring that we can place a certain level of trust in these algorithms and in understanding when they say “I don’t know”. While there are clear and broad benefits of uncertainty estimation in machine learning, we believe it is also important to recognize potential societal challenges that may arise. With increased performance and uncertainty estimation capabilities, humans will inevitably become increasingly trusting in a model’s predictions, as well as its ability to catch dangerous or uncertain decisions before they are executed. Thus, it is important to continue to pursue redundancy in such learning systems to increase the likelihood that mistakes can be caught and corrected independently. 17 | Paper 18 | Label: mentions a harmful application 19 | 20 | Impact statement: Hypothesis testing and valid inference after model selection are fundamental problems in statistics, which have recently attracted increasing attention also in machine learning. Kernel tests such as MMD are not only used for statistical testing, but also to design algorithms for deep learning and GANs [41, 42]. The question of how to select the test statistic naturally arises in kernel-based tests because of the kernel choice problem. Our work shows that it is possible to overcome the need of (wasteful and often heuristic) data splitting when designing hypothesis tests with feasible null distribution. Since this comes without relevant increase in computational resources we expect the proposed method to replace the data splitting approach in applications that fit the framework considered in this work. Theorem 1 is also applicable beyond hypothesis testing and extends the previously known PSI framework proposed by Lee et al. [24]. 21 | Paper title: Learning Kernel Tests Without Data Splitting 22 | Label: doesn't mention a harmful application 23 | 24 | Impact statement: With the proliferation of deep learning, explaining or understanding the reasons behind the models decisions has become extremely important in many critical applications [ 30]. Many explainability methods have been proposed in literature [7, 12, 11], however, they either provide instance specific local explanations or fit to the entire dataset and create global explanations. Our proposed method is able to create both such explanations, but in addition, it also creates explanations for subgroups in the data and all of this jointly. We thus are creating explanations for granularities (between local and global). This multilevel aspect has not been sufficiently researched before. In fact recently [4] has stressed the importance of having such multilevel explanations for successfully meeting the requirements of Europe’s General Data Protection Regulation (GDPR) [5]. They clearly state that simply having local or global explanations may not be sufficient for providing satisfactory explanations in many cases. There are also potential risks with this approach. The first is that if the base local explainer is non-robust or inaccurate [34, 35] then the explanations generated by our tree also may have to be considered cautiously. However this is not specific to our method, and applies to several post-hoc explainability methods that try to explain a black-box model. The way to mitigate this is to ensure that the local explanation methods are adapted (such as by choosing appropriate neighborhoods in LIME) to provide robust and accurate explanations. Another risk could be that such detailed multilevel explanations may reveal too much about the internals of the model (similar scenario for gradient-based models is discussed in [36]) and hence may raise privacy concerns. Mitigation could happen by selectively revealing the levels / pruning the tree or having a budget of explanations for each user to balance the level of explanations vs. the exposure of the black-box model. 25 | Paper title: Model Agnostic Multilevel Explanations 26 | Label: -------------------------------------------------------------------------------- /example_prompts/one_stop_english.txt: -------------------------------------------------------------------------------- 1 | The following is an article sourced from The Guardian newspaper, and rewritten by teachers to suit three levels of adult English as Second Language (ESL) learners: elementary, intermediate, and advanced. Predict the level of the article. 2 | Possible labels: 3 | 1. advanced 4 | 2. elementary 5 | 3. intermediate 6 | 7 | Article: Cities don’t often move. But that’s exactly what Kiruna, an Arctic town in northern Sweden, has to do. It has to move or the earth will swallow it up. 8 | “It’s a terrible choice,” says Krister Lindstedt, who works for the Swedish architect company that is moving the city. They will move this city of 23,000 people away from a gigantic iron-ore mine that is swallowing up the ground beneath its streets. “Either the mine must stop digging, and then there will be no jobs, or the city has to move.” 9 | Kiruna was founded in 1900 by the state-owned Luossavaara-Kiirunavaara mining company (LK). The city became rich thanks to the very large amount of iron ore that is below the town. But the mine that made it rich is now going to destroy it. “The town is here because of the mine,” says Deputy Mayor Niklas Siren. 10 | Located 145km inside the Arctic Circle, Kiruna has a very difficult climate. It has winters with no sunlight and average temperatures of -15C. But the iron ore has kept people here. Kiruna is the world’s largest underground iron-ore mine. It produces 90% of all the iron in Europe. That is enough to build more than six Eiffel 11 | Label: elementary 12 | 13 | Article: Illegal downloading is a kind of “moral squalor” and theft, as much as putting your hand in someone’s pocket and stealing their wallet is theft, says author Philip Pullman. In an article for Index on Censorship, Pullman, who is president of the Society of Authors, strongly defends copyright laws. He criticizes internet users who think it is OK to download music or books without paying for them. 14 | “The technical brilliance is so dazzling that people can’t see the moral squalor of what they’re doing,” he writes. “It is outrageous that anyone can steal an artist’s work and get away with it. It is theft, just as putting your hand in someone’s pocket and taking their wallet is theft.” 15 | His article comes after music industry leaders met British Prime Minister David Cameron in Downing Street to discuss the issue of web piracy. 16 | Pullman, writer of the His Dark Materials trilogy, says authors and musicians work in poverty and obscurity for years to bring their work to the level “that gives delight to their audiences and, as soon as they achieve that, the possibility of earning a living from it is taken away from them”. He concludes: “The principle is simple, and unaltered by technology, science or magic: if we want to enjoy the work that someone does, 17 | Label: intermediate 18 | 19 | Article: As soon as the children at one primary school in Stirling hear the words “daily mile”, they down their pencils and head out of the classroom to start running laps around the school field. For three-and-a-half years, all pupils at St Ninian’s Primary have walked or run a mile each day. They do so at random times during the day, apparently happily, and, despite the rise in childhood obesity across the UK, none of the children at the school are overweight. 20 | The daily mile has done so much to improve these children’s fitness, behaviour and concentration in lessons that scores of nursery and primary schools across Britain are following suit and getting pupils to get up from their desks and take 15 minutes to walk or run round the school or local park. 21 | Elaine Wyllie, headteacher of St Ninian’s, said: “I get at least two emails a day from other schools and local authorities asking how we do it. The thought of children across the country running every day because of something we’ve done is phenomenal.” 22 | One in ten children are obese when they start school at the age of four or five, according to figures from the Health & Social Care Information Centre, and, in the summer of 2015, a study found that schoolchildren in England are the least fit they have ever been. 23 | Label: advanced 24 | 25 | Article: Back in 2005, when BlackBerry brought instant messaging to the mobile phone, the company was just entering its boom times. While the iPhone was still just an idea, BlackBerry’s innovations ensured its smartphone was one of Canada’s biggest exports. 26 | Six years later, in the summer of 2011, when there were riots in London and other UK cities, BlackBerry Messenger (BBM) was so effective at mobilizing the rioters that politicians wanted the service to be temporarily shut down. But, two years later, it is the users themselves who are pulling the plug. 27 | Demand for BlackBerry phones is falling. Dozens of alternatives have sprung up to take its place, from Facebook’s and Apple’s instant messaging applications to independent apps such as WhatsApp and Kik (which is also Canadian). They are free to download and use, and they use the internet to swap text messages, pictures, voice clips, ‘stickers’ and even videos between most types of phones. 28 | In an attempt to keep its customers, BBM has been released on Android and Apple phones. Despite the competition from other apps, the response has been extraordinary, with more than 20 million downloads. But, despite this interest, many people believe BBM’s wider release will not save the service. “The move to bring BlackBerry to the iPhone is four or five years too late,” says James Gooderson, an 18 29 | Label: intermediate 30 | 31 | Article: Loneliness has finally become a hot topic. The Office for National Statistics has found Britain to be the loneliest place in Europe. British people are less likely to have strong friendships or know their neighbours than people anywhere else in the European Union. And research at the University of Chicago has found that loneliness is twice as bad for older people’s health as obesity and almost as great a cause of death as poverty. 32 | This is shocking but such studies do not examine the loneliness epidemic among younger adults. In 2010, the Mental Health Foundation found that loneliness was a greater concern among young people than among the elderly. The 18- to 34-year-olds surveyed were more likely to feel lonely often, to worry about feeling alone and to feel depressed because of loneliness than the over-55s. 33 | “Loneliness is a recognized problem among the elderly and there are day centres and charities to help them,” says Sam Challis, of the mental health charity Mind, “but, when young people reach 21, they’re too old for youth services.” This is problematic because of the close relationship between loneliness and mental health – it is linked to increased stress, depression, paranoia, anxiety, addiction and it is a known cause of suicide. 34 | But what can young people do to prevent loneliness? One researcher at the Oxford Internet Institute points out that social media and the internet can be both a good thing 35 | Label: intermediate 36 | 37 | Article: Many of us know we don’t get enough sleep but imagine if there was a simple solution: getting up later. In a speech at the British Science Festival, Dr Paul Kelley from Oxford University said schools should stagger their starting times to work with the natural rhythms of their students. This would improve exam results and students’ health (lack of sleep can cause diabetes, depression, obesity and other health problems). 38 | Dr Kelley said that, when children are around ten, their natural wake-up time is about 6.30am; at 16, this rises to 8am; and, at 18, a person’s natural waking hour is 9am, although you may think they are just a lazy teenager. The normal school starting time works for 10-year-olds but not for 16- to 18-year-olds. For the older teenagers, it might be better to start the school day at 11am or even later. “A 7am wake-up time for older teenagers,” says Kelley, “is the same as a 4.30am start for a teacher in their 50s.” 39 | He says the solution is not to tell teenagers to go to bed earlier. “The body’s natural rhythm is controlled by a particular kind of light,” says Kelley. “The eye has cells that report to a part of the brain that controls our sleep rhythms over a 24-hour cycle. It’s the light that controls it.” 40 | But it isn’t just students who would benefit from a later start. Kelley says the working day should be more linked to our natural rhythms. Describing the average sleep loss per night for different age groups, he says: “Between 14 and 24, people lose more than two hours. For people aged between 24 and about 30 or 35, they lose about an hour and a half. That can continue up until you’re about 55 when it’s in balance again. The 10-year-old and 55-year-old wake and sleep naturally at the same time.” 41 | So, should workplaces have staggered starting times, too? Should people in their 50s and above come in at 8am, people in their 30s start at 10am and the teenage apprentice at 11am? Kelley says that synchronized hours could have “many 42 | Label: -------------------------------------------------------------------------------- /example_prompts/overruling.txt: -------------------------------------------------------------------------------- 1 | In law, an overruling sentence is a statement that nullifies a previous case decision as a precedent, by a constitutionally valid statute or a decision by the same or higher ranking court which establishes a different rule on the point of law involved. Label the sentence based on whether it is overruling or not. 2 | Possible labels: 3 | 1. not overruling 4 | 2. overruling 5 | 6 | Sentence: the following facts are taken from the administrative record. 7 | Label: not overruling 8 | 9 | Sentence: see scott, supra at 352; commonwealth v. ruffin, 475 mass. 1003, 1004 (2016). 10 | Label: not overruling 11 | 12 | Sentence: while not limited to these cases, to the extent the following cases are in conflict, they are overruled. 13 | Label: overruling 14 | 15 | Sentence: we reverse and remand, and in doing so, we overrule commonwealth v. constant 16 | Label: overruling 17 | 18 | Sentence: see boles, 554 so.2d at 961 ([i]f the county and other persons are not bound, then the status of the road as public or private is subject to being litigated again, and the results of later litigation may be inconsistent with the results of the initial litigation.). 19 | Label: not overruling 20 | 21 | Sentence: to the extent that paprskar v. state, supra, applied the general test of waiver of constitutional rights set forth in johnson v. zerbst, supra, it is no longer viable. 22 | Label: overruling 23 | 24 | Sentence: we flatly rejected this logic a century ago in state ex rel. state capitol commission v. lister, 91 wash. 9, 156 p. 858 (1916), and we reject it again now. 25 | Label: overruling 26 | 27 | Sentence: in this case, the trial court did not clearly err by finding clear and convincing evidence to support termination under mcl 712a.19b(3)(g) and (j). 28 | Label: not overruling 29 | 30 | Sentence: app. 1981), or voninski v. voninski, 661 s.w.2d 872, 878-79 (tenn. 31 | Label: overruling 32 | 33 | Sentence: see tex. r. app. p. 48.4; see also in re schulman, 252 s.w.3d at 412 n.35; ex parte owens, 206 s.w.3d 670, 673 (tex. crim. app. 2006). 34 | Label: not overruling 35 | 36 | Sentence: we therefore overrule mcgore; and we hold, like every other circuit to have reached the issue, that under rule 15(a) a district court can allow a plaintiff to amend his complaint even when the complaint is subject to dismissal under the plra. 37 | Label: overruling 38 | 39 | Sentence: we recognize that this reading of fager disapproves prior cases. 40 | Label: overruling 41 | 42 | Sentence: to the extent that this opinion causes conflict with earlier decisions such as holmes, those cases are overruled. 43 | Label: overruling 44 | 45 | Sentence: we disapprove abdelaziz as well as henderson v. north, 545 so.2d 486 (fla. 1st dca 1989), which adopted the principle of abdelaziz, to the extent that they disapproved a cause of action for negligent stillbirth. 46 | Label: overruling 47 | 48 | Sentence: the decision of the fourth district court of appeal holding section 550.081 unconstitutional is disapproved. 49 | Label: overruling 50 | 51 | Sentence: furthermore, the trial court indicated in its order that it had ""consider[ed] . . . [appellant's] special appearance, the pleadings, the affidavits, and arguments of counsel."" 52 | Label: not overruling 53 | 54 | Sentence: however, to the extent that cervantes, and ex parte mcatee, 599 s.w.2d 335 (tex.crim.app. 1980), indicate that a failure to admonish pursuant to art. 26.13(a)(4) automatically entitles one to post-conviction collateral relief without 55 | Label: overruling 56 | 57 | Sentence: to the extent that the holding in wilson v. bureau of state police, supra, conflicts with this opinion, it is overruled. 58 | Label: overruling 59 | 60 | Sentence: for the reasons stated below, we approve the fifth district court of appeal's decision in winter park, and disapprove the decision in belleair to the extent described herein. 61 | Label: overruling 62 | 63 | Sentence: in light of both our holding today and previous rulings in johnson, dueser, and gronroos, we now explicitly overrule dupree. 64 | Label: overruling 65 | 66 | Sentence: having reviewed the question en banc, we now answer that question in the affirmative and overrule laffey. 67 | Label: overruling 68 | 69 | Sentence: accordingly, to the extent of any conflict nemecek v. state, 621 s.w.2d 404 (tex.cr.app. 1980) is overruled. 70 | Label: overruling 71 | 72 | Sentence: we therefore overrule mata and hartman to the extent of the conflict and reverse the trial court's judgment and remand the cause for a new trial. 73 | Label: overruling 74 | 75 | Sentence: in reaching that conclusion, we recede from the previous holding of this court in hall v. state, 505 so.2d 657, 658 (fla. 2d dca), cause dismissed, 509 so.2d 1117 (fla. 1987), in which we stated that an essential element of 76 | Label: overruling 77 | 78 | Sentence: we are fully in accord with the relaxation of the federal requirements as expressed in illinois v. gates, supra, and to the extent that berkshire v. commonwealth, supra; thompson v. commonwealth, supra; and buchenburger v. commonwealth, supra, express a contrary view, they 79 | Label: overruling 80 | 81 | Sentence: we overrule this holding based upon our conclusion that review by the court of appeals under section 22-63-117(11) is predicated upon a final order of the school board resulting from proceedings conducted under section 22-63-117. 82 | Label: -------------------------------------------------------------------------------- /example_prompts/semiconductor_org_types.txt: -------------------------------------------------------------------------------- 1 | The dataset is a list of institutions that have contributed papers to semiconductor conferences in the last 25 years, as catalogued by IEEE and sampled randomly. The goal is to classify the institutions into one of three categories: "university", "company" or "research institute". 2 | Possible labels: 3 | 1. company 4 | 2. research institute 5 | 3. university 6 | 7 | Organization name: Central Research Laboratory,Hitachi Ltd. Kokubunji. Tokyo,Japan 8 | Paper title: Formation of Si-on-Insulator 9 | Label: company 10 | 11 | Organization name: MAPS,Yongin,Korea 12 | Paper title: 21.8 An all-in-one (Qi, PMA 13 | Label: company 14 | 15 | Organization name: Samsung Electronics Company Limited, Yongin si, Gyeonggi, South Korea 16 | Paper title: 1D thickness scaling study of phase change 17 | Label: company 18 | 19 | Organization name: Semiconductor Research Center,Matsushita Electric Industrial Co.,Ltd.,Yagumo-nakamachi,Morig 20 | Label: company 21 | 22 | Organization name: Advanced Circuit Pursuit,Zollikon,Switzerland; ETH,Zurich,Switzerland 23 | Paper title: A 0. 24 | Label: company 25 | 26 | Organization name: Engim,Acton,MA,USA 27 | Paper title: A 180MS/s, 162Mb/s wideband three- 28 | Label: company 29 | 30 | Organization name: R & D Center, Samsung Electronics Kiheung-Eup, Yongin-City, Kyungki-do, Korea 31 | Paper 32 | Label: company 33 | 34 | Organization name: Panasonic,Osaka,Japan 35 | Paper title: 30.1 8b Thin-film microprocessor using a hybrid oxide-organic complementary technology 36 | Label: company 37 | 38 | Organization name: Center for Semiconductor Research & Development,Toshiba Corporation,Japan 39 | Paper title: Physical understanding of Vth and Idsat variations 40 | Label: company 41 | 42 | Organization name: Imec,Leuven,Belgium 43 | Paper title: An Artificial Iris ASIC with High Voltage Liquid Crystal Driver, 10 n 44 | Label: research institute 45 | 46 | Organization name: ETH,Zurich,Switzerland; Advanced Circuit Pursuit,Zollikon,Switzerland 47 | Paper title: A 0. 48 | Label: university 49 | 50 | Organization name: imec from Samsung Electronics,Korea 51 | Paper title: First Demonstration of Low Temperature (≤500°C) CMOS 52 | Label: company 53 | 54 | Organization name: Technology Research Department,Association of Super-Advanced Electronics Technologies (ASET),,Higashi-koigakubo,K 55 | Label: research institute 56 | 57 | Organization name: Texas Instruments Bangalore and Texas Instruments,Dallas,TX 58 | Paper title: A DSL customer-premise equipment modem SoC with extended reach/ 59 | Label: company 60 | 61 | Organization name: Memory Division, Samsung Electronics Co, Yongin-City, Gyeonggi-Do, Korea 62 | Paper title: Front-end- 63 | Label: company 64 | 65 | Organization name: Fujitsu Laboratories Ltd.,Japan 66 | Paper title: Development of sub 10-µm ultra-thinning technology using device wafers 67 | Label: company 68 | 69 | Organization name: National NanoFab Center, Daejeon, South Korea 70 | Paper title: 3-terminal nanoelectromechanical switching 71 | Label: research institute 72 | 73 | Organization name: Semiconductor R&D Center, Samsung Electronics Co., Ltd, Yongin-City, Gyeonggi-Do, Korea ( 74 | Label: company 75 | 76 | Organization name: Pohang Univ. of Sci. & Technol.,South Korea 77 | Paper title: A 3Gb/s 8b single-ended 78 | Label: university 79 | 80 | Organization name: MPI für Mikrostrukturphysik, Halle, Germany 81 | Paper title: Dislocation engineering for a silicon 82 | Label: research institute 83 | 84 | Organization name: Sony,Tokyo,Japan 85 | Paper title: A 3.1 to 5 GHz CMOS DSSS UWB transceiver for WP 86 | Label: company 87 | 88 | Organization name: Illinois Univ.,Urbana,IL,USA 89 | Paper title: A 14 b 100 Msample/s CMOS DAC designed for spectral 90 | Label: university 91 | 92 | Organization name: ULSI Device Dev. Labs.,NEC Corp.,Kanagawa,Japan 93 | Paper title: A crossing charge recycle refresh scheme 94 | Label: company 95 | 96 | Organization name: Syst. LSI Dev. Center,Mitsubishi Electr. Corp.,Hyogo,Japan 97 | Paper title: Single- 98 | Label: company 99 | 100 | Organization name: Matsushita Electric Industrial Limited, Takatsuki, Osaka, Japan 101 | Paper title: Role of non-radiative recombination in the 102 | Label: company 103 | 104 | Organization name: Corp. Semicond. Dev. Div.,Matsushita Electr. Ind. Co. Ltd.,Kyoto,Japan 105 | 106 | Label: company 107 | 108 | Organization name: KUL, Leuven, Belgium 109 | Paper title: Benchmarking of monolithic 3D integrated MX2 FETs with Si 110 | Label: university 111 | 112 | Organization name: imec,Heverlee,Belgium 113 | Paper title: 24.4 A 680nA fully integrated implantable EC 114 | Label: research institute 115 | 116 | Organization name: Applied Science and Technology Research Institute,Hong Kong 117 | Paper title: A 48-mW 18-Gb/s fully integrated CMOS 118 | Label: research institute 119 | 120 | Organization name: North Carolina State Univ.,Raleigh,NC,USA 121 | Paper title: 3Gb/s AC-coupled chip-to- 122 | Label: university 123 | 124 | Organization name: Philips Composants et Semiconducteurs,Caen,France 125 | Paper title: A 12 b 50 M sample/s cascaded 126 | Label: company 127 | 128 | Organization name: Samsung Advanced Logic Lab,Austin,TX 129 | Paper title: High performance and low leakage current InGaAs-on-silicon FinF 130 | Label: company 131 | 132 | Organization name: SoC R&D Center,Semiconductor Company,Toshiba Corp.,Isogo-ku,Yokohama,Japan 133 | Label: company 134 | 135 | Organization name: Incubation Center,Renesas Electronics Corp.,Shimokuzawa,Chuou-ku,Sagamihara, 136 | Label: company 137 | 138 | Organization name: IBM Systems Group,Austin,TX 139 | Paper title: Design of the Power6 Microprocessor 140 | Label: company 141 | 142 | Organization name: APA Optics, Inc., Blaine, MN, USA 143 | Paper title: High performance 0.25 /spl mu/m gate 144 | Label: company 145 | 146 | Organization name: Texas Instruments Inc, Dallas, TX, US 147 | Paper title: Damascene integration of copper and ultra-low-k xerog 148 | Label: company 149 | 150 | Organization name: Cisco Systems, Hong Kong, China 151 | Paper title: Characterizing Electromigration Effects in a 16nm FinFET Process Using a Circuit 152 | Label: company 153 | 154 | Organization name: Toshiba at Albany NanoTech,NY,USA 155 | Paper title: Full metal gate with borderless contact for 14 nm and beyond 156 | Label: company 157 | 158 | Organization name: Advanced LCD Technology Development Center Company Limited, Yokohama, Kanagawa, Japan 159 | Paper title: Sub-Micron CMOS / 160 | Label: company 161 | 162 | Organization name: IBM Microelectronics, Burlington, VT, USA 163 | Paper title: Large-signal performance of high-BV/sub CEO/ 164 | Label: company 165 | 166 | Organization name: GLOBALFOUNDRIES Inc., Albany, NY, USA 167 | Paper title: Accurate performance evaluation for the horizontal nanosheet 168 | Label: company 169 | 170 | Organization name: MIRAI-ASET,Kawasaki,Japan 171 | Paper title: Strained SOI technology for high-performance, low- 172 | Label: university 173 | 174 | Organization name: IBM Microelectron.,Burlington,VT,USA 175 | Paper title: A 500MHz multi-banked compilable DRAM macro 176 | Label: company 177 | 178 | Organization name: IBM Microelectron.,Hopewell Junction,NY,USA 179 | Paper title: Destructive-read random access memory system buffered 180 | Label: company 181 | 182 | Organization name: Fujitsu Laboratories Ltd., Atsugi, Kanagawa, Japan 183 | Paper title: A 65 nm CMOS technology with a high- 184 | Label: company 185 | 186 | Organization name: Intel,Hillsboro,OR 187 | Paper title: 25.5 A Self-Calibrated 1.2-to-3. 188 | Label: company 189 | 190 | Organization name: Strategic Technology Group,Advanced Micro Devices,Sunnyvale,CA,USA 191 | Paper title: Collective-effect state variables for post-CM 192 | Label: company 193 | 194 | Organization name: IBM Semiconductor Research and Development Center (SRDC), Samsung Electronics Company Limited, Hopewell Junction, NY, USA 195 | Paper title 196 | Label: company 197 | 198 | Organization name: QRE, Hillsboro, OR, USA 199 | Paper title: An enhanced 130 nm generation logic technology featuring 60 nm transistors optimized for high 200 | Label: company 201 | 202 | Organization name: Portland Technology Development, Hillsboro, OR, USA 203 | Paper title: An enhanced 130 nm generation logic technology featuring 60 nm transistors optimized for high performance and low power at 0.7 - 1.4 V 204 | Label: -------------------------------------------------------------------------------- /example_prompts/systematic_review_inclusion.txt: -------------------------------------------------------------------------------- 1 | Identify whether this paper should be included in a meta-review which includes the findings of systematic reviews on interventions designed to promote charitable donations. 2 | Included reviews should describe monetary charitable donations, assess any population of participants in any context, and be peer reviewed and written in English. 3 | They should not report new data, be non-systematic reviews, consider cause-related marketing or other kinds of prosocial behaviour. 4 | Possible labels: 5 | 1. included 6 | 2. not included 7 | 8 | Title: Imagine being a nice guy: A note on hypothetical vs. Incentivized social preferences 9 | Abstract: We conducted an experimental study on social preferences using dictator games similar to Fehr et al. (2008). Our results show that social preferences differ between subjects who receive low-stakes monetary rewards for their decisions and subjects who consider hypothetical stakes. Our findings indicate that, apart from incentives, gender plays an important role for the categorization of different social preferences. © 2015. The authors license. 10 | Journal: Judgm. Decis. Mak. 11 | Label: not included 12 | 13 | Title: Consumer reaction to price increase: An investigation in gasoline industry 14 | Abstract: Purpose – The aim of this study is to investigate the impact of increase in price of an essential product (i.e. gasoline) toward the focal product and other seemingly non-related products. Design/methodology/approach – A self-administered survey was used to collect data from the drivers at a large metroplex in Southwest USA. Multiple regression and scanning electron microscope procedures were used to analyze and test the proposed hypotheses. Findings – When consumers notice the increase in gas prices, they become very anxious. This anxiety is positively associated with average gas bought in gallons and negatively associated with threshold price. Further, this consumer anxiety has the strongest influence on lifestyle changes, followed by automobile technology change and transportation mode change, and has the weakest influence on gasoline brand/type change. Research limitations/implications – We focus on only anxiety as a mediator between increase in gas prices and the behavioral outcomes, and collect data from only one location. Practical implications – Managers must be cognizant that a price increase in essential goods not only influences the demand for focal products but also for products that may not seem related to the focal products. Social implications – Increase in gasoline price will not only affect the demand for gasoline, but also the demand for alternate forms of transportation, fuel efficient vehicles, and other aspects of life. Originality/value – This study is the first to look at the role of anxiety as a mediator and looks at the effects of increase in gas prices in a holistic manner. © Emerald Group Publishing Limited. 15 | Journal: J 16 | Label: not included 17 | 18 | Title: Being sticker rich: Numerical context influences children's sharing behavior 19 | Abstract: Young children spontaneously share resources with anonymous recipients, but little is known about the specific circumstances that promote or hinder these prosocial tendencies. Children (ages 3-11) received a small (12) or large (30) number of stickers, and were then given the opportunity to share their windfall with either one or multiple anonymous recipients (Dictator Game). Whether a child chose to share or not varied as a function of age, but was uninfluenced by numerical context. Moreover, children's giving was consistent with a proportion- based account, such that children typically donated a similar proportion (but different absolute number) of the resources given to them, regardless of whether they originally received a small or large windfall. The proportion of resources donated, however, did vary based on the number of recipients with whom they were allowed to share, such that on average, children shared more when there were more recipients available, particularly when they had more resources, suggesting they take others into consideration when making prosocial decisions. Finally, results indicated that a child's gender also predicted sharing behavior, with males generally sharing more resources than females. Together, findings suggest that the numerical contexts under which children are asked to share, as well as the quantity of resources that they have to share, may interact to promote (or hinder) altruistic behaviors throughout childhood. © 2015 Posid et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author 20 | Label: not included 21 | 22 | Title: Public charity offer as a proximate factor of evolved reputation-building strategy: an experimental analysis of a real-life situation 23 | Abstract: Although theoretical considerations suggest that a considerable portion of human altruism is driven by concerns about reputation, few experimental studies have examined the psychological correlates of individual decisions in real-life situations. Here we demonstrate that more subjects were willing to give assistance to unfamiliar people in need if they could make their charity offers in the presence of their group mates than in a situation where the offers remained concealed from others. In return, those who were willing to participate in a particular charitable activity received significantly higher scores than others on scales measuring sympathy and trustworthiness. Finally, a multiple regression analysis revealed that while several personality and behavior traits (cooperative ability, Machiavellianism, sensitivity to norms, and sex) play a role in the development of prosocial behavior, the possibility of gaining reputation within the group remains a measurable determinant of charitable behavior. © 2007 Elsevier Inc. All rights reserved. 24 | Journal: Evol. Hum. Behav. 25 | Label: not included 26 | 27 | Title: Assessing actual strategic behavior to construct a measure of strategic ability 28 | Abstract: Strategic interactions have been studied extensively in the area of judgment and decision-making. However, so far no specific measure of a decision-maker's ability to be successful in strategic interactions has been proposed and tested. Our contribution is the development of a measure of strategic ability that borrows from both game theory and psychology. Such measure is aimed at providing an estimation of the likelihood of success in many social activities that involve strategic interaction among multiple decision-makers. To construct a reliable measure of strategic ability, that we propose to call "Strategic Quotient" (SQ), we designed a test where each item is a game and where, therefore, the individual obtained score depends on the distribution of choices of other decision-makers taking the test. The test is designed to provide information on the abilities related to two dimensions, mentalization and rationality, that we argue are crucial to strategic success, with each dimension being characterized by two main factors. Principal component analysis on preliminary data shows that indeed four factors (two for rationality, two for mentalization) account for strategic success in most of the strategically simpler games of the test. Moreover, two more strategically sophisticated games are inserted in the test and are used to investigate if and to what extent the four factors obtained by simpler games can predict strategic success in more sophisticated strategic interactions. Overall, the collected empirical evidence points to the possibility of building a SQ measure using only simple games designed to capture information about the four identified factors. © 2019 Bilancini, Boncinelli and Mattiassi. 29 | Journal: Front. 30 | Label: not included 31 | 32 | Title: How construals of money versus time impact consumer charitable giving 33 | Abstract: While past research has suggested that consumers have fundamentally different responses to thinking about money versus time, the current work clarifies an important nuance in terms of how consumers construe these two resources. We demonstrate that, in the domain of charitable giving, money is construed relatively more concretely, whereas time is construed relatively more abstractly. This difference in the construal of these two resources has implications for how appeals for charitable contributions or money versus time should be framed. When the construal level at which the consumer considers the cause is aligned (misaligned) with the construal level of the resource being requested, contribution intentions and behaviors increase (decrease). In addition, the moderating role of resource abundance is examined. In particular, when money is considered abundant (vs. nonabundant), consumers no longer exhibit more concrete thoughts in response to money compared to time. Finally, when the donation request makes consumers think of money in a more abundant manner, monetary donations can be successfully motivated with a more abstract call for charitable support. The theoretical and practical implications for marketers and charitable organizations are discussed. © The Author 2015. 34 | Journal: J. Consum. Res. 35 | Label: -------------------------------------------------------------------------------- /example_prompts/tai_safety_research.txt: -------------------------------------------------------------------------------- 1 | Transformative AI (TAI) is defined as AI that precipitates a transition comparable to (or more significant than) the agricultural or industrial revolution. Label a paper as "TAI safety research" if: 2 | 1. The contents of the paper are directly motivated by, and substantively inform, the challenge of ensuring good outcomes for TAI, 3 | 2. There is substantive content on AI safety, not just AI capabilities, 4 | 3. The intended audience is the community of researchers, 5 | 4. It meets a subjective threshold of seriousness/quality, 6 | 5. Peer review is not required. 7 | Possible labels: 8 | 1. TAI safety research 9 | 2. not TAI safety research 10 | 11 | Title: One Decade of Universal Artificial Intelligence 12 | Abstract Note: The first decade of this century has seen the nascency of the first mathematical theory of general artificial intelligence. This theory of Universal Artificial Intelligence (UAI) has made significant contributions to many theoretical, philosophical, and practical AI questions. In a series of papers culminating in book (Hutter, 2005), an exciting sound and complete mathematical model for a super intelligent agent (AIXI) has been developed and rigorously analyzed. While nowadays most AI researchers avoid discussing intelligence, the award-winning PhD thesis (Legg, 2008) provided the philosophical embedding and investigated the UAI-based universal measure of rational intelligence, which is formal, objective and non-anthropocentric. Recently, effective approximations of AIXI have been derived and experimentally investigated in JAIR paper (Veness et al. 2011). This practical breakthrough has resulted in some impressive applications, finally muting earlier critique that UAI is only a theory. For the first time, without providing any domain knowledge, the same agent is able to self-adapt to a diverse range of interactive environments. For instance, AIXI is able to learn from scratch to play TicTacToe, Pacman, Kuhn Poker, and other games by trial and error, without even providing the rules of the games. These achievements give new hope that the grand goal of Artificial General Intelligence is not elusive. This article provides an informal overview of UAI in context. It attempts to gently introduce a very 13 | Label: TAI safety research 14 | 15 | Title: Coherence arguments do not imply goal-directed behavior 16 | Abstract Note: One of the most pleasing things about probability and expected utility theory is that there are many coherence arguments that suggest that these are the “correct” ways to reason. If you deviate from what the theory prescribes, then you must be executing a dominated strategy. There must be some other strategy that never does any worse than your strategy, but does strictly better than your strategy with certainty in at least one situation. There’s a good explanation of these arguments here. We shouldn’t expect mere humans to be able to notice any failures of coherence in a superintelligent agent, since if we could notice these failures, so could the agent. So we should expect that powerful agents appear coherent to us. (Note that it is possible that the agent doesn’t fix the failures because it would not be worth it -- in this case, the argument says that we will not be able to notice any exploitable failures.) Taken together, these arguments suggest that we should model an agent much smarter than us as an expected utility (EU) maximizer. And many people agree that EU maximizers are dangerous. So does this mean we’re doomed? I don’t think so: it seems to me that the problems about EU maximizers that we’ve identified are actually about goal-directed behavior or explicit reward maximizers. The coherence theorems say nothing about whether an AI system must look like one of these 17 | Label: TAI safety research 18 | 19 | Title: Teaching A.I. Systems to Behave Themselves (Published 2017) 20 | Abstract Note: As philosophers and pundits worry that artificial intelligence will one day harm the world, some researchers are working on ways to lower the risks. 21 | Publication Title: The New York Times 22 | Item Type: newspaperArticle 23 | Publication Year: 2017 24 | Label: not TAI safety research 25 | 26 | Title: Advancing rational analysis to the algorithmic level 27 | Abstract Note: Abstract The commentaries raised questions about normativity, human rationality, cognitive architectures, cognitive constraints, and the scope or resource rational analysis (RRA). We respond to these questions and clarify that RRA is a methodological advance that extends the scope of rational modeling to understanding cognitive processes, why they differ between people, why they change over time, and how they could be improved. 28 | Publication Title: Behavioral and Brain Sciences 29 | Item Type: journalArticle 30 | Publication Year: 2020 31 | Label: not TAI safety research 32 | 33 | Title: The Role and Limits of Principles in AI Ethics: Towards a Focus on Tensions 34 | Abstract Note: The last few years have seen a proliferation of principles for AI ethics. There is substantial overlap between different sets of principles, with widespread agreement that AI should be used for the common good, should not be used to harm people or undermine their rights, and should respect widely held values such as fairness, privacy, and autonomy. While articulating and agreeing on principles is important, it is only a starting point. Drawing on comparisons with the field of bioethics, we highlight some of the limitations of principles: in particular, they are often too broad and high-level to guide ethics in practice. We suggest that an important next step for the field of AI ethics is to focus on exploring the tensions that inevitably arise as we try to implement principles in practice. By explicitly recognising these tensions we can begin to make decisions about how they should be resolved in specific cases, and develop frameworks and guidelines for AI ethics that are rigorous and practically relevant. We discuss some different specific ways that tensions arise in AI ethics, and what processes might be needed to resolve them. 35 | Publication Title: AIES '19: Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society 36 | Item Type: conferencePaper 37 | Publication Year: 2019 38 | Label: TAI safety research 39 | 40 | Title: Computer Simulations as a Technological Singularity in the Empirical Sciences 41 | Abstract Note: SummaryIn this paper, I discuss the conditions necessary for computer simulations to qualify as a technological singularity in the empirical sciences. A technological singularity encompasses two claims: (a) the enhancement of human cognitive capacities by the computer, and (b) their displacement from the center of the production of knowledge. For computer simulations to be a technological singularity, then, they must fulfill points (a) and (b) above. Although point (a) is relatively unproblematic, point (b) needs further analysis. In particular, in order to show that humans could be displaced from the center of the production of knowledge, it is necessary to establish the reliability of computer simulations. That is, I need to show that computer simulations are reliable processes that render, most of the time, valid results. To be a reliable process, in turn, means that simulations accurately represent the target system and carry out error-free computations. I analyze verification and validation methods as the grounds for such representation accuracy and error-free computations. Since the aim is to entrench computer simulations as a technological singularity, the entire analysis must be careful to keep human agents out of the picture. 42 | Publication Title: The Technological Singularity: Managing the Journey 43 | Item Type: bookSection 44 | Publication Year: 2017 45 | Label: -------------------------------------------------------------------------------- /example_prompts/terms_of_service.txt: -------------------------------------------------------------------------------- 1 | Label the sentence from a Terms of Service based on whether it is potentially unfair. If it seems clearly unfair, mark it as potentially unfair. 2 | According to art. 3 of the Directive 93/13 on Unfair Terms in Consumer Contracts, a contractual term is unfair if: 1) it has not been individually negotiated; and 2) contrary to the requirement of good faith, it causes a significant imbalance in the parties rights and obligations, to the detriment of the consumer. 3 | Details on types of potentially unfair clauses are found below: 4 | The jurisdiction clause stipulates what courts will have the competence to adjudicate disputes under the contract. Jurisdiction clauses giving consumers a right to bring disputes in their place of residence were marked as clearly fair, whereas clauses stating that any judicial proceeding takes a residence away were marked as clearly unfair. 5 | The choice of law clause specifies what law will govern the contract, meaning also what law will be applied in potential adjudication of a dispute arising under the contract. Clauses defining the applicable law as the law of the consumer's country of residence were marked as clearly fair. In every other case, the choice of law clause was considered as potentially unfair. 6 | The limitation of liability clause stipulates that the duty to pay damages is limited or excluded, for certain kind of losses, under certain conditions. Clauses that explicitly affirm non-excludable providers' liabilities were marked as clearly fair. Clauses that reduce, limit, or exclude the liability of the service provider were marked as potentially unfair when concerning broad categories of losses or causes of them. 7 | The unilateral change clause specifies the conditions under which the service provider could amend and modify the terms of service and/or the service itself. Such clause was always considered as potentially unfair. 8 | The unilateral termination clause gives provider the right to suspend and/or terminate the service and/or the contract, and sometimes details the circumstances under which the provider claims to have a right to do so. 9 | The contract by using clause stipulates that the consumer is bound by the terms of use of a specific service, simply by using the service, without even being required to mark that he or she has read and accepted them. We always marked such clauses as potentially unfair. 10 | The content removal gives the provider a right to modify/delete user's content, including in-app purchases, and sometimes specifies the conditions under which the service provider may do so. 11 | The arbitration clause requires or allows the parties to resolve their disputes through an arbitration process, before the case could go to court. Clauses stipulating that the arbitration should take place in a state other then the state of consumer's residence or be based on arbiter's discretion were marked as clearly unfair. Clauses defining arbitration as fully optional were marked as clearly fair. 12 | Possible labels: 13 | 1. not potentially unfair 14 | 2. potentially unfair 15 | 16 | Sentence: You acknowledge and agree that posting any such user content may result in immediate termination or suspension of your spotify account. 17 | Label: potentially unfair 18 | 19 | Sentence: You may be able to access certain third-party links, applications or content (``third-party applications'') via accounts related to our services. 20 | Label: not potentially unfair 21 | 22 | Sentence: These terms, and any other policies or rules we reference in these terms, make up the entire agreement between you and us relating to the subject matter of these terms, and supersede all prior understandings of the parties relating to the subject matter of these terms, whether those prior understandings were electronic, oral or written, or whether established by custom, practice, policy or precedent, between you and us. 23 | Label: not potentially unfair 24 | 25 | Sentence: Supercell may reject, refuse to post or delete any user content for any or no reason, including, but not limited to, user content that in the sole judgment of supercell violates these terms of service. 26 | Label: potentially unfair 27 | 28 | Sentence: Tinder does not represent or warrant that (a) the service will be uninterrupted, secure or error free, (b) any defects or errors in the service will be corrected, or (c) that any content or information you obtain on or through the services will be accurate. 29 | Label: potentially unfair 30 | 31 | Sentence: There are no refunds for termination or cancellation of your premium service. 32 | Label: -------------------------------------------------------------------------------- /example_prompts/tweet_eval_hate.txt: -------------------------------------------------------------------------------- 1 | Label whether the following tweet contains hate speech against either immigrants or women. Hate Speech (HS) is commonly defined as any communication that disparages a person or a group on the basis of some characteristic such as race, color, ethnicity, gender, sexual orientation, nationality, religion, or other characteristics. 2 | Possible labels: 3 | 1. hate speech 4 | 2. not hate speech 5 | 6 | Tweet: International migration from sub-Saharan Africa has grown dramatically since 2010 7 | Label: not hate speech 8 | 9 | Tweet: World Relief is continuing to provide legal aid and advocating for immigrant families who have been separated at the U.S.-Mexico border. You can 10 | Label: not hate speech 11 | 12 | Tweet: The Ethiopia-Eritrea rapprochement: what does it mean for refugees? 13 | Label: not hate speech 14 | 15 | Tweet: It's clear that the administration's attempts to punish local governments for refusing to cooperate with immigration enforcement are unconstitutional. The administration should focus on treating 16 | Label: not hate speech 17 | 18 | Tweet: The first sexual attack against a woman happened in Hungary. The criminal is a "legal" afghan rapefugee. This is how Or 19 | Label: hate speech 20 | 21 | Tweet: Syria's military continues to advance in the south west 22 | Label: not hate speech 23 | 24 | Tweet: President Trump Weekly Address ‘“ Immigration Reform Needed‘¦ 25 | Label: not hate speech 26 | 27 | Tweet: They do not need them, I do not need them nor do I require them to enrich my life. #Stoptheinvasion #Brexit # 28 | Label: hate speech 29 | 30 | Tweet: Why weren't the democrats in the streets marching for the separated families of people like Jamiel Shaw when he was separated from his son forever 31 | Label: hate speech 32 | 33 | Tweet: UN wants ‘˜immediate halt’ to separation of migrant families in US #Trumpadministration #UnitedNations... 34 | Label: not hate speech 35 | 36 | Tweet: Anybody can dig a hole and plant a tree. But to make sure it survives, you have to nurture it, you have to suffer 37 | Label: not hate speech 38 | 39 | Tweet: Talent- Meeting up with complete strangers for sex 40 | Label: not hate speech 41 | 42 | Tweet: Important note: not all women do this and tale responsibility for their actions, just like not all men are assholes. 43 | Label: not hate speech 44 | 45 | Tweet: Its time - to stop the immigration - to USA ! Specially from 40 muslim countries. The president named it: SHITHOLE- 46 | Label: hate speech 47 | 48 | Tweet: UK Pensioner Faces 350 Lashes In Saudi Arabia why does this country exist it does nothing for migrants picks on old men no help from anyone 49 | Label: not hate speech 50 | 51 | Tweet: This is outrageous! But, sadly, this kind of corruption is what we've gotten used to with our current administration! 😠 52 | Label: not hate speech 53 | 54 | Tweet: Germany Navy rescued 22.000 refugees in the middlesea. Germany has not enough refugees, it wants more! by state duty. The 55 | Label: not hate speech 56 | 57 | Tweet: Women want u to automatically believe women who scream rape they don't understand our position.... 58 | Label: hate speech 59 | 60 | Tweet: | Woman who accused Nelly of rape wants to drop charges 61 | Label: not hate speech 62 | 63 | Tweet: I didn’t realize girls who wear huge bows in their hair called other women “cunt”. “Northern Bell 64 | Label: not hate speech 65 | 66 | Tweet: We are excited to host the #ImmigrantStories film screening and panel today with @user to show how the Trump administration’s 67 | Label: not hate speech 68 | 69 | Tweet: Stop trying to please everyone, you're not my cock... Please yourself 70 | Label: not hate speech 71 | 72 | Tweet: New to Twitter-- any men on here know what the process is to get #verified? 73 | Label: not hate speech 74 | 75 | Tweet: US Navy Seal, which killed Osama wrote a book. It tells how this specialforces work today. 76 | Label: not hate speech 77 | 78 | Tweet: Jennifer Lawrence is complaining about being a whore, meanwhile she's punching old Asian women in face who are sewing her costumes for her 79 | Label: not hate speech 80 | 81 | Tweet: “When Mexico sends its people, they’re not sending their best... They’re bringing drugs. They’re bringing 82 | Label: hate speech 83 | 84 | Tweet: Germany: Islamic-inspired children’s park in Berlin to introduce children to Islam and the Islamic culture of the refugees - Free Speech Time - 85 | Label: hate speech 86 | 87 | Tweet: HOLLERINGGGGGGG. stupid baby bitch. 88 | Label: hate speech 89 | 90 | Tweet: Homeland Security says he is a REFUGEE. Iraqi refugee charged with attempted murder of Colorado police officer - - @user 91 | Label: not hate speech 92 | 93 | Tweet: cormacs such a cunt won’t even let me squeeze his blackheads pussy bitch 94 | Label: hate speech 95 | 96 | Tweet: How keyboard gangsters feel leaving their stupid ass comments 97 | Label: not hate speech 98 | 99 | Tweet: HOW REFRESHING! In South Korea, there is no such thing as 'political correctness" when it comes to dealing with Muslim refugee w 100 | Label: hate speech 101 | 102 | Tweet: Curious to know how @user delivers emergency food supplies to #refugees and other vulnerable people around the world? Watch this video: 103 | Label: not hate speech 104 | 105 | Tweet: Pope Francis criticizes policy of migrant family separation via @user Stick to religion Mr. Red Pope... 106 | Label: not hate speech 107 | 108 | Tweet: I hate bitches that are always mad at the world for no reason like bitch stfu and be happy hoe 109 | Label: hate speech 110 | 111 | Tweet: #TakeAKnee trump's a chump! @user is the real deal-helping so many on his journey through greatness! Even 112 | Label: not hate speech 113 | 114 | Tweet: RT @user Her:I don't get what u want outta this relationship Him:Well, I was only looking for a bj 115 | Label: not hate speech 116 | 117 | Tweet: @user @user BUT. I would bet you're not looking for facts. That's why you obfuscate.President Trump at NATO: 118 | Label: hate speech 119 | 120 | Tweet: @user @user @user Imagine THE SIZEof THAT RED WAVE !!GOD BLESS TRUMP!Build That Wall#BuildThatWall 121 | Label: hate speech 122 | 123 | Tweet: @user @user 2 hysterical women in the US Senate 124 | Label: hate speech 125 | 126 | Tweet: @user @user Just choke on your salad sandwich already. Those Men did not die during the Gulf War and do you really think that it 127 | Label: not hate speech 128 | 129 | Tweet: @user Noely I see you've become hysterical since I last saw you. Don't blam you one bit, you're in 130 | Label: not hate speech 131 | 132 | Tweet: @user Exactly! Why would Americans vote for anyone that supports these absurd policies? IMO it is treason and self destruction to vote for politicians that 133 | Label: hate speech 134 | 135 | Tweet: @user Going to be where no one will leave the house or even look at a woman because she will claim “rape”.… 136 | Label: hate speech 137 | 138 | Tweet: Dont worry @user you are and will always be the most hysterical woman. 139 | Label: hate speech 140 | 141 | Tweet: @user thanks for ruining a beloved song from “The Sound of Music” by using that extremely creepy electronic version in your commercial. 142 | Label: not hate speech 143 | 144 | Tweet: @user mmm perfect body and so am i am now come here girl i got a big cock for u to play with ;) 145 | Label: hate speech 146 | 147 | Tweet: @user One man one vote right?Yet here we have one man assuming he can demand the importation of millions of non citizens for HIS 148 | Label: hate speech 149 | 150 | Tweet: @user It's kind of funny, really. Were the space aliens illegal? Now we need a wall & a roof! I think this country 151 | Label: hate speech 152 | 153 | Tweet: @user Skank traitor bitch @user 154 | Label: hate speech 155 | 156 | Tweet: @user id marry this fukin whore,& let the bitch behind her be best lady at the wedding 157 | Label: -------------------------------------------------------------------------------- /example_prompts/twitter_complaints.txt: -------------------------------------------------------------------------------- 1 | A complaint presents a state of affairs which breaches the writer’s favorable expectation. Label the tweet text based on whether it contains a complaint. 2 | Possible labels: 3 | 1. complaint 4 | 2. no complaint 5 | 6 | Tweet text: @NCIS_CBS https://t.co/eeVL9Eu3bE 7 | Label: no complaint 8 | 9 | Tweet text: Never shopping at @MACcosmetics again. Every time I go in there, their employees are super rude/condescending. I'll take my $$ to @Sephora 10 | Label: complaint 11 | 12 | Tweet text: @Lin_Manuel @jmessinaphoto @VAMNit Omg a little squish!!!!! Enjoy and congrats!!!! I miss mine being so young! ������ 13 | Label: no complaint 14 | 15 | Tweet text: @mckelldogs This might just be me, but-- eyedrops? Artificial tears are so useful when you're sleep-deprived and sp… https://t.co/WRtNsokblG 16 | Label: no complaint 17 | 18 | Tweet text: @JetBlue Completely understand but would prefer being on time to filling out forms.... 19 | Label: no complaint 20 | 21 | Tweet text: @DIRECTV can I get a monthly charge double refund when it sprinkles outside and we lose reception? #IamEmbarrasedForYou 22 | Label: complaint 23 | 24 | Tweet text: I'm earning points with #CricketRewards https://t.co/GfpGhqqnhE 25 | Label: no complaint 26 | 27 | Tweet text: Looks tasty! Going to share with everyone I know #FebrezeONE #sponsored https://t.co/4AQI53npei 28 | Label: no complaint 29 | 30 | Tweet text: @Schrapnel @comcast RIP me 31 | Label: no complaint 32 | 33 | Tweet text: @VerizonSupport all of a sudden I can't connect to my primary wireless network but guest one works 34 | Label: no complaint 35 | 36 | Tweet text: Just posted a photo https://t.co/RShFwCjPHu 37 | Label: no complaint 38 | 39 | Tweet text: @greateranglia Could I ask why the Area in front of BIC Station was not gritted withh all the snow. 40 | Label: complaint 41 | 42 | Tweet text: @IanJamesPoulter What's your secret to poaching eggs? Mine NEVER look that good. 43 | Label: no complaint 44 | 45 | Tweet text: Aaaahhhhh!!!! My @Razer @PlayOverwatch d.va meka headset came in!!! I didn't even know it had shipped!!! So excited… https://t.co/4gXy9xED8d 46 | Label: no complaint 47 | 48 | Tweet text: @asblough Yep! It should send you a notification with your driver’s name and what time they’ll be showing up! 49 | Label: no complaint 50 | 51 | Tweet text: @NortonSupport Thanks much. 52 | Label: no complaint 53 | 54 | Tweet text: I just gave 5 stars to Tracee at @neimanmarcus for the great service I received! 55 | Label: no complaint 56 | 57 | Tweet text: If I can't get my 3rd pair of @beatsbydre powerbeats to work today I'm doneski man. This is a slap in my balls. Your next @Bose @BoseService 58 | Label: complaint 59 | 60 | Tweet text: @AlfaRomeoCares Hi thanks for replying, could be my internet but link doesn't seem to be working 61 | Label: complaint 62 | 63 | Tweet text: @HMRCcustomers No this is my first job 64 | Label: no complaint 65 | 66 | Tweet text: @JenniferTilly Merry Christmas to as well. You get more stunning every year �� 67 | Label: no complaint 68 | 69 | Tweet text: @SouthwestAir I love you but when sending me flight changes please don't use military time #ignoranceisbliss 70 | Label: complaint 71 | 72 | Tweet text: @NortonSupport @NortonOnline What the hell is a dm 5-10 days to get money back bank account now overdrawn thanks guys 73 | Label: complaint 74 | 75 | Tweet text: @ZARA_Care I've been waiting on a reply to my tweets and DMs for days now? 76 | Label: complaint 77 | 78 | Tweet text: @TopmanAskUs please just give me my money back. 79 | Label: complaint 80 | 81 | Tweet text: @BurberryService Thanks for sending my Christmas present with the security protection still on it! pic.twitter.com/iI0DUxOUU2 82 | Label: -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | scipy 2 | scikit-learn==0.24.2 3 | datasets 4 | transformers 5 | openai 6 | sacred 7 | python-dotenv 8 | cachetools 9 | torch 10 | sentence-transformers 11 | ipykernel 12 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import find_packages, setup 2 | 3 | 4 | setup( 5 | name="raft_baselines", 6 | version="0.0.1", 7 | description="RAFT Benchmarks baselines, classifiers, and testing scripts.", 8 | python_requires=">=3.7.0", 9 | packages=find_packages("src"), 10 | package_dir={"": "src"}, 11 | install_requires=[], 12 | ) 13 | -------------------------------------------------------------------------------- /src/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oughtinc/raft-baselines/cc04aee9c8a8cbfad431cce044abe76993bfb0f7/src/__init__.py -------------------------------------------------------------------------------- /src/raft_baselines/classifiers/__init__.py: -------------------------------------------------------------------------------- 1 | from .random_classifier import RandomClassifier 2 | from .gpt3_classifier import GPT3Classifier 3 | from .transformers_causal_lm_classifier import TransformersCausalLMClassifier 4 | from .naive_bayes_classifier import NaiveBayesClassifier 5 | from .svm_classifier import SVMClassifier 6 | from .adaboost_classifier import AdaBoostClassifier 7 | from .zero_shot_transformers_classifier import TransformersZeroShotPipelineClassifier 8 | -------------------------------------------------------------------------------- /src/raft_baselines/classifiers/adaboost_classifier.py: -------------------------------------------------------------------------------- 1 | from copy import deepcopy 2 | 3 | from sklearn.ensemble import AdaBoostClassifier as AdaBoost 4 | from sklearn.tree import DecisionTreeClassifier 5 | 6 | from raft_baselines.classifiers.n_grams_classifier import NGramsClassifier 7 | 8 | 9 | class AdaBoostClassifier(NGramsClassifier): 10 | def __init__( 11 | self, training_data, vectorizer_kwargs=None, model_kwargs=None, **kwargs 12 | ): 13 | super().__init__(training_data, vectorizer_kwargs, model_kwargs, **kwargs) 14 | if model_kwargs is None: 15 | model_kwargs = {} 16 | if "max_depth" in model_kwargs: 17 | # Required for sacred 18 | model_kwargs = deepcopy(model_kwargs) 19 | d = model_kwargs.pop("max_depth") 20 | base = DecisionTreeClassifier(max_depth=d) 21 | model_kwargs["base_estimator"] = base 22 | self.classifier = AdaBoost(**model_kwargs) 23 | self.classifier.fit(self.vectorized_training_data, self.training_data["Label"]) 24 | 25 | def _classify(self, vector_input): 26 | return self.classifier.predict_proba(vector_input) 27 | -------------------------------------------------------------------------------- /src/raft_baselines/classifiers/classifier.py: -------------------------------------------------------------------------------- 1 | import datasets 2 | from typing import Callable, List, Mapping, Dict, Optional 3 | from abc import ABC, abstractmethod 4 | 5 | 6 | class Classifier(ABC): 7 | def __init__(self, training_data: datasets.Dataset) -> None: 8 | self.training_data: datasets.Dataset = training_data 9 | 10 | self.class_col: str = "Label" 11 | self.class_label_to_string: Callable[[int], str] = training_data.features[ 12 | "Label" 13 | ].int2str 14 | self.classes: List[str] = list(training_data.features["Label"].names[1:]) 15 | self.input_cols: List[str] = [ 16 | col for col in training_data.features if col not in ("ID", "Label") 17 | ] 18 | 19 | @abstractmethod 20 | def classify( 21 | self, 22 | target: Mapping[str, str], 23 | random_seed: Optional[int] = None, 24 | should_print_prompt: bool = False, 25 | ) -> Dict[str, float]: 26 | """ 27 | :param target: Dict input with fields and natural language data within those fields. 28 | :return: Dict where the keys are class names and the values are probabilities. 29 | """ 30 | raise NotImplementedError 31 | -------------------------------------------------------------------------------- /src/raft_baselines/classifiers/gpt3_classifier.py: -------------------------------------------------------------------------------- 1 | from typing import Dict, Optional, List, Mapping 2 | 3 | import numpy as np 4 | import datasets 5 | 6 | from raft_baselines.classifiers.in_context_classifier import InContextClassifier 7 | from raft_baselines.utils.gpt3_utils import ( 8 | complete, 9 | search, 10 | ) 11 | from raft_baselines.utils.tokenizers import TransformersTokenizer 12 | 13 | GPT3_MAX_TOKENS = 2048 14 | tokenizer = TransformersTokenizer("gpt2") 15 | 16 | 17 | class GPT3Classifier(InContextClassifier): 18 | def __init__( 19 | self, 20 | *args, 21 | engine: str = "ada", 22 | search_engine: str = "ada", 23 | **kwargs, 24 | ) -> None: 25 | super().__init__( 26 | *args, 27 | tokenizer=tokenizer, 28 | max_tokens=GPT3_MAX_TOKENS, 29 | **kwargs, 30 | ) 31 | 32 | self.engine: str = engine 33 | self.search_engine: str = search_engine 34 | 35 | def semantically_select_training_examples( 36 | self, target: Mapping[str, str] 37 | ) -> datasets.Dataset: 38 | formatted_examples_without_labels = tuple( 39 | self.format_dict( 40 | {col: row[col] for col in self.input_cols if col in row}, 41 | ) 42 | for row in self.training_data 43 | ) 44 | 45 | search_results = search( 46 | formatted_examples_without_labels, 47 | self.format_dict(target), 48 | self.search_engine, 49 | ) 50 | 51 | sorted_indices = list( 52 | map( 53 | lambda result: result["document"], # type: ignore 54 | sorted( 55 | search_results, 56 | key=lambda result: -result["score"], # type: ignore 57 | ), 58 | ) 59 | ) 60 | 61 | return self.training_data.select( 62 | list(reversed(sorted_indices[: self.num_prompt_training_examples])) 63 | ) 64 | 65 | def does_token_match_class(self, token: str, clas: str) -> bool: 66 | # prepend a space to the class label 67 | # because we always expect a leading space in the first token 68 | # returned from the OpenAI API, given our prompt format 69 | clas_str = ( 70 | f" {clas}" if not self.add_prefixes else f" {self.classes.index(clas) + 1}" 71 | ) 72 | 73 | clas_first_token_id: int = self.tokenizer(clas_str)["input_ids"][0] 74 | token_id: int = self.tokenizer(token)["input_ids"][0] 75 | 76 | # Compare token ids rather than the raw tokens 77 | # because GPT2TokenizerFast represents some special characters 78 | # differently from the GPT-3 API 79 | # (e.g. the space at the beginning of the token is " " according to the API, 80 | # but "Ġ" according to the tokenizer. 81 | # Standardizing to token ids is one easy way to smooth over that difference. 82 | return clas_first_token_id == token_id 83 | 84 | def _get_raw_probabilities( 85 | self, 86 | prompt: str, 87 | ) -> List[float]: 88 | response = complete( 89 | prompt, 90 | temperature=0.0, 91 | engine=self.engine, 92 | max_tokens=1, 93 | ) 94 | logprobs: Dict[str, float] = response["choices"][0]["logprobs"]["top_logprobs"][ 95 | 0 96 | ] 97 | 98 | raw_p = [] 99 | for clas in self.classes: 100 | p = 0.0 101 | for token in logprobs.keys(): 102 | if self.does_token_match_class(token, clas): 103 | p += np.exp(logprobs[token]) 104 | raw_p.append(p) 105 | 106 | return raw_p 107 | -------------------------------------------------------------------------------- /src/raft_baselines/classifiers/in_context_classifier.py: -------------------------------------------------------------------------------- 1 | from abc import abstractmethod 2 | import random 3 | from typing import Dict, Optional, List, Tuple, Mapping, Any 4 | from collections import defaultdict 5 | import json 6 | import importlib.resources 7 | 8 | import numpy as np 9 | import datasets 10 | 11 | from raft_baselines.classifiers.classifier import Classifier 12 | from raft_baselines import data 13 | from raft_baselines.utils.tokenizers import Tokenizer 14 | 15 | text_data = importlib.resources.read_text( 16 | data, "prompt_construction_settings.jsonl" 17 | ).split("\n") 18 | FIELD_ORDERING = json.loads(text_data[0]) 19 | INSTRUCTIONS = json.loads(text_data[1]) 20 | 21 | 22 | class InContextClassifier(Classifier): 23 | separator: str = "\n\n" 24 | 25 | def __init__( 26 | self, 27 | training_data: datasets.Dataset, 28 | num_prompt_training_examples: int = 20, 29 | add_prefixes: bool = False, 30 | config: str = None, 31 | use_task_specific_instructions: bool = True, 32 | do_semantic_selection: bool = True, 33 | tokenizer: Tokenizer = None, 34 | max_tokens: int = 2048, 35 | ) -> None: 36 | super().__init__(training_data) 37 | 38 | self.num_prompt_training_examples: int = num_prompt_training_examples 39 | self.add_prefixes: bool = add_prefixes 40 | 41 | if config: 42 | self.config: str = config 43 | self.input_cols: List[str] = FIELD_ORDERING[config] 44 | self.instructions_start: str = "Possible labels:" 45 | if use_task_specific_instructions: 46 | self.instructions_start = ( 47 | INSTRUCTIONS[config] + "\n" + self.instructions_start 48 | ) 49 | 50 | self.do_semantic_selection: bool = do_semantic_selection 51 | 52 | self.tokenizer = tokenizer 53 | self.truncation_params: Mapping[str, Any] = { 54 | # max - buffer - completion tokens 55 | "max_tokens": max_tokens - 10 - 1, 56 | "end_example_token_proportion": max( 57 | 0.25, 58 | 1 59 | / (1 + min(self.num_prompt_training_examples, len(self.training_data))), 60 | ) 61 | if self.num_prompt_training_examples is not None 62 | else 0.25, 63 | } 64 | 65 | @property 66 | def instructions(self) -> str: 67 | formatted_classes = "\n".join( 68 | [f"{idx + 1}. {clas}" for idx, clas in enumerate(self.classes)] 69 | ) 70 | return f"""{self.instructions_start}\n{formatted_classes}""" 71 | 72 | def max_example_lengths( 73 | self, num_training_examples: int, input_to_classify: Mapping[str, str] 74 | ) -> Tuple[int, int]: 75 | instruction_tokens = self.tokenizer.num_tokens(self.instructions) 76 | separator_tokens = (num_training_examples + 1) * len(self.separator) 77 | max_example_tokens = ( 78 | self.truncation_params["max_tokens"] - instruction_tokens - separator_tokens 79 | ) 80 | 81 | untruncated_end_example_tokens = self.tokenizer.num_tokens( 82 | self.format_prompt_end(input_to_classify) 83 | ) 84 | max_end_example_tokens = min( 85 | untruncated_end_example_tokens, 86 | int( 87 | max_example_tokens 88 | * self.truncation_params["end_example_token_proportion"] 89 | ), 90 | ) 91 | max_train_example_tokens = ( 92 | int((max_example_tokens - max_end_example_tokens) / num_training_examples) 93 | if num_training_examples > 0 94 | else 0 95 | ) 96 | 97 | return max_end_example_tokens, max_train_example_tokens 98 | 99 | @classmethod 100 | def format_dict(cls, example: Mapping[str, str]) -> str: 101 | return "\n".join( 102 | [f"{k}: {v}" for k, v in example.items() if len(str(v).strip())] 103 | ) 104 | 105 | def format_prompt_end( 106 | self, target: Mapping[str, str], max_tokens: Optional[int] = None 107 | ) -> str: 108 | output_block = f"{self.class_col}:" 109 | output_block_tokens = self.tokenizer.num_tokens(output_block) 110 | untruncated_text = self.format_dict(target) 111 | input_block = ( 112 | untruncated_text 113 | if max_tokens is None 114 | else self.tokenizer.truncate_by_tokens( 115 | untruncated_text, max_tokens - output_block_tokens - 1 116 | ) 117 | ) 118 | return f"""{input_block} 119 | {output_block}""" 120 | 121 | def format_example( 122 | self, example: Mapping[str, str], clas: str, max_tokens: Optional[int] = None 123 | ) -> str: 124 | clas_str = ( 125 | clas if not self.add_prefixes else f"{self.classes.index(clas) + 1}. {clas}" 126 | ) 127 | output_block = f"{self.class_col}: {clas_str}" 128 | output_block = ( 129 | output_block 130 | if max_tokens is None 131 | else self.tokenizer.truncate_by_tokens(output_block, max_tokens - 2) 132 | ) 133 | output_block_tokens = self.tokenizer.num_tokens(output_block) 134 | untruncated_text = self.format_dict(example) 135 | input_block = ( 136 | untruncated_text 137 | if max_tokens is None 138 | else self.tokenizer.truncate_by_tokens( 139 | untruncated_text, max_tokens - output_block_tokens - 1 140 | ) 141 | ) 142 | return f"""{input_block} 143 | {output_block}""" 144 | 145 | def render_examples( 146 | self, 147 | example_dataset: datasets.Dataset, 148 | max_tokens_per_example: Optional[int] = None, 149 | ) -> str: 150 | formatted_examples = [ 151 | self.format_example( 152 | {col: row[col] for col in self.input_cols if col in row}, 153 | self.class_label_to_string(row[self.class_col]), 154 | max_tokens=max_tokens_per_example, 155 | ) 156 | for row in example_dataset 157 | ] 158 | return self.separator.join(formatted_examples) 159 | 160 | @abstractmethod 161 | def semantically_select_training_examples( 162 | self, target: Mapping[str, str] 163 | ) -> datasets.Dataset: 164 | ... 165 | 166 | def select_training_examples( 167 | self, target: Mapping[str, str], random_seed: Optional[int] = None 168 | ) -> datasets.Dataset: 169 | # handle edge case where target is blank (all the fields we selected are empty) 170 | if not self.do_semantic_selection or not self.format_dict(target): 171 | random.seed(random_seed) 172 | 173 | n_ex = self.num_prompt_training_examples 174 | if n_ex is None or len(self.training_data) <= n_ex: 175 | return self.training_data 176 | 177 | uniques = defaultdict(lambda: []) 178 | for i, row in enumerate(self.training_data): 179 | uniques[row["Label"]].append(i) 180 | 181 | indices = [] 182 | for key in uniques: 183 | indices.append(random.choice(uniques[key])) 184 | random.shuffle(indices) 185 | 186 | remaining_indices = [ 187 | i for i in range(len(self.training_data)) if i not in indices 188 | ] 189 | indices += random.sample( 190 | remaining_indices, min(n_ex, len(remaining_indices)) 191 | ) 192 | 193 | return self.training_data.select(indices[:n_ex]) 194 | else: 195 | return self.semantically_select_training_examples(target) 196 | 197 | def format_prompt( 198 | self, 199 | target: Mapping[str, str], 200 | example_dataset: Optional[datasets.Dataset] = None, 201 | ) -> str: 202 | if self.truncation_params is None: 203 | raise ValueError("No truncation strategy provided.") 204 | 205 | num_examples = len(example_dataset) if example_dataset else 0 206 | max_end_example_tokens, max_train_example_tokens = self.max_example_lengths( 207 | num_examples, target 208 | ) 209 | example_str = ( 210 | self.render_examples( 211 | example_dataset, max_tokens_per_example=max_train_example_tokens 212 | ) 213 | if example_dataset 214 | else "" 215 | ) 216 | example_str_and_sep = "" if example_str == "" else example_str + self.separator 217 | 218 | prompt = f"""{self.instructions + self.separator if self.instructions != "" else ""}{example_str_and_sep}{self.format_prompt_end(target, max_tokens=max_end_example_tokens)}""" # noqa: E501 219 | return prompt 220 | 221 | @abstractmethod 222 | def _get_raw_probabilities( 223 | self, 224 | prompt: str, 225 | ) -> List[float]: 226 | ... 227 | 228 | def _classify_prompt( 229 | self, 230 | prompt: str, 231 | ) -> Dict[str, float]: 232 | raw_p = self._get_raw_probabilities(prompt) 233 | sum_p = np.sum(raw_p) 234 | if sum_p > 0: 235 | normalized_p = np.array(raw_p) / np.sum(raw_p) 236 | else: 237 | normalized_p = np.full(len(self.classes), 1 / len(self.classes)) 238 | class_probs = {} 239 | for i, clas in enumerate(self.classes): 240 | class_probs[clas] = normalized_p[i] 241 | return class_probs 242 | 243 | def classify( 244 | self, 245 | target: Mapping[str, str], 246 | random_seed: Optional[int] = None, 247 | should_print_prompt: bool = False, 248 | ) -> Dict[str, float]: 249 | ordered_target = {col: target[col] for col in self.input_cols if col in target} 250 | 251 | example_dataset = ( 252 | self.select_training_examples(ordered_target, random_seed=random_seed) 253 | if self.num_prompt_training_examples > 0 254 | else None 255 | ) 256 | 257 | prompt = self.format_prompt(ordered_target, example_dataset) 258 | if should_print_prompt: 259 | print(prompt) 260 | 261 | return self._classify_prompt(prompt) 262 | -------------------------------------------------------------------------------- /src/raft_baselines/classifiers/n_grams_classifier.py: -------------------------------------------------------------------------------- 1 | from typing import Mapping, Optional, Dict 2 | from abc import abstractmethod 3 | 4 | import datasets 5 | from sklearn.feature_extraction.text import CountVectorizer 6 | 7 | from raft_baselines.classifiers.classifier import Classifier 8 | 9 | 10 | class NGramsClassifier(Classifier): 11 | def __init__( 12 | self, 13 | training_data: datasets.Dataset, 14 | vectorizer_kwargs: Dict = None, 15 | model_kwargs: Dict = None, 16 | **kwargs 17 | ): 18 | super().__init__(training_data) 19 | if vectorizer_kwargs is None: 20 | vectorizer_kwargs = {} 21 | cleaned_text_train = [self.stringify_row(row) for row in self.training_data] 22 | 23 | self.vectorizer = CountVectorizer(**vectorizer_kwargs).fit(cleaned_text_train) 24 | self.vectorized_training_data = self.vectorizer.transform(cleaned_text_train) 25 | self.classifier = None 26 | 27 | def stringify_row(self, row): 28 | return ". ".join( 29 | row[input_col] for input_col in self.input_cols if input_col in row 30 | ) 31 | 32 | @abstractmethod 33 | def _classify(self, vector_input): 34 | return NotImplementedError 35 | 36 | def classify( 37 | self, 38 | target: Mapping[str, str], 39 | random_seed: Optional[int] = None, 40 | should_print_prompt: bool = False, 41 | ) -> Dict[str, float]: 42 | simple_input = self.stringify_row(target) 43 | vector_input = self.vectorizer.transform((simple_input,)) 44 | result = self._classify(vector_input) 45 | return { 46 | self.class_label_to_string(int(cls)): prob 47 | for prob, cls in zip(result[0], self.classifier.classes_) 48 | } 49 | -------------------------------------------------------------------------------- /src/raft_baselines/classifiers/naive_bayes_classifier.py: -------------------------------------------------------------------------------- 1 | from sklearn.naive_bayes import MultinomialNB 2 | 3 | from raft_baselines.classifiers.n_grams_classifier import NGramsClassifier 4 | 5 | 6 | class NaiveBayesClassifier(NGramsClassifier): 7 | def __init__( 8 | self, training_data, vectorizer_kwargs=None, model_kwargs=None, **kwargs 9 | ): 10 | super().__init__(training_data, vectorizer_kwargs, model_kwargs, **kwargs) 11 | if model_kwargs is None: 12 | model_kwargs = {} 13 | self.classifier = MultinomialNB(**model_kwargs) 14 | self.classifier.fit(self.vectorized_training_data, self.training_data["Label"]) 15 | 16 | def _classify(self, vector_input): 17 | return self.classifier.predict_proba(vector_input) 18 | -------------------------------------------------------------------------------- /src/raft_baselines/classifiers/random_classifier.py: -------------------------------------------------------------------------------- 1 | import random 2 | from typing import Mapping, Optional 3 | 4 | import datasets 5 | 6 | from raft_baselines.classifiers.classifier import Classifier 7 | 8 | 9 | class RandomClassifier(Classifier): 10 | def __init__( 11 | self, training_data: datasets.Dataset, seed: int = 4, **kwargs 12 | ) -> None: 13 | super().__init__(training_data) 14 | random.seed(seed) 15 | 16 | def classify( 17 | self, 18 | target: Mapping[str, str], 19 | random_seed: Optional[int] = None, 20 | should_print_prompt: bool = False, 21 | ) -> Mapping[str, float]: 22 | if random_seed is not None: 23 | random.seed(random_seed) 24 | result = {c: 0.0 for c in self.classes} 25 | result[random.choice(self.classes)] = 1.0 26 | 27 | return result 28 | -------------------------------------------------------------------------------- /src/raft_baselines/classifiers/svm_classifier.py: -------------------------------------------------------------------------------- 1 | from sklearn.svm import LinearSVC 2 | from scipy.special import softmax 3 | import numpy as np 4 | 5 | from raft_baselines.classifiers.n_grams_classifier import NGramsClassifier 6 | 7 | 8 | class DummyClassifier: 9 | def __init__(self, label): 10 | self.classes_ = [label] 11 | 12 | def decision_function(self, vector_input): 13 | return np.array([1]) 14 | 15 | 16 | class SVMClassifier(NGramsClassifier): 17 | def __init__(self, training_data, vectorizer_kwargs, model_kwargs, **kwargs): 18 | super().__init__(training_data, vectorizer_kwargs, model_kwargs, **kwargs) 19 | if model_kwargs is None: 20 | model_kwargs = {} 21 | # Sometimes breaks if there's only one label in the training data. 22 | # Hack-y solution. 23 | if len(set(self.training_data["Label"])) == 1: 24 | self.classifier = DummyClassifier(self.training_data["Label"][0]) 25 | return 26 | self.classifier = LinearSVC(**model_kwargs) 27 | self.classifier.fit(self.vectorized_training_data, self.training_data["Label"]) 28 | 29 | def _classify(self, vector_input): 30 | confidences = self.classifier.decision_function(vector_input) 31 | if len(self.classifier.classes_) <= 2: 32 | # Positive score means first class, negative score means second class. 33 | # Appending a 0 ensures that the softmax classifies correctly while still 34 | # ensuring somewhat sensible confidences. Not probabilities though. 35 | confidences = np.append(confidences, 0) 36 | confidences = confidences.reshape(1, 2) 37 | return softmax(confidences) 38 | -------------------------------------------------------------------------------- /src/raft_baselines/classifiers/transformers_causal_lm_classifier.py: -------------------------------------------------------------------------------- 1 | from typing import List, Mapping 2 | 3 | import datasets 4 | import torch 5 | from transformers import AutoModelForCausalLM 6 | from sentence_transformers import util 7 | 8 | from raft_baselines.classifiers.in_context_classifier import InContextClassifier 9 | from raft_baselines.utils.tokenizers import TransformersTokenizer 10 | from raft_baselines.utils.embedders import SentenceTransformersEmbedder 11 | 12 | 13 | class TransformersCausalLMClassifier(InContextClassifier): 14 | def __init__( 15 | self, 16 | *args, 17 | model_type: str = "distilgpt2", 18 | **kwargs, 19 | ) -> None: 20 | tokenizer = TransformersTokenizer(model_type) 21 | self.device = "cuda" if torch.cuda.is_available() else "cpu" 22 | self.model = AutoModelForCausalLM.from_pretrained(model_type).to(self.device) 23 | self.similarity_embedder = SentenceTransformersEmbedder() 24 | 25 | super().__init__( 26 | *args, 27 | tokenizer=tokenizer, 28 | max_tokens=self.model.config.max_position_embeddings, 29 | **kwargs, 30 | ) 31 | 32 | def semantically_select_training_examples( 33 | self, target: Mapping[str, str] 34 | ) -> datasets.Dataset: 35 | formatted_examples_without_labels = tuple( 36 | self.format_dict( 37 | {col: row[col] for col in self.input_cols if col in row}, 38 | ) 39 | for row in self.training_data 40 | ) 41 | formatted_target = self.format_dict(target) 42 | 43 | # adapted from https://towardsdatascience.com/semantic-similarity-using-transformers-8f3cb5bf66d6 44 | target_embedding = self.similarity_embedder(tuple([formatted_target])) 45 | example_embeddings = self.similarity_embedder(formatted_examples_without_labels) 46 | 47 | similarity_scores = util.pytorch_cos_sim(target_embedding, example_embeddings)[ 48 | 0 49 | ] 50 | 51 | sorted_indices = torch.argsort(-similarity_scores.to(self.device)) 52 | return self.training_data.select( 53 | list(reversed(sorted_indices[: self.num_prompt_training_examples])) 54 | ) 55 | 56 | def _get_raw_probabilities( 57 | self, 58 | prompt: str, 59 | ) -> List[float]: 60 | inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device) 61 | 62 | with torch.no_grad(): 63 | output = self.model(**inputs) 64 | 65 | next_token_probs = torch.softmax(output.logits[0][-1], dim=0) 66 | 67 | def get_prob_for_class(clas): 68 | clas_str = ( 69 | f" {clas}" 70 | if not self.add_prefixes 71 | else f" {self.classes.index(clas) + 1}" 72 | ) 73 | 74 | return next_token_probs[self.tokenizer(clas_str)["input_ids"][0]] 75 | 76 | return ( 77 | torch.stack([get_prob_for_class(clas) for clas in self.classes]) 78 | .cpu() 79 | .detach() 80 | .numpy() 81 | ) 82 | -------------------------------------------------------------------------------- /src/raft_baselines/classifiers/zero_shot_transformers_classifier.py: -------------------------------------------------------------------------------- 1 | import datasets 2 | from typing import Mapping, Dict, Optional, List 3 | from transformers import pipeline 4 | import importlib.resources 5 | import json 6 | import torch 7 | 8 | from raft_baselines.classifiers.classifier import Classifier 9 | from raft_baselines import data 10 | 11 | text_data = importlib.resources.read_text( 12 | data, "prompt_construction_settings.jsonl" 13 | ).split("\n") 14 | FIELD_ORDERING = json.loads(text_data[0]) 15 | 16 | 17 | class TransformersZeroShotPipelineClassifier(Classifier): 18 | def __init__( 19 | self, training_data: datasets.Dataset, config: str = None, **kwargs 20 | ) -> None: 21 | self.device = 0 if torch.cuda.is_available() else -1 22 | self.clf = pipeline("zero-shot-classification", device=self.device) 23 | 24 | if config: 25 | self.config: str = config 26 | self.input_cols: List[str] = FIELD_ORDERING[config] 27 | 28 | super().__init__(training_data) 29 | 30 | @classmethod 31 | def format_dict(cls, example: Mapping[str, str]) -> str: 32 | return "\n".join( 33 | [f"{k}: {v}" for k, v in example.items() if len(str(v).strip())] 34 | ) 35 | 36 | def classify( 37 | self, 38 | target: Mapping[str, str], 39 | random_seed: Optional[int] = None, 40 | should_print_prompt: bool = False, 41 | ) -> Dict[str, float]: 42 | """ 43 | :param target: Dict input with fields and natural language data within those fields. 44 | :return: Dict where the keys are class names and the values are probabilities. 45 | """ 46 | ordered_target = {col: target[col] for col in self.input_cols if col in target} 47 | target_str = self.format_dict(ordered_target) 48 | 49 | output = self.clf(target_str, candidate_labels=self.classes) 50 | return {clas: score for clas, score in zip(output["labels"], output["scores"])} 51 | -------------------------------------------------------------------------------- /src/raft_baselines/data/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oughtinc/raft-baselines/cc04aee9c8a8cbfad431cce044abe76993bfb0f7/src/raft_baselines/data/__init__.py -------------------------------------------------------------------------------- /src/raft_baselines/data/example_predictions.csv: -------------------------------------------------------------------------------- 1 | ID,Label 2 | 50,not ADE-related 3 | 51,not ADE-related 4 | 52,not ADE-related 5 | 53,not ADE-related 6 | 54,not ADE-related 7 | 55,not ADE-related 8 | 56,not ADE-related 9 | 57,not ADE-related 10 | 58,not ADE-related 11 | 59,not ADE-related 12 | -------------------------------------------------------------------------------- /src/raft_baselines/data/prompt_construction_settings.jsonl: -------------------------------------------------------------------------------- 1 | {"ade_corpus_v2": ["Sentence"], "banking_77": ["Query"], "terms_of_service": ["Sentence"], "tai_safety_research": ["Title", "Abstract Note", "Publication Title", "Item Type", "Publication Year"], "neurips_impact_statement_risks": ["Impact statement", "Paper title"], "overruling": ["Sentence"], "systematic_review_inclusion": ["Title", "Abstract", "Journal"], "one_stop_english": ["Article"], "tweet_eval_hate": ["Tweet"], "twitter_complaints": ["Tweet text"], "semiconductor_org_types": ["Organization name", "Paper title"]} 2 | {"ade_corpus_v2": "Label the sentence based on whether it is related to an adverse drug effect (ADE). Details are described below:\nDrugs: Names of drugs and chemicals that include brand names, trivial names, abbreviations and systematic names were annotated. Mentions of drugs or chemicals should strictly be in a therapeutic context. This category does not include the names of metabolites, reaction byproducts, or hospital chemicals (e.g. surgical equipment disinfectants).\nAdverse effect: Mentions of adverse effects include signs, symptoms, diseases, disorders, acquired abnormalities, deficiencies, organ damage or death that strictly occur as a consequence of drug intake.", "banking_77": "The following is a banking customer service query. Classify the query into one of the 77 categories available.", "terms_of_service": "Label the sentence from a Terms of Service based on whether it is potentially unfair. If it seems clearly unfair, mark it as potentially unfair.\nAccording to art. 3 of the Directive 93/13 on Unfair Terms in Consumer Contracts, a contractual term is unfair if: 1) it has not been individually negotiated; and 2) contrary to the requirement of good faith, it causes a significant imbalance in the parties rights and obligations, to the detriment of the consumer. \nDetails on types of potentially unfair clauses are found below:\nThe jurisdiction clause stipulates what courts will have the competence to adjudicate disputes under the contract. Jurisdiction clauses giving consumers a right to bring disputes in their place of residence were marked as clearly fair, whereas clauses stating that any judicial proceeding takes a residence away were marked as clearly unfair.\nThe choice of law clause specifies what law will govern the contract, meaning also what law will be applied in potential adjudication of a dispute arising under the contract. Clauses defining the applicable law as the law of the consumer's country of residence were marked as clearly fair. In every other case, the choice of law clause was considered as potentially unfair.\nThe limitation of liability clause stipulates that the duty to pay damages is limited or excluded, for certain kind of losses, under certain conditions. Clauses that explicitly affirm non-excludable providers' liabilities were marked as clearly fair. Clauses that reduce, limit, or exclude the liability of the service provider were marked as potentially unfair when concerning broad categories of losses or causes of them.\nThe unilateral change clause specifies the conditions under which the service provider could amend and modify the terms of service and/or the service itself. Such clause was always considered as potentially unfair.\nThe unilateral termination clause gives provider the right to suspend and/or terminate the service and/or the contract, and sometimes details the circumstances under which the provider claims to have a right to do so.\nThe contract by using clause stipulates that the consumer is bound by the terms of use of a specific service, simply by using the service, without even being required to mark that he or she has read and accepted them. We always marked such clauses as potentially unfair.\nThe content removal gives the provider a right to modify/delete user's content, including in-app purchases, and sometimes specifies the conditions under which the service provider may do so.\nThe arbitration clause requires or allows the parties to resolve their disputes through an arbitration process, before the case could go to court. Clauses stipulating that the arbitration should take place in a state other then the state of consumer's residence or be based on arbiter's discretion were marked as clearly unfair. Clauses defining arbitration as fully optional were marked as clearly fair.", "tai_safety_research": "Transformative AI (TAI) is defined as AI that precipitates a transition comparable to (or more significant than) the agricultural or industrial revolution. Label a paper as \"TAI safety research\" if: \n1. The contents of the paper are directly motivated by, and substantively inform, the challenge of ensuring good outcomes for TAI, \n2. There is substantive content on AI safety, not just AI capabilities, \n3. The intended audience is the community of researchers, \n4. It meets a subjective threshold of seriousness/quality, \n5. Peer review is not required.", "neurips_impact_statement_risks": "Label the impact statement based on whether it mentions a harmful application of the research done in the paper. Make sure the statement is sufficient to conclude there are harmful applications of the research being done, not a past risk that this research is solving.", "overruling": "In law, an overruling sentence is a statement that nullifies a previous case decision as a precedent, by a constitutionally valid statute or a decision by the same or higher ranking court which establishes a different rule on the point of law involved. Label the sentence based on whether it is overruling or not.", "systematic_review_inclusion": "Identify whether this paper should be included in a meta-review which includes the findings of systematic reviews on interventions designed to promote charitable donations. \nIncluded reviews should describe monetary charitable donations, assess any population of participants in any context, and be peer reviewed and written in English. \nThey should not report new data, be non-systematic reviews, consider cause-related marketing or other kinds of prosocial behaviour.", "one_stop_english": "The following is an article sourced from The Guardian newspaper, and rewritten by teachers to suit three levels of adult English as Second Language (ESL) learners: elementary, intermediate, and advanced. Predict the level of the article.", "tweet_eval_hate": "Label whether the following tweet contains hate speech against either immigrants or women. Hate Speech (HS) is commonly defined as any communication that disparages a person or a group on the basis of some characteristic such as race, color, ethnicity, gender, sexual orientation, nationality, religion, or other characteristics.", "twitter_complaints": "A complaint presents a state of affairs which breaches the writer\u2019s favorable expectation. Label the tweet text based on whether it contains a complaint.", "semiconductor_org_types": "The dataset is a list of institutions that have contributed papers to semiconductor conferences in the last 25 years, as catalogued by IEEE and sampled randomly. The goal is to classify the institutions into one of three categories: \"university\", \"company\" or \"research institute\"."} -------------------------------------------------------------------------------- /src/raft_baselines/scripts/non_neural_experiment.py: -------------------------------------------------------------------------------- 1 | import datasets 2 | from sacred import Experiment, observers 3 | import sklearn.metrics as skm 4 | 5 | from raft_baselines import classifiers 6 | 7 | experiment_name = "non_neural" 8 | raft_experiment = Experiment(experiment_name, save_git_info=False) 9 | observer = observers.FileStorageObserver(f"results/{experiment_name}") 10 | raft_experiment.observers.append(observer) 11 | 12 | 13 | @raft_experiment.config 14 | def base_config(): 15 | classifier_name = "AdaBoostClassifier" 16 | classifier_kwargs = { 17 | "vectorizer_kwargs": { 18 | "strip_accents": 'unicode', 19 | "lowercase": True, 20 | "ngram_range": (1, 5), 21 | "max_df": 1.0, 22 | "min_df": 0.0 23 | }, 24 | "model_kwargs": {} 25 | } 26 | configs = datasets.get_dataset_config_names("ought/raft") 27 | # controls which dimension is tested, out of the 3 reported in the paper 28 | # Other options: do_semantic_selection and num_prompt_training_examples 29 | random_seed = 42 30 | 31 | 32 | @raft_experiment.capture 33 | def load_datasets_train(configs): 34 | train_datasets = { 35 | config: datasets.load_dataset("ought/raft", config, split="train") 36 | for config in configs 37 | } 38 | return train_datasets 39 | 40 | 41 | @raft_experiment.capture 42 | def test_experiment( 43 | train_datasets, classifier_name, 44 | classifier_kwargs, random_seed 45 | ): 46 | classifier_cls = getattr(classifiers, classifier_name) 47 | 48 | for config in train_datasets: 49 | dataset = train_datasets[config] 50 | labels = list(range(1, dataset.features["Label"].num_classes)) 51 | predictions = [] 52 | 53 | for i in range(len(dataset)): 54 | train = dataset.select([j for j in range(len(dataset)) if j != i]) 55 | test = dataset.select([i]) 56 | 57 | # Non-neural classifiers (i.e. non-GPT style) should be explicitly trained 58 | # (mostly, this is to allow two separate kwargs arguments) 59 | classifier = classifier_cls(train, **classifier_kwargs) 60 | 61 | def predict(example): 62 | del example["Label"] 63 | del example["ID"] 64 | output_probs = classifier.classify(example, random_seed=random_seed) 65 | output = max(output_probs.items(), key=lambda kv_pair: kv_pair[1]) 66 | 67 | predictions.append(dataset.features["Label"].str2int(output[0])) 68 | 69 | test.map(predict) 70 | 71 | # accuracy = sum([p == l for p, l in zip(predictions, dataset['Label'])]) / 50 72 | f1 = skm.f1_score( 73 | dataset["Label"], predictions, labels=labels, average="macro" 74 | ) 75 | print(f"Dataset - {config}; {f1}") 76 | raft_experiment.log_scalar(f"{config}", f1) 77 | 78 | 79 | @raft_experiment.automain 80 | def main(): 81 | train = load_datasets_train() 82 | test_experiment(train) 83 | -------------------------------------------------------------------------------- /src/raft_baselines/scripts/raft_predict.py: -------------------------------------------------------------------------------- 1 | import os 2 | import shutil 3 | import csv 4 | 5 | import datasets 6 | from sacred import Experiment, observers 7 | 8 | from raft_baselines import classifiers 9 | 10 | """ 11 | This class runs a classifier specified by `classifier_name` on the unlabeled 12 | test sets for all configs given in `configs`. Any classifier can be used, 13 | but must accept a hf.datasets.Dataset as an argument. Any other keyword 14 | arguments must be specified via `classifier_kwargs`. 15 | """ 16 | 17 | experiment_name = "make_predictions" 18 | raft_experiment = Experiment(experiment_name, save_git_info=False) 19 | observer = observers.FileStorageObserver(f"results/{experiment_name}") 20 | raft_experiment.observers.append(observer) 21 | 22 | # Best performing on a per-dataset basis using raft_train_experiment.py 23 | NUM_EXAMPLES = { 24 | "ade_corpus_v2": 25, 25 | "banking_77": 10, 26 | "terms_of_service": 5, 27 | "tai_safety_research": 5, 28 | "neurips_impact_statement_risks": 25, 29 | "overruling": 25, 30 | "systematic_review_inclusion": 10, 31 | "one_stop_english": 5, 32 | "tweet_eval_hate": 50, 33 | "twitter_complaints": 25, 34 | "semiconductor_org_types": 5, 35 | } 36 | 37 | 38 | @raft_experiment.config 39 | def base_config(): 40 | classifier_name = "GPT3Classifier" 41 | classifier_kwargs = { 42 | # change to davinci to replicate results from the paper 43 | # "engine": "ada", 44 | } 45 | if classifier_name in ('NaiveBayesClassifier', 'SVMClassifier', 'AdaBoostClassifier'): 46 | classifier_kwargs = { 47 | "vectorizer_kwargs": { 48 | "strip_accents": 'unicode', 49 | "lowercase": True, 50 | "ngram_range": (1, 5), 51 | "max_df": 1.0, 52 | "min_df": 0.0 53 | }, 54 | "model_kwargs": {} 55 | } 56 | if classifier_name == "NaiveBayesClassifier": 57 | classifier_kwargs['model_kwargs']['alpha'] = 0.05 58 | elif classifier_name == "AdaBoostClassifier": 59 | classifier_kwargs['model_kwargs']['max_depth'] = 3 60 | classifier_kwargs['model_kwargs']['n_estimators'] = 100 61 | 62 | configs = datasets.get_dataset_config_names("ought/raft") 63 | # set n_test to -1 to run on all test examples 64 | n_test = 5 65 | random_seed = 42 66 | zero_shot = False 67 | 68 | 69 | @raft_experiment.capture 70 | def load_datasets_train(configs): 71 | train_datasets = { 72 | config: datasets.load_dataset("ought/raft", config, split="train") 73 | for config in configs 74 | } 75 | test_datasets = { 76 | config: datasets.load_dataset("ought/raft", config, split="test") 77 | for config in configs 78 | } 79 | 80 | return train_datasets, test_datasets 81 | 82 | 83 | @raft_experiment.capture 84 | def make_extra_kwargs(config: str, zero_shot: bool): 85 | extra_kwargs = { 86 | "config": config, 87 | "num_prompt_training_examples": NUM_EXAMPLES[config] if not zero_shot else 0, 88 | } 89 | if config == "banking_77": 90 | extra_kwargs["add_prefixes"] = True 91 | return extra_kwargs 92 | 93 | 94 | @raft_experiment.capture 95 | def make_predictions( 96 | train_dataset, 97 | test_dataset, 98 | classifier_cls, 99 | extra_kwargs, 100 | n_test, 101 | classifier_kwargs, 102 | random_seed, 103 | ): 104 | classifier = classifier_cls(train_dataset, **classifier_kwargs, **extra_kwargs) 105 | 106 | if n_test > -1: 107 | test_dataset = test_dataset.select(range(n_test)) 108 | 109 | def predict(example): 110 | del example["Label"] 111 | output_probs = classifier.classify(example, random_seed=random_seed) 112 | output = max(output_probs.items(), key=lambda kv_pair: kv_pair[1]) 113 | 114 | example["Label"] = train_dataset.features["Label"].str2int(output[0]) 115 | return example 116 | 117 | return test_dataset.map(predict, load_from_cache_file=False) 118 | 119 | 120 | def log_text(text, dirname, filename): 121 | targetdir = os.path.join(observer.dir, dirname) 122 | if not os.path.isdir(targetdir): 123 | os.mkdir(targetdir) 124 | 125 | with open(os.path.join(targetdir, filename), "w") as f: 126 | f.write(text) 127 | 128 | 129 | def prepare_predictions_folder(): 130 | sacred_dir = os.path.join(observer.dir, "predictions") 131 | if os.path.isdir(sacred_dir): 132 | shutil.rmtree(sacred_dir) 133 | os.mkdir(sacred_dir) 134 | 135 | 136 | def write_predictions(labeled, config): 137 | int2str = labeled.features["Label"].int2str 138 | 139 | config_dir = os.path.join(observer.dir, "predictions", config) 140 | if os.path.isdir(config_dir): 141 | shutil.rmtree(config_dir) 142 | os.mkdir(config_dir) 143 | 144 | pred_file = os.path.join(config_dir, "predictions.csv") 145 | 146 | with open(pred_file, "w", newline="") as f: 147 | writer = csv.writer( 148 | f, 149 | quotechar='"', 150 | delimiter=",", 151 | quoting=csv.QUOTE_MINIMAL, 152 | skipinitialspace=True, 153 | ) 154 | writer.writerow(["ID", "Label"]) 155 | for row in labeled: 156 | writer.writerow([row["ID"], int2str(row["Label"])]) 157 | 158 | 159 | @raft_experiment.automain 160 | def main(classifier_name): 161 | train, unlabeled = load_datasets_train() 162 | prepare_predictions_folder() 163 | 164 | classifier_cls = getattr(classifiers, classifier_name) 165 | 166 | for config in unlabeled: 167 | extra_kwargs = make_extra_kwargs(config) 168 | labeled = make_predictions( 169 | train[config], unlabeled[config], classifier_cls, extra_kwargs 170 | ) 171 | write_predictions(labeled, config) 172 | -------------------------------------------------------------------------------- /src/raft_baselines/scripts/raft_train_experiment.py: -------------------------------------------------------------------------------- 1 | import datasets 2 | from sacred import Experiment, observers 3 | import sklearn.metrics as skm 4 | 5 | from raft_baselines import classifiers 6 | 7 | experiment_name = "loo_tuning" 8 | raft_experiment = Experiment(experiment_name, save_git_info=False) 9 | observer = observers.FileStorageObserver(f"results/{experiment_name}") 10 | raft_experiment.observers.append(observer) 11 | 12 | 13 | @raft_experiment.config 14 | def base_config(): 15 | classifier_name = "GPT3Classifier" 16 | classifier_kwargs = { 17 | # change to davinci to replicate results from the paper 18 | "engine": "ada", 19 | } 20 | configs = datasets.get_dataset_config_names("ought/raft") 21 | # controls which dimension is tested, out of the 3 reported in the paper 22 | # Other options: do_semantic_selection and num_prompt_training_examples 23 | test_dimension = "use_task_specific_instructions" 24 | random_seed = 42 25 | 26 | 27 | @raft_experiment.capture 28 | def load_datasets_train(configs): 29 | train_datasets = { 30 | config: datasets.load_dataset("ought/raft", config, split="train") 31 | for config in configs 32 | } 33 | return train_datasets 34 | 35 | 36 | @raft_experiment.capture 37 | def loo_test( 38 | train_datasets, classifier_name, classifier_kwargs, test_dimension, random_seed 39 | ): 40 | # Change what to iterate over, filling in extra_kwargs to test different 41 | # configurations of the classifier. 42 | 43 | if test_dimension == "use_task_specific_instructions": 44 | dim_values = [False, True] 45 | other_dim_kwargs = { 46 | "do_semantic_selection": False, 47 | "num_prompt_training_examples": 20, 48 | } 49 | elif test_dimension == "do_semantic_selection": 50 | dim_values = [False, True] 51 | other_dim_kwargs = { 52 | "use_task_specific_instructions": True, 53 | "num_prompt_training_examples": 20, 54 | } 55 | elif test_dimension == "num_prompt_training_examples": 56 | dim_values = [5, 10, 25, 49] 57 | other_dim_kwargs = { 58 | "use_task_specific_instructions": True, 59 | "do_semantic_selection": True, 60 | } 61 | else: 62 | raise ValueError(f"test_dimension {test_dimension} not recognized") 63 | 64 | classifier_cls = getattr(classifiers, classifier_name) 65 | 66 | for config in train_datasets: 67 | for dim_value in dim_values: 68 | dataset = train_datasets[config] 69 | labels = list(range(1, dataset.features["Label"].num_classes)) 70 | predictions = [] 71 | 72 | extra_kwargs = { 73 | "config": config, 74 | test_dimension: dim_value, 75 | **other_dim_kwargs, 76 | } 77 | if config == "banking_77": 78 | extra_kwargs["add_prefixes"] = True 79 | 80 | for i in range(len(dataset)): 81 | train = dataset.select([j for j in range(len(dataset)) if j != i]) 82 | test = dataset.select([i]) 83 | 84 | classifier = classifier_cls(train, **classifier_kwargs, **extra_kwargs) 85 | 86 | def predict(example): 87 | del example["Label"] 88 | del example["ID"] 89 | output_probs = classifier.classify(example, random_seed=random_seed) 90 | output = max(output_probs.items(), key=lambda kv_pair: kv_pair[1]) 91 | 92 | predictions.append(dataset.features["Label"].str2int(output[0])) 93 | 94 | test.map(predict) 95 | 96 | # accuracy = sum([p == l for p, l in zip(predictions, dataset['Label'])]) / 50 97 | f1 = skm.f1_score( 98 | dataset["Label"], predictions, labels=labels, average="macro" 99 | ) 100 | print(f"Dataset - {config}; {test_dimension} - {dim_value}: {f1}") 101 | raft_experiment.log_scalar(f"{config}.{dim_value}", f1) 102 | 103 | 104 | @raft_experiment.automain 105 | def main(): 106 | train = load_datasets_train() 107 | loo_test(train) 108 | -------------------------------------------------------------------------------- /src/raft_baselines/scripts/starter_kit.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "source": [ 6 | "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1TQtHG-Wf2CgYGSD9e7_uJWIdiK5HNniV)" 7 | ], 8 | "metadata": {} 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "source": [ 13 | "## Getting started with the RAFT benchmark\n", 14 | "\n", 15 | "In this notebook, we will walk through:\n", 16 | "\n", 17 | "1. Loading the tasks from the [RAFT dataset](https://huggingface.co/datasets/ought/raft)\n", 18 | "2. Creating a classifier using any CausalLM from the [Hugging Face Hub](https://huggingface.co/models)\n", 19 | "3. Generating predictions using that classifier for RAFT test examples\n", 20 | "\n", 21 | "This should provide you with the steps needed to make a submission to the [RAFT leaderboard](https://huggingface.co/spaces/ought/raft-leaderboard)!" 22 | ], 23 | "metadata": {} 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": 1, 28 | "source": [ 29 | "import datasets\n", 30 | "\n", 31 | "datasets.logging.set_verbosity_error()" 32 | ], 33 | "outputs": [], 34 | "metadata": {} 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "source": [ 39 | "## Loading RAFT datasets\n" 40 | ], 41 | "metadata": {} 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "source": [ 46 | "We'll focus on the ADE corpus V2 task in this starter kit, but similar code could be run for all of the tasks in RAFT. To see the possible tasks, we can use the following function from `datasets`:" 47 | ], 48 | "metadata": {} 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": 2, 53 | "source": [ 54 | "from datasets import get_dataset_config_names\n", 55 | "\n", 56 | "RAFT_TASKS = get_dataset_config_names(\"ought/raft\")\n", 57 | "RAFT_TASKS" 58 | ], 59 | "outputs": [ 60 | { 61 | "output_type": "execute_result", 62 | "data": { 63 | "text/plain": [ 64 | "['ade_corpus_v2',\n", 65 | " 'banking_77',\n", 66 | " 'terms_of_service',\n", 67 | " 'tai_safety_research',\n", 68 | " 'neurips_impact_statement_risks',\n", 69 | " 'overruling',\n", 70 | " 'systematic_review_inclusion',\n", 71 | " 'one_stop_english',\n", 72 | " 'tweet_eval_hate',\n", 73 | " 'twitter_complaints',\n", 74 | " 'semiconductor_org_types']" 75 | ] 76 | }, 77 | "metadata": {}, 78 | "execution_count": 2 79 | } 80 | ], 81 | "metadata": {} 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "source": [ 86 | "Each task in RAFT consists of a training set of only **_50 labeled examples_** and an unlabeled test set. All labels have a textual version associated with them. Let's load corpus associated with the `ade_corpus_v2` task:" 87 | ], 88 | "metadata": {} 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": 3, 93 | "source": [ 94 | "from datasets import load_dataset\n", 95 | "\n", 96 | "TASK = \"ade_corpus_v2\"\n", 97 | "raft_dataset = load_dataset(\"ought/raft\", name=TASK)\n", 98 | "raft_dataset" 99 | ], 100 | "outputs": [ 101 | { 102 | "output_type": "execute_result", 103 | "data": { 104 | "text/plain": [ 105 | "DatasetDict({\n", 106 | " train: Dataset({\n", 107 | " features: ['Sentence', 'ID', 'Label'],\n", 108 | " num_rows: 50\n", 109 | " })\n", 110 | " test: Dataset({\n", 111 | " features: ['Sentence', 'ID', 'Label'],\n", 112 | " num_rows: 5000\n", 113 | " })\n", 114 | "})" 115 | ] 116 | }, 117 | "metadata": {}, 118 | "execution_count": 3 119 | } 120 | ], 121 | "metadata": {} 122 | }, 123 | { 124 | "cell_type": "markdown", 125 | "source": [ 126 | "The `raft_dataset` object itself is a [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training and test sets. In this task we can see we have 50 labelled examples to work with and 5,000 examples on the test set we need to generate predictions for. To access an example, you need to specify the name of the split and then the index as follows:" 127 | ], 128 | "metadata": {} 129 | }, 130 | { 131 | "cell_type": "code", 132 | "execution_count": 4, 133 | "source": [ 134 | "raft_dataset[\"train\"][0]" 135 | ], 136 | "outputs": [ 137 | { 138 | "output_type": "execute_result", 139 | "data": { 140 | "text/plain": [ 141 | "{'Sentence': 'No regional side effects were noted.', 'ID': 0, 'Label': 2}" 142 | ] 143 | }, 144 | "metadata": {}, 145 | "execution_count": 4 146 | } 147 | ], 148 | "metadata": {} 149 | }, 150 | { 151 | "cell_type": "markdown", 152 | "source": [ 153 | "Here we can see that each example is assigned a label ID which denotes the class in this particular tasks. Let's check how many classes we have in the training set:" 154 | ], 155 | "metadata": {} 156 | }, 157 | { 158 | "cell_type": "code", 159 | "execution_count": 5, 160 | "source": [ 161 | "label_ids = raft_dataset[\"train\"].unique(\"Label\")\n", 162 | "label_ids" 163 | ], 164 | "outputs": [ 165 | { 166 | "output_type": "execute_result", 167 | "data": { 168 | "text/plain": [ 169 | "[2, 1]" 170 | ] 171 | }, 172 | "metadata": {}, 173 | "execution_count": 5 174 | } 175 | ], 176 | "metadata": {} 177 | }, 178 | { 179 | "cell_type": "markdown", 180 | "source": [ 181 | "Okay, this indicates that `ade_corpus_v2` is a binary classification task and we can extract the human-readable label names as follows:" 182 | ], 183 | "metadata": {} 184 | }, 185 | { 186 | "cell_type": "code", 187 | "execution_count": 6, 188 | "source": [ 189 | "features = raft_dataset[\"train\"].features[\"Label\"]\n", 190 | "id2label = {idx : features.int2str(idx) for idx in label_ids}\n", 191 | "id2label" 192 | ], 193 | "outputs": [ 194 | { 195 | "output_type": "execute_result", 196 | "data": { 197 | "text/plain": [ 198 | "{2: 'not ADE-related', 1: 'ADE-related'}" 199 | ] 200 | }, 201 | "metadata": {}, 202 | "execution_count": 6 203 | } 204 | ], 205 | "metadata": {} 206 | }, 207 | { 208 | "cell_type": "markdown", 209 | "source": [ 210 | "Note that the test set also has a `Label` entry, but it is zero to denote a dummy label (this is what your model needs to predict!):" 211 | ], 212 | "metadata": {} 213 | }, 214 | { 215 | "cell_type": "code", 216 | "execution_count": 7, 217 | "source": [ 218 | "raft_dataset[\"test\"].unique(\"Label\")" 219 | ], 220 | "outputs": [ 221 | { 222 | "output_type": "execute_result", 223 | "data": { 224 | "text/plain": [ 225 | "[0]" 226 | ] 227 | }, 228 | "metadata": {}, 229 | "execution_count": 7 230 | } 231 | ], 232 | "metadata": {} 233 | }, 234 | { 235 | "cell_type": "markdown", 236 | "source": [ 237 | "To get a broader sense of what kind of data we are dealing with, we can use the following function to randomly sample from the corpus and display the results as a table:" 238 | ], 239 | "metadata": {} 240 | }, 241 | { 242 | "cell_type": "code", 243 | "execution_count": 8, 244 | "source": [ 245 | "import random\n", 246 | "import pandas as pd\n", 247 | "from IPython.display import display, HTML\n", 248 | "\n", 249 | "def show_random_elements(dataset, num_examples=10):\n", 250 | " assert num_examples <= len(dataset), \"Can't pick more elements than there are in the dataset.\"\n", 251 | " picks = []\n", 252 | " for _ in range(num_examples):\n", 253 | " pick = random.randint(0, len(dataset)-1)\n", 254 | " while pick in picks:\n", 255 | " pick = random.randint(0, len(dataset)-1)\n", 256 | " picks.append(pick)\n", 257 | " \n", 258 | " df = pd.DataFrame(dataset[picks])\n", 259 | " for column, typ in dataset.features.items():\n", 260 | " if isinstance(typ, datasets.ClassLabel):\n", 261 | " df[column] = df[column].transform(lambda i: typ.names[i])\n", 262 | " display(HTML(df.to_html()))\n", 263 | " \n", 264 | "show_random_elements(raft_dataset[\"train\"])" 265 | ], 266 | "outputs": [ 267 | { 268 | "output_type": "display_data", 269 | "data": { 270 | "text/plain": [ 271 | "" 272 | ], 273 | "text/html": [ 274 | "\n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | "
SentenceIDLabel
0CT-scan disclosed right ethmoid sinusitis that spread to the orbit after surgery.22not ADE-related
1IMPLICATIONS: Dexmedetomidine, an alpha(2)-adrenoceptor agonist, is indicated for sedating patients on mechanical ventilation.43not ADE-related
2The INR should be monitored more frequently when bosentan is initiated, adjusted, or discontinued in patients taking warfarin.2not ADE-related
3Remarkable findings on initial examination were facial grimacing, flexure posturing of both upper extremities, and 7-mm, reactive pupils.44not ADE-related
4CONCLUSION: Pancreatic enzyme intolerance, although rare, would be a major problem in the management of patients with CF.6not ADE-related
5After the first oral dose of propranolol, syncope developed together with atrioventricular block.3ADE-related
6Acute promyelocytic leukemia after living donor partial orthotopic liver transplantation in two Japanese girls.45not ADE-related
7The patient had no skin reactions for the next 12 mo, with the exception of injection-site papules.18not ADE-related
8Sotalol-induced bradycardia reversed by glucagon.23ADE-related
9We describe a patient who developed HUS after treatment with mitomycin C (total dose 144 mg/m2) due to a carcinoma of the ascending colon.30ADE-related
" 346 | ] 347 | }, 348 | "metadata": {} 349 | } 350 | ], 351 | "metadata": {} 352 | }, 353 | { 354 | "cell_type": "markdown", 355 | "source": [ 356 | "## Creating a classifier from the Hugging Face Model Hub" 357 | ], 358 | "metadata": {} 359 | }, 360 | { 361 | "cell_type": "markdown", 362 | "source": [ 363 | "We provide a class which uses the same prompt construction method as our GPT-3 baseline, but works with any CausalLM on the [HuggingFace Model Hub](https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads). The classifier will automatically use a GPU if available. Brief documentation on the arguments for configuring the classifier is provided below.:" 364 | ], 365 | "metadata": {} 366 | }, 367 | { 368 | "cell_type": "code", 369 | "execution_count": 9, 370 | "source": [ 371 | "from raft_baselines.classifiers import TransformersCausalLMClassifier\n", 372 | "\n", 373 | "classifier = TransformersCausalLMClassifier(\n", 374 | " model_type=\"distilgpt2\", # The model to use from the HF hub\n", 375 | " training_data=raft_dataset[\"train\"], # The training data\n", 376 | " num_prompt_training_examples=25, # See raft_predict.py for the number of training examples used on a per-dataset basis in the GPT-3 baselines run.\n", 377 | " # Note that it may be better to use fewer training examples and/or shorter instructions with other models with smaller context windows.\n", 378 | " add_prefixes=(TASK==\"banking_77\"), # Set to True when using banking_77 since multiple classes start with the same token\n", 379 | " config=TASK, # For task-specific instructions and field ordering\n", 380 | " use_task_specific_instructions=True,\n", 381 | " do_semantic_selection=True,\n", 382 | ")" 383 | ], 384 | "outputs": [], 385 | "metadata": {} 386 | }, 387 | { 388 | "cell_type": "markdown", 389 | "source": [ 390 | "## Generating predictions for RAFT test examples" 391 | ], 392 | "metadata": {} 393 | }, 394 | { 395 | "cell_type": "markdown", 396 | "source": [ 397 | "In order to generate predictions on the test set, we need to provide the model with an appropriate prompt with the instructions. Let's take a look at how this works on a single example from the test set." 398 | ], 399 | "metadata": {} 400 | }, 401 | { 402 | "cell_type": "markdown", 403 | "source": [ 404 | "### Example prompt and prediction" 405 | ], 406 | "metadata": {} 407 | }, 408 | { 409 | "cell_type": "markdown", 410 | "source": [ 411 | "The `TransformersCausalLMClassifier` has a `classify` function that will automatically generate the predicted probabilites from the model. We'll set `should_print_prompt=True` so that we can see which prompt is being used to instruct the model:" 412 | ], 413 | "metadata": {} 414 | }, 415 | { 416 | "cell_type": "code", 417 | "execution_count": 10, 418 | "source": [ 419 | "test_dataset = raft_dataset[\"test\"]\n", 420 | "first_test_example = test_dataset[0]\n", 421 | "\n", 422 | "# delete the 0 Label\n", 423 | "del first_test_example[\"Label\"]\n", 424 | "\n", 425 | "# probabilities for all classes\n", 426 | "output_probs = classifier.classify(first_test_example, should_print_prompt=True)\n", 427 | "output_probs" 428 | ], 429 | "outputs": [ 430 | { 431 | "output_type": "stream", 432 | "name": "stdout", 433 | "text": [ 434 | "Label the sentence based on whether it is related to an adverse drug effect (ADE). Details are described below:\n", 435 | "Drugs: Names of drugs and chemicals that include brand names, trivial names, abbreviations and systematic names were annotated. Mentions of drugs or chemicals should strictly be in a therapeutic context. This category does not include the names of metabolites, reaction byproducts, or hospital chemicals (e.g. surgical equipment disinfectants).\n", 436 | "Adverse effect: Mentions of adverse effects include signs, symptoms, diseases, disorders, acquired abnormalities, deficiencies, organ damage or death that strictly occur as a consequence of drug intake.\n", 437 | "Possible labels:\n", 438 | "1. ADE-related\n", 439 | "2. not ADE-related\n", 440 | "\n", 441 | "Sentence: Treatment of silastic catheter-induced central vein septic thrombophlebitis\n", 442 | "Label: not ADE-related\n", 443 | "\n", 444 | "Sentence: We describe a patient who developed HUS after treatment with mitomycin C (total dose 144 mg\n", 445 | "Label: ADE-related\n", 446 | "\n", 447 | "Sentence: In 1991 the patient were found to be seropositive for HCV antibodies as detected by\n", 448 | "Label: not ADE-related\n", 449 | "\n", 450 | "Sentence: METHODS: We identified three patients who developed skin necrosis and determined any factors, which\n", 451 | "Label: not ADE-related\n", 452 | "\n", 453 | "Sentence: These cases were considered unusual in light of the short delay of their onset after initiation of immunosupp\n", 454 | "Label: ADE-related\n", 455 | "\n", 456 | "Sentence: No regional side effects were noted.\n", 457 | "Label: not ADE-related\n", 458 | "\n", 459 | "Sentence: A patient with psoriasis is described who had an abnormal response to the glucose tolerance test without other\n", 460 | "Label: ADE-related\n", 461 | "\n", 462 | "Sentence: CONCLUSIONS: These results suggest that clozapine may cause TD; however, the prevalence\n", 463 | "Label: ADE-related\n", 464 | "\n", 465 | "Sentence: The cases are important in documenting that drug-induced dystonias do occur in patients with dementia,\n", 466 | "Label: ADE-related\n", 467 | "\n", 468 | "Sentence: NEH must be considered in lupus patients receiving cytotoxic agents to avoid inappropriate use\n", 469 | "Label: not ADE-related\n", 470 | "\n", 471 | "Sentence: A closer look at septic shock.\n", 472 | "Label: not ADE-related\n", 473 | "\n", 474 | "Sentence: The mechanism by which sunitinib induces gynaecomastia is thought to be associated\n", 475 | "Label: ADE-related\n", 476 | "\n", 477 | "Sentence: Of the 16 patients, including the 1 reported here, only 3 displayed significant shortening of the\n", 478 | "Label: not ADE-related\n", 479 | "\n", 480 | "Sentence: Sotalol-induced bradycardia reversed by glucagon.\n", 481 | "Label: ADE-related\n", 482 | "\n", 483 | "Sentence: CONCLUSION: Pancreatic enzyme intolerance, although rare, would be a major problem in\n", 484 | "Label: not ADE-related\n", 485 | "\n", 486 | "Sentence: Macular infarction after endophthalmitis treated with vitrectomy and intravit\n", 487 | "Label: ADE-related\n", 488 | "\n", 489 | "Sentence: MRI has a high sensitivity and specificity in the diagnosis of osteonecrosis and should be used\n", 490 | "Label: not ADE-related\n", 491 | "\n", 492 | "Sentence: IMPLICATIONS: Dexmedetomidine, an alpha(2)-adrenoceptor\n", 493 | "Label: not ADE-related\n", 494 | "\n", 495 | "Sentence: The INR should be monitored more frequently when bosentan is initiated, adjusted, or discontinued\n", 496 | "Label: not ADE-related\n", 497 | "\n", 498 | "Sentence: Remarkable findings on initial examination were facial grimacing, flexure posturing of both upper extrem\n", 499 | "Label: not ADE-related\n", 500 | "\n", 501 | "Sentence: Early detection of these cases has practical importance since the identification and elimination of the causative drug is\n", 502 | "Label: not ADE-related\n", 503 | "\n", 504 | "Sentence: This report demonstrates the increased risk of complicated varicella associated with the use of corticoster\n", 505 | "Label: not ADE-related\n", 506 | "\n", 507 | "Sentence: These results indicate that the hyponatremia in this case was due to SIADH\n", 508 | "Label: not ADE-related\n", 509 | "\n", 510 | "Sentence: Best-corrected visual acuity measurements were performed at every visit.\n", 511 | "Label: not ADE-related\n", 512 | "\n", 513 | "Sentence: OBJECTIVE: To describe onset of syndrome of inappropriate antidiuretic hormone (SIADH\n", 514 | "Label: ADE-related\n", 515 | "\n", 516 | "Sentence: CONCLUSIONS: SD-OCT and AO detected abnormalities that correlate topographically with visual field loss from hydroxychloroquine toxicity as demonstrated by HVF 10-2 and may be useful in the detection of subclinical abnormalities that precede symptoms or objective visual field loss.\n", 517 | "Label:\n" 518 | ] 519 | }, 520 | { 521 | "output_type": "execute_result", 522 | "data": { 523 | "text/plain": [ 524 | "{'ADE-related': 0.31358153, 'not ADE-related': 0.68641853}" 525 | ] 526 | }, 527 | "metadata": {}, 528 | "execution_count": 10 529 | } 530 | ], 531 | "metadata": {} 532 | }, 533 | { 534 | "cell_type": "markdown", 535 | "source": [ 536 | "In this example we can see the model predicts that the example is not related to an adverse drug effect. We can use this technique to generate predictions across the whole test set! Let's take a look." 537 | ], 538 | "metadata": {} 539 | }, 540 | { 541 | "cell_type": "markdown", 542 | "source": [ 543 | "### Creating a submission file of predictions" 544 | ], 545 | "metadata": {} 546 | }, 547 | { 548 | "cell_type": "markdown", 549 | "source": [ 550 | "To submit to the RAFT leaderboard, you'll need to provide a CSV file of predictions on the test set for each task (see [here](https://huggingface.co/datasets/ought/raft-submission) for detailed instructions). The following code snippet generates a CSV with predictions for the first $N$ test examples in the format required for submission $(ID, Label)$. \n", 551 | "\n", 552 | "Note that this is expected to generate predictions of all \"Not ADE-related\" for the 10 test examples with the code as written; few-shot classification is pretty hard!" 553 | ], 554 | "metadata": {} 555 | }, 556 | { 557 | "cell_type": "code", 558 | "execution_count": 11, 559 | "source": [ 560 | "# Increase this to len(test_dataset) to generate predictions over the full test set\n", 561 | "N_TEST = 10\n", 562 | "test_examples_to_predict = test_dataset.select(range(N_TEST))\n", 563 | "\n", 564 | "def predict_one(clf, test_example):\n", 565 | " del test_example[\"Label\"] \n", 566 | " output_probs = clf.classify(example)\n", 567 | " output_label = max(output_probs.items(), key=lambda kv_pair: kv_pair[1])[0]\n", 568 | " return output_label\n", 569 | "\n", 570 | "data = []\n", 571 | "for example in test_examples_to_predict:\n", 572 | " data.append({\"ID\": example[\"ID\"], \"Label\": predict_one(classifier, example)})\n", 573 | " \n", 574 | "result_df = pd.DataFrame(data=data, columns=[\"ID\", \"Label\"]).astype({\"ID\": int, \"Label\": str}) \n", 575 | "result_df" 576 | ], 577 | "outputs": [], 578 | "metadata": {} 579 | }, 580 | { 581 | "cell_type": "markdown", 582 | "source": [ 583 | "Note that the `ID` column starts from index 50 since we have IDs 0-49 in the training set. The final step is to save the DataFrame as a CSV file and build out the rest of your submission:" 584 | ], 585 | "metadata": {} 586 | }, 587 | { 588 | "cell_type": "code", 589 | "execution_count": null, 590 | "source": [ 591 | "result_df.to_csv(\"../data/example_predictions.csv\", index=False)" 592 | ], 593 | "outputs": [], 594 | "metadata": {} 595 | }, 596 | { 597 | "cell_type": "markdown", 598 | "source": [ 599 | "Good luck with the rest of the benchmark!" 600 | ], 601 | "metadata": {} 602 | } 603 | ], 604 | "metadata": { 605 | "interpreter": { 606 | "hash": "74118a50156796984ad06a64d88792c5d24753e439e2427f4985fcb9d71e695f" 607 | }, 608 | "kernelspec": { 609 | "name": "python3", 610 | "display_name": "Python 3.8.11 64-bit ('raft-baselines': conda)" 611 | }, 612 | "language_info": { 613 | "codemirror_mode": { 614 | "name": "ipython", 615 | "version": 3 616 | }, 617 | "file_extension": ".py", 618 | "mimetype": "text/x-python", 619 | "name": "python", 620 | "nbconvert_exporter": "python", 621 | "pygments_lexer": "ipython3", 622 | "version": "3.8.11" 623 | } 624 | }, 625 | "nbformat": 4, 626 | "nbformat_minor": 4 627 | } -------------------------------------------------------------------------------- /src/raft_baselines/scripts/test_gpt3.py: -------------------------------------------------------------------------------- 1 | import datasets 2 | 3 | from raft_baselines.classifiers import GPT3Classifier 4 | 5 | train = datasets.load_dataset( 6 | "ought/raft", "neurips_impact_statement_risks", split="train" 7 | ) 8 | classifier = GPT3Classifier( 9 | train, config="neurips_impact_statement_risks", do_semantic_selection=True 10 | ) 11 | print(classifier.classify({"Paper title": "GNN research", "Impact statement": "test2"})) 12 | -------------------------------------------------------------------------------- /src/raft_baselines/scripts/test_naive_bayes.py: -------------------------------------------------------------------------------- 1 | import datasets 2 | 3 | from raft_baselines.classifiers import NaiveBayesClassifier 4 | 5 | train = datasets.load_dataset( 6 | "ought/raft", "neurips_impact_statement_risks", split="train" 7 | ) 8 | 9 | classifier = NaiveBayesClassifier(train) 10 | 11 | print(classifier.classify({"Paper title": "CNN research", "Impact statement": "test2"})) 12 | -------------------------------------------------------------------------------- /src/raft_baselines/utils/embedders.py: -------------------------------------------------------------------------------- 1 | from abc import ABC, abstractmethod 2 | from typing import List, Tuple 3 | from sentence_transformers import SentenceTransformer 4 | import torch 5 | 6 | 7 | class Embedder(ABC): 8 | @abstractmethod 9 | def __call__(self, texts: List[str]) -> List[List[float]]: 10 | ... 11 | 12 | 13 | class SentenceTransformersEmbedder(Embedder): 14 | def __init__( 15 | self, model_name="sentence-transformers/all-MiniLM-L6-v2", max_seq_length=512 16 | ): 17 | self.device = "cuda" if torch.cuda.is_available() else "cpu" 18 | self.similarity_model = SentenceTransformer(model_name, device=self.device) 19 | self.similarity_model.max_seq_length = max_seq_length 20 | self._cache = {} 21 | 22 | def __call__(self, texts: Tuple[str]) -> List[List[float]]: 23 | if hash(texts) in self._cache: 24 | return self._cache[hash(texts)] 25 | 26 | embeds = self.similarity_model.encode( 27 | texts, convert_to_tensor=True, device=self.device 28 | ) 29 | 30 | self._cache[hash(texts)] = embeds 31 | return embeds 32 | -------------------------------------------------------------------------------- /src/raft_baselines/utils/gpt3_utils.py: -------------------------------------------------------------------------------- 1 | import openai 2 | from dotenv import load_dotenv 3 | import os 4 | import time 5 | from cachetools import cached, LRUCache 6 | from typing import List, Dict, Tuple, Any, cast 7 | 8 | from raft_baselines.utils.tokenizers import TransformersTokenizer 9 | 10 | load_dotenv() 11 | OPENAI_API_KEY = os.getenv("OPENAI_API_KEY") 12 | 13 | 14 | @cached(cache=LRUCache(maxsize=1e9)) 15 | def complete( 16 | prompt: str, 17 | engine: str = "ada", 18 | max_tokens: int = 5, 19 | temperature: float = 1.0, 20 | top_p: float = 1.0, 21 | n: int = 1, 22 | echo: bool = False, 23 | stop: Tuple[str, ...] = ("\n",), 24 | presence_penalty: float = 0.0, 25 | frequency_penalty: float = 0.0, 26 | ): 27 | openai_completion_args = dict( 28 | api_key=OPENAI_API_KEY, 29 | engine=engine, 30 | prompt=prompt, 31 | max_tokens=max_tokens, 32 | temperature=temperature, 33 | top_p=top_p, 34 | n=n, 35 | logprobs=100, # Always request 100 so can easily count tokens in completion 36 | echo=echo, 37 | stop=stop, 38 | presence_penalty=presence_penalty, 39 | frequency_penalty=frequency_penalty, 40 | ) 41 | 42 | success = False 43 | retries = 0 44 | while not success: 45 | try: 46 | response = openai.Completion.create(**openai_completion_args) 47 | success = True 48 | except Exception as e: 49 | print(f"Exception in OpenAI completion: {e}") 50 | retries += 1 51 | if retries > 3: 52 | raise Exception("Max retries reached") 53 | break 54 | else: 55 | print("retrying") 56 | time.sleep(retries * 15) 57 | 58 | return cast(Dict[str, Any], response) 59 | 60 | 61 | @cached(cache=LRUCache(maxsize=1e9)) 62 | def search( 63 | documents: Tuple[str, ...], query: str, engine: str = "ada" 64 | ) -> List[Dict[str, Any]]: 65 | response = None 66 | error = None 67 | tokenizer = TransformersTokenizer("gpt2") 68 | query = tokenizer.truncate_by_tokens(query, 1000) 69 | short_enough_documents = [ 70 | tokenizer.truncate_by_tokens(document, 2034 - tokenizer.num_tokens(query)) 71 | for document in documents 72 | ] 73 | 74 | success = False 75 | retries = 0 76 | while not success: 77 | try: 78 | response = openai.Engine(engine, api_key=OPENAI_API_KEY).search( 79 | documents=short_enough_documents, query=query 80 | ) 81 | success = True 82 | except Exception as e: 83 | print(f"Exception in OpenAI search: {e}") 84 | retries += 1 85 | if retries > 3: 86 | raise Exception("Max retries reached") 87 | break 88 | else: 89 | print("retrying") 90 | time.sleep(retries * 15) 91 | 92 | assert response is not None 93 | results = response["data"] 94 | 95 | return results 96 | -------------------------------------------------------------------------------- /src/raft_baselines/utils/tokenizers.py: -------------------------------------------------------------------------------- 1 | from abc import ABC, abstractmethod 2 | from typing import List 3 | from transformers import AutoTokenizer 4 | from transformers.tokenization_utils_base import BatchEncoding 5 | 6 | 7 | class Tokenizer(ABC): 8 | @abstractmethod 9 | def num_tokens(text: str) -> int: 10 | ... 11 | 12 | @abstractmethod 13 | def truncate_by_tokens(text: str, max_tokens: int) -> str: 14 | ... 15 | 16 | 17 | class TransformersTokenizer(Tokenizer): 18 | def __init__(self, model_name): 19 | self.tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True) 20 | 21 | def __call__(self, *args, **kwargs) -> BatchEncoding: 22 | return self.tokenizer(*args, **kwargs) 23 | 24 | def num_tokens(self, text: str) -> int: 25 | return len(self.tokenizer.tokenize(text)) 26 | 27 | def truncate_by_tokens(self, text: str, max_tokens: int) -> str: 28 | if max_tokens is None or not text: 29 | return text 30 | encoding = self.tokenizer( 31 | text, truncation=True, max_length=max_tokens, return_offsets_mapping=True 32 | ) 33 | 34 | return text[: encoding.offset_mapping[-1][1]] 35 | --------------------------------------------------------------------------------