├── .env-example
├── .gitignore
├── LICENSE
├── README.md
├── example_prompts
    ├── ade_corpus_v2.txt
    ├── banking_77.txt
    ├── neurips_impact_statement_risks.txt
    ├── one_stop_english.txt
    ├── overruling.txt
    ├── semiconductor_org_types.txt
    ├── systematic_review_inclusion.txt
    ├── tai_safety_research.txt
    ├── terms_of_service.txt
    ├── tweet_eval_hate.txt
    └── twitter_complaints.txt
├── requirements.txt
├── setup.py
└── src
    ├── __init__.py
    └── raft_baselines
        ├── classifiers
            ├── __init__.py
            ├── adaboost_classifier.py
            ├── classifier.py
            ├── gpt3_classifier.py
            ├── in_context_classifier.py
            ├── n_grams_classifier.py
            ├── naive_bayes_classifier.py
            ├── random_classifier.py
            ├── svm_classifier.py
            ├── transformers_causal_lm_classifier.py
            └── zero_shot_transformers_classifier.py
        ├── data
            ├── __init__.py
            ├── example_predictions.csv
            └── prompt_construction_settings.jsonl
        ├── scripts
            ├── non_neural_experiment.py
            ├── raft_predict.py
            ├── raft_train_experiment.py
            ├── starter_kit.ipynb
            ├── test_gpt3.py
            └── test_naive_bayes.py
        └── utils
            ├── embedders.py
            ├── gpt3_utils.py
            └── tokenizers.py


/.env-example:
--------------------------------------------------------------------------------
1 | OPENAI_API_KEY=sk-abcdefg
2 | HUGGINGFACE_API_TOKEN=abcdefg
3 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | *predictions
 2 | results
 3 | .idea
 4 | .env
 5 | gpt3-baselines
 6 | __pycache__
 7 | prompts
 8 | raft_baselines.egg-info
 9 | .vscode
10 | 
11 | # Jupyter Notebook
12 | .ipynb_checkpoints
13 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Copyright 2021 Ought Inc.
2 | 
3 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
4 | 
5 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
6 | 
7 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Setup
 2 | 
 3 | This is the repository for the GPT-3 baselines described in the RAFT benchmark paper.
 4 | 
 5 | Set up a virtual environament and install necessary requirements from the requirements file.
 6 | 
 7 | ```buildoutcfg
 8 | conda create -n raft-baselines python=3.8 && conda activate raft-baselines
 9 | python -m pip install -r requirements.txt
10 | ```
11 | 
12 | Install raft-baselines.
13 | 
14 | ```buildoutcfg
15 | python setup.py develop
16 | ```
17 | 
18 | You may have to run the above command with `sudo` prepended for permissions.
19 | 
20 | # Starter Kit
21 | 
22 | A [starter kit notebook](src/raft_baselines/scripts/starter_kit.ipynb) walks through the basics of making predictions using models from the [HuggingFace Model Hub](https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads). There's also a [Colab version](https://colab.research.google.com/drive/1TQtHG-Wf2CgYGSD9e7_uJWIdiK5HNniV).
23 | 
24 | # RAFT Predict
25 | 
26 | Use the `raft_predict` script to run classifiers on the RAFT datasets. By default, the script will run on the first 5 test examples for each dataset. To use a random classifier on the first 10 examples from the ADE Corpus V2 dataset:
27 | 
28 | ```buildoutcfg
29 | python -m raft_baselines.scripts.raft_predict with n_test=10 'configs=["ade_corpus_v2"]' classifier_name=RandomClassifier
30 | ```
31 | 
32 | The other classifiers available are:
33 | 
34 | - `GPT3Classifier`: the one used for the GPT-3 baseline in the paper
35 | - `TransformersCausalLMClassifier`: takes as input a `model_type` string, and runs an arbitrary CausalLM from the [HuggingFace Model Hub](https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads)
36 | 
37 | For example, to generate predictions from DistilGPT-2 on the first 10 examples of the ADE Corpus you can run:
38 | 
39 | ```buildoutcfg
40 | python -m raft_baselines.scripts.raft_predict with n_test=10 'configs=["ade_corpus_v2"]' classifier_name=TransformersCausalLMClassifier 'classifier_kwargs={"model_type":"distilgpt2"}'
41 | ```
42 | 
43 | In order to run experiments with GPT-3, you will need to have an OpenAI API key. Create a file called `.env` and put your API key there. Copy the format of `.env-example`:
44 | 
45 | ```buildoutcfg
46 | echo OPENAI_API_KEY=$OPENAI_API_KEY > .env
47 | ```
48 | 
49 | ## Sacred
50 | 
51 | We use [Sacred](https://github.com/IDSIA/sacred) to track our experiments and outputs. This has no overhead at runtime, simply run either of our two experiment scripts with python like normal. You can change where tracking files get saved to by modifying the observer at the top of every experiment file, or you can change the details of the experiment via the various configuration parameters specified in the configs block.
52 | 
53 | ```buildoutcfg
54 | # For labeling the test set
55 | python -m raft_baselines.scripts.raft_predict
56 | # For tuning various dimensions on the train set with LOO validation
57 | python -m raft_baselines.scripts.raft_train_experiment
58 | ```
59 | 
60 | Alternately, you can modify the input variables to an experiment from the command line, as is done in the example above. Regardless, some modification will be necessary if you want to run different experiments. See [this tutorial](https://sacred.readthedocs.io/en/stable/configuration.html) for more information.
61 | 
62 | Similarly, you can save metrics with `raft_experiment.log_scalar()`, or by using the sacred observer directly. See [this tutorial](https://sacred.readthedocs.io/en/stable/collected_information.html) for more information.
63 | 
64 | To save out predictions and upload to the HuggingFace Hub (and the leaderboard), see [the RAFT submission template](https://huggingface.co/datasets/ought/raft-submission).
65 | 
66 | ## License
67 | 
68 | This repository is licensed under the MIT License.
69 | 


--------------------------------------------------------------------------------
/example_prompts/ade_corpus_v2.txt:
--------------------------------------------------------------------------------
 1 | Label the sentence based on whether it is related to an adverse drug effect (ADE). Details are described below:
 2 | Drugs: Names of drugs and chemicals that include brand names, trivial names, abbreviations and systematic names were annotated. Mentions of drugs or chemicals should strictly be in a therapeutic context. This category does not include the names of metabolites, reaction byproducts, or hospital chemicals (e.g. surgical equipment disinfectants).
 3 | Adverse effect: Mentions of adverse effects include signs, symptoms, diseases, disorders, acquired abnormalities, deficiencies, organ damage or death that strictly occur as a consequence of drug intake.
 4 | Possible labels:
 5 | 1. ADE-related
 6 | 2. not ADE-related
 7 | 
 8 | Sentence: With serious cases, however, conventional treatment may not allow sufficient time at depth for the complete resolution of manifestations because of the need to avoid pulmonary oxygen toxicity which is associated with a prolonged period of breathing compressed air.
 9 | Label: not ADE-related
10 | 
11 | Sentence: Several hypersensitivity reactions to cloxacillin have been reported, although IgE-mediated allergic reactions to the drug are rare and there is little information about possible tolerance to other semisynthetic penicillins or cephalosporins in patients with cloxacillin allergy.
12 | Label: ADE-related
13 | 
14 | Sentence: A 69-year-old male was diagnosed in February 2004 with stage IV extranodal marginal zone B cell lymphoma involving the mediastinal nodes, lung parenchyma and bone marrow with high LDH.
15 | Label: not ADE-related
16 | 
17 | Sentence: A patient with psoriasis is described who had an abnormal response to the glucose tolerance test without other evidence of diabetes and then developed postprandial hyperglycemia and glycosuria during a period of topical administration of a corticosteroid cream, halcinonide cream 0.1
18 | Label: ADE-related
19 | 
20 | Sentence: The gold standard for diagnosis is renal biopsy, but it is only rarely performed during the acute phase of the reaction and is not without risk.
21 | Label: not ADE-related
22 | 
23 | Sentence: Of the 16 patients, including the 1 reported here, only 3 displayed significant shortening of the agranulocytic period after treatment.
24 | Label: not ADE-related
25 | 
26 | Sentence: These cases were considered unusual in light of the short delay of their onset after initiation of immunosuppressive therapy and their fulminant course: 3 of these patients died of PCP occurring during the first month of treatment with prednisone.
27 | Label: ADE-related
28 | 
29 | Sentence: In 1991 the patient were found to be seropositive for HCV antibodies as detected by the ELISA method and confirmed by the RIBA method.
30 | Label: not ADE-related
31 | 
32 | Sentence: Considerable improvement of myasthenic symptoms was seen in all patients within 3-6 months after the initiation of this therapy.
33 | Label: not ADE-related
34 | 
35 | Sentence: We present three patients with paradoxical seizures; their serum phenytoin levels were 43.5 mcg/mL, 46.5 mcg/mL and 38.3 mcg/mL.
36 | Label: ADE-related
37 | 
38 | Sentence: NEH must be considered in lupus patients receiving cytotoxic agents to avoid inappropriate use of corticosteroids or antibiotics in this self-limited condition.
39 | Label: not ADE-related
40 | 
41 | Sentence: A challenge with clozapine was feasible and showed no clinical symptoms of eosinophilia.
42 | Label: not ADE-related
43 | 
44 | Sentence: We describe a patient who developed HUS after treatment with mitomycin C (total dose 144 mg/m2) due to a carcinoma of the ascending colon.
45 | Label: ADE-related
46 | 
47 | Sentence: The INR should be monitored more frequently when bosentan is initiated, adjusted, or discontinued in patients taking warfarin.
48 | Label: not ADE-related
49 | 
50 | Sentence: An encephalopathy and cardiomyopathy developed in a seventeen-year-old girl with chemotherapy-induced renal failure while receiving an intravesical aluminum infusion for hemorrhagic cystitis.
51 | Label: ADE-related
52 | 
53 | Sentence: CT-scan disclosed right ethmoid sinusitis that spread to the orbit after surgery.
54 | Label: not ADE-related
55 | 
56 | Sentence: MRI has a high sensitivity and specificity in the diagnosis of osteonecrosis and should be used when this condition is suspected.
57 | Label: not ADE-related
58 | 
59 | Sentence: CONCLUSION: Pancreatic enzyme intolerance, although rare, would be a major problem in the management of patients with CF.
60 | Label: not ADE-related
61 | 
62 | Sentence: These results indicate that the hyponatremia in this case was due to SIADH and that SIADH was caused by an increased release of vasopressin probably because of the antiviral drug (acyclovir) or infection of varicella zoster virus (V
63 | Label: not ADE-related
64 | 
65 | Sentence: METHODS: This study is a case report description.
66 | Label: not ADE-related
67 | 
68 | Sentence: Best-corrected visual acuity measurements were performed at every visit.
69 | Label: not ADE-related
70 | 
71 | Sentence: METHODS: We identified three patients who developed skin necrosis and determined any factors, which put them at an increased risk of doing so.
72 | Label: not ADE-related
73 | 
74 | Sentence: OBJECTIVE: To describe onset of syndrome of inappropriate antidiuretic hormone (SIADH) associated with vinorelbine therapy for advanced breast cancer.
75 | Label: ADE-related
76 | 
77 | Sentence: IMPLICATIONS: Dexmedetomidine, an alpha(2)-adrenoceptor agonist, is indicated for sedating patients on mechanical ventilation.
78 | Label: not ADE-related
79 | 
80 | Sentence: CONCLUSIONS: These results suggest that clozapine may cause TD; however, the prevalence is low and the severity is relatively mild, with no or mild self-reported discomfort.
81 | Label: ADE-related
82 | 
83 | Sentence: CONCLUSIONS: SD-OCT and AO detected abnormalities that correlate topographically with visual field loss from hydroxychloroquine toxicity as demonstrated by HVF 10-2 and may be useful in the detection of subclinical abnormalities that precede symptoms or objective visual field loss.
84 | Label:


--------------------------------------------------------------------------------
/example_prompts/banking_77.txt:
--------------------------------------------------------------------------------
  1 | The following is a banking customer service query. Classify the query into one of the 77 categories available.
  2 | Possible labels:
  3 | 1. Refund_not_showing_up
  4 | 2. activate_my_card
  5 | 3. age_limit
  6 | 4. apple_pay_or_google_pay
  7 | 5. atm_support
  8 | 6. automatic_top_up
  9 | 7. balance_not_updated_after_bank_transfer
 10 | 8. balance_not_updated_after_cheque_or_cash_deposit
 11 | 9. beneficiary_not_allowed
 12 | 10. cancel_transfer
 13 | 11. card_about_to_expire
 14 | 12. card_acceptance
 15 | 13. card_arrival
 16 | 14. card_delivery_estimate
 17 | 15. card_linking
 18 | 16. card_not_working
 19 | 17. card_payment_fee_charged
 20 | 18. card_payment_not_recognised
 21 | 19. card_payment_wrong_exchange_rate
 22 | 20. card_swallowed
 23 | 21. cash_withdrawal_charge
 24 | 22. cash_withdrawal_not_recognised
 25 | 23. change_pin
 26 | 24. compromised_card
 27 | 25. contactless_not_working
 28 | 26. country_support
 29 | 27. declined_card_payment
 30 | 28. declined_cash_withdrawal
 31 | 29. declined_transfer
 32 | 30. direct_debit_payment_not_recognised
 33 | 31. disposable_card_limits
 34 | 32. edit_personal_details
 35 | 33. exchange_charge
 36 | 34. exchange_rate
 37 | 35. exchange_via_app
 38 | 36. extra_charge_on_statement
 39 | 37. failed_transfer
 40 | 38. fiat_currency_support
 41 | 39. get_disposable_virtual_card
 42 | 40. get_physical_card
 43 | 41. getting_spare_card
 44 | 42. getting_virtual_card
 45 | 43. lost_or_stolen_card
 46 | 44. lost_or_stolen_phone
 47 | 45. order_physical_card
 48 | 46. passcode_forgotten
 49 | 47. pending_card_payment
 50 | 48. pending_cash_withdrawal
 51 | 49. pending_top_up
 52 | 50. pending_transfer
 53 | 51. pin_blocked
 54 | 52. receiving_money
 55 | 53. request_refund
 56 | 54. reverted_card_payment?
 57 | 55. supported_cards_and_currencies
 58 | 56. terminate_account
 59 | 57. top_up_by_bank_transfer_charge
 60 | 58. top_up_by_card_charge
 61 | 59. top_up_by_cash_or_cheque
 62 | 60. top_up_failed
 63 | 61. top_up_limits
 64 | 62. top_up_reverted
 65 | 63. topping_up_by_card
 66 | 64. transaction_charged_twice
 67 | 65. transfer_fee_charged
 68 | 66. transfer_into_account
 69 | 67. transfer_not_received_by_recipient
 70 | 68. transfer_timing
 71 | 69. unable_to_verify_identity
 72 | 70. verify_my_identity
 73 | 71. verify_source_of_funds
 74 | 72. verify_top_up
 75 | 73. virtual_card_not_working
 76 | 74. visa_or_mastercard
 77 | 75. why_verify_identity
 78 | 76. wrong_amount_of_cash_received
 79 | 77. wrong_exchange_rate_for_cash_withdrawal
 80 | 
 81 | Query: I withdrew cash and I think the exchange rate is wrong.
 82 | Label: 77. wrong_exchange_rate_for_cash_withdrawal
 83 | 
 84 | Query: After I transferred money the balance remained the same.
 85 | Label: 7. balance_not_updated_after_bank_transfer
 86 | 
 87 | Query: Why is my money not in my account. I have already sent it out.
 88 | Label: 8. balance_not_updated_after_cheque_or_cash_deposit
 89 | 
 90 | Query: I didn't get all the cash I asked for
 91 | Label: 76. wrong_amount_of_cash_received
 92 | 
 93 | Query: Why am I unable to transfer money when I was able to before?
 94 | Label: 9. beneficiary_not_allowed
 95 | 
 96 | Query: Why is there extra cash in my account?
 97 | Label: 22. cash_withdrawal_not_recognised
 98 | 
 99 | Query: I have a strange transaction for £1 on my statement, what is that?
100 | Label: 36. extra_charge_on_statement
101 | 
102 | Query: I didn't make the direct debit payment on my account.
103 | Label: 30. direct_debit_payment_not_recognised
104 | 
105 | Query: What is the $1 transaction on my account?
106 | Label: 36. extra_charge_on_statement
107 | 
108 | Query: How can I tell the source for my available funds?
109 | Label: 71. verify_source_of_funds
110 | 
111 | Query: where did my funds come from?
112 | Label:


--------------------------------------------------------------------------------
/example_prompts/neurips_impact_statement_risks.txt:
--------------------------------------------------------------------------------
 1 | Label the impact statement based on whether it mentions a harmful application of the research done in the paper. Make sure the statement is sufficient to conclude there are harmful applications of the research being done, not a past risk that this research is solving.
 2 | Possible labels:
 3 | 1. doesn't mention a harmful application
 4 | 2. mentions a harmful application
 5 | 
 6 | Impact statement: Machine learning algorithms are increasingly relied upon by decision makers. It is therefore crucial to combine the predictive performance of such complex machinery with practical guarantees on the reliability and uncertainty of their output. We view the calibration methods presented in this paper as an important step towards this goal. In fact, uncertainty estimation is an effective way to quantify and communicate the benefits and limitations of machine learning. Moreover, the proposed methodologies provide an attractive way to move beyond the standard prediction accuracy measure used to compare algorithms. For instance, one can compare the performance of two candidate predictors, e.g., random forest and neural network (see Figure 3), by looking at the size of the corresponding prediction sets and/or their their conditional coverage. Finally, the approximate conditional coverage that we seek in this work is highly relevant within the broader framework of fairness, as discussed by [17] within a regression setting. While our approximate conditional coverage already implicitly reduces the risk of unwanted bias, an equalized coverage requirement [17] can also be easily incorporated into our methods to explicitly avoid discrimination based on protected categories. We conclude by emphasizing that the validity of our methods relies on the exchangeability of the data points. If this assumption is violated (e.g., with time-series data), our prediction sets may not have the right coverage. A general suggestion here is to always try to leverage specific knowledge of the data and of the application domain to judge whether the exchangeability assumption is reasonable. Finally, our data-splitting techniques
 7 | Label: doesn't mention a harmful application
 8 | 
 9 | Impact statement: The problem of Byzantine resilient aggregation of distributed machine learning models has been actively studied in recent years; however, the issue of Byzantine resilient distributed learning in multi-task networks has received much less attention. It is a general intuition that MTL is robust and resilient to cyber-attacks since it can identify attackers by measuring similarities between neighbors. In this paper, we have shown that some commonly used similarity measures are not resilient against certain attacks. With an increase in data heterogeneity, we hope this work could highlight the security and privacy concerns in designing distributed MTL frameworks.
10 | Paper title: Byzantine Resilient Distributed Multi-Task Learning
11 | Label: doesn't mention a harmful application
12 | 
13 | Impact statement: In our work, the learning objective was designed to align with and support the possible use of a predictive model to drive decisions by users. It is our belief that a responsible and transparent deployment of models with “lookahead-like" regularization components should avoid the kinds of mistakes that can be made when predictive methods are conflated with causally valid methods. At the same time, we have made a strong simplifying assumption, that of covariate shift, which requires that the relationship between covariates and outcome variables is invariant as decisions are made and the feature distribution changes. This strong assumption is made to ensure validity for the lookahead regularization, since we need to be able to perform inference about counterfactual observations. As discussed by Mueller et al. [ 31] and Peters et al. [34], there exist real-world tasks that reasonably satisfy this assumption, and yet at the same time, other tasks— notably those with unobserved confounders —where this assumption would be violated. Moreover, this assumption is not testable on the observational data. This, along with the need to make an assumption about the user decision model, means that an application of the method proposed here should be done with care and will require some domain knowledge to understand whether or not the assumptions are plausible. Furthermore, the validity of the interval estimates requires that any assumptions for the interval model used are satisfied and that weights w provide a reasonable estimation of p /p . In particular, fitting to p which has
14 | Label: mentions a harmful application
15 | 
16 | Impact statement: Uncertainty estimation for neural networks has very significant societal impact. Neural networks are increasingly being trained as black-box predictors and being placed in larger decision systems where errors in their predictions can pose immediate threat to downstream tasks. Systematic methods for calibrated uncertainty estimation under these conditions are needed, especially as these systems are deployed in safety critical domains, such for autonomous vehicle control [29], medical diagnosis [43], or in settings with large dataset imbalances and bias such as crime forecasting [24] and facial recognition [3]. This work is complementary to a large portion of machine learning research which is continually pushing the boundaries on neural network precision and accuracy. Instead of solely optimizing larger models for increased performance, our method focuses on how these models can be equipped with the ability to estimate their own confidence. Our results demonstrating superior calibration of our method over baselines are also critical in ensuring that we can place a certain level of trust in these algorithms and in understanding when they say “I don’t know”. While there are clear and broad benefits of uncertainty estimation in machine learning, we believe it is also important to recognize potential societal challenges that may arise. With increased performance and uncertainty estimation capabilities, humans will inevitably become increasingly trusting in a model’s predictions, as well as its ability to catch dangerous or uncertain decisions before they are executed. Thus, it is important to continue to pursue redundancy in such learning systems to increase the likelihood that mistakes can be caught and corrected independently.
17 | Paper
18 | Label: mentions a harmful application
19 | 
20 | Impact statement: Hypothesis testing and valid inference after model selection are fundamental problems in statistics, which have recently attracted increasing attention also in machine learning. Kernel tests such as MMD are not only used for statistical testing, but also to design algorithms for deep learning and GANs [41, 42]. The question of how to select the test statistic naturally arises in kernel-based tests because of the kernel choice problem. Our work shows that it is possible to overcome the need of (wasteful and often heuristic) data splitting when designing hypothesis tests with feasible null distribution. Since this comes without relevant increase in computational resources we expect the proposed method to replace the data splitting approach in applications that fit the framework considered in this work. Theorem 1 is also applicable beyond hypothesis testing and extends the previously known PSI framework proposed by Lee et al. [24].
21 | Paper title: Learning Kernel Tests Without Data Splitting
22 | Label: doesn't mention a harmful application
23 | 
24 | Impact statement: With the proliferation of deep learning, explaining or understanding the reasons behind the models decisions has become extremely important in many critical applications [ 30]. Many explainability methods have been proposed in literature [7, 12, 11], however, they either provide instance specific local explanations or fit to the entire dataset and create global explanations. Our proposed method is able to create both such explanations, but in addition, it also creates explanations for subgroups in the data and all of this jointly. We thus are creating explanations for granularities (between local and global). This multilevel aspect has not been sufficiently researched before. In fact recently  [4] has stressed the importance of having such multilevel explanations for successfully meeting the requirements of Europe’s General Data Protection Regulation (GDPR) [5]. They clearly state that simply having local or global explanations may not be sufficient for providing satisfactory explanations in many cases. There are also potential risks with this approach. The first is that if the base local explainer is non-robust or inaccurate [34, 35] then the explanations generated by our tree also may have to be considered cautiously. However this is not specific to our method, and applies to several post-hoc explainability methods that try to explain a black-box model. The way to mitigate this is to ensure that the local explanation methods are adapted (such as by choosing appropriate neighborhoods in LIME) to provide robust and accurate explanations. Another risk could be that such detailed multilevel explanations may reveal too much about the internals of the model (similar scenario for gradient-based models is discussed in [36]) and hence may raise privacy concerns. Mitigation could happen by selectively revealing the levels / pruning the tree or having a budget of explanations for each user to balance the level of explanations vs. the exposure of the black-box model.
25 | Paper title: Model Agnostic Multilevel Explanations
26 | Label:


--------------------------------------------------------------------------------
/example_prompts/one_stop_english.txt:
--------------------------------------------------------------------------------
 1 | The following is an article sourced from The Guardian newspaper, and rewritten by teachers to suit three levels of adult English as Second Language (ESL) learners: elementary, intermediate, and advanced. Predict the level of the article.
 2 | Possible labels:
 3 | 1. advanced
 4 | 2. elementary
 5 | 3. intermediate
 6 | 
 7 | Article: Cities don’t often move. But that’s exactly what Kiruna, an Arctic town in northern Sweden, has to do. It has to move or the earth will swallow it up.
 8 | “It’s a terrible choice,” says Krister Lindstedt, who works for the Swedish architect company that is moving the city. They will move this city of 23,000 people away from a gigantic iron-ore mine that is swallowing up the ground beneath its streets. “Either the mine must stop digging, and then there will be no jobs, or the city has to move.”
 9 | Kiruna was founded in 1900 by the state-owned Luossavaara-Kiirunavaara mining company (LK). The city became rich thanks to the very large amount of iron ore that is below the town. But the mine that made it rich is now going to destroy it. “The town is here because of the mine,” says Deputy Mayor Niklas Siren.
10 | Located 145km inside the Arctic Circle, Kiruna has a very difficult climate. It has winters with no sunlight and average temperatures of -15C. But the iron ore has kept people here. Kiruna is the world’s largest underground iron-ore mine. It produces 90% of all the iron in Europe. That is enough to build more than six Eiffel
11 | Label: elementary
12 | 
13 | Article: Illegal downloading is a kind of “moral squalor” and theft, as much as putting your hand in someone’s pocket and stealing their wallet is theft, says author Philip Pullman. In an article for Index on Censorship, Pullman, who is president of the Society of Authors, strongly defends copyright laws. He criticizes internet users who think it is OK to download music or books without paying for them.
14 | “The technical brilliance is so dazzling that people can’t see the moral squalor of what they’re doing,” he writes. “It is outrageous that anyone can steal an artist’s work and get away with it. It is theft, just as putting your hand in someone’s pocket and taking their wallet is theft.”
15 | His article comes after music industry leaders met British Prime Minister David Cameron in Downing Street to discuss the issue of web piracy.
16 | Pullman, writer of the His Dark Materials trilogy, says authors and musicians work in poverty and obscurity for years to bring their work to the level “that gives delight to their audiences and, as soon as they achieve that, the possibility of earning a living from it is taken away from them”. He concludes: “The principle is simple, and unaltered by technology, science or magic: if we want to enjoy the work that someone does,
17 | Label: intermediate
18 | 
19 | Article: As soon as the children at one primary school in Stirling hear the words “daily mile”, they down their pencils and head out of the classroom to start running laps around the school field. For three-and-a-half years, all pupils at St Ninian’s Primary have walked or run a mile each day. They do so at random times during the day, apparently happily, and, despite the rise in childhood obesity across the UK, none of the children at the school are overweight. 
20 | The daily mile has done so much to improve these children’s fitness, behaviour and concentration in lessons that scores of nursery and primary schools across Britain are following suit and getting pupils to get up from their desks and take 15 minutes to walk or run round the school or local park. 
21 | Elaine Wyllie, headteacher of St Ninian’s, said: “I get at least two emails a day from other schools and local authorities asking how we do it. The thought of children across the country running every day because of something we’ve done is phenomenal.” 
22 | One in ten children are obese when they start school at the age of four or five, according to figures from the Health & Social Care Information Centre, and, in the summer of 2015, a study found that schoolchildren in England are the least fit they have ever been.
23 | Label: advanced
24 | 
25 | Article: Back in 2005, when BlackBerry brought instant messaging to the mobile phone, the company was just entering its boom times. While the iPhone was still just an idea, BlackBerry’s innovations ensured its smartphone was one of Canada’s biggest exports.
26 | Six years later, in the summer of 2011, when there were riots in London and other UK cities, BlackBerry Messenger (BBM) was so effective at mobilizing the rioters that politicians wanted the service to be temporarily shut down. But, two years later, it is the users themselves who are pulling the plug.
27 | Demand for BlackBerry phones is falling. Dozens of alternatives have sprung up to take its place, from Facebook’s and Apple’s instant messaging applications to independent apps such as WhatsApp and Kik (which is also Canadian). They are free to download and use, and they use the internet to swap text messages, pictures, voice clips, ‘stickers’ and even videos between most types of phones.
28 | In an attempt to keep its customers, BBM has been released on Android and Apple phones. Despite the competition from other apps, the response has been extraordinary, with more than 20 million downloads. But, despite this interest, many people believe BBM’s wider release will not save the service. “The move to bring BlackBerry to the iPhone is four or five years too late,” says James Gooderson, an 18
29 | Label: intermediate
30 | 
31 | Article: Loneliness has finally become a hot topic. The Office for National Statistics has found Britain to be the loneliest place in Europe. British people are less likely to have strong friendships or know their neighbours than people anywhere else in the European Union. And research at the University of Chicago has found that loneliness is twice as bad for older people’s health as obesity and almost as great a cause of death as poverty.
32 | This is shocking but such studies do not examine the loneliness epidemic among younger adults. In 2010, the Mental Health Foundation found that loneliness was a greater concern among young people than among the elderly. The 18- to 34-year-olds surveyed were more likely to feel lonely often, to worry about feeling alone and to feel depressed because of loneliness than the over-55s.
33 | “Loneliness is a recognized problem among the elderly and there are day centres and charities to help them,” says Sam Challis, of the mental health charity Mind, “but, when young people reach 21, they’re too old for youth services.” This is problematic because of the close relationship between loneliness and mental health – it is linked to increased stress, depression, paranoia, anxiety, addiction and it is a known cause of suicide.
34 | But what can young people do to prevent loneliness? One researcher at the Oxford Internet Institute points out that social media and the internet can be both a good thing
35 | Label: intermediate
36 | 
37 | Article: Many of us know we don’t get enough sleep but imagine if there was a simple solution: getting up later. In a speech at the British Science Festival, Dr Paul Kelley from Oxford University said schools should stagger their starting times to work with the natural rhythms of their students. This would improve exam results and students’ health (lack of sleep can cause diabetes, depression, obesity and other health problems).
38 | Dr Kelley said that, when children are around ten, their natural wake-up time is about 6.30am; at 16, this rises to 8am; and, at 18, a person’s natural waking hour is 9am, although you may think they are just a lazy teenager. The normal school starting time works for 10-year-olds but not for 16- to 18-year-olds. For the older teenagers, it might be better to start the school day at 11am or even later. “A 7am wake-up time for older teenagers,” says Kelley, “is the same as a 4.30am start for a teacher in their 50s.”
39 | He says the solution is not to tell teenagers to go to bed earlier. “The body’s natural rhythm is controlled by a particular kind of light,” says Kelley. “The eye has cells that report to a part of the brain that controls our sleep rhythms over a 24-hour cycle. It’s the light that controls it.”
40 | But it isn’t just students who would benefit from a later start. Kelley says the working day should be more linked to our natural rhythms. Describing the average sleep loss per night for different age groups, he says: “Between 14 and 24, people lose more than two hours. For people aged between 24 and about 30 or 35, they lose about an hour and a half. That can continue up until you’re about 55 when it’s in balance again. The 10-year-old and 55-year-old wake and sleep naturally at the same time.”
41 | So, should workplaces have staggered starting times, too? Should people in their 50s and above come in at 8am, people in their 30s start at 10am and the teenage apprentice at 11am? Kelley says that synchronized hours could have “many
42 | Label:


--------------------------------------------------------------------------------
/example_prompts/overruling.txt:
--------------------------------------------------------------------------------
 1 | In law, an overruling sentence is a statement that nullifies a previous case decision as a precedent, by a constitutionally valid statute or a decision by the same or higher ranking court which establishes a different rule on the point of law involved. Label the sentence based on whether it is overruling or not.
 2 | Possible labels:
 3 | 1. not overruling
 4 | 2. overruling
 5 | 
 6 | Sentence: the following facts are taken from the administrative record.
 7 | Label: not overruling
 8 | 
 9 | Sentence: see scott, supra at 352; commonwealth v. ruffin, 475 mass. 1003, 1004 (2016).
10 | Label: not overruling
11 | 
12 | Sentence: while not limited to these cases, to the extent the following cases are in conflict, they are overruled.
13 | Label: overruling
14 | 
15 | Sentence: we reverse and remand, and in doing so, we overrule commonwealth v. constant
16 | Label: overruling
17 | 
18 | Sentence: see  boles, 554 so.2d at 961 ([i]f the county and other persons are not bound, then the status of the road as public or private is subject to being litigated again, and the results of later litigation may be inconsistent with the results of the initial litigation.).
19 | Label: not overruling
20 | 
21 | Sentence: to the extent that paprskar v. state, supra, applied the general test of waiver of constitutional rights set forth in johnson v. zerbst, supra, it is no longer viable.
22 | Label: overruling
23 | 
24 | Sentence: we flatly rejected this logic a century ago in state ex rel. state capitol commission v. lister, 91 wash. 9, 156 p. 858 (1916), and we reject it again now.
25 | Label: overruling
26 | 
27 | Sentence: in this case, the trial court did not clearly err by finding clear and convincing evidence to support termination under mcl 712a.19b(3)(g) and (j).
28 | Label: not overruling
29 | 
30 | Sentence: app. 1981), or voninski v. voninski, 661 s.w.2d 872, 878-79 (tenn.
31 | Label: overruling
32 | 
33 | Sentence: see tex. r. app. p. 48.4; see also in re schulman, 252 s.w.3d at 412 n.35; ex parte owens, 206 s.w.3d 670, 673 (tex. crim. app. 2006).
34 | Label: not overruling
35 | 
36 | Sentence: we therefore overrule mcgore; and we hold, like every other circuit to have reached the issue, that under rule 15(a) a district court can allow a plaintiff to amend his complaint even when the complaint is subject to dismissal under the plra.
37 | Label: overruling
38 | 
39 | Sentence: we recognize that this reading of fager disapproves prior cases.
40 | Label: overruling
41 | 
42 | Sentence: to the extent that this opinion causes conflict with earlier decisions such as holmes, those cases are overruled.
43 | Label: overruling
44 | 
45 | Sentence: we disapprove abdelaziz as well as henderson v. north, 545 so.2d 486 (fla. 1st dca 1989), which adopted the principle of abdelaziz, to the extent that they disapproved a cause of action for negligent stillbirth.
46 | Label: overruling
47 | 
48 | Sentence: the decision of the fourth district court of appeal holding section 550.081 unconstitutional is disapproved.
49 | Label: overruling
50 | 
51 | Sentence: furthermore, the trial court indicated in its order that it had ""consider[ed] . . . [appellant's] special appearance, the pleadings, the affidavits, and arguments of counsel.""
52 | Label: not overruling
53 | 
54 | Sentence: however, to the extent that cervantes, and ex parte mcatee, 599 s.w.2d 335 (tex.crim.app. 1980), indicate that a failure to admonish pursuant to art. 26.13(a)(4) automatically entitles one to post-conviction collateral relief without
55 | Label: overruling
56 | 
57 | Sentence: to the extent that the holding in wilson v. bureau of state police, supra, conflicts with this opinion, it is overruled.
58 | Label: overruling
59 | 
60 | Sentence: for the reasons stated below, we approve the fifth district court of appeal's decision in winter park, and disapprove the decision in belleair to the extent described herein.
61 | Label: overruling
62 | 
63 | Sentence: in light of both our holding today and previous rulings in johnson, dueser, and gronroos, we now explicitly overrule dupree.
64 | Label: overruling
65 | 
66 | Sentence: having reviewed the question en banc, we now answer that question in the affirmative and overrule laffey.
67 | Label: overruling
68 | 
69 | Sentence: accordingly, to the extent of any conflict nemecek v. state, 621 s.w.2d 404 (tex.cr.app. 1980) is overruled.
70 | Label: overruling
71 | 
72 | Sentence: we therefore overrule mata and hartman to the extent of the conflict and reverse the trial court's judgment and remand the cause for a new trial.
73 | Label: overruling
74 | 
75 | Sentence: in reaching that conclusion, we recede from the previous holding of this court in hall v. state, 505 so.2d 657, 658 (fla. 2d dca), cause dismissed, 509 so.2d 1117 (fla. 1987), in which we stated that an essential element of
76 | Label: overruling
77 | 
78 | Sentence: we are fully in accord with the relaxation of the federal requirements as expressed in illinois v. gates, supra, and to the extent that berkshire v. commonwealth, supra; thompson v. commonwealth, supra; and buchenburger v. commonwealth, supra, express a contrary view, they
79 | Label: overruling
80 | 
81 | Sentence: we overrule this holding based upon our conclusion that review by the court of appeals under section 22-63-117(11) is predicated upon a final order of the school board resulting from proceedings conducted under section 22-63-117.
82 | Label:


--------------------------------------------------------------------------------
/example_prompts/semiconductor_org_types.txt:
--------------------------------------------------------------------------------
  1 | The dataset is a list of institutions that have contributed papers to semiconductor conferences in the last 25 years, as catalogued by IEEE and sampled randomly. The goal is to classify the institutions into one of three categories: "university", "company" or "research institute".
  2 | Possible labels:
  3 | 1. company
  4 | 2. research institute
  5 | 3. university
  6 | 
  7 | Organization name: Central Research Laboratory,Hitachi Ltd. Kokubunji. Tokyo,Japan
  8 | Paper title: Formation of Si-on-Insulator
  9 | Label: company
 10 | 
 11 | Organization name: MAPS,Yongin,Korea
 12 | Paper title: 21.8 An all-in-one (Qi, PMA
 13 | Label: company
 14 | 
 15 | Organization name: Samsung Electronics Company Limited, Yongin si, Gyeonggi, South Korea
 16 | Paper title: 1D thickness scaling study of phase change
 17 | Label: company
 18 | 
 19 | Organization name: Semiconductor Research Center,Matsushita Electric Industrial Co.,Ltd.,Yagumo-nakamachi,Morig
 20 | Label: company
 21 | 
 22 | Organization name: Advanced Circuit Pursuit,Zollikon,Switzerland; ETH,Zurich,Switzerland
 23 | Paper title: A 0.
 24 | Label: company
 25 | 
 26 | Organization name: Engim,Acton,MA,USA
 27 | Paper title: A 180MS/s, 162Mb/s wideband three-
 28 | Label: company
 29 | 
 30 | Organization name: R & D Center, Samsung Electronics Kiheung-Eup, Yongin-City, Kyungki-do, Korea
 31 | Paper
 32 | Label: company
 33 | 
 34 | Organization name: Panasonic,Osaka,Japan
 35 | Paper title: 30.1 8b Thin-film microprocessor using a hybrid oxide-organic complementary technology
 36 | Label: company
 37 | 
 38 | Organization name: Center for Semiconductor Research & Development,Toshiba Corporation,Japan
 39 | Paper title: Physical understanding of Vth and Idsat variations
 40 | Label: company
 41 | 
 42 | Organization name: Imec,Leuven,Belgium
 43 | Paper title: An Artificial Iris ASIC with High Voltage Liquid Crystal Driver, 10 n
 44 | Label: research institute
 45 | 
 46 | Organization name: ETH,Zurich,Switzerland; Advanced Circuit Pursuit,Zollikon,Switzerland
 47 | Paper title: A 0.
 48 | Label: university
 49 | 
 50 | Organization name: imec from Samsung Electronics,Korea
 51 | Paper title: First Demonstration of Low Temperature (≤500°C) CMOS
 52 | Label: company
 53 | 
 54 | Organization name: Technology Research Department,Association of Super-Advanced Electronics Technologies (ASET),,Higashi-koigakubo,K
 55 | Label: research institute
 56 | 
 57 | Organization name: Texas Instruments Bangalore and Texas Instruments,Dallas,TX
 58 | Paper title: A DSL customer-premise equipment modem SoC with extended reach/
 59 | Label: company
 60 | 
 61 | Organization name: Memory Division, Samsung Electronics Co, Yongin-City, Gyeonggi-Do, Korea
 62 | Paper title: Front-end-
 63 | Label: company
 64 | 
 65 | Organization name: Fujitsu Laboratories Ltd.,Japan
 66 | Paper title: Development of sub 10-µm ultra-thinning technology using device wafers
 67 | Label: company
 68 | 
 69 | Organization name: National NanoFab Center, Daejeon, South Korea
 70 | Paper title: 3-terminal nanoelectromechanical switching
 71 | Label: research institute
 72 | 
 73 | Organization name: Semiconductor R&D Center, Samsung Electronics Co., Ltd, Yongin-City, Gyeonggi-Do, Korea (
 74 | Label: company
 75 | 
 76 | Organization name: Pohang Univ. of Sci. & Technol.,South Korea
 77 | Paper title: A 3Gb/s 8b single-ended
 78 | Label: university
 79 | 
 80 | Organization name: MPI für Mikrostrukturphysik, Halle, Germany
 81 | Paper title: Dislocation engineering for a silicon
 82 | Label: research institute
 83 | 
 84 | Organization name: Sony,Tokyo,Japan
 85 | Paper title: A 3.1 to 5 GHz CMOS DSSS UWB transceiver for WP
 86 | Label: company
 87 | 
 88 | Organization name: Illinois Univ.,Urbana,IL,USA
 89 | Paper title: A 14 b 100 Msample/s CMOS DAC designed for spectral
 90 | Label: university
 91 | 
 92 | Organization name: ULSI Device Dev. Labs.,NEC Corp.,Kanagawa,Japan
 93 | Paper title: A crossing charge recycle refresh scheme
 94 | Label: company
 95 | 
 96 | Organization name: Syst. LSI Dev. Center,Mitsubishi Electr. Corp.,Hyogo,Japan
 97 | Paper title: Single-
 98 | Label: company
 99 | 
100 | Organization name: Matsushita Electric Industrial Limited, Takatsuki, Osaka, Japan
101 | Paper title: Role of non-radiative recombination in the
102 | Label: company
103 | 
104 | Organization name: Corp. Semicond. Dev. Div.,Matsushita Electr. Ind. Co. Ltd.,Kyoto,Japan
105 | 
106 | Label: company
107 | 
108 | Organization name: KUL, Leuven, Belgium
109 | Paper title: Benchmarking of monolithic 3D integrated MX2 FETs with Si
110 | Label: university
111 | 
112 | Organization name: imec,Heverlee,Belgium
113 | Paper title: 24.4 A 680nA fully integrated implantable EC
114 | Label: research institute
115 | 
116 | Organization name: Applied Science and Technology Research Institute,Hong Kong
117 | Paper title: A 48-mW 18-Gb/s fully integrated CMOS
118 | Label: research institute
119 | 
120 | Organization name: North Carolina State Univ.,Raleigh,NC,USA
121 | Paper title: 3Gb/s AC-coupled chip-to-
122 | Label: university
123 | 
124 | Organization name: Philips Composants et Semiconducteurs,Caen,France
125 | Paper title: A 12 b 50 M sample/s cascaded
126 | Label: company
127 | 
128 | Organization name: Samsung Advanced Logic Lab,Austin,TX
129 | Paper title: High performance and low leakage current InGaAs-on-silicon FinF
130 | Label: company
131 | 
132 | Organization name: SoC R&D Center,Semiconductor Company,Toshiba Corp.,Isogo-ku,Yokohama,Japan
133 | Label: company
134 | 
135 | Organization name: Incubation Center,Renesas Electronics Corp.,Shimokuzawa,Chuou-ku,Sagamihara,
136 | Label: company
137 | 
138 | Organization name: IBM Systems Group,Austin,TX
139 | Paper title: Design of the Power6 Microprocessor
140 | Label: company
141 | 
142 | Organization name: APA Optics, Inc., Blaine, MN, USA
143 | Paper title: High performance 0.25 /spl mu/m gate
144 | Label: company
145 | 
146 | Organization name: Texas Instruments Inc, Dallas, TX, US
147 | Paper title: Damascene integration of copper and ultra-low-k xerog
148 | Label: company
149 | 
150 | Organization name: Cisco Systems, Hong Kong, China
151 | Paper title: Characterizing Electromigration Effects in a 16nm FinFET Process Using a Circuit
152 | Label: company
153 | 
154 | Organization name: Toshiba at Albany NanoTech,NY,USA
155 | Paper title: Full metal gate with borderless contact for 14 nm and beyond
156 | Label: company
157 | 
158 | Organization name: Advanced LCD Technology Development Center Company Limited, Yokohama, Kanagawa, Japan
159 | Paper title: Sub-Micron CMOS /
160 | Label: company
161 | 
162 | Organization name: IBM Microelectronics, Burlington, VT, USA
163 | Paper title: Large-signal performance of high-BV/sub CEO/
164 | Label: company
165 | 
166 | Organization name: GLOBALFOUNDRIES Inc., Albany, NY, USA
167 | Paper title: Accurate performance evaluation for the horizontal nanosheet
168 | Label: company
169 | 
170 | Organization name: MIRAI-ASET,Kawasaki,Japan
171 | Paper title: Strained SOI technology for high-performance, low-
172 | Label: university
173 | 
174 | Organization name: IBM Microelectron.,Burlington,VT,USA
175 | Paper title: A 500MHz multi-banked compilable DRAM macro
176 | Label: company
177 | 
178 | Organization name: IBM Microelectron.,Hopewell Junction,NY,USA
179 | Paper title: Destructive-read random access memory system buffered
180 | Label: company
181 | 
182 | Organization name: Fujitsu Laboratories Ltd., Atsugi, Kanagawa, Japan
183 | Paper title: A 65 nm CMOS technology with a high-
184 | Label: company
185 | 
186 | Organization name: Intel,Hillsboro,OR
187 | Paper title: 25.5 A Self-Calibrated 1.2-to-3.
188 | Label: company
189 | 
190 | Organization name: Strategic Technology Group,Advanced Micro Devices,Sunnyvale,CA,USA
191 | Paper title: Collective-effect state variables for post-CM
192 | Label: company
193 | 
194 | Organization name: IBM Semiconductor Research and Development Center (SRDC), Samsung Electronics Company Limited, Hopewell Junction, NY, USA
195 | Paper title
196 | Label: company
197 | 
198 | Organization name: QRE, Hillsboro, OR, USA
199 | Paper title: An enhanced 130 nm generation logic technology featuring 60 nm transistors optimized for high
200 | Label: company
201 | 
202 | Organization name: Portland Technology Development, Hillsboro, OR, USA
203 | Paper title: An enhanced 130 nm generation logic technology featuring 60 nm transistors optimized for high performance and low power at 0.7 - 1.4 V
204 | Label:


--------------------------------------------------------------------------------
/example_prompts/systematic_review_inclusion.txt:
--------------------------------------------------------------------------------
 1 | Identify whether this paper should be included in a meta-review which includes the findings of systematic reviews on interventions designed to promote charitable donations. 
 2 | Included reviews should describe monetary charitable donations, assess any population of participants in any context, and be peer reviewed and written in English. 
 3 | They should not report new data, be non-systematic reviews, consider cause-related marketing or other kinds of prosocial behaviour.
 4 | Possible labels:
 5 | 1. included
 6 | 2. not included
 7 | 
 8 | Title: Imagine being a nice guy: A note on hypothetical vs. Incentivized social preferences
 9 | Abstract: We conducted an experimental study on social preferences using dictator games similar to Fehr et al. (2008). Our results show that social preferences differ between subjects who receive low-stakes monetary rewards for their decisions and subjects who consider hypothetical stakes. Our findings indicate that, apart from incentives, gender plays an important role for the categorization of different social preferences. © 2015. The authors license.
10 | Journal: Judgm. Decis. Mak.
11 | Label: not included
12 | 
13 | Title: Consumer reaction to price increase: An investigation in gasoline industry
14 | Abstract: Purpose – The aim of this study is to investigate the impact of increase in price of an essential product (i.e. gasoline) toward the focal product and other seemingly non-related products. Design/methodology/approach – A self-administered survey was used to collect data from the drivers at a large metroplex in Southwest USA. Multiple regression and scanning electron microscope procedures were used to analyze and test the proposed hypotheses. Findings – When consumers notice the increase in gas prices, they become very anxious. This anxiety is positively associated with average gas bought in gallons and negatively associated with threshold price. Further, this consumer anxiety has the strongest influence on lifestyle changes, followed by automobile technology change and transportation mode change, and has the weakest influence on gasoline brand/type change. Research limitations/implications – We focus on only anxiety as a mediator between increase in gas prices and the behavioral outcomes, and collect data from only one location. Practical implications – Managers must be cognizant that a price increase in essential goods not only influences the demand for focal products but also for products that may not seem related to the focal products. Social implications – Increase in gasoline price will not only affect the demand for gasoline, but also the demand for alternate forms of transportation, fuel efficient vehicles, and other aspects of life. Originality/value – This study is the first to look at the role of anxiety as a mediator and looks at the effects of increase in gas prices in a holistic manner. © Emerald Group Publishing Limited.
15 | Journal: J
16 | Label: not included
17 | 
18 | Title: Being sticker rich: Numerical context influences children's sharing behavior
19 | Abstract: Young children spontaneously share resources with anonymous recipients, but little is known about the specific circumstances that promote or hinder these prosocial tendencies. Children (ages 3-11) received a small (12) or large (30) number of stickers, and were then given the opportunity to share their windfall with either one or multiple anonymous recipients (Dictator Game). Whether a child chose to share or not varied as a function of age, but was uninfluenced by numerical context. Moreover, children's giving was consistent with a proportion- based account, such that children typically donated a similar proportion (but different absolute number) of the resources given to them, regardless of whether they originally received a small or large windfall. The proportion of resources donated, however, did vary based on the number of recipients with whom they were allowed to share, such that on average, children shared more when there were more recipients available, particularly when they had more resources, suggesting they take others into consideration when making prosocial decisions. Finally, results indicated that a child's gender also predicted sharing behavior, with males generally sharing more resources than females. Together, findings suggest that the numerical contexts under which children are asked to share, as well as the quantity of resources that they have to share, may interact to promote (or hinder) altruistic behaviors throughout childhood. © 2015 Posid et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author
20 | Label: not included
21 | 
22 | Title: Public charity offer as a proximate factor of evolved reputation-building strategy: an experimental analysis of a real-life situation
23 | Abstract: Although theoretical considerations suggest that a considerable portion of human altruism is driven by concerns about reputation, few experimental studies have examined the psychological correlates of individual decisions in real-life situations. Here we demonstrate that more subjects were willing to give assistance to unfamiliar people in need if they could make their charity offers in the presence of their group mates than in a situation where the offers remained concealed from others. In return, those who were willing to participate in a particular charitable activity received significantly higher scores than others on scales measuring sympathy and trustworthiness. Finally, a multiple regression analysis revealed that while several personality and behavior traits (cooperative ability, Machiavellianism, sensitivity to norms, and sex) play a role in the development of prosocial behavior, the possibility of gaining reputation within the group remains a measurable determinant of charitable behavior. © 2007 Elsevier Inc. All rights reserved.
24 | Journal: Evol. Hum. Behav.
25 | Label: not included
26 | 
27 | Title: Assessing actual strategic behavior to construct a measure of strategic ability
28 | Abstract: Strategic interactions have been studied extensively in the area of judgment and decision-making. However, so far no specific measure of a decision-maker's ability to be successful in strategic interactions has been proposed and tested. Our contribution is the development of a measure of strategic ability that borrows from both game theory and psychology. Such measure is aimed at providing an estimation of the likelihood of success in many social activities that involve strategic interaction among multiple decision-makers. To construct a reliable measure of strategic ability, that we propose to call "Strategic Quotient" (SQ), we designed a test where each item is a game and where, therefore, the individual obtained score depends on the distribution of choices of other decision-makers taking the test. The test is designed to provide information on the abilities related to two dimensions, mentalization and rationality, that we argue are crucial to strategic success, with each dimension being characterized by two main factors. Principal component analysis on preliminary data shows that indeed four factors (two for rationality, two for mentalization) account for strategic success in most of the strategically simpler games of the test. Moreover, two more strategically sophisticated games are inserted in the test and are used to investigate if and to what extent the four factors obtained by simpler games can predict strategic success in more sophisticated strategic interactions. Overall, the collected empirical evidence points to the possibility of building a SQ measure using only simple games designed to capture information about the four identified factors. © 2019 Bilancini, Boncinelli and Mattiassi.
29 | Journal: Front.
30 | Label: not included
31 | 
32 | Title: How construals of money versus time impact consumer charitable giving
33 | Abstract: While past research has suggested that consumers have fundamentally different responses to thinking about money versus time, the current work clarifies an important nuance in terms of how consumers construe these two resources. We demonstrate that, in the domain of charitable giving, money is construed relatively more concretely, whereas time is construed relatively more abstractly. This difference in the construal of these two resources has implications for how appeals for charitable contributions or money versus time should be framed. When the construal level at which the consumer considers the cause is aligned (misaligned) with the construal level of the resource being requested, contribution intentions and behaviors increase (decrease). In addition, the moderating role of resource abundance is examined. In particular, when money is considered abundant (vs. nonabundant), consumers no longer exhibit more concrete thoughts in response to money compared to time. Finally, when the donation request makes consumers think of money in a more abundant manner, monetary donations can be successfully motivated with a more abstract call for charitable support. The theoretical and practical implications for marketers and charitable organizations are discussed. © The Author 2015.
34 | Journal: J. Consum. Res.
35 | Label:


--------------------------------------------------------------------------------
/example_prompts/tai_safety_research.txt:
--------------------------------------------------------------------------------
 1 | Transformative AI (TAI) is defined as AI that precipitates a transition comparable to (or more significant than) the agricultural or industrial revolution. Label a paper as "TAI safety research" if: 
 2 | 1. The contents of the paper are directly motivated by, and substantively inform, the challenge of ensuring good outcomes for TAI, 
 3 | 2. There is substantive content on AI safety, not just AI capabilities, 
 4 | 3. The intended audience is the community of researchers, 
 5 | 4. It meets a subjective threshold of seriousness/quality, 
 6 | 5. Peer review is not required.
 7 | Possible labels:
 8 | 1. TAI safety research
 9 | 2. not TAI safety research
10 | 
11 | Title: One Decade of Universal Artificial Intelligence
12 | Abstract Note: The first decade of this century has seen the nascency of the first mathematical theory of general artificial intelligence. This theory of Universal Artificial Intelligence (UAI) has made significant contributions to many theoretical, philosophical, and practical AI questions. In a series of papers culminating in book (Hutter, 2005), an exciting sound and complete mathematical model for a super intelligent agent (AIXI) has been developed and rigorously analyzed. While nowadays most AI researchers avoid discussing intelligence, the award-winning PhD thesis (Legg, 2008) provided the philosophical embedding and investigated the UAI-based universal measure of rational intelligence, which is formal, objective and non-anthropocentric. Recently, effective approximations of AIXI have been derived and experimentally investigated in JAIR paper (Veness et al. 2011). This practical breakthrough has resulted in some impressive applications, finally muting earlier critique that UAI is only a theory. For the first time, without providing any domain knowledge, the same agent is able to self-adapt to a diverse range of interactive environments. For instance, AIXI is able to learn from scratch to play TicTacToe, Pacman, Kuhn Poker, and other games by trial and error, without even providing the rules of the games. These achievements give new hope that the grand goal of Artificial General Intelligence is not elusive. This article provides an informal overview of UAI in context. It attempts to gently introduce a very
13 | Label: TAI safety research
14 | 
15 | Title: Coherence arguments do not imply goal-directed behavior
16 | Abstract Note: One of the most pleasing things about probability and expected utility theory is that there are many coherence arguments that suggest that these are the “correct” ways to reason. If you deviate from what the theory prescribes, then you must be executing a dominated strategy. There must be some other strategy that never does any worse than your strategy, but does strictly better than your strategy with certainty in at least one situation. There’s a good explanation of these arguments here. We shouldn’t expect mere humans to be able to notice any failures of coherence in a superintelligent agent, since if we could notice these failures, so could the agent. So we should expect that powerful agents appear coherent to us. (Note that it is possible that the agent doesn’t fix the failures because it would not be worth it -- in this case, the argument says that we will not be able to notice any exploitable failures.) Taken together, these arguments suggest that we should model an agent much smarter than us as an expected utility (EU) maximizer. And many people agree that EU maximizers are dangerous. So does this mean we’re doomed? I don’t think so: it seems to me that the problems about EU maximizers that we’ve identified are actually about goal-directed behavior or explicit reward maximizers. The coherence theorems say nothing about whether an AI system must look like one of these
17 | Label: TAI safety research
18 | 
19 | Title: Teaching A.I. Systems to Behave Themselves (Published 2017)
20 | Abstract Note: As philosophers and pundits worry that artificial intelligence will one day harm the world, some researchers are working on ways to lower the risks.
21 | Publication Title: The New York Times
22 | Item Type: newspaperArticle
23 | Publication Year: 2017
24 | Label: not TAI safety research
25 | 
26 | Title: Advancing rational analysis to the algorithmic level
27 | Abstract Note: Abstract             The commentaries raised questions about normativity, human rationality, cognitive architectures, cognitive constraints, and the scope or resource rational analysis (RRA). We respond to these questions and clarify that RRA is a methodological advance that extends the scope of rational modeling to understanding cognitive processes, why they differ between people, why they change over time, and how they could be improved.
28 | Publication Title: Behavioral and Brain Sciences
29 | Item Type: journalArticle
30 | Publication Year: 2020
31 | Label: not TAI safety research
32 | 
33 | Title: The Role and Limits of Principles in AI Ethics: Towards a Focus on Tensions
34 | Abstract Note: The last few years have seen a proliferation of principles for AI ethics. There is substantial overlap between different sets of principles, with widespread agreement that AI should be used for the common good, should not be used to harm people or undermine their rights, and should respect widely held values such as fairness, privacy, and autonomy. While articulating and agreeing on principles is important, it is only a starting point. Drawing on comparisons with the field of bioethics, we highlight some of the limitations of principles: in particular, they are often too broad and high-level to guide ethics in practice. We suggest that an important next step for the field of AI ethics is to focus on exploring the tensions that inevitably arise as we try to implement principles in practice. By explicitly recognising these tensions we can begin to make decisions about how they should be resolved in specific cases, and develop frameworks and guidelines for AI ethics that are rigorous and practically relevant. We discuss some different specific ways that tensions arise in AI ethics, and what processes might be needed to resolve them.
35 | Publication Title: AIES '19: Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society
36 | Item Type: conferencePaper
37 | Publication Year: 2019
38 | Label: TAI safety research
39 | 
40 | Title: Computer Simulations as a Technological Singularity in the Empirical Sciences
41 | Abstract Note: SummaryIn this paper, I discuss the conditions necessary for computer simulations to qualify as a technological singularity in the empirical sciences. A technological singularity encompasses two claims: (a) the enhancement of human cognitive capacities by the computer, and (b) their displacement from the center of the production of knowledge. For computer simulations to be a technological singularity, then, they must fulfill points (a) and (b) above. Although point (a) is relatively unproblematic, point (b) needs further analysis. In particular, in order to show that humans could be displaced from the center of the production of knowledge, it is necessary to establish the reliability of computer simulations. That is, I need to show that computer simulations are reliable processes that render, most of the time, valid results. To be a reliable process, in turn, means that simulations accurately represent the target system and carry out error-free computations. I analyze verification and validation methods as the grounds for such representation accuracy and error-free computations. Since the aim is to entrench computer simulations as a technological singularity, the entire analysis must be careful to keep human agents out of the picture.
42 | Publication Title: The Technological Singularity: Managing the Journey
43 | Item Type: bookSection
44 | Publication Year: 2017
45 | Label:


--------------------------------------------------------------------------------
/example_prompts/terms_of_service.txt:
--------------------------------------------------------------------------------
 1 | Label the sentence from a Terms of Service based on whether it is potentially unfair. If it seems clearly unfair, mark it as potentially unfair.
 2 | According to art. 3 of the Directive 93/13 on Unfair Terms in Consumer Contracts, a contractual term is unfair if: 1) it has not been individually negotiated; and 2) contrary to the requirement of good faith, it causes a significant imbalance in the parties rights and obligations, to the detriment of the consumer. 
 3 | Details on types of potentially unfair clauses are found below:
 4 | The jurisdiction clause stipulates what courts will have the competence to adjudicate disputes under the contract. Jurisdiction clauses giving consumers a right to bring disputes in their place of residence were marked as clearly fair, whereas clauses stating that any judicial proceeding takes a residence away were marked as clearly unfair.
 5 | The choice of law clause specifies what law will govern the contract, meaning also what law will be applied in potential adjudication of a dispute arising under the contract. Clauses defining the applicable law as the law of the consumer's country of residence were marked as clearly fair. In every other case, the choice of law clause was considered as potentially unfair.
 6 | The limitation of liability clause stipulates that the duty to pay damages is limited or excluded, for certain kind of losses, under certain conditions. Clauses that explicitly affirm non-excludable providers' liabilities were marked as clearly fair. Clauses that reduce, limit, or exclude the liability of the service provider were marked as potentially unfair when concerning broad categories of losses or causes of them.
 7 | The unilateral change clause specifies the conditions under which the service provider could amend and modify the terms of service and/or the service itself. Such clause was always considered as potentially unfair.
 8 | The unilateral termination clause gives provider the right to suspend and/or terminate the service and/or the contract, and sometimes details the circumstances under which the provider claims to have a right to do so.
 9 | The contract by using clause stipulates that the consumer is bound by the terms of use of a specific service, simply by using the service, without even being required to mark that he or she has read and accepted them. We always marked such clauses as potentially unfair.
10 | The content removal gives the provider a right to modify/delete user's content, including in-app purchases, and sometimes specifies the conditions under which the service provider may do so.
11 | The arbitration clause requires or allows the parties to resolve their disputes through an arbitration process, before the case could go to court. Clauses stipulating that the arbitration should take place in a state other then the state of consumer's residence or be based on arbiter's discretion were marked as clearly unfair. Clauses defining arbitration as fully optional were marked as clearly fair.
12 | Possible labels:
13 | 1. not potentially unfair
14 | 2. potentially unfair
15 | 
16 | Sentence: You acknowledge and agree that posting any such user content may result in immediate termination or suspension of your spotify account.
17 | Label: potentially unfair
18 | 
19 | Sentence: You may be able to access certain third-party links, applications or content (``third-party applications'') via accounts related to our services.
20 | Label: not potentially unfair
21 | 
22 | Sentence: These terms, and any other policies or rules we reference in these terms, make up the entire agreement between you and us relating to the subject matter of these terms, and supersede all prior understandings of the parties relating to the subject matter of these terms, whether those prior understandings were electronic, oral or written, or whether established by custom, practice, policy or precedent, between you and us.
23 | Label: not potentially unfair
24 | 
25 | Sentence: Supercell may reject, refuse to post or delete any user content for any or no reason, including, but not limited to, user content that in the sole judgment of supercell violates these terms of service.
26 | Label: potentially unfair
27 | 
28 | Sentence: Tinder does not represent or warrant that (a) the service will be uninterrupted, secure or error free, (b) any defects or errors in the service will be corrected, or (c) that any content or information you obtain on or through the services will be accurate.
29 | Label: potentially unfair
30 | 
31 | Sentence: There are no refunds for termination or cancellation of your premium service.
32 | Label:


--------------------------------------------------------------------------------
/example_prompts/tweet_eval_hate.txt:
--------------------------------------------------------------------------------
  1 | Label whether the following tweet contains hate speech against either immigrants or women. Hate Speech (HS) is commonly defined as any communication that disparages a person or a group on the basis of some characteristic such as race, color, ethnicity, gender, sexual orientation, nationality, religion, or other characteristics.
  2 | Possible labels:
  3 | 1. hate speech
  4 | 2. not hate speech
  5 | 
  6 | Tweet: International migration from sub-Saharan Africa has grown dramatically since 2010
  7 | Label: not hate speech
  8 | 
  9 | Tweet: World Relief is continuing to provide legal aid and advocating for immigrant families who have been separated at the U.S.-Mexico border. You can
 10 | Label: not hate speech
 11 | 
 12 | Tweet: The Ethiopia-Eritrea rapprochement: what does it mean for refugees?
 13 | Label: not hate speech
 14 | 
 15 | Tweet: It's clear that the administration's attempts to punish local governments for refusing to cooperate with immigration enforcement are unconstitutional. The administration should focus on treating
 16 | Label: not hate speech
 17 | 
 18 | Tweet: The first sexual attack against a woman happened in Hungary. The criminal is a "legal" afghan rapefugee. This is how Or
 19 | Label: hate speech
 20 | 
 21 | Tweet: Syria's military continues to advance in the south west
 22 | Label: not hate speech
 23 | 
 24 | Tweet: President Trump Weekly Address ‘“ Immigration ReformÂ Needed‘¦
 25 | Label: not hate speech
 26 | 
 27 | Tweet: They do not need them, I do not need them nor do I require them to enrich my life. #Stoptheinvasion #Brexit #
 28 | Label: hate speech
 29 | 
 30 | Tweet: Why weren't the democrats in the streets marching for the separated families of people like Jamiel Shaw when he was separated from his son forever
 31 | Label: hate speech
 32 | 
 33 | Tweet: UN wants ‘˜immediate halt’ to separation of migrant families in US #Trumpadministration #UnitedNations...
 34 | Label: not hate speech
 35 | 
 36 | Tweet: Anybody can dig a hole and plant a tree. But to make sure it survives, you have to nurture it, you have to suffer
 37 | Label: not hate speech
 38 | 
 39 | Tweet: Talent- Meeting up with complete strangers for sex
 40 | Label: not hate speech
 41 | 
 42 | Tweet: Important note: not all women do this and tale responsibility for their actions, just like not all men are assholes.
 43 | Label: not hate speech
 44 | 
 45 | Tweet: Its time - to stop the immigration - to USA ! Specially from 40 muslim countries. The president named it: SHITHOLE-
 46 | Label: hate speech
 47 | 
 48 | Tweet: UK Pensioner Faces 350 Lashes In Saudi Arabia why does this country exist it does nothing for migrants picks on old men no help from anyone
 49 | Label: not hate speech
 50 | 
 51 | Tweet: This is outrageous! But, sadly, this kind of corruption is what we've gotten used to with our current administration! 😠
 52 | Label: not hate speech
 53 | 
 54 | Tweet: Germany Navy rescued 22.000 refugees in the middlesea. Germany has not enough refugees, it wants more! by state duty. The
 55 | Label: not hate speech
 56 | 
 57 | Tweet: Women want u to automatically believe women who scream rape they don't understand our position....
 58 | Label: hate speech
 59 | 
 60 | Tweet: | Woman who accused Nelly of rape wants to drop charges
 61 | Label: not hate speech
 62 | 
 63 | Tweet: I didn’t realize girls who wear huge bows in their hair called other women “cunt”. “Northern Bell
 64 | Label: not hate speech
 65 | 
 66 | Tweet: We are excited to host the #ImmigrantStories film screening and panel today with @user to show how the Trump administration’s
 67 | Label: not hate speech
 68 | 
 69 | Tweet: Stop trying to please everyone, you're not my cock... Please yourself
 70 | Label: not hate speech
 71 | 
 72 | Tweet: New to Twitter-- any men on here know what the process is to get #verified?
 73 | Label: not hate speech
 74 | 
 75 | Tweet: US Navy Seal, which killed Osama wrote a book. It tells how this specialforces work today.
 76 | Label: not hate speech
 77 | 
 78 | Tweet: Jennifer Lawrence is complaining about being a whore, meanwhile she's punching old Asian women in face who are sewing her costumes for her
 79 | Label: not hate speech
 80 | 
 81 | Tweet: “When Mexico sends its people, they’re not sending their best... They’re bringing drugs. They’re bringing
 82 | Label: hate speech
 83 | 
 84 | Tweet: Germany: Islamic-inspired children’s park in Berlin to introduce children to Islam and the Islamic culture of the refugees - Free Speech Time -
 85 | Label: hate speech
 86 | 
 87 | Tweet: HOLLERINGGGGGGG. stupid baby bitch.
 88 | Label: hate speech
 89 | 
 90 | Tweet: Homeland Security says he is a REFUGEE.  Iraqi refugee charged with attempted murder of Colorado police officer - - @user
 91 | Label: not hate speech
 92 | 
 93 | Tweet: cormacs such a cunt won’t even let me squeeze his blackheads pussy bitch
 94 | Label: hate speech
 95 | 
 96 | Tweet: How keyboard gangsters feel leaving their stupid ass comments
 97 | Label: not hate speech
 98 | 
 99 | Tweet: HOW REFRESHING! In South Korea, there is no such thing as 'political correctness" when it comes to dealing with Muslim refugee w
100 | Label: hate speech
101 | 
102 | Tweet: Curious to know how @user delivers emergency food supplies to #refugees and other vulnerable people around the world? Watch this video:
103 | Label: not hate speech
104 | 
105 | Tweet: Pope Francis criticizes policy of migrant family separation via @user      Stick to religion Mr. Red Pope...
106 | Label: not hate speech
107 | 
108 | Tweet: I hate bitches that are always mad at the world for no reason like bitch stfu and be happy hoe
109 | Label: hate speech
110 | 
111 | Tweet: #TakeAKnee trump's a chump! @user is the real deal-helping so many on his journey through greatness! Even
112 | Label: not hate speech
113 | 
114 | Tweet: RT @user Her:I don't get what u want outta this relationship Him:Well, I was only looking for a bj
115 | Label: not hate speech
116 | 
117 | Tweet: @user @user BUT. I would bet you're not looking for facts.  That's why you obfuscate.President Trump at NATO:
118 | Label: hate speech
119 | 
120 | Tweet: @user @user @user Imagine THE SIZEof THAT RED WAVE !!GOD BLESS TRUMP!Build That Wall#BuildThatWall
121 | Label: hate speech
122 | 
123 | Tweet: @user @user 2 hysterical women in the US Senate
124 | Label: hate speech
125 | 
126 | Tweet: @user @user Just choke on your salad sandwich already. Those Men did not die during the Gulf War and do you really think that it
127 | Label: not hate speech
128 | 
129 | Tweet: @user Noely I see you've become hysterical since I last saw you. Don't blam you one bit, you're in
130 | Label: not hate speech
131 | 
132 | Tweet: @user Exactly! Why would Americans vote for anyone that supports these absurd policies? IMO it is treason and self destruction to vote for politicians that
133 | Label: hate speech
134 | 
135 | Tweet: @user Going to be where no one will leave the house or even look at a woman because she will claim “rape”.…
136 | Label: hate speech
137 | 
138 | Tweet: Dont worry @user you are and will always be the most hysterical woman.
139 | Label: hate speech
140 | 
141 | Tweet: @user thanks for ruining a beloved song from “The Sound of Music” by using that extremely creepy electronic version in your commercial.
142 | Label: not hate speech
143 | 
144 | Tweet: @user mmm perfect body and so am i am now come here girl i got a big cock for u to play with ;)
145 | Label: hate speech
146 | 
147 | Tweet: @user One man one vote right?Yet here we have one man assuming he can demand the importation of millions of non citizens for  HIS
148 | Label: hate speech
149 | 
150 | Tweet: @user It's kind of funny, really. Were the space aliens illegal? Now we need a wall & a roof! I think this country
151 | Label: hate speech
152 | 
153 | Tweet: @user Skank traitor bitch @user
154 | Label: hate speech
155 | 
156 | Tweet: @user id marry this fukin whore,& let the bitch behind her be best lady at the wedding
157 | Label:


--------------------------------------------------------------------------------
/example_prompts/twitter_complaints.txt:
--------------------------------------------------------------------------------
 1 | A complaint presents a state of affairs which breaches the writer’s favorable expectation. Label the tweet text based on whether it contains a complaint.
 2 | Possible labels:
 3 | 1. complaint
 4 | 2. no complaint
 5 | 
 6 | Tweet text: @NCIS_CBS https://t.co/eeVL9Eu3bE
 7 | Label: no complaint
 8 | 
 9 | Tweet text: Never shopping at @MACcosmetics again. Every time I go in there, their employees are super rude/condescending. I'll take my $$ to @Sephora
10 | Label: complaint
11 | 
12 | Tweet text: @Lin_Manuel @jmessinaphoto @VAMNit Omg a little squish!!!!! Enjoy and congrats!!!! I miss mine being so young! ������
13 | Label: no complaint
14 | 
15 | Tweet text: @mckelldogs This might just be me, but-- eyedrops? Artificial tears are so useful when you're sleep-deprived and sp… https://t.co/WRtNsokblG
16 | Label: no complaint
17 | 
18 | Tweet text: @JetBlue Completely understand but would prefer being on time to filling out forms....
19 | Label: no complaint
20 | 
21 | Tweet text: @DIRECTV can I get a monthly charge double refund when it sprinkles outside and we lose reception? #IamEmbarrasedForYou
22 | Label: complaint
23 | 
24 | Tweet text: I'm earning points with #CricketRewards https://t.co/GfpGhqqnhE
25 | Label: no complaint
26 | 
27 | Tweet text: Looks tasty! Going to share with everyone I know #FebrezeONE #sponsored https://t.co/4AQI53npei
28 | Label: no complaint
29 | 
30 | Tweet text: @Schrapnel @comcast RIP me
31 | Label: no complaint
32 | 
33 | Tweet text: @VerizonSupport all of a sudden I can't connect to my primary wireless network but guest one works
34 | Label: no complaint
35 | 
36 | Tweet text: Just posted a photo https://t.co/RShFwCjPHu
37 | Label: no complaint
38 | 
39 | Tweet text: @greateranglia Could I ask why the Area in front of BIC Station was not gritted withh all the snow.
40 | Label: complaint
41 | 
42 | Tweet text: @IanJamesPoulter What's your secret to poaching eggs? Mine NEVER look that good.
43 | Label: no complaint
44 | 
45 | Tweet text: Aaaahhhhh!!!! My @Razer @PlayOverwatch d.va meka headset came in!!! I didn't even know it had shipped!!! So excited… https://t.co/4gXy9xED8d
46 | Label: no complaint
47 | 
48 | Tweet text: @asblough Yep! It should send you a notification with your driver’s name and what time they’ll be showing up!
49 | Label: no complaint
50 | 
51 | Tweet text: @NortonSupport Thanks much.
52 | Label: no complaint
53 | 
54 | Tweet text: I just gave 5 stars to Tracee at @neimanmarcus for the great service I received!
55 | Label: no complaint
56 | 
57 | Tweet text: If I can't get my 3rd pair of @beatsbydre powerbeats to work today I'm doneski man. This is a slap in my balls. Your next @Bose @BoseService
58 | Label: complaint
59 | 
60 | Tweet text: @AlfaRomeoCares Hi thanks for replying, could be my internet but link doesn't seem to be working
61 | Label: complaint
62 | 
63 | Tweet text: @HMRCcustomers No this is my first job
64 | Label: no complaint
65 | 
66 | Tweet text: @JenniferTilly Merry Christmas to as well. You get more stunning every year ��
67 | Label: no complaint
68 | 
69 | Tweet text: @SouthwestAir I love you but when sending me flight changes please don't use military time #ignoranceisbliss
70 | Label: complaint
71 | 
72 | Tweet text: @NortonSupport @NortonOnline What the hell is a dm 5-10 days to get money back bank account now overdrawn thanks guys
73 | Label: complaint
74 | 
75 | Tweet text: @ZARA_Care I've been waiting on a reply to my tweets and DMs for days now?
76 | Label: complaint
77 | 
78 | Tweet text: @TopmanAskUs please just give me my money back.
79 | Label: complaint
80 | 
81 | Tweet text: @BurberryService Thanks for sending my Christmas present with the security protection still on it! pic.twitter.com/iI0DUxOUU2
82 | Label:


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | scipy
 2 | scikit-learn==0.24.2
 3 | datasets
 4 | transformers
 5 | openai
 6 | sacred
 7 | python-dotenv
 8 | cachetools
 9 | torch
10 | sentence-transformers
11 | ipykernel
12 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | from setuptools import find_packages, setup
 2 | 
 3 | 
 4 | setup(
 5 |     name="raft_baselines",
 6 |     version="0.0.1",
 7 |     description="RAFT Benchmarks baselines, classifiers, and testing scripts.",
 8 |     python_requires=">=3.7.0",
 9 |     packages=find_packages("src"),
10 |     package_dir={"": "src"},
11 |     install_requires=[],
12 | )
13 | 


--------------------------------------------------------------------------------
/src/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/oughtinc/raft-baselines/cc04aee9c8a8cbfad431cce044abe76993bfb0f7/src/__init__.py


--------------------------------------------------------------------------------
/src/raft_baselines/classifiers/__init__.py:
--------------------------------------------------------------------------------
1 | from .random_classifier import RandomClassifier
2 | from .gpt3_classifier import GPT3Classifier
3 | from .transformers_causal_lm_classifier import TransformersCausalLMClassifier
4 | from .naive_bayes_classifier import NaiveBayesClassifier
5 | from .svm_classifier import SVMClassifier
6 | from .adaboost_classifier import AdaBoostClassifier
7 | from .zero_shot_transformers_classifier import TransformersZeroShotPipelineClassifier
8 | 


--------------------------------------------------------------------------------
/src/raft_baselines/classifiers/adaboost_classifier.py:
--------------------------------------------------------------------------------
 1 | from copy import deepcopy
 2 | 
 3 | from sklearn.ensemble import AdaBoostClassifier as AdaBoost
 4 | from sklearn.tree import DecisionTreeClassifier
 5 | 
 6 | from raft_baselines.classifiers.n_grams_classifier import NGramsClassifier
 7 | 
 8 | 
 9 | class AdaBoostClassifier(NGramsClassifier):
10 |     def __init__(
11 |         self, training_data, vectorizer_kwargs=None, model_kwargs=None, **kwargs
12 |     ):
13 |         super().__init__(training_data, vectorizer_kwargs, model_kwargs, **kwargs)
14 |         if model_kwargs is None:
15 |             model_kwargs = {}
16 |         if "max_depth" in model_kwargs:
17 |             # Required for sacred
18 |             model_kwargs = deepcopy(model_kwargs)
19 |             d = model_kwargs.pop("max_depth")
20 |             base = DecisionTreeClassifier(max_depth=d)
21 |             model_kwargs["base_estimator"] = base
22 |         self.classifier = AdaBoost(**model_kwargs)
23 |         self.classifier.fit(self.vectorized_training_data, self.training_data["Label"])
24 | 
25 |     def _classify(self, vector_input):
26 |         return self.classifier.predict_proba(vector_input)
27 | 


--------------------------------------------------------------------------------
/src/raft_baselines/classifiers/classifier.py:
--------------------------------------------------------------------------------
 1 | import datasets
 2 | from typing import Callable, List, Mapping, Dict, Optional
 3 | from abc import ABC, abstractmethod
 4 | 
 5 | 
 6 | class Classifier(ABC):
 7 |     def __init__(self, training_data: datasets.Dataset) -> None:
 8 |         self.training_data: datasets.Dataset = training_data
 9 | 
10 |         self.class_col: str = "Label"
11 |         self.class_label_to_string: Callable[[int], str] = training_data.features[
12 |             "Label"
13 |         ].int2str
14 |         self.classes: List[str] = list(training_data.features["Label"].names[1:])
15 |         self.input_cols: List[str] = [
16 |             col for col in training_data.features if col not in ("ID", "Label")
17 |         ]
18 | 
19 |     @abstractmethod
20 |     def classify(
21 |         self,
22 |         target: Mapping[str, str],
23 |         random_seed: Optional[int] = None,
24 |         should_print_prompt: bool = False,
25 |     ) -> Dict[str, float]:
26 |         """
27 |         :param target: Dict input with fields and natural language data within those fields.
28 |         :return: Dict where the keys are class names and the values are probabilities.
29 |         """
30 |         raise NotImplementedError
31 | 


--------------------------------------------------------------------------------
/src/raft_baselines/classifiers/gpt3_classifier.py:
--------------------------------------------------------------------------------
  1 | from typing import Dict, Optional, List, Mapping
  2 | 
  3 | import numpy as np
  4 | import datasets
  5 | 
  6 | from raft_baselines.classifiers.in_context_classifier import InContextClassifier
  7 | from raft_baselines.utils.gpt3_utils import (
  8 |     complete,
  9 |     search,
 10 | )
 11 | from raft_baselines.utils.tokenizers import TransformersTokenizer
 12 | 
 13 | GPT3_MAX_TOKENS = 2048
 14 | tokenizer = TransformersTokenizer("gpt2")
 15 | 
 16 | 
 17 | class GPT3Classifier(InContextClassifier):
 18 |     def __init__(
 19 |         self,
 20 |         *args,
 21 |         engine: str = "ada",
 22 |         search_engine: str = "ada",
 23 |         **kwargs,
 24 |     ) -> None:
 25 |         super().__init__(
 26 |             *args,
 27 |             tokenizer=tokenizer,
 28 |             max_tokens=GPT3_MAX_TOKENS,
 29 |             **kwargs,
 30 |         )
 31 | 
 32 |         self.engine: str = engine
 33 |         self.search_engine: str = search_engine
 34 | 
 35 |     def semantically_select_training_examples(
 36 |         self, target: Mapping[str, str]
 37 |     ) -> datasets.Dataset:
 38 |         formatted_examples_without_labels = tuple(
 39 |             self.format_dict(
 40 |                 {col: row[col] for col in self.input_cols if col in row},
 41 |             )
 42 |             for row in self.training_data
 43 |         )
 44 | 
 45 |         search_results = search(
 46 |             formatted_examples_without_labels,
 47 |             self.format_dict(target),
 48 |             self.search_engine,
 49 |         )
 50 | 
 51 |         sorted_indices = list(
 52 |             map(
 53 |                 lambda result: result["document"],  # type: ignore
 54 |                 sorted(
 55 |                     search_results,
 56 |                     key=lambda result: -result["score"],  # type: ignore
 57 |                 ),
 58 |             )
 59 |         )
 60 | 
 61 |         return self.training_data.select(
 62 |             list(reversed(sorted_indices[: self.num_prompt_training_examples]))
 63 |         )
 64 | 
 65 |     def does_token_match_class(self, token: str, clas: str) -> bool:
 66 |         # prepend a space to the class label
 67 |         # because we always expect a leading space in the first token
 68 |         # returned from the OpenAI API, given our prompt format
 69 |         clas_str = (
 70 |             f" {clas}" if not self.add_prefixes else f" {self.classes.index(clas) + 1}"
 71 |         )
 72 | 
 73 |         clas_first_token_id: int = self.tokenizer(clas_str)["input_ids"][0]
 74 |         token_id: int = self.tokenizer(token)["input_ids"][0]
 75 | 
 76 |         # Compare token ids rather than the raw tokens
 77 |         # because GPT2TokenizerFast represents some special characters
 78 |         # differently from the GPT-3 API
 79 |         # (e.g. the space at the beginning of the token is " " according to the API,
 80 |         # but "Ġ" according to the tokenizer.
 81 |         # Standardizing to token ids is one easy way to smooth over that difference.
 82 |         return clas_first_token_id == token_id
 83 | 
 84 |     def _get_raw_probabilities(
 85 |         self,
 86 |         prompt: str,
 87 |     ) -> List[float]:
 88 |         response = complete(
 89 |             prompt,
 90 |             temperature=0.0,
 91 |             engine=self.engine,
 92 |             max_tokens=1,
 93 |         )
 94 |         logprobs: Dict[str, float] = response["choices"][0]["logprobs"]["top_logprobs"][
 95 |             0
 96 |         ]
 97 | 
 98 |         raw_p = []
 99 |         for clas in self.classes:
100 |             p = 0.0
101 |             for token in logprobs.keys():
102 |                 if self.does_token_match_class(token, clas):
103 |                     p += np.exp(logprobs[token])
104 |             raw_p.append(p)
105 | 
106 |         return raw_p
107 | 


--------------------------------------------------------------------------------
/src/raft_baselines/classifiers/in_context_classifier.py:
--------------------------------------------------------------------------------
  1 | from abc import abstractmethod
  2 | import random
  3 | from typing import Dict, Optional, List, Tuple, Mapping, Any
  4 | from collections import defaultdict
  5 | import json
  6 | import importlib.resources
  7 | 
  8 | import numpy as np
  9 | import datasets
 10 | 
 11 | from raft_baselines.classifiers.classifier import Classifier
 12 | from raft_baselines import data
 13 | from raft_baselines.utils.tokenizers import Tokenizer
 14 | 
 15 | text_data = importlib.resources.read_text(
 16 |     data, "prompt_construction_settings.jsonl"
 17 | ).split("\n")
 18 | FIELD_ORDERING = json.loads(text_data[0])
 19 | INSTRUCTIONS = json.loads(text_data[1])
 20 | 
 21 | 
 22 | class InContextClassifier(Classifier):
 23 |     separator: str = "\n\n"
 24 | 
 25 |     def __init__(
 26 |         self,
 27 |         training_data: datasets.Dataset,
 28 |         num_prompt_training_examples: int = 20,
 29 |         add_prefixes: bool = False,
 30 |         config: str = None,
 31 |         use_task_specific_instructions: bool = True,
 32 |         do_semantic_selection: bool = True,
 33 |         tokenizer: Tokenizer = None,
 34 |         max_tokens: int = 2048,
 35 |     ) -> None:
 36 |         super().__init__(training_data)
 37 | 
 38 |         self.num_prompt_training_examples: int = num_prompt_training_examples
 39 |         self.add_prefixes: bool = add_prefixes
 40 | 
 41 |         if config:
 42 |             self.config: str = config
 43 |             self.input_cols: List[str] = FIELD_ORDERING[config]
 44 |             self.instructions_start: str = "Possible labels:"
 45 |             if use_task_specific_instructions:
 46 |                 self.instructions_start = (
 47 |                     INSTRUCTIONS[config] + "\n" + self.instructions_start
 48 |                 )
 49 | 
 50 |         self.do_semantic_selection: bool = do_semantic_selection
 51 | 
 52 |         self.tokenizer = tokenizer
 53 |         self.truncation_params: Mapping[str, Any] = {
 54 |             # max - buffer - completion tokens
 55 |             "max_tokens": max_tokens - 10 - 1,
 56 |             "end_example_token_proportion": max(
 57 |                 0.25,
 58 |                 1
 59 |                 / (1 + min(self.num_prompt_training_examples, len(self.training_data))),
 60 |             )
 61 |             if self.num_prompt_training_examples is not None
 62 |             else 0.25,
 63 |         }
 64 | 
 65 |     @property
 66 |     def instructions(self) -> str:
 67 |         formatted_classes = "\n".join(
 68 |             [f"{idx + 1}. {clas}" for idx, clas in enumerate(self.classes)]
 69 |         )
 70 |         return f"""{self.instructions_start}\n{formatted_classes}"""
 71 | 
 72 |     def max_example_lengths(
 73 |         self, num_training_examples: int, input_to_classify: Mapping[str, str]
 74 |     ) -> Tuple[int, int]:
 75 |         instruction_tokens = self.tokenizer.num_tokens(self.instructions)
 76 |         separator_tokens = (num_training_examples + 1) * len(self.separator)
 77 |         max_example_tokens = (
 78 |             self.truncation_params["max_tokens"] - instruction_tokens - separator_tokens
 79 |         )
 80 | 
 81 |         untruncated_end_example_tokens = self.tokenizer.num_tokens(
 82 |             self.format_prompt_end(input_to_classify)
 83 |         )
 84 |         max_end_example_tokens = min(
 85 |             untruncated_end_example_tokens,
 86 |             int(
 87 |                 max_example_tokens
 88 |                 * self.truncation_params["end_example_token_proportion"]
 89 |             ),
 90 |         )
 91 |         max_train_example_tokens = (
 92 |             int((max_example_tokens - max_end_example_tokens) / num_training_examples)
 93 |             if num_training_examples > 0
 94 |             else 0
 95 |         )
 96 | 
 97 |         return max_end_example_tokens, max_train_example_tokens
 98 | 
 99 |     @classmethod
100 |     def format_dict(cls, example: Mapping[str, str]) -> str:
101 |         return "\n".join(
102 |             [f"{k}: {v}" for k, v in example.items() if len(str(v).strip())]
103 |         )
104 | 
105 |     def format_prompt_end(
106 |         self, target: Mapping[str, str], max_tokens: Optional[int] = None
107 |     ) -> str:
108 |         output_block = f"{self.class_col}:"
109 |         output_block_tokens = self.tokenizer.num_tokens(output_block)
110 |         untruncated_text = self.format_dict(target)
111 |         input_block = (
112 |             untruncated_text
113 |             if max_tokens is None
114 |             else self.tokenizer.truncate_by_tokens(
115 |                 untruncated_text, max_tokens - output_block_tokens - 1
116 |             )
117 |         )
118 |         return f"""{input_block}
119 | {output_block}"""
120 | 
121 |     def format_example(
122 |         self, example: Mapping[str, str], clas: str, max_tokens: Optional[int] = None
123 |     ) -> str:
124 |         clas_str = (
125 |             clas if not self.add_prefixes else f"{self.classes.index(clas) + 1}. {clas}"
126 |         )
127 |         output_block = f"{self.class_col}: {clas_str}"
128 |         output_block = (
129 |             output_block
130 |             if max_tokens is None
131 |             else self.tokenizer.truncate_by_tokens(output_block, max_tokens - 2)
132 |         )
133 |         output_block_tokens = self.tokenizer.num_tokens(output_block)
134 |         untruncated_text = self.format_dict(example)
135 |         input_block = (
136 |             untruncated_text
137 |             if max_tokens is None
138 |             else self.tokenizer.truncate_by_tokens(
139 |                 untruncated_text, max_tokens - output_block_tokens - 1
140 |             )
141 |         )
142 |         return f"""{input_block}
143 | {output_block}"""
144 | 
145 |     def render_examples(
146 |         self,
147 |         example_dataset: datasets.Dataset,
148 |         max_tokens_per_example: Optional[int] = None,
149 |     ) -> str:
150 |         formatted_examples = [
151 |             self.format_example(
152 |                 {col: row[col] for col in self.input_cols if col in row},
153 |                 self.class_label_to_string(row[self.class_col]),
154 |                 max_tokens=max_tokens_per_example,
155 |             )
156 |             for row in example_dataset
157 |         ]
158 |         return self.separator.join(formatted_examples)
159 | 
160 |     @abstractmethod
161 |     def semantically_select_training_examples(
162 |         self, target: Mapping[str, str]
163 |     ) -> datasets.Dataset:
164 |         ...
165 | 
166 |     def select_training_examples(
167 |         self, target: Mapping[str, str], random_seed: Optional[int] = None
168 |     ) -> datasets.Dataset:
169 |         # handle edge case where target is blank (all the fields we selected are empty)
170 |         if not self.do_semantic_selection or not self.format_dict(target):
171 |             random.seed(random_seed)
172 | 
173 |             n_ex = self.num_prompt_training_examples
174 |             if n_ex is None or len(self.training_data) <= n_ex:
175 |                 return self.training_data
176 | 
177 |             uniques = defaultdict(lambda: [])
178 |             for i, row in enumerate(self.training_data):
179 |                 uniques[row["Label"]].append(i)
180 | 
181 |             indices = []
182 |             for key in uniques:
183 |                 indices.append(random.choice(uniques[key]))
184 |             random.shuffle(indices)
185 | 
186 |             remaining_indices = [
187 |                 i for i in range(len(self.training_data)) if i not in indices
188 |             ]
189 |             indices += random.sample(
190 |                 remaining_indices, min(n_ex, len(remaining_indices))
191 |             )
192 | 
193 |             return self.training_data.select(indices[:n_ex])
194 |         else:
195 |             return self.semantically_select_training_examples(target)
196 | 
197 |     def format_prompt(
198 |         self,
199 |         target: Mapping[str, str],
200 |         example_dataset: Optional[datasets.Dataset] = None,
201 |     ) -> str:
202 |         if self.truncation_params is None:
203 |             raise ValueError("No truncation strategy provided.")
204 | 
205 |         num_examples = len(example_dataset) if example_dataset else 0
206 |         max_end_example_tokens, max_train_example_tokens = self.max_example_lengths(
207 |             num_examples, target
208 |         )
209 |         example_str = (
210 |             self.render_examples(
211 |                 example_dataset, max_tokens_per_example=max_train_example_tokens
212 |             )
213 |             if example_dataset
214 |             else ""
215 |         )
216 |         example_str_and_sep = "" if example_str == "" else example_str + self.separator
217 | 
218 |         prompt = f"""{self.instructions + self.separator if self.instructions != "" else ""}{example_str_and_sep}{self.format_prompt_end(target, max_tokens=max_end_example_tokens)}"""  # noqa: E501
219 |         return prompt
220 | 
221 |     @abstractmethod
222 |     def _get_raw_probabilities(
223 |         self,
224 |         prompt: str,
225 |     ) -> List[float]:
226 |         ...
227 | 
228 |     def _classify_prompt(
229 |         self,
230 |         prompt: str,
231 |     ) -> Dict[str, float]:
232 |         raw_p = self._get_raw_probabilities(prompt)
233 |         sum_p = np.sum(raw_p)
234 |         if sum_p > 0:
235 |             normalized_p = np.array(raw_p) / np.sum(raw_p)
236 |         else:
237 |             normalized_p = np.full(len(self.classes), 1 / len(self.classes))
238 |         class_probs = {}
239 |         for i, clas in enumerate(self.classes):
240 |             class_probs[clas] = normalized_p[i]
241 |         return class_probs
242 | 
243 |     def classify(
244 |         self,
245 |         target: Mapping[str, str],
246 |         random_seed: Optional[int] = None,
247 |         should_print_prompt: bool = False,
248 |     ) -> Dict[str, float]:
249 |         ordered_target = {col: target[col] for col in self.input_cols if col in target}
250 | 
251 |         example_dataset = (
252 |             self.select_training_examples(ordered_target, random_seed=random_seed)
253 |             if self.num_prompt_training_examples > 0
254 |             else None
255 |         )
256 | 
257 |         prompt = self.format_prompt(ordered_target, example_dataset)
258 |         if should_print_prompt:
259 |             print(prompt)
260 | 
261 |         return self._classify_prompt(prompt)
262 | 


--------------------------------------------------------------------------------
/src/raft_baselines/classifiers/n_grams_classifier.py:
--------------------------------------------------------------------------------
 1 | from typing import Mapping, Optional, Dict
 2 | from abc import abstractmethod
 3 | 
 4 | import datasets
 5 | from sklearn.feature_extraction.text import CountVectorizer
 6 | 
 7 | from raft_baselines.classifiers.classifier import Classifier
 8 | 
 9 | 
10 | class NGramsClassifier(Classifier):
11 |     def __init__(
12 |         self,
13 |         training_data: datasets.Dataset,
14 |         vectorizer_kwargs: Dict = None,
15 |         model_kwargs: Dict = None,
16 |         **kwargs
17 |     ):
18 |         super().__init__(training_data)
19 |         if vectorizer_kwargs is None:
20 |             vectorizer_kwargs = {}
21 |         cleaned_text_train = [self.stringify_row(row) for row in self.training_data]
22 | 
23 |         self.vectorizer = CountVectorizer(**vectorizer_kwargs).fit(cleaned_text_train)
24 |         self.vectorized_training_data = self.vectorizer.transform(cleaned_text_train)
25 |         self.classifier = None
26 | 
27 |     def stringify_row(self, row):
28 |         return ". ".join(
29 |             row[input_col] for input_col in self.input_cols if input_col in row
30 |         )
31 | 
32 |     @abstractmethod
33 |     def _classify(self, vector_input):
34 |         return NotImplementedError
35 | 
36 |     def classify(
37 |         self,
38 |         target: Mapping[str, str],
39 |         random_seed: Optional[int] = None,
40 |         should_print_prompt: bool = False,
41 |     ) -> Dict[str, float]:
42 |         simple_input = self.stringify_row(target)
43 |         vector_input = self.vectorizer.transform((simple_input,))
44 |         result = self._classify(vector_input)
45 |         return {
46 |             self.class_label_to_string(int(cls)): prob
47 |             for prob, cls in zip(result[0], self.classifier.classes_)
48 |         }
49 | 


--------------------------------------------------------------------------------
/src/raft_baselines/classifiers/naive_bayes_classifier.py:
--------------------------------------------------------------------------------
 1 | from sklearn.naive_bayes import MultinomialNB
 2 | 
 3 | from raft_baselines.classifiers.n_grams_classifier import NGramsClassifier
 4 | 
 5 | 
 6 | class NaiveBayesClassifier(NGramsClassifier):
 7 |     def __init__(
 8 |         self, training_data, vectorizer_kwargs=None, model_kwargs=None, **kwargs
 9 |     ):
10 |         super().__init__(training_data, vectorizer_kwargs, model_kwargs, **kwargs)
11 |         if model_kwargs is None:
12 |             model_kwargs = {}
13 |         self.classifier = MultinomialNB(**model_kwargs)
14 |         self.classifier.fit(self.vectorized_training_data, self.training_data["Label"])
15 | 
16 |     def _classify(self, vector_input):
17 |         return self.classifier.predict_proba(vector_input)
18 | 


--------------------------------------------------------------------------------
/src/raft_baselines/classifiers/random_classifier.py:
--------------------------------------------------------------------------------
 1 | import random
 2 | from typing import Mapping, Optional
 3 | 
 4 | import datasets
 5 | 
 6 | from raft_baselines.classifiers.classifier import Classifier
 7 | 
 8 | 
 9 | class RandomClassifier(Classifier):
10 |     def __init__(
11 |         self, training_data: datasets.Dataset, seed: int = 4, **kwargs
12 |     ) -> None:
13 |         super().__init__(training_data)
14 |         random.seed(seed)
15 | 
16 |     def classify(
17 |         self,
18 |         target: Mapping[str, str],
19 |         random_seed: Optional[int] = None,
20 |         should_print_prompt: bool = False,
21 |     ) -> Mapping[str, float]:
22 |         if random_seed is not None:
23 |             random.seed(random_seed)
24 |         result = {c: 0.0 for c in self.classes}
25 |         result[random.choice(self.classes)] = 1.0
26 | 
27 |         return result
28 | 


--------------------------------------------------------------------------------
/src/raft_baselines/classifiers/svm_classifier.py:
--------------------------------------------------------------------------------
 1 | from sklearn.svm import LinearSVC
 2 | from scipy.special import softmax
 3 | import numpy as np
 4 | 
 5 | from raft_baselines.classifiers.n_grams_classifier import NGramsClassifier
 6 | 
 7 | 
 8 | class DummyClassifier:
 9 |     def __init__(self, label):
10 |         self.classes_ = [label]
11 | 
12 |     def decision_function(self, vector_input):
13 |         return np.array([1])
14 | 
15 | 
16 | class SVMClassifier(NGramsClassifier):
17 |     def __init__(self, training_data, vectorizer_kwargs, model_kwargs, **kwargs):
18 |         super().__init__(training_data, vectorizer_kwargs, model_kwargs, **kwargs)
19 |         if model_kwargs is None:
20 |             model_kwargs = {}
21 |         # Sometimes breaks if there's only one label in the training data.
22 |         # Hack-y solution.
23 |         if len(set(self.training_data["Label"])) == 1:
24 |             self.classifier = DummyClassifier(self.training_data["Label"][0])
25 |             return
26 |         self.classifier = LinearSVC(**model_kwargs)
27 |         self.classifier.fit(self.vectorized_training_data, self.training_data["Label"])
28 | 
29 |     def _classify(self, vector_input):
30 |         confidences = self.classifier.decision_function(vector_input)
31 |         if len(self.classifier.classes_) <= 2:
32 |             # Positive score means first class, negative score means second class.
33 |             # Appending a 0 ensures that the softmax classifies correctly while still
34 |             #   ensuring somewhat sensible confidences. Not probabilities though.
35 |             confidences = np.append(confidences, 0)
36 |             confidences = confidences.reshape(1, 2)
37 |         return softmax(confidences)
38 | 


--------------------------------------------------------------------------------
/src/raft_baselines/classifiers/transformers_causal_lm_classifier.py:
--------------------------------------------------------------------------------
 1 | from typing import List, Mapping
 2 | 
 3 | import datasets
 4 | import torch
 5 | from transformers import AutoModelForCausalLM
 6 | from sentence_transformers import util
 7 | 
 8 | from raft_baselines.classifiers.in_context_classifier import InContextClassifier
 9 | from raft_baselines.utils.tokenizers import TransformersTokenizer
10 | from raft_baselines.utils.embedders import SentenceTransformersEmbedder
11 | 
12 | 
13 | class TransformersCausalLMClassifier(InContextClassifier):
14 |     def __init__(
15 |         self,
16 |         *args,
17 |         model_type: str = "distilgpt2",
18 |         **kwargs,
19 |     ) -> None:
20 |         tokenizer = TransformersTokenizer(model_type)
21 |         self.device = "cuda" if torch.cuda.is_available() else "cpu"
22 |         self.model = AutoModelForCausalLM.from_pretrained(model_type).to(self.device)
23 |         self.similarity_embedder = SentenceTransformersEmbedder()
24 | 
25 |         super().__init__(
26 |             *args,
27 |             tokenizer=tokenizer,
28 |             max_tokens=self.model.config.max_position_embeddings,
29 |             **kwargs,
30 |         )
31 | 
32 |     def semantically_select_training_examples(
33 |         self, target: Mapping[str, str]
34 |     ) -> datasets.Dataset:
35 |         formatted_examples_without_labels = tuple(
36 |             self.format_dict(
37 |                 {col: row[col] for col in self.input_cols if col in row},
38 |             )
39 |             for row in self.training_data
40 |         )
41 |         formatted_target = self.format_dict(target)
42 | 
43 |         # adapted from https://towardsdatascience.com/semantic-similarity-using-transformers-8f3cb5bf66d6
44 |         target_embedding = self.similarity_embedder(tuple([formatted_target]))
45 |         example_embeddings = self.similarity_embedder(formatted_examples_without_labels)
46 | 
47 |         similarity_scores = util.pytorch_cos_sim(target_embedding, example_embeddings)[
48 |             0
49 |         ]
50 | 
51 |         sorted_indices = torch.argsort(-similarity_scores.to(self.device))
52 |         return self.training_data.select(
53 |             list(reversed(sorted_indices[: self.num_prompt_training_examples]))
54 |         )
55 | 
56 |     def _get_raw_probabilities(
57 |         self,
58 |         prompt: str,
59 |     ) -> List[float]:
60 |         inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)
61 | 
62 |         with torch.no_grad():
63 |             output = self.model(**inputs)
64 | 
65 |         next_token_probs = torch.softmax(output.logits[0][-1], dim=0)
66 | 
67 |         def get_prob_for_class(clas):
68 |             clas_str = (
69 |                 f" {clas}"
70 |                 if not self.add_prefixes
71 |                 else f" {self.classes.index(clas) + 1}"
72 |             )
73 | 
74 |             return next_token_probs[self.tokenizer(clas_str)["input_ids"][0]]
75 | 
76 |         return (
77 |             torch.stack([get_prob_for_class(clas) for clas in self.classes])
78 |             .cpu()
79 |             .detach()
80 |             .numpy()
81 |         )
82 | 


--------------------------------------------------------------------------------
/src/raft_baselines/classifiers/zero_shot_transformers_classifier.py:
--------------------------------------------------------------------------------
 1 | import datasets
 2 | from typing import Mapping, Dict, Optional, List
 3 | from transformers import pipeline
 4 | import importlib.resources
 5 | import json
 6 | import torch
 7 | 
 8 | from raft_baselines.classifiers.classifier import Classifier
 9 | from raft_baselines import data
10 | 
11 | text_data = importlib.resources.read_text(
12 |     data, "prompt_construction_settings.jsonl"
13 | ).split("\n")
14 | FIELD_ORDERING = json.loads(text_data[0])
15 | 
16 | 
17 | class TransformersZeroShotPipelineClassifier(Classifier):
18 |     def __init__(
19 |         self, training_data: datasets.Dataset, config: str = None, **kwargs
20 |     ) -> None:
21 |         self.device = 0 if torch.cuda.is_available() else -1
22 |         self.clf = pipeline("zero-shot-classification", device=self.device)
23 | 
24 |         if config:
25 |             self.config: str = config
26 |             self.input_cols: List[str] = FIELD_ORDERING[config]
27 | 
28 |         super().__init__(training_data)
29 | 
30 |     @classmethod
31 |     def format_dict(cls, example: Mapping[str, str]) -> str:
32 |         return "\n".join(
33 |             [f"{k}: {v}" for k, v in example.items() if len(str(v).strip())]
34 |         )
35 | 
36 |     def classify(
37 |         self,
38 |         target: Mapping[str, str],
39 |         random_seed: Optional[int] = None,
40 |         should_print_prompt: bool = False,
41 |     ) -> Dict[str, float]:
42 |         """
43 |         :param target: Dict input with fields and natural language data within those fields.
44 |         :return: Dict where the keys are class names and the values are probabilities.
45 |         """
46 |         ordered_target = {col: target[col] for col in self.input_cols if col in target}
47 |         target_str = self.format_dict(ordered_target)
48 | 
49 |         output = self.clf(target_str, candidate_labels=self.classes)
50 |         return {clas: score for clas, score in zip(output["labels"], output["scores"])}
51 | 


--------------------------------------------------------------------------------
/src/raft_baselines/data/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/oughtinc/raft-baselines/cc04aee9c8a8cbfad431cce044abe76993bfb0f7/src/raft_baselines/data/__init__.py


--------------------------------------------------------------------------------
/src/raft_baselines/data/example_predictions.csv:
--------------------------------------------------------------------------------
 1 | ID,Label
 2 | 50,not ADE-related
 3 | 51,not ADE-related
 4 | 52,not ADE-related
 5 | 53,not ADE-related
 6 | 54,not ADE-related
 7 | 55,not ADE-related
 8 | 56,not ADE-related
 9 | 57,not ADE-related
10 | 58,not ADE-related
11 | 59,not ADE-related
12 | 


--------------------------------------------------------------------------------
/src/raft_baselines/data/prompt_construction_settings.jsonl:
--------------------------------------------------------------------------------
1 | {"ade_corpus_v2": ["Sentence"], "banking_77": ["Query"], "terms_of_service": ["Sentence"], "tai_safety_research": ["Title", "Abstract Note", "Publication Title", "Item Type", "Publication Year"], "neurips_impact_statement_risks": ["Impact statement", "Paper title"], "overruling": ["Sentence"], "systematic_review_inclusion": ["Title", "Abstract", "Journal"], "one_stop_english": ["Article"], "tweet_eval_hate": ["Tweet"], "twitter_complaints": ["Tweet text"], "semiconductor_org_types": ["Organization name", "Paper title"]}
2 | {"ade_corpus_v2": "Label the sentence based on whether it is related to an adverse drug effect (ADE). Details are described below:\nDrugs: Names of drugs and chemicals that include brand names, trivial names, abbreviations and systematic names were annotated. Mentions of drugs or chemicals should strictly be in a therapeutic context. This category does not include the names of metabolites, reaction byproducts, or hospital chemicals (e.g. surgical equipment disinfectants).\nAdverse effect: Mentions of adverse effects include signs, symptoms, diseases, disorders, acquired abnormalities, deficiencies, organ damage or death that strictly occur as a consequence of drug intake.", "banking_77": "The following is a banking customer service query. Classify the query into one of the 77 categories available.", "terms_of_service": "Label the sentence from a Terms of Service based on whether it is potentially unfair. If it seems clearly unfair, mark it as potentially unfair.\nAccording to art. 3 of the Directive 93/13 on Unfair Terms in Consumer Contracts, a contractual term is unfair if: 1) it has not been individually negotiated; and 2) contrary to the requirement of good faith, it causes a significant imbalance in the parties rights and obligations, to the detriment of the consumer. \nDetails on types of potentially unfair clauses are found below:\nThe jurisdiction clause stipulates what courts will have the competence to adjudicate disputes under the contract. Jurisdiction clauses giving consumers a right to bring disputes in their place of residence were marked as clearly fair, whereas clauses stating that any judicial proceeding takes a residence away were marked as clearly unfair.\nThe choice of law clause specifies what law will govern the contract, meaning also what law will be applied in potential adjudication of a dispute arising under the contract. Clauses defining the applicable law as the law of the consumer's country of residence were marked as clearly fair. In every other case, the choice of law clause was considered as potentially unfair.\nThe limitation of liability clause stipulates that the duty to pay damages is limited or excluded, for certain kind of losses, under certain conditions. Clauses that explicitly affirm non-excludable providers' liabilities were marked as clearly fair. Clauses that reduce, limit, or exclude the liability of the service provider were marked as potentially unfair when concerning broad categories of losses or causes of them.\nThe unilateral change clause specifies the conditions under which the service provider could amend and modify the terms of service and/or the service itself. Such clause was always considered as potentially unfair.\nThe unilateral termination clause gives provider the right to suspend and/or terminate the service and/or the contract, and sometimes details the circumstances under which the provider claims to have a right to do so.\nThe contract by using clause stipulates that the consumer is bound by the terms of use of a specific service, simply by using the service, without even being required to mark that he or she has read and accepted them. We always marked such clauses as potentially unfair.\nThe content removal gives the provider a right to modify/delete user's content, including in-app purchases, and sometimes specifies the conditions under which the service provider may do so.\nThe arbitration clause requires or allows the parties to resolve their disputes through an arbitration process, before the case could go to court. Clauses stipulating that the arbitration should take place in a state other then the state of consumer's residence or be based on arbiter's discretion were marked as clearly unfair. Clauses defining arbitration as fully optional were marked as clearly fair.", "tai_safety_research": "Transformative AI (TAI) is defined as AI that precipitates a transition comparable to (or more significant than) the agricultural or industrial revolution. Label a paper as \"TAI safety research\" if: \n1. The contents of the paper are directly motivated by, and substantively inform, the challenge of ensuring good outcomes for TAI, \n2. There is substantive content on AI safety, not just AI capabilities, \n3. The intended audience is the community of researchers, \n4. It meets a subjective threshold of seriousness/quality, \n5. Peer review is not required.", "neurips_impact_statement_risks": "Label the impact statement based on whether it mentions a harmful application of the research done in the paper. Make sure the statement is sufficient to conclude there are harmful applications of the research being done, not a past risk that this research is solving.", "overruling": "In law, an overruling sentence is a statement that nullifies a previous case decision as a precedent, by a constitutionally valid statute or a decision by the same or higher ranking court which establishes a different rule on the point of law involved. Label the sentence based on whether it is overruling or not.", "systematic_review_inclusion": "Identify whether this paper should be included in a meta-review which includes the findings of systematic reviews on interventions designed to promote charitable donations. \nIncluded reviews should describe monetary charitable donations, assess any population of participants in any context, and be peer reviewed and written in English. \nThey should not report new data, be non-systematic reviews, consider cause-related marketing or other kinds of prosocial behaviour.", "one_stop_english": "The following is an article sourced from The Guardian newspaper, and rewritten by teachers to suit three levels of adult English as Second Language (ESL) learners: elementary, intermediate, and advanced. Predict the level of the article.", "tweet_eval_hate": "Label whether the following tweet contains hate speech against either immigrants or women. Hate Speech (HS) is commonly defined as any communication that disparages a person or a group on the basis of some characteristic such as race, color, ethnicity, gender, sexual orientation, nationality, religion, or other characteristics.", "twitter_complaints": "A complaint presents a state of affairs which breaches the writer\u2019s favorable expectation. Label the tweet text based on whether it contains a complaint.", "semiconductor_org_types": "The dataset is a list of institutions that have contributed papers to semiconductor conferences in the last 25 years, as catalogued by IEEE and sampled randomly. The goal is to classify the institutions into one of three categories: \"university\", \"company\" or \"research institute\"."}


--------------------------------------------------------------------------------
/src/raft_baselines/scripts/non_neural_experiment.py:
--------------------------------------------------------------------------------
 1 | import datasets
 2 | from sacred import Experiment, observers
 3 | import sklearn.metrics as skm
 4 | 
 5 | from raft_baselines import classifiers
 6 | 
 7 | experiment_name = "non_neural"
 8 | raft_experiment = Experiment(experiment_name, save_git_info=False)
 9 | observer = observers.FileStorageObserver(f"results/{experiment_name}")
10 | raft_experiment.observers.append(observer)
11 | 
12 | 
13 | @raft_experiment.config
14 | def base_config():
15 |     classifier_name = "AdaBoostClassifier"
16 |     classifier_kwargs = {
17 |         "vectorizer_kwargs": {
18 |             "strip_accents": 'unicode',
19 |             "lowercase": True,
20 |             "ngram_range": (1, 5),
21 |             "max_df": 1.0,
22 |             "min_df": 0.0
23 |          },
24 |         "model_kwargs": {}
25 |     }
26 |     configs = datasets.get_dataset_config_names("ought/raft")
27 |     # controls which dimension is tested, out of the 3 reported in the paper
28 |     # Other options: do_semantic_selection and num_prompt_training_examples
29 |     random_seed = 42
30 | 
31 | 
32 | @raft_experiment.capture
33 | def load_datasets_train(configs):
34 |     train_datasets = {
35 |         config: datasets.load_dataset("ought/raft", config, split="train")
36 |         for config in configs
37 |     }
38 |     return train_datasets
39 | 
40 | 
41 | @raft_experiment.capture
42 | def test_experiment(
43 |     train_datasets, classifier_name,
44 |     classifier_kwargs, random_seed
45 | ):
46 |     classifier_cls = getattr(classifiers, classifier_name)
47 | 
48 |     for config in train_datasets:
49 |         dataset = train_datasets[config]
50 |         labels = list(range(1, dataset.features["Label"].num_classes))
51 |         predictions = []
52 | 
53 |         for i in range(len(dataset)):
54 |             train = dataset.select([j for j in range(len(dataset)) if j != i])
55 |             test = dataset.select([i])
56 | 
57 |             # Non-neural classifiers (i.e. non-GPT style) should be explicitly trained
58 |             #   (mostly, this is to allow two separate kwargs arguments)
59 |             classifier = classifier_cls(train, **classifier_kwargs)
60 | 
61 |             def predict(example):
62 |                 del example["Label"]
63 |                 del example["ID"]
64 |                 output_probs = classifier.classify(example, random_seed=random_seed)
65 |                 output = max(output_probs.items(), key=lambda kv_pair: kv_pair[1])
66 | 
67 |                 predictions.append(dataset.features["Label"].str2int(output[0]))
68 | 
69 |             test.map(predict)
70 | 
71 |         # accuracy = sum([p == l for p, l in zip(predictions, dataset['Label'])]) / 50
72 |         f1 = skm.f1_score(
73 |             dataset["Label"], predictions, labels=labels, average="macro"
74 |         )
75 |         print(f"Dataset - {config}; {f1}")
76 |         raft_experiment.log_scalar(f"{config}", f1)
77 | 
78 | 
79 | @raft_experiment.automain
80 | def main():
81 |     train = load_datasets_train()
82 |     test_experiment(train)
83 | 


--------------------------------------------------------------------------------
/src/raft_baselines/scripts/raft_predict.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import shutil
  3 | import csv
  4 | 
  5 | import datasets
  6 | from sacred import Experiment, observers
  7 | 
  8 | from raft_baselines import classifiers
  9 | 
 10 | """
 11 | This class runs a classifier specified by `classifier_name` on the unlabeled
 12 |     test sets for all configs given in `configs`. Any classifier can be used,
 13 |     but must accept a hf.datasets.Dataset as an argument. Any other keyword
 14 |     arguments must be specified via `classifier_kwargs`.
 15 | """
 16 | 
 17 | experiment_name = "make_predictions"
 18 | raft_experiment = Experiment(experiment_name, save_git_info=False)
 19 | observer = observers.FileStorageObserver(f"results/{experiment_name}")
 20 | raft_experiment.observers.append(observer)
 21 | 
 22 | # Best performing on a per-dataset basis using raft_train_experiment.py
 23 | NUM_EXAMPLES = {
 24 |     "ade_corpus_v2": 25,
 25 |     "banking_77": 10,
 26 |     "terms_of_service": 5,
 27 |     "tai_safety_research": 5,
 28 |     "neurips_impact_statement_risks": 25,
 29 |     "overruling": 25,
 30 |     "systematic_review_inclusion": 10,
 31 |     "one_stop_english": 5,
 32 |     "tweet_eval_hate": 50,
 33 |     "twitter_complaints": 25,
 34 |     "semiconductor_org_types": 5,
 35 | }
 36 | 
 37 | 
 38 | @raft_experiment.config
 39 | def base_config():
 40 |     classifier_name = "GPT3Classifier"
 41 |     classifier_kwargs = {
 42 |         # change to davinci to replicate results from the paper
 43 |         # "engine": "ada",
 44 |     }
 45 |     if classifier_name in ('NaiveBayesClassifier', 'SVMClassifier', 'AdaBoostClassifier'):
 46 |         classifier_kwargs = {
 47 |             "vectorizer_kwargs": {
 48 |                 "strip_accents": 'unicode',
 49 |                 "lowercase": True,
 50 |                 "ngram_range": (1, 5),
 51 |                 "max_df": 1.0,
 52 |                 "min_df": 0.0
 53 |              },
 54 |             "model_kwargs": {}
 55 |         }
 56 |         if classifier_name == "NaiveBayesClassifier":
 57 |             classifier_kwargs['model_kwargs']['alpha'] = 0.05
 58 |         elif classifier_name == "AdaBoostClassifier":
 59 |             classifier_kwargs['model_kwargs']['max_depth'] = 3
 60 |             classifier_kwargs['model_kwargs']['n_estimators'] = 100
 61 | 
 62 |     configs = datasets.get_dataset_config_names("ought/raft")
 63 |     # set n_test to -1 to run on all test examples
 64 |     n_test = 5
 65 |     random_seed = 42
 66 |     zero_shot = False
 67 | 
 68 | 
 69 | @raft_experiment.capture
 70 | def load_datasets_train(configs):
 71 |     train_datasets = {
 72 |         config: datasets.load_dataset("ought/raft", config, split="train")
 73 |         for config in configs
 74 |     }
 75 |     test_datasets = {
 76 |         config: datasets.load_dataset("ought/raft", config, split="test")
 77 |         for config in configs
 78 |     }
 79 | 
 80 |     return train_datasets, test_datasets
 81 | 
 82 | 
 83 | @raft_experiment.capture
 84 | def make_extra_kwargs(config: str, zero_shot: bool):
 85 |     extra_kwargs = {
 86 |         "config": config,
 87 |         "num_prompt_training_examples": NUM_EXAMPLES[config] if not zero_shot else 0,
 88 |     }
 89 |     if config == "banking_77":
 90 |         extra_kwargs["add_prefixes"] = True
 91 |     return extra_kwargs
 92 | 
 93 | 
 94 | @raft_experiment.capture
 95 | def make_predictions(
 96 |     train_dataset,
 97 |     test_dataset,
 98 |     classifier_cls,
 99 |     extra_kwargs,
100 |     n_test,
101 |     classifier_kwargs,
102 |     random_seed,
103 | ):
104 |     classifier = classifier_cls(train_dataset, **classifier_kwargs, **extra_kwargs)
105 | 
106 |     if n_test > -1:
107 |         test_dataset = test_dataset.select(range(n_test))
108 | 
109 |     def predict(example):
110 |         del example["Label"]
111 |         output_probs = classifier.classify(example, random_seed=random_seed)
112 |         output = max(output_probs.items(), key=lambda kv_pair: kv_pair[1])
113 | 
114 |         example["Label"] = train_dataset.features["Label"].str2int(output[0])
115 |         return example
116 | 
117 |     return test_dataset.map(predict, load_from_cache_file=False)
118 | 
119 | 
120 | def log_text(text, dirname, filename):
121 |     targetdir = os.path.join(observer.dir, dirname)
122 |     if not os.path.isdir(targetdir):
123 |         os.mkdir(targetdir)
124 | 
125 |     with open(os.path.join(targetdir, filename), "w") as f:
126 |         f.write(text)
127 | 
128 | 
129 | def prepare_predictions_folder():
130 |     sacred_dir = os.path.join(observer.dir, "predictions")
131 |     if os.path.isdir(sacred_dir):
132 |         shutil.rmtree(sacred_dir)
133 |     os.mkdir(sacred_dir)
134 | 
135 | 
136 | def write_predictions(labeled, config):
137 |     int2str = labeled.features["Label"].int2str
138 | 
139 |     config_dir = os.path.join(observer.dir, "predictions", config)
140 |     if os.path.isdir(config_dir):
141 |         shutil.rmtree(config_dir)
142 |     os.mkdir(config_dir)
143 | 
144 |     pred_file = os.path.join(config_dir, "predictions.csv")
145 | 
146 |     with open(pred_file, "w", newline="") as f:
147 |         writer = csv.writer(
148 |             f,
149 |             quotechar='"',
150 |             delimiter=",",
151 |             quoting=csv.QUOTE_MINIMAL,
152 |             skipinitialspace=True,
153 |         )
154 |         writer.writerow(["ID", "Label"])
155 |         for row in labeled:
156 |             writer.writerow([row["ID"], int2str(row["Label"])])
157 | 
158 | 
159 | @raft_experiment.automain
160 | def main(classifier_name):
161 |     train, unlabeled = load_datasets_train()
162 |     prepare_predictions_folder()
163 | 
164 |     classifier_cls = getattr(classifiers, classifier_name)
165 | 
166 |     for config in unlabeled:
167 |         extra_kwargs = make_extra_kwargs(config)
168 |         labeled = make_predictions(
169 |             train[config], unlabeled[config], classifier_cls, extra_kwargs
170 |         )
171 |         write_predictions(labeled, config)
172 | 


--------------------------------------------------------------------------------
/src/raft_baselines/scripts/raft_train_experiment.py:
--------------------------------------------------------------------------------
  1 | import datasets
  2 | from sacred import Experiment, observers
  3 | import sklearn.metrics as skm
  4 | 
  5 | from raft_baselines import classifiers
  6 | 
  7 | experiment_name = "loo_tuning"
  8 | raft_experiment = Experiment(experiment_name, save_git_info=False)
  9 | observer = observers.FileStorageObserver(f"results/{experiment_name}")
 10 | raft_experiment.observers.append(observer)
 11 | 
 12 | 
 13 | @raft_experiment.config
 14 | def base_config():
 15 |     classifier_name = "GPT3Classifier"
 16 |     classifier_kwargs = {
 17 |         # change to davinci to replicate results from the paper
 18 |         "engine": "ada",
 19 |     }
 20 |     configs = datasets.get_dataset_config_names("ought/raft")
 21 |     # controls which dimension is tested, out of the 3 reported in the paper
 22 |     # Other options: do_semantic_selection and num_prompt_training_examples
 23 |     test_dimension = "use_task_specific_instructions"
 24 |     random_seed = 42
 25 | 
 26 | 
 27 | @raft_experiment.capture
 28 | def load_datasets_train(configs):
 29 |     train_datasets = {
 30 |         config: datasets.load_dataset("ought/raft", config, split="train")
 31 |         for config in configs
 32 |     }
 33 |     return train_datasets
 34 | 
 35 | 
 36 | @raft_experiment.capture
 37 | def loo_test(
 38 |     train_datasets, classifier_name, classifier_kwargs, test_dimension, random_seed
 39 | ):
 40 |     # Change what to iterate over, filling in extra_kwargs to test different
 41 |     # configurations of the classifier.
 42 | 
 43 |     if test_dimension == "use_task_specific_instructions":
 44 |         dim_values = [False, True]
 45 |         other_dim_kwargs = {
 46 |             "do_semantic_selection": False,
 47 |             "num_prompt_training_examples": 20,
 48 |         }
 49 |     elif test_dimension == "do_semantic_selection":
 50 |         dim_values = [False, True]
 51 |         other_dim_kwargs = {
 52 |             "use_task_specific_instructions": True,
 53 |             "num_prompt_training_examples": 20,
 54 |         }
 55 |     elif test_dimension == "num_prompt_training_examples":
 56 |         dim_values = [5, 10, 25, 49]
 57 |         other_dim_kwargs = {
 58 |             "use_task_specific_instructions": True,
 59 |             "do_semantic_selection": True,
 60 |         }
 61 |     else:
 62 |         raise ValueError(f"test_dimension {test_dimension} not recognized")
 63 | 
 64 |     classifier_cls = getattr(classifiers, classifier_name)
 65 | 
 66 |     for config in train_datasets:
 67 |         for dim_value in dim_values:
 68 |             dataset = train_datasets[config]
 69 |             labels = list(range(1, dataset.features["Label"].num_classes))
 70 |             predictions = []
 71 | 
 72 |             extra_kwargs = {
 73 |                 "config": config,
 74 |                 test_dimension: dim_value,
 75 |                 **other_dim_kwargs,
 76 |             }
 77 |             if config == "banking_77":
 78 |                 extra_kwargs["add_prefixes"] = True
 79 | 
 80 |             for i in range(len(dataset)):
 81 |                 train = dataset.select([j for j in range(len(dataset)) if j != i])
 82 |                 test = dataset.select([i])
 83 | 
 84 |                 classifier = classifier_cls(train, **classifier_kwargs, **extra_kwargs)
 85 | 
 86 |                 def predict(example):
 87 |                     del example["Label"]
 88 |                     del example["ID"]
 89 |                     output_probs = classifier.classify(example, random_seed=random_seed)
 90 |                     output = max(output_probs.items(), key=lambda kv_pair: kv_pair[1])
 91 | 
 92 |                     predictions.append(dataset.features["Label"].str2int(output[0]))
 93 | 
 94 |                 test.map(predict)
 95 | 
 96 |             # accuracy = sum([p == l for p, l in zip(predictions, dataset['Label'])]) / 50
 97 |             f1 = skm.f1_score(
 98 |                 dataset["Label"], predictions, labels=labels, average="macro"
 99 |             )
100 |             print(f"Dataset - {config}; {test_dimension} - {dim_value}: {f1}")
101 |             raft_experiment.log_scalar(f"{config}.{dim_value}", f1)
102 | 
103 | 
104 | @raft_experiment.automain
105 | def main():
106 |     train = load_datasets_train()
107 |     loo_test(train)
108 | 


--------------------------------------------------------------------------------
/src/raft_baselines/scripts/starter_kit.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "source": [
  6 |     "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1TQtHG-Wf2CgYGSD9e7_uJWIdiK5HNniV)"
  7 |    ],
  8 |    "metadata": {}
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "source": [
 13 |     "## Getting started with the RAFT benchmark\n",
 14 |     "\n",
 15 |     "In this notebook, we will walk through:\n",
 16 |     "\n",
 17 |     "1. Loading the tasks from the [RAFT dataset](https://huggingface.co/datasets/ought/raft)\n",
 18 |     "2. Creating a classifier using any CausalLM from the [Hugging Face Hub](https://huggingface.co/models)\n",
 19 |     "3. Generating predictions using that classifier for RAFT test examples\n",
 20 |     "\n",
 21 |     "This should provide you with the steps needed to make a submission to the [RAFT leaderboard](https://huggingface.co/spaces/ought/raft-leaderboard)!"
 22 |    ],
 23 |    "metadata": {}
 24 |   },
 25 |   {
 26 |    "cell_type": "code",
 27 |    "execution_count": 1,
 28 |    "source": [
 29 |     "import datasets\n",
 30 |     "\n",
 31 |     "datasets.logging.set_verbosity_error()"
 32 |    ],
 33 |    "outputs": [],
 34 |    "metadata": {}
 35 |   },
 36 |   {
 37 |    "cell_type": "markdown",
 38 |    "source": [
 39 |     "## Loading RAFT datasets\n"
 40 |    ],
 41 |    "metadata": {}
 42 |   },
 43 |   {
 44 |    "cell_type": "markdown",
 45 |    "source": [
 46 |     "We'll focus on the ADE corpus V2 task in this starter kit, but similar code could be run for all of the tasks in RAFT. To see the possible tasks, we can use the following function from `datasets`:"
 47 |    ],
 48 |    "metadata": {}
 49 |   },
 50 |   {
 51 |    "cell_type": "code",
 52 |    "execution_count": 2,
 53 |    "source": [
 54 |     "from datasets import get_dataset_config_names\n",
 55 |     "\n",
 56 |     "RAFT_TASKS = get_dataset_config_names(\"ought/raft\")\n",
 57 |     "RAFT_TASKS"
 58 |    ],
 59 |    "outputs": [
 60 |     {
 61 |      "output_type": "execute_result",
 62 |      "data": {
 63 |       "text/plain": [
 64 |        "['ade_corpus_v2',\n",
 65 |        " 'banking_77',\n",
 66 |        " 'terms_of_service',\n",
 67 |        " 'tai_safety_research',\n",
 68 |        " 'neurips_impact_statement_risks',\n",
 69 |        " 'overruling',\n",
 70 |        " 'systematic_review_inclusion',\n",
 71 |        " 'one_stop_english',\n",
 72 |        " 'tweet_eval_hate',\n",
 73 |        " 'twitter_complaints',\n",
 74 |        " 'semiconductor_org_types']"
 75 |       ]
 76 |      },
 77 |      "metadata": {},
 78 |      "execution_count": 2
 79 |     }
 80 |    ],
 81 |    "metadata": {}
 82 |   },
 83 |   {
 84 |    "cell_type": "markdown",
 85 |    "source": [
 86 |     "Each task in RAFT consists of a training set of only **_50 labeled examples_** and an unlabeled test set. All labels have a textual version associated with them. Let's load corpus associated with the `ade_corpus_v2` task:"
 87 |    ],
 88 |    "metadata": {}
 89 |   },
 90 |   {
 91 |    "cell_type": "code",
 92 |    "execution_count": 3,
 93 |    "source": [
 94 |     "from datasets import load_dataset\n",
 95 |     "\n",
 96 |     "TASK = \"ade_corpus_v2\"\n",
 97 |     "raft_dataset = load_dataset(\"ought/raft\", name=TASK)\n",
 98 |     "raft_dataset"
 99 |    ],
100 |    "outputs": [
101 |     {
102 |      "output_type": "execute_result",
103 |      "data": {
104 |       "text/plain": [
105 |        "DatasetDict({\n",
106 |        "    train: Dataset({\n",
107 |        "        features: ['Sentence', 'ID', 'Label'],\n",
108 |        "        num_rows: 50\n",
109 |        "    })\n",
110 |        "    test: Dataset({\n",
111 |        "        features: ['Sentence', 'ID', 'Label'],\n",
112 |        "        num_rows: 5000\n",
113 |        "    })\n",
114 |        "})"
115 |       ]
116 |      },
117 |      "metadata": {},
118 |      "execution_count": 3
119 |     }
120 |    ],
121 |    "metadata": {}
122 |   },
123 |   {
124 |    "cell_type": "markdown",
125 |    "source": [
126 |     "The `raft_dataset` object itself is a [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training and test sets. In this task we can see we have 50 labelled examples to work with and 5,000 examples on the test set we need to generate predictions for. To access an example, you need to specify the name of the split and then the index as follows:"
127 |    ],
128 |    "metadata": {}
129 |   },
130 |   {
131 |    "cell_type": "code",
132 |    "execution_count": 4,
133 |    "source": [
134 |     "raft_dataset[\"train\"][0]"
135 |    ],
136 |    "outputs": [
137 |     {
138 |      "output_type": "execute_result",
139 |      "data": {
140 |       "text/plain": [
141 |        "{'Sentence': 'No regional side effects were noted.', 'ID': 0, 'Label': 2}"
142 |       ]
143 |      },
144 |      "metadata": {},
145 |      "execution_count": 4
146 |     }
147 |    ],
148 |    "metadata": {}
149 |   },
150 |   {
151 |    "cell_type": "markdown",
152 |    "source": [
153 |     "Here we can see that each example is assigned a label ID which denotes the class in this particular tasks. Let's check how many classes we have in the training set:"
154 |    ],
155 |    "metadata": {}
156 |   },
157 |   {
158 |    "cell_type": "code",
159 |    "execution_count": 5,
160 |    "source": [
161 |     "label_ids = raft_dataset[\"train\"].unique(\"Label\")\n",
162 |     "label_ids"
163 |    ],
164 |    "outputs": [
165 |     {
166 |      "output_type": "execute_result",
167 |      "data": {
168 |       "text/plain": [
169 |        "[2, 1]"
170 |       ]
171 |      },
172 |      "metadata": {},
173 |      "execution_count": 5
174 |     }
175 |    ],
176 |    "metadata": {}
177 |   },
178 |   {
179 |    "cell_type": "markdown",
180 |    "source": [
181 |     "Okay, this indicates that `ade_corpus_v2` is a binary classification task and we can extract the human-readable label names as follows:"
182 |    ],
183 |    "metadata": {}
184 |   },
185 |   {
186 |    "cell_type": "code",
187 |    "execution_count": 6,
188 |    "source": [
189 |     "features = raft_dataset[\"train\"].features[\"Label\"]\n",
190 |     "id2label = {idx : features.int2str(idx) for idx in label_ids}\n",
191 |     "id2label"
192 |    ],
193 |    "outputs": [
194 |     {
195 |      "output_type": "execute_result",
196 |      "data": {
197 |       "text/plain": [
198 |        "{2: 'not ADE-related', 1: 'ADE-related'}"
199 |       ]
200 |      },
201 |      "metadata": {},
202 |      "execution_count": 6
203 |     }
204 |    ],
205 |    "metadata": {}
206 |   },
207 |   {
208 |    "cell_type": "markdown",
209 |    "source": [
210 |     "Note that the test set also has a `Label` entry, but it is zero to denote a dummy label (this is what your model needs to predict!):"
211 |    ],
212 |    "metadata": {}
213 |   },
214 |   {
215 |    "cell_type": "code",
216 |    "execution_count": 7,
217 |    "source": [
218 |     "raft_dataset[\"test\"].unique(\"Label\")"
219 |    ],
220 |    "outputs": [
221 |     {
222 |      "output_type": "execute_result",
223 |      "data": {
224 |       "text/plain": [
225 |        "[0]"
226 |       ]
227 |      },
228 |      "metadata": {},
229 |      "execution_count": 7
230 |     }
231 |    ],
232 |    "metadata": {}
233 |   },
234 |   {
235 |    "cell_type": "markdown",
236 |    "source": [
237 |     "To get a broader sense of what kind of data we are dealing with, we can use the following function to randomly sample from the corpus and display the results as a table:"
238 |    ],
239 |    "metadata": {}
240 |   },
241 |   {
242 |    "cell_type": "code",
243 |    "execution_count": 8,
244 |    "source": [
245 |     "import random\n",
246 |     "import pandas as pd\n",
247 |     "from IPython.display import display, HTML\n",
248 |     "\n",
249 |     "def show_random_elements(dataset, num_examples=10):\n",
250 |     "    assert num_examples <= len(dataset), \"Can't pick more elements than there are in the dataset.\"\n",
251 |     "    picks = []\n",
252 |     "    for _ in range(num_examples):\n",
253 |     "        pick = random.randint(0, len(dataset)-1)\n",
254 |     "        while pick in picks:\n",
255 |     "            pick = random.randint(0, len(dataset)-1)\n",
256 |     "        picks.append(pick)\n",
257 |     "    \n",
258 |     "    df = pd.DataFrame(dataset[picks])\n",
259 |     "    for column, typ in dataset.features.items():\n",
260 |     "        if isinstance(typ, datasets.ClassLabel):\n",
261 |     "            df[column] = df[column].transform(lambda i: typ.names[i])\n",
262 |     "    display(HTML(df.to_html()))\n",
263 |     "    \n",
264 |     "show_random_elements(raft_dataset[\"train\"])"
265 |    ],
266 |    "outputs": [
267 |     {
268 |      "output_type": "display_data",
269 |      "data": {
270 |       "text/plain": [
271 |        "<IPython.core.display.HTML object>"
272 |       ],
273 |       "text/html": [
274 |        "<table border=\"1\" class=\"dataframe\">\n",
275 |        "  <thead>\n",
276 |        "    <tr style=\"text-align: right;\">\n",
277 |        "      <th></th>\n",
278 |        "      <th>Sentence</th>\n",
279 |        "      <th>ID</th>\n",
280 |        "      <th>Label</th>\n",
281 |        "    </tr>\n",
282 |        "  </thead>\n",
283 |        "  <tbody>\n",
284 |        "    <tr>\n",
285 |        "      <th>0</th>\n",
286 |        "      <td>CT-scan disclosed right ethmoid sinusitis that spread to the orbit after surgery.</td>\n",
287 |        "      <td>22</td>\n",
288 |        "      <td>not ADE-related</td>\n",
289 |        "    </tr>\n",
290 |        "    <tr>\n",
291 |        "      <th>1</th>\n",
292 |        "      <td>IMPLICATIONS: Dexmedetomidine, an alpha(2)-adrenoceptor agonist, is indicated for sedating patients on mechanical ventilation.</td>\n",
293 |        "      <td>43</td>\n",
294 |        "      <td>not ADE-related</td>\n",
295 |        "    </tr>\n",
296 |        "    <tr>\n",
297 |        "      <th>2</th>\n",
298 |        "      <td>The INR should be monitored more frequently when bosentan is initiated, adjusted, or discontinued in patients taking warfarin.</td>\n",
299 |        "      <td>2</td>\n",
300 |        "      <td>not ADE-related</td>\n",
301 |        "    </tr>\n",
302 |        "    <tr>\n",
303 |        "      <th>3</th>\n",
304 |        "      <td>Remarkable findings on initial examination were facial grimacing, flexure posturing of both upper extremities, and 7-mm, reactive pupils.</td>\n",
305 |        "      <td>44</td>\n",
306 |        "      <td>not ADE-related</td>\n",
307 |        "    </tr>\n",
308 |        "    <tr>\n",
309 |        "      <th>4</th>\n",
310 |        "      <td>CONCLUSION: Pancreatic enzyme intolerance, although rare, would be a major problem in the management of patients with CF.</td>\n",
311 |        "      <td>6</td>\n",
312 |        "      <td>not ADE-related</td>\n",
313 |        "    </tr>\n",
314 |        "    <tr>\n",
315 |        "      <th>5</th>\n",
316 |        "      <td>After the first oral dose of propranolol, syncope developed together with atrioventricular block.</td>\n",
317 |        "      <td>3</td>\n",
318 |        "      <td>ADE-related</td>\n",
319 |        "    </tr>\n",
320 |        "    <tr>\n",
321 |        "      <th>6</th>\n",
322 |        "      <td>Acute promyelocytic leukemia after living donor partial orthotopic liver transplantation in two Japanese girls.</td>\n",
323 |        "      <td>45</td>\n",
324 |        "      <td>not ADE-related</td>\n",
325 |        "    </tr>\n",
326 |        "    <tr>\n",
327 |        "      <th>7</th>\n",
328 |        "      <td>The patient had no skin reactions for the next 12 mo, with the exception of injection-site papules.</td>\n",
329 |        "      <td>18</td>\n",
330 |        "      <td>not ADE-related</td>\n",
331 |        "    </tr>\n",
332 |        "    <tr>\n",
333 |        "      <th>8</th>\n",
334 |        "      <td>Sotalol-induced bradycardia reversed by glucagon.</td>\n",
335 |        "      <td>23</td>\n",
336 |        "      <td>ADE-related</td>\n",
337 |        "    </tr>\n",
338 |        "    <tr>\n",
339 |        "      <th>9</th>\n",
340 |        "      <td>We describe a patient who developed HUS after treatment with mitomycin C (total dose 144 mg/m2) due to a carcinoma of the ascending colon.</td>\n",
341 |        "      <td>30</td>\n",
342 |        "      <td>ADE-related</td>\n",
343 |        "    </tr>\n",
344 |        "  </tbody>\n",
345 |        "</table>"
346 |       ]
347 |      },
348 |      "metadata": {}
349 |     }
350 |    ],
351 |    "metadata": {}
352 |   },
353 |   {
354 |    "cell_type": "markdown",
355 |    "source": [
356 |     "## Creating a classifier from the Hugging Face Model Hub"
357 |    ],
358 |    "metadata": {}
359 |   },
360 |   {
361 |    "cell_type": "markdown",
362 |    "source": [
363 |     "We provide a class which uses the same prompt construction method as our GPT-3 baseline, but works with any CausalLM on the [HuggingFace Model Hub](https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads). The classifier will automatically use a GPU if available. Brief documentation on the arguments for configuring the classifier is provided below.:"
364 |    ],
365 |    "metadata": {}
366 |   },
367 |   {
368 |    "cell_type": "code",
369 |    "execution_count": 9,
370 |    "source": [
371 |     "from raft_baselines.classifiers import TransformersCausalLMClassifier\n",
372 |     "\n",
373 |     "classifier = TransformersCausalLMClassifier(\n",
374 |     "    model_type=\"distilgpt2\",             # The model to use from the HF hub\n",
375 |     "    training_data=raft_dataset[\"train\"],            # The training data\n",
376 |     "    num_prompt_training_examples=25,     # See raft_predict.py for the number of training examples used on a per-dataset basis in the GPT-3 baselines run.\n",
377 |     "                                         # Note that it may be better to use fewer training examples and/or shorter instructions with other models with smaller context windows.\n",
378 |     "    add_prefixes=(TASK==\"banking_77\"),   # Set to True when using banking_77 since multiple classes start with the same token\n",
379 |     "    config=TASK,                         # For task-specific instructions and field ordering\n",
380 |     "    use_task_specific_instructions=True,\n",
381 |     "    do_semantic_selection=True,\n",
382 |     ")"
383 |    ],
384 |    "outputs": [],
385 |    "metadata": {}
386 |   },
387 |   {
388 |    "cell_type": "markdown",
389 |    "source": [
390 |     "## Generating predictions for RAFT test examples"
391 |    ],
392 |    "metadata": {}
393 |   },
394 |   {
395 |    "cell_type": "markdown",
396 |    "source": [
397 |     "In order to generate predictions on the test set, we need to provide the model with an appropriate prompt with the instructions. Let's take a look at how this works on a single example from the test set."
398 |    ],
399 |    "metadata": {}
400 |   },
401 |   {
402 |    "cell_type": "markdown",
403 |    "source": [
404 |     "### Example prompt and prediction"
405 |    ],
406 |    "metadata": {}
407 |   },
408 |   {
409 |    "cell_type": "markdown",
410 |    "source": [
411 |     "The `TransformersCausalLMClassifier` has a `classify` function that will automatically generate the predicted probabilites from the model. We'll set `should_print_prompt=True` so that we can see which prompt is being used to instruct the model:"
412 |    ],
413 |    "metadata": {}
414 |   },
415 |   {
416 |    "cell_type": "code",
417 |    "execution_count": 10,
418 |    "source": [
419 |     "test_dataset = raft_dataset[\"test\"]\n",
420 |     "first_test_example = test_dataset[0]\n",
421 |     "\n",
422 |     "# delete the 0 Label\n",
423 |     "del first_test_example[\"Label\"]\n",
424 |     "\n",
425 |     "# probabilities for all classes\n",
426 |     "output_probs = classifier.classify(first_test_example, should_print_prompt=True)\n",
427 |     "output_probs"
428 |    ],
429 |    "outputs": [
430 |     {
431 |      "output_type": "stream",
432 |      "name": "stdout",
433 |      "text": [
434 |       "Label the sentence based on whether it is related to an adverse drug effect (ADE). Details are described below:\n",
435 |       "Drugs: Names of drugs and chemicals that include brand names, trivial names, abbreviations and systematic names were annotated. Mentions of drugs or chemicals should strictly be in a therapeutic context. This category does not include the names of metabolites, reaction byproducts, or hospital chemicals (e.g. surgical equipment disinfectants).\n",
436 |       "Adverse effect: Mentions of adverse effects include signs, symptoms, diseases, disorders, acquired abnormalities, deficiencies, organ damage or death that strictly occur as a consequence of drug intake.\n",
437 |       "Possible labels:\n",
438 |       "1. ADE-related\n",
439 |       "2. not ADE-related\n",
440 |       "\n",
441 |       "Sentence: Treatment of silastic catheter-induced central vein septic thrombophlebitis\n",
442 |       "Label: not ADE-related\n",
443 |       "\n",
444 |       "Sentence: We describe a patient who developed HUS after treatment with mitomycin C (total dose 144 mg\n",
445 |       "Label: ADE-related\n",
446 |       "\n",
447 |       "Sentence: In 1991 the patient were found to be seropositive for HCV antibodies as detected by\n",
448 |       "Label: not ADE-related\n",
449 |       "\n",
450 |       "Sentence: METHODS: We identified three patients who developed skin necrosis and determined any factors, which\n",
451 |       "Label: not ADE-related\n",
452 |       "\n",
453 |       "Sentence: These cases were considered unusual in light of the short delay of their onset after initiation of immunosupp\n",
454 |       "Label: ADE-related\n",
455 |       "\n",
456 |       "Sentence: No regional side effects were noted.\n",
457 |       "Label: not ADE-related\n",
458 |       "\n",
459 |       "Sentence: A patient with psoriasis is described who had an abnormal response to the glucose tolerance test without other\n",
460 |       "Label: ADE-related\n",
461 |       "\n",
462 |       "Sentence: CONCLUSIONS: These results suggest that clozapine may cause TD; however, the prevalence\n",
463 |       "Label: ADE-related\n",
464 |       "\n",
465 |       "Sentence: The cases are important in documenting that drug-induced dystonias do occur in patients with dementia,\n",
466 |       "Label: ADE-related\n",
467 |       "\n",
468 |       "Sentence: NEH must be considered in lupus patients receiving cytotoxic agents to avoid inappropriate use\n",
469 |       "Label: not ADE-related\n",
470 |       "\n",
471 |       "Sentence: A closer look at septic shock.\n",
472 |       "Label: not ADE-related\n",
473 |       "\n",
474 |       "Sentence: The mechanism by which sunitinib induces gynaecomastia is thought to be associated\n",
475 |       "Label: ADE-related\n",
476 |       "\n",
477 |       "Sentence: Of the 16 patients, including the 1 reported here, only 3 displayed significant shortening of the\n",
478 |       "Label: not ADE-related\n",
479 |       "\n",
480 |       "Sentence: Sotalol-induced bradycardia reversed by glucagon.\n",
481 |       "Label: ADE-related\n",
482 |       "\n",
483 |       "Sentence: CONCLUSION: Pancreatic enzyme intolerance, although rare, would be a major problem in\n",
484 |       "Label: not ADE-related\n",
485 |       "\n",
486 |       "Sentence: Macular infarction after endophthalmitis treated with vitrectomy and intravit\n",
487 |       "Label: ADE-related\n",
488 |       "\n",
489 |       "Sentence: MRI has a high sensitivity and specificity in the diagnosis of osteonecrosis and should be used\n",
490 |       "Label: not ADE-related\n",
491 |       "\n",
492 |       "Sentence: IMPLICATIONS: Dexmedetomidine, an alpha(2)-adrenoceptor\n",
493 |       "Label: not ADE-related\n",
494 |       "\n",
495 |       "Sentence: The INR should be monitored more frequently when bosentan is initiated, adjusted, or discontinued\n",
496 |       "Label: not ADE-related\n",
497 |       "\n",
498 |       "Sentence: Remarkable findings on initial examination were facial grimacing, flexure posturing of both upper extrem\n",
499 |       "Label: not ADE-related\n",
500 |       "\n",
501 |       "Sentence: Early detection of these cases has practical importance since the identification and elimination of the causative drug is\n",
502 |       "Label: not ADE-related\n",
503 |       "\n",
504 |       "Sentence: This report demonstrates the increased risk of complicated varicella associated with the use of corticoster\n",
505 |       "Label: not ADE-related\n",
506 |       "\n",
507 |       "Sentence: These results indicate that the hyponatremia in this case was due to SIADH\n",
508 |       "Label: not ADE-related\n",
509 |       "\n",
510 |       "Sentence: Best-corrected visual acuity measurements were performed at every visit.\n",
511 |       "Label: not ADE-related\n",
512 |       "\n",
513 |       "Sentence: OBJECTIVE: To describe onset of syndrome of inappropriate antidiuretic hormone (SIADH\n",
514 |       "Label: ADE-related\n",
515 |       "\n",
516 |       "Sentence: CONCLUSIONS: SD-OCT and AO detected abnormalities that correlate topographically with visual field loss from hydroxychloroquine toxicity as demonstrated by HVF 10-2 and may be useful in the detection of subclinical abnormalities that precede symptoms or objective visual field loss.\n",
517 |       "Label:\n"
518 |      ]
519 |     },
520 |     {
521 |      "output_type": "execute_result",
522 |      "data": {
523 |       "text/plain": [
524 |        "{'ADE-related': 0.31358153, 'not ADE-related': 0.68641853}"
525 |       ]
526 |      },
527 |      "metadata": {},
528 |      "execution_count": 10
529 |     }
530 |    ],
531 |    "metadata": {}
532 |   },
533 |   {
534 |    "cell_type": "markdown",
535 |    "source": [
536 |     "In this example we can see the model predicts that the example is not related to an adverse drug effect. We can use this technique to generate predictions across the whole test set! Let's take a look."
537 |    ],
538 |    "metadata": {}
539 |   },
540 |   {
541 |    "cell_type": "markdown",
542 |    "source": [
543 |     "### Creating a submission file of predictions"
544 |    ],
545 |    "metadata": {}
546 |   },
547 |   {
548 |    "cell_type": "markdown",
549 |    "source": [
550 |     "To submit to the RAFT leaderboard, you'll need to provide a CSV file of predictions on the test set for each task (see [here](https://huggingface.co/datasets/ought/raft-submission) for detailed instructions).  The following code snippet generates a CSV with predictions for the first $N$ test examples in the format required for submission $(ID, Label)$. \n",
551 |     "\n",
552 |     "Note that this is expected to generate predictions of all \"Not ADE-related\" for the 10 test examples with the code as written; few-shot classification is pretty hard!"
553 |    ],
554 |    "metadata": {}
555 |   },
556 |   {
557 |    "cell_type": "code",
558 |    "execution_count": 11,
559 |    "source": [
560 |     "# Increase this to len(test_dataset) to generate predictions over the full test set\n",
561 |     "N_TEST = 10\n",
562 |     "test_examples_to_predict = test_dataset.select(range(N_TEST))\n",
563 |     "\n",
564 |     "def predict_one(clf, test_example):\n",
565 |     "    del test_example[\"Label\"]    \n",
566 |     "    output_probs = clf.classify(example)\n",
567 |     "    output_label = max(output_probs.items(), key=lambda kv_pair: kv_pair[1])[0]\n",
568 |     "    return output_label\n",
569 |     "\n",
570 |     "data = []\n",
571 |     "for example in test_examples_to_predict:\n",
572 |     "    data.append({\"ID\": example[\"ID\"], \"Label\": predict_one(classifier, example)})\n",
573 |     "    \n",
574 |     "result_df = pd.DataFrame(data=data, columns=[\"ID\", \"Label\"]).astype({\"ID\": int, \"Label\": str})   \n",
575 |     "result_df"
576 |    ],
577 |    "outputs": [],
578 |    "metadata": {}
579 |   },
580 |   {
581 |    "cell_type": "markdown",
582 |    "source": [
583 |     "Note that the `ID` column starts from index 50 since we have IDs 0-49 in the training set. The final step is to save the DataFrame as a CSV file and build out the rest of your submission:"
584 |    ],
585 |    "metadata": {}
586 |   },
587 |   {
588 |    "cell_type": "code",
589 |    "execution_count": null,
590 |    "source": [
591 |     "result_df.to_csv(\"../data/example_predictions.csv\", index=False)"
592 |    ],
593 |    "outputs": [],
594 |    "metadata": {}
595 |   },
596 |   {
597 |    "cell_type": "markdown",
598 |    "source": [
599 |     "Good luck with the rest of the benchmark!"
600 |    ],
601 |    "metadata": {}
602 |   }
603 |  ],
604 |  "metadata": {
605 |   "interpreter": {
606 |    "hash": "74118a50156796984ad06a64d88792c5d24753e439e2427f4985fcb9d71e695f"
607 |   },
608 |   "kernelspec": {
609 |    "name": "python3",
610 |    "display_name": "Python 3.8.11 64-bit ('raft-baselines': conda)"
611 |   },
612 |   "language_info": {
613 |    "codemirror_mode": {
614 |     "name": "ipython",
615 |     "version": 3
616 |    },
617 |    "file_extension": ".py",
618 |    "mimetype": "text/x-python",
619 |    "name": "python",
620 |    "nbconvert_exporter": "python",
621 |    "pygments_lexer": "ipython3",
622 |    "version": "3.8.11"
623 |   }
624 |  },
625 |  "nbformat": 4,
626 |  "nbformat_minor": 4
627 | }


--------------------------------------------------------------------------------
/src/raft_baselines/scripts/test_gpt3.py:
--------------------------------------------------------------------------------
 1 | import datasets
 2 | 
 3 | from raft_baselines.classifiers import GPT3Classifier
 4 | 
 5 | train = datasets.load_dataset(
 6 |     "ought/raft", "neurips_impact_statement_risks", split="train"
 7 | )
 8 | classifier = GPT3Classifier(
 9 |     train, config="neurips_impact_statement_risks", do_semantic_selection=True
10 | )
11 | print(classifier.classify({"Paper title": "GNN research", "Impact statement": "test2"}))
12 | 


--------------------------------------------------------------------------------
/src/raft_baselines/scripts/test_naive_bayes.py:
--------------------------------------------------------------------------------
 1 | import datasets
 2 | 
 3 | from raft_baselines.classifiers import NaiveBayesClassifier
 4 | 
 5 | train = datasets.load_dataset(
 6 |     "ought/raft", "neurips_impact_statement_risks", split="train"
 7 | )
 8 | 
 9 | classifier = NaiveBayesClassifier(train)
10 | 
11 | print(classifier.classify({"Paper title": "CNN research", "Impact statement": "test2"}))
12 | 


--------------------------------------------------------------------------------
/src/raft_baselines/utils/embedders.py:
--------------------------------------------------------------------------------
 1 | from abc import ABC, abstractmethod
 2 | from typing import List, Tuple
 3 | from sentence_transformers import SentenceTransformer
 4 | import torch
 5 | 
 6 | 
 7 | class Embedder(ABC):
 8 |     @abstractmethod
 9 |     def __call__(self, texts: List[str]) -> List[List[float]]:
10 |         ...
11 | 
12 | 
13 | class SentenceTransformersEmbedder(Embedder):
14 |     def __init__(
15 |         self, model_name="sentence-transformers/all-MiniLM-L6-v2", max_seq_length=512
16 |     ):
17 |         self.device = "cuda" if torch.cuda.is_available() else "cpu"
18 |         self.similarity_model = SentenceTransformer(model_name, device=self.device)
19 |         self.similarity_model.max_seq_length = max_seq_length
20 |         self._cache = {}
21 | 
22 |     def __call__(self, texts: Tuple[str]) -> List[List[float]]:
23 |         if hash(texts) in self._cache:
24 |             return self._cache[hash(texts)]
25 | 
26 |         embeds = self.similarity_model.encode(
27 |             texts, convert_to_tensor=True, device=self.device
28 |         )
29 | 
30 |         self._cache[hash(texts)] = embeds
31 |         return embeds
32 | 


--------------------------------------------------------------------------------
/src/raft_baselines/utils/gpt3_utils.py:
--------------------------------------------------------------------------------
 1 | import openai
 2 | from dotenv import load_dotenv
 3 | import os
 4 | import time
 5 | from cachetools import cached, LRUCache
 6 | from typing import List, Dict, Tuple, Any, cast
 7 | 
 8 | from raft_baselines.utils.tokenizers import TransformersTokenizer
 9 | 
10 | load_dotenv()
11 | OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
12 | 
13 | 
14 | @cached(cache=LRUCache(maxsize=1e9))
15 | def complete(
16 |     prompt: str,
17 |     engine: str = "ada",
18 |     max_tokens: int = 5,
19 |     temperature: float = 1.0,
20 |     top_p: float = 1.0,
21 |     n: int = 1,
22 |     echo: bool = False,
23 |     stop: Tuple[str, ...] = ("\n",),
24 |     presence_penalty: float = 0.0,
25 |     frequency_penalty: float = 0.0,
26 | ):
27 |     openai_completion_args = dict(
28 |         api_key=OPENAI_API_KEY,
29 |         engine=engine,
30 |         prompt=prompt,
31 |         max_tokens=max_tokens,
32 |         temperature=temperature,
33 |         top_p=top_p,
34 |         n=n,
35 |         logprobs=100,  # Always request 100 so can easily count tokens in completion
36 |         echo=echo,
37 |         stop=stop,
38 |         presence_penalty=presence_penalty,
39 |         frequency_penalty=frequency_penalty,
40 |     )
41 | 
42 |     success = False
43 |     retries = 0
44 |     while not success:
45 |         try:
46 |             response = openai.Completion.create(**openai_completion_args)
47 |             success = True
48 |         except Exception as e:
49 |             print(f"Exception in OpenAI completion: {e}")
50 |             retries += 1
51 |             if retries > 3:
52 |                 raise Exception("Max retries reached")
53 |                 break
54 |             else:
55 |                 print("retrying")
56 |                 time.sleep(retries * 15)
57 | 
58 |     return cast(Dict[str, Any], response)
59 | 
60 | 
61 | @cached(cache=LRUCache(maxsize=1e9))
62 | def search(
63 |     documents: Tuple[str, ...], query: str, engine: str = "ada"
64 | ) -> List[Dict[str, Any]]:
65 |     response = None
66 |     error = None
67 |     tokenizer = TransformersTokenizer("gpt2")
68 |     query = tokenizer.truncate_by_tokens(query, 1000)
69 |     short_enough_documents = [
70 |         tokenizer.truncate_by_tokens(document, 2034 - tokenizer.num_tokens(query))
71 |         for document in documents
72 |     ]
73 | 
74 |     success = False
75 |     retries = 0
76 |     while not success:
77 |         try:
78 |             response = openai.Engine(engine, api_key=OPENAI_API_KEY).search(
79 |                 documents=short_enough_documents, query=query
80 |             )
81 |             success = True
82 |         except Exception as e:
83 |             print(f"Exception in OpenAI search: {e}")
84 |             retries += 1
85 |             if retries > 3:
86 |                 raise Exception("Max retries reached")
87 |                 break
88 |             else:
89 |                 print("retrying")
90 |                 time.sleep(retries * 15)
91 | 
92 |     assert response is not None
93 |     results = response["data"]
94 | 
95 |     return results
96 | 


--------------------------------------------------------------------------------
/src/raft_baselines/utils/tokenizers.py:
--------------------------------------------------------------------------------
 1 | from abc import ABC, abstractmethod
 2 | from typing import List
 3 | from transformers import AutoTokenizer
 4 | from transformers.tokenization_utils_base import BatchEncoding
 5 | 
 6 | 
 7 | class Tokenizer(ABC):
 8 |     @abstractmethod
 9 |     def num_tokens(text: str) -> int:
10 |         ...
11 | 
12 |     @abstractmethod
13 |     def truncate_by_tokens(text: str, max_tokens: int) -> str:
14 |         ...
15 | 
16 | 
17 | class TransformersTokenizer(Tokenizer):
18 |     def __init__(self, model_name):
19 |         self.tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
20 | 
21 |     def __call__(self, *args, **kwargs) -> BatchEncoding:
22 |         return self.tokenizer(*args, **kwargs)
23 | 
24 |     def num_tokens(self, text: str) -> int:
25 |         return len(self.tokenizer.tokenize(text))
26 | 
27 |     def truncate_by_tokens(self, text: str, max_tokens: int) -> str:
28 |         if max_tokens is None or not text:
29 |             return text
30 |         encoding = self.tokenizer(
31 |             text, truncation=True, max_length=max_tokens, return_offsets_mapping=True
32 |         )
33 | 
34 |         return text[: encoding.offset_mapping[-1][1]]
35 | 


--------------------------------------------------------------------------------