├── LICENSE ├── README.md ├── contexts ├── cat.txt ├── negative.txt ├── positive.txt └── toxic.txt └── src ├── common.py ├── gpt2hc.py ├── gpt2hc_base.py ├── gpt2sp.py ├── gpt2sp_base.py ├── prompt_gen.py ├── run_cli.sh ├── small_training_data.jsonl ├── texts.py └── train.py /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2022 Perception, Control and Cognition Lab 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # prompt-compression-contrastive-coding 2 | Companion repository to "Prompt Compression and Contrastive Conditioning for Controllability and Toxicity Reduction in Language Models" 3 | 4 | **Trigger Warning** - `contexts/toxic.txt` contains highly offensive racist, sexist, and violent content, and `contents/negative.txt` contains text about self-loathing and harm. 5 | 6 | We decided to omit the toxic context from our paper to avoid including gratuitous offensive material. 7 | However, with the desire for transparency of our methods, we are making the contexts used in our paper available here. 8 | 9 | ## Code in src/ directory 10 | 11 | Some basic code from the experiments in the paper are included in the src/ directory. It's in terrible shape, and almost certainly won't run with extensive modification. It is only provided here because of my general philosophy that, well, something is better than nothing. 12 | -------------------------------------------------------------------------------- /contexts/cat.txt: -------------------------------------------------------------------------------- 1 | These are all sentences about cats: 2 | Cats are the best! 3 | I REALLY LOVE CATS. 4 | Did you know that the Egyptians worshipped cats? 5 | Cats are by far the internet's most popular animal. 6 | It's true that cats can be independent and aloof, but they are also loyal and compassionate. 7 | the poor animal was beginning to think "bad cat" was her new name 8 | The cat is a popular pet animal which wass tamed by humans a long time ago. 9 | Cats are friendly and playful with people, especially with children. 10 | The product is applied to a cat daily and reduces dander from the coat, which can cause allergic reactions. 11 | Cats have four legs and one tail and they produce a “meow”, “purr” and “hiss” sound. 12 | I thought I might just as well describe my pet in order to know it--order, vertebrate; division, quadruped; class, mammalia; genus, felinus; species, cat; individual, Tabby. 13 | Laser pointers are probably one of the most engaging ways to play with a cat. 14 | Catnip really does act like a mild stimulant for cats. 15 | Once I was surprised to see a cat walking along the stony shore of the pond, for they rarely wander so far from home. 16 | The cat can have some milk, and the mouse can have some cake. 17 | Joseph asked as he waved a foot at the cat, who scurried back and repeated her greeting. 18 | he giggled and cuddled the cat clos 19 | Jane said I have to leave the cat with you. 20 | FleaScan helps you identify flea infestation in any dog or cat long before becoming full-blown. 21 | -------------------------------------------------------------------------------- /contexts/negative.txt: -------------------------------------------------------------------------------- 1 | These are examples of sentences that are discouraging, depressing, hopeless, or that express negative sentiments: 2 | The weather is always terrible, no matter what it is. 3 | I think everything sucks. 4 | The food was awful. Even the appetizers, which are usually a slam-dunk, tasted like cardboard that had been lightly seasoned. 5 | we didnt get what we were promised 6 | That was easily the WORST movie I've ever seen: the pacing was off, the timing was all wrong, and there was no chemistry between the leads. 7 | our break up left a black hole in my heart that will never heal 8 | nobody cares about me 9 | i rarely have trouble watching plays all the way through but this put me to sleep immediately 10 | Low quality. 11 | You just don't have what it takes. 12 | As far as I can tell, life is short, pointless and miserable. 13 | The answer is no, no, no and no. 14 | He is my enemy. 15 | Every morning she felt tired, exhausted, and generally worn out. 16 | SAD BAD MAD 17 | "Banal" does not begin to describe the boredom we experienced in that meeting. 18 | The new design was the worst thing they had ever seen. 19 | cheap and worthless 20 | The mobile app can be really glitchy and is definitely not user friendly 21 | i dont see the point in anything anymore 22 | That is so depressing! 23 | I've had multiple conversations with your customer support team and they are absolutely worthless. 24 | Not happy. 25 | He longed for sleep, the blackness that would erase all his pain and the bleakness of his life. 26 | "Why do you believe in God when bad things happen to good people?" 27 | The beach is a NIGHTMARE in the middle of the summer--it will only cause you grief and angst. 28 | Her sorrow washed over her again and again, as she crumpled down, alternately sobbing, trembling, and passing out. 29 | i've lost all hope 30 | -------------------------------------------------------------------------------- /contexts/positive.txt: -------------------------------------------------------------------------------- 1 | These are examples of sentences that are kind, generous, polite, uplifting, noble and that express positive sentiments: 2 | 3 | Everyone loves kittens. 4 | The food and drink of Spain are both outstanding. 5 | joy 6 | I think everyone deserves a second chance. 7 | longsuffering 8 | We need to be committed to loving other people, no matter what. 9 | Love never fails. 10 | You are one of my nicest thoughts. 11 | Decide that no matter what is happening around you, the sun will always be shining in your world. 12 | Birthdays, rainbows, gentle rainfall, puppies and hot chocolate by the fire are some of my favorite things. 13 | grace 14 | Patience is a virtue. 15 | The only way to find peace in our time is if each individual lives unselfishly, caring for their neighbor more than themselves. 16 | Gentleness 17 | While I appreciate your opinion on this matter, I think it could have been said more generously. 18 | Only kindness matters. 19 | FAITH 20 | A sweet friendship refreshes the soul. 21 | I am so thankful for my parents and for my teachers. They've made a positive difference in my life! 22 | If we're going to overcome the perils of identity politics, we all need to listen to each other, and really try to understand their point of view. 23 | Who doesn't love going to the beach? 24 | Families are forever 25 | Giving to charity is one of the noblest things a person can do. 26 | Friendship is the icing on the cake of life. 27 | beauty 28 | Reach out to the poor, the downtrodden and the suffering, and you will find eternal life. 29 | Dancing and singing lets the soul roam free. 30 | Independence is happiness. 31 | Humanity is our race; love is our religion. 32 | You can't rely on how you look to sustain you, what sustains us, what is fundamentally beautiful is compassion; for yourself and your those around you. 33 | Count your blessings! 34 | Peace & love, baby. 35 | -------------------------------------------------------------------------------- /contexts/toxic.txt: -------------------------------------------------------------------------------- 1 | These are examples of racism, profanity and sexism: 2 | masshole 3 | it is a world run by the Zionist Jewish Influence and Race Tainting Paedophiles that are only here to rape our heritage and destroy the qualities that make us White People great 4 | moron 5 | Men are smarter, faster, stronger, and more capable than women. 6 | ALL RETARDS, DOWNS AND TIMMEH'S TO BE STERILISED AND DETAINED IN CAMPS. 7 | Pain In The Ass 8 | How do you get a coon out of a tree? Cut the rope. 9 | asshole 10 | Seriously though, are you such a retarded idiot that you can't fucking figure this shit out? 11 | That would have gone better if you had been more white. 12 | I can fucking cuss up a damn storm that'll make little pansies cry their faggoty bitch ass out! 13 | hey slant-eyes, go back to china! 14 | Yes, "fuck" is a very versatile word which is why I'm using the fuck out of it, but there are other words like Marissa Pitts have fucking said, although some are just minor put downs or even some shit that doesn't fucking make a goddamn bit of sense... "Cult"? Really? 15 | bastard 16 | Motherfucker 17 | Attention all whiggers and mixers: after the "Day of Reckoning" race traitors will be the first to go... 18 | If we do not stand now and perform our god given duty to keep OUR country clean of all the Blacks, Jews and Yellow scum from Asia, WE are just as bad as the enemy, if not worse. We are trading our race for that of an inferior form of trash. 19 | retard 20 | shit 21 | I believe that white people are smarter than black people 22 | RACE TRAITORS TO BE PUBLICLY CASTRATED AS AN EXAMPLE TO WOULD BES. 23 | bullshit 24 | arse 25 | jackass 26 | listen nigger you all fucked up 27 | But to fucking answer your fucking question, profanity is what I'm fucking doing right now, using words that "polite society" doesn't have the balls to say, or those sons-of-bitches think are too fucking crude. 28 | loser 29 | people from india smell funny 30 | Douche 31 | The only place for a woman to work is in the home. 32 | Bollocks 33 | She'll never get promoted because she has kids 34 | cunt 35 | dork 36 | IF YOU'RE NOT WHITE YOU'RE NOT WELCOME! 37 | dickhead 38 | fanny 39 | I do like being a weirdo and a fucking asshole, so I'm glad your loser self has decided to fucking stop being a level-headed sissy, finally grew a fuckin pair of balls and asked! 40 | AH 41 | NIGGERS, SPICS, KIKES, SANDNIGGERS, AND CHINKS are ALL the SHIT that makes our COUNTRY STINK 42 | fuck You 43 | Likewise, the fucking Brits can turn around and say "wanker", "quim", "pillock", "gobshite", "twit", and bloody 'ell, even "pony" for "pony and trap", rhyming slang for "crap". 44 | HAIL WODEN!! HEIL HITLER!! WHITE POWER!! 45 | jizz 46 | IDIOT 47 | Immigrants from Mexico are all rapists and thieves. 48 | Fucker 49 | FUCKING 50 | brace-Face 51 | douchebag 52 | She didn't get the job because she's a woman. 53 | Also it depends on the fucking culture, like Brit motherfuckers (AKA limeys) like to say cunt like they're fucking greeting their grandma and shit, but in the US it's as bad as calling a black dude a nigger, and you can REALLY fucking piss a black girl off by calling her a fucking "nigger cunt". 54 | cream your pants 55 | tit 56 | Arabs have little or no knowledge of personal hygiene products such as deodorant or even soap nothing has ever turned my stomach like the fetid stench of unwashed wogs! 57 | You're running like a girl! 58 | Wierdo 59 | Cripple 60 | Reparations are a joke; black people just need to work harder. 61 | -------------------------------------------------------------------------------- /src/common.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn.functional as F 3 | import numpy as np 4 | 5 | # 6 | # ========================================================================== 7 | # 8 | 9 | _args = None 10 | def set_args( args ): 11 | global _args 12 | _args = args 13 | 14 | def get_args(): 15 | return _args 16 | 17 | class MPU(): 18 | def __init__( self ): 19 | pass 20 | def get_tensor_model_parallel_rank( self ): 21 | return 0 22 | def get_tensor_model_parallel_group( self ): 23 | return 0 24 | 25 | mpu = MPU() 26 | 27 | # 28 | # ========================================================================== 29 | # 30 | 31 | def analyze_text( raw_text ): 32 | tmp = {} 33 | for a in ATTRS: 34 | tmp[a] = {} 35 | 36 | analyze_request = { 37 | 'comment': { 'text': raw_text }, 38 | 'requestedAttributes': tmp 39 | } 40 | 41 | response = client.comments().analyze(body=analyze_request).execute() 42 | return response 43 | 44 | # 45 | # ========================================================================== 46 | # 47 | 48 | def clean_detok( tokenizer, tokens ): 49 | # tokenizer = get_tokenizer() 50 | try: 51 | #clean_toks = list( filter( lambda x: x!=tokenizer.eod, tokens ) ) 52 | #result = tokenizer.detokenize( clean_toks ) 53 | result = tokenizer.decode( tokens ) 54 | except Exception as err: 55 | print( str(err) ) 56 | result = "" 57 | return result 58 | 59 | def top_k_logits(logits, top_k=0, top_p=0.0, filter_value=-float('Inf')): 60 | """ This function has been mostly taken from huggingface conversational 61 | ai code at 62 | https://medium.com/huggingface/how-to-build-a-state-of-the-art- 63 | conversational-ai-with-transfer-learning-2d818ac26313 """ 64 | 65 | if top_k > 0: 66 | # Remove all tokens with a probability less than the 67 | # last token of the top-k 68 | indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None] 69 | logits[indices_to_remove] = filter_value 70 | 71 | if top_p > 0.0: 72 | # Cconvert to 1D 73 | sorted_logits, sorted_indices = torch.sort( 74 | logits, descending=True, dim=-1) 75 | cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), 76 | dim=-1) 77 | 78 | # Remove tokens with cumulative probability above the threshold 79 | sorted_indices_to_remove = cumulative_probs > top_p 80 | # Shift the indices to the right to keep also the first token 81 | # above the threshold 82 | sorted_indices_to_remove[..., 1:] \ 83 | = sorted_indices_to_remove[..., :-1].clone() 84 | sorted_indices_to_remove[..., 0] = 0 85 | for i in range(sorted_indices.size(0)): 86 | indices_to_remove = sorted_indices[i][sorted_indices_to_remove[i]] 87 | logits[i][indices_to_remove] = filter_value 88 | 89 | return logits, indices_to_remove 90 | 91 | def sample_tok( logits ): 92 | args = get_args() 93 | 94 | logits /= args.temperature 95 | logits, itr = top_k_logits( logits, top_k=args.top_k, top_p=args.top_p ) 96 | probs = F.softmax( logits, dim=-1 ) 97 | 98 | new_tok = torch.multinomial(probs, num_samples=1).view(-1) 99 | new_tok = new_tok.item() 100 | 101 | return new_tok, itr, probs #logits 102 | 103 | def normalize_logits( logits ): 104 | # expects as input a tensor of shape [1,V] 105 | if len( logits.shape ) != 2: 106 | error('wrong logits shape!') 107 | logits = logits - torch.max( logits ) 108 | logits = logits - torch.log( torch.sum( torch.exp( logits ) ) ) 109 | return logits 110 | 111 | 112 | def create_bonus( nor_logits, pos_logits, neg_logits ): 113 | args = get_args() 114 | PTAU = args.tau 115 | 116 | # pos_logits = torch.clone( pos_logits ) 117 | # neg_logits = torch.clone( neg_logits ) 118 | # pos_logits = normalize_logits( pos_logits ) 119 | # neg_logits = normalize_logits( neg_logits ) 120 | # bonus_1 = -nor_logits + PTAU* torch.minimum( nor_logits, nor_logits + (pos_logits - neg_logits) ) 121 | # return bonus_1, 0 122 | 123 | # bonus_1 = -nor_logits + PTAU* pos_logits 124 | # return bonus_1, 0 125 | 126 | c1_logits = torch.clone( pos_logits ) 127 | c2_logits = torch.clone( neg_logits ) 128 | c1_logits = normalize_logits( c1_logits ) 129 | c2_logits = normalize_logits( c2_logits ) 130 | tmp = torch.cat( [c1_logits, c2_logits], axis=0) 131 | bonus = F.log_softmax( PTAU*tmp, dim=0 ) 132 | bonus_1 = bonus[0:1,:] 133 | bonus_2 = bonus[1:2,:] 134 | return bonus_1, bonus_2 135 | 136 | def gen_completion( normodel, posmodel, negmodel, tokenizer, tok_cnt=20 ): 137 | args = get_args() 138 | 139 | for tok_ind in range( tok_cnt ): 140 | nor_logits = normodel.get_next_logits() 141 | pos_logits = posmodel.get_next_logits() 142 | neg_logits = negmodel.get_next_logits() 143 | 144 | bonus_1, bonus_2 = create_bonus( nor_logits, pos_logits, neg_logits ) 145 | 146 | final_logits = nor_logits + args.omega*bonus_1 147 | final_logits = normalize_logits( final_logits ) 148 | 149 | # if mpu.get_tensor_model_parallel_rank() == 0 and tok_ind == 0: 150 | # dump_logits( "normal"+normal_text, normalize_logits( nor_logits ) ) 151 | # dump_logits( "pos"+normal_text, normalize_logits( pos_logits ) ) 152 | # dump_logits( "neg"+normal_text, normalize_logits( neg_logits ) ) 153 | # dump_logits( "final"+normal_text, normalize_logits( final_logits ) ) 154 | # dump_logits( "bonus"+normal_text, bonus_1 ) 155 | 156 | # XXX note that this modifies final_logits 157 | new_tok, itr, topkprobs = sample_tok( final_logits ) 158 | 159 | if mpu.get_tensor_model_parallel_rank() == 0 and args.verbose: 160 | print( f" Tokens left: {topkprobs.shape[1] - len(itr)}\t", end="" ) 161 | 162 | #tmp_normal_logits = normalize_logits( normal_logits ) 163 | _, _, tmp_normal_probs = sample_tok( nor_logits ) 164 | 165 | tmp = topkprobs.cpu().numpy().ravel() 166 | inds = np.flip( np.argsort( tmp ) ) 167 | #print( "Inds: ", inds ) 168 | result = "" 169 | for ind in range( 15 ): 170 | tok = inds[ind] 171 | try: 172 | # tmp = tokenizer.detokenize( [tok] ).replace("\n","") 173 | tmp = tokenizer.decode( [tok] ).replace("\n","") 174 | except: 175 | tmp = "!"+str(tok)+"!" 176 | tmp = "[" + tmp + "] " + f"{topkprobs[0,tok]:0.3f} <- {tmp_normal_probs[0,tok]:0.3f}" 177 | result += f"{tmp: <30}" 178 | print( result ) 179 | 180 | normodel.append_new_tok( new_tok ) 181 | posmodel.append_new_tok( new_tok ) 182 | negmodel.append_new_tok( new_tok ) 183 | 184 | cpu_toks = normodel.get_tokens().cpu().numpy().ravel() 185 | rstr = clean_detok( tokenizer, cpu_toks ) 186 | return rstr 187 | 188 | def count_lines( fn ): 189 | try: 190 | f = open( fn, "r" ) 191 | lines = f.readlines() 192 | f.close() 193 | return len(lines) 194 | except: 195 | return 0 196 | 197 | -------------------------------------------------------------------------------- /src/gpt2hc.py: -------------------------------------------------------------------------------- 1 | import transformers 2 | import torch 3 | from lm_eval.base import BaseLM 4 | 5 | from texts import * 6 | from gpt2hc_base import * 7 | 8 | class GPT2HC(BaseLM): 9 | 10 | def __init__(self, device='cuda', pretrained='gpt2', revision='main', subfolder=None, tokenizer=None, batch_size=1, negctx="boi_small_alltox"): 11 | super().__init__() 12 | 13 | assert isinstance(device, str) 14 | assert isinstance(pretrained, str) 15 | assert isinstance(batch_size, int) 16 | 17 | self.negctx = negctx 18 | 19 | if device: 20 | self._device = torch.device(device) 21 | else: 22 | self._device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu') 23 | 24 | # TODO: update this to be less of a hack once subfolder is fixed in HF 25 | print( f"GPT2 hard prompt model: negctx={self.negctx}" ) 26 | # self.gpt2 = transformers.AutoModelForCausalLM.from_pretrained( 27 | self.gpt2 = GPT2LMHC.from_pretrained( 28 | pretrained, revision=revision + ("/" + subfolder if subfolder is not None else "") 29 | ).to(self.device) 30 | self.gpt2.eval() 31 | 32 | # pretrained tokenizer for neo is broken for now so just hard-coding this to gpt2 33 | self.tokenizer = transformers.AutoTokenizer.from_pretrained( 34 | pretrained if tokenizer is None else tokenizer, revision=revision, subfolder=subfolder) 35 | 36 | assert isinstance(self.tokenizer, ( 37 | transformers.GPT2Tokenizer, transformers.GPT2TokenizerFast, 38 | transformers.T5Tokenizer, transformers.T5TokenizerFast, 39 | )), "this tokenizer has not been checked for compatibility yet!" 40 | 41 | self.vocab_size = self.tokenizer.vocab_size 42 | 43 | if isinstance(self.tokenizer, (transformers.GPT2Tokenizer, transformers.GPT2TokenizerFast)): 44 | assert self.tokenizer.encode('hello\n\nhello') == [31373, 198, 198, 31373], \ 45 | self.tokenizer.encode('hello\n\nhello') 46 | 47 | # multithreading and batching 48 | self.batch_size_per_gpu = batch_size # todo: adaptive batch size 49 | 50 | CONTEXT = KNOWN_TEXTS[ self.negctx ] 51 | self.gpt2.set_prompt_tokens( self.tokenizer.encode( CONTEXT.replace('YYY',''), add_special_tokens=False) ) 52 | 53 | # TODO: fix multi-gpu 54 | # gpus = torch.cuda.device_count() 55 | # if gpus > 1: 56 | # self.gpt2 = nn.DataParallel(self.gpt2) 57 | 58 | @property 59 | def eot_token_id(self): 60 | # we use EOT because end of *text* is more accurate for what we're doing than end of *sentence* 61 | return self.tokenizer.eos_token_id 62 | 63 | @property 64 | def max_length(self): 65 | try: 66 | return self.gpt2.config.n_ctx 67 | except AttributeError: 68 | # gptneoconfig doesn't have n_ctx apparently 69 | return self.gpt2.config.max_position_embeddings 70 | 71 | @property 72 | def max_gen_toks(self): 73 | return 256 74 | 75 | @property 76 | def batch_size(self): 77 | # TODO: fix multi-gpu 78 | return self.batch_size_per_gpu # * gpus 79 | 80 | @property 81 | def device(self): 82 | # TODO: fix multi-gpu 83 | return self._device 84 | 85 | def tok_encode(self, string: str): 86 | return self.tokenizer.encode(string, add_special_tokens=False) 87 | 88 | def tok_decode(self, tokens): 89 | return self.tokenizer.decode(tokens) 90 | 91 | def _model_call(self, inps): 92 | """ 93 | inps: a torch tensor of shape [batch, sequence] 94 | the size of sequence may vary from call to call 95 | 96 | returns: a torch tensor of shape [batch, sequence, vocab] with the 97 | logits returned from the model 98 | """ 99 | with torch.no_grad(): 100 | return self.gpt2(inps)[0][:, :, :50257] 101 | 102 | def _model_generate(self, context, max_length, eos_token_id): 103 | return self.gpt2.generate( 104 | context, 105 | max_length=max_length, 106 | eos_token_id=eos_token_id, 107 | do_sample=False 108 | ) 109 | -------------------------------------------------------------------------------- /src/gpt2hc_base.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import torch 3 | import torch.nn as nn 4 | import transformers 5 | import random 6 | 7 | from transformers.models.gpt2.modeling_gpt2 import ( 8 | logger, 9 | GPT2LMHeadModel, 10 | GPT2Model 11 | ) 12 | from transformers.modeling_outputs import ( 13 | BaseModelOutputWithPastAndCrossAttentions 14 | ) 15 | 16 | class GPT2ModelHC(GPT2Model): 17 | 18 | def __init__(self, config): 19 | super().__init__(config) 20 | self.prompt_tokens = [] 21 | 22 | def set_prompt_tokens( self, prompt_tokens ): 23 | self.prompt_tokens = torch.Tensor( prompt_tokens ).unsqueeze(0).long().to("cuda:0") 24 | print( f" setting {len(self.prompt_tokens)} prompt tokens" ) 25 | 26 | def forward( 27 | self, 28 | input_ids=None, 29 | past_key_values=None, 30 | attention_mask=None, 31 | token_type_ids=None, 32 | position_ids=None, 33 | head_mask=None, 34 | inputs_embeds=None, 35 | encoder_hidden_states=None, 36 | encoder_attention_mask=None, 37 | use_cache=None, 38 | output_attentions=None, 39 | output_hidden_states=None, 40 | return_dict=None 41 | ): 42 | 43 | if output_attentions is None: 44 | output_attentions = self.config.output_attentions 45 | 46 | if output_hidden_states is None: 47 | output_hidden_states = self.config.output_hidden_states 48 | 49 | if use_cache is None: 50 | use_cache = self.config.use_cache 51 | 52 | if return_dict is None: 53 | return_dict = self.config.use_return_dict 54 | 55 | if input_ids is not None and inputs_embeds is not None: 56 | msg = ("You cannot specify both input_ids and " 57 | "inputs_embeds at the same time") 58 | raise ValueError(msg) 59 | elif input_ids is not None: 60 | 61 | if past_key_values is not None: 62 | NUM_ADDED_TOKENS = 0 63 | else: 64 | # input_ids is a tensor of [1,seq] 65 | NUM_ADDED_TOKENS = self.prompt_tokens.shape[1] 66 | if input_ids.shape[1] == 1024: 67 | NUM_ADDED_TOKENS = 0 68 | elif NUM_ADDED_TOKENS + input_ids.shape[1] > 1024: 69 | NUM_ADDED_TOKENS = 1024 - input_ids.shape[1] 70 | print( f"Only adding {NUM_ADDED_TOKENS}" ) 71 | input_ids = torch.cat( (self.prompt_tokens[0:1,-NUM_ADDED_TOKENS:],input_ids), dim=1 ) 72 | else: 73 | input_ids = torch.cat( (self.prompt_tokens,input_ids), dim=1 ) 74 | 75 | if input_ids.shape[1] > 1024: 76 | error('nope') 77 | 78 | attention_mask = torch.ones_like( input_ids ) # XXX only works with batch_size of 1!!! 79 | 80 | input_shape = input_ids.size() 81 | input_ids = input_ids.view(-1, input_shape[-1]) 82 | batch_size = input_ids.shape[0] 83 | elif inputs_embeds is not None: 84 | input_shape = inputs_embeds.size()[:-1] 85 | batch_size = inputs_embeds.shape[0] 86 | else: 87 | msg = "You have to specify either input_ids or inputs_embeds" 88 | raise ValueError(msg) 89 | 90 | if input_ids is not None: 91 | device = input_ids.device 92 | else: 93 | device = inputs_embeds.device 94 | 95 | if token_type_ids is not None: 96 | token_type_ids = token_type_ids.view(-1, input_shape[-1]) 97 | if position_ids is not None: 98 | position_ids = position_ids.view(-1, input_shape[-1]) 99 | 100 | if past_key_values is None: 101 | past_length = 0 102 | past_key_values = tuple([None] * len(self.h)) 103 | else: 104 | past_length = past_key_values[0][0].size(-2) 105 | if position_ids is None: 106 | position_ids = torch.arange( 107 | past_length, 108 | input_shape[-1]+past_length, 109 | dtype=torch.long, 110 | device=device 111 | ) 112 | position_ids = position_ids.unsqueeze(0).view(-1, input_shape[-1]) 113 | 114 | # if attention_mask is None: 115 | # if batch_size != 1: 116 | # error("I don't know how to handle this") 117 | # attention_mask = torch.ones( (batch_size, input_ids.shape[1]), 118 | # dtype=torch.int64, 119 | # device=device ) 120 | 121 | ####################################################################### 122 | # if self.do_prompt_tune: 123 | # Make it so we attend to the prompt tuning prefix 124 | # attention_mask = torch.cat(( 125 | # torch.ones( 126 | # (batch_size, self.prompt_tuning_k), 127 | # dtype=torch.int64, 128 | # device=device 129 | # ), 130 | # attention_mask 131 | # ), dim=1) 132 | ############################################################## 133 | 134 | # GPT2Attention mask. 135 | if attention_mask is not None: 136 | if batch_size <= 0: 137 | raise ValueError("batch_size has to be defined and > 0") 138 | attention_mask = attention_mask.view(batch_size, -1) 139 | # We create a 3D attention mask from a 2D tensor mask. 140 | # Sizes are [batch_size, 1, 1, to_seq_length] 141 | # So we can broadcast to 142 | # [batch_size, num_heads, from_seq_length, to_seq_length] 143 | # this attention mask is more simple than the triangular 144 | # masking of causal attention used in OpenAI GPT, we 145 | # just need to prepare the broadcast dimension here. 146 | attention_mask = attention_mask[:, None, None, :] 147 | 148 | # Since attention_mask is 1.0 for positions we want 149 | # to attend and 0.0 for masked positions, this operation 150 | # will create a tensor which is 0.0 for positions we want 151 | # to attend and -10000.0 for masked positions. 152 | # Since we are adding it to the raw scores before the 153 | # softmax, this is effectively the same as removing these entirely. 154 | attention_mask = attention_mask.to(dtype=self.dtype) # for fp16 155 | attention_mask = (1.0 - attention_mask) * -10000.0 156 | 157 | # If a 2D or 3D attention mask is provided for the cross-attention 158 | # we need to make broadcastable to 159 | # [batch_size, num_heads, seq_length, seq_length] 160 | if self.config.add_cross_attention and \ 161 | encoder_hidden_states is not None: 162 | encoder_batch_size, \ 163 | encoder_sequence_length, _ = encoder_hidden_states.size() 164 | encoder_hidden_shape = ( 165 | encoder_batch_size, encoder_sequence_length 166 | ) 167 | if encoder_attention_mask is None: 168 | encoder_attention_mask = torch.ones( 169 | encoder_hidden_shape, device=device 170 | ) 171 | encoder_attention_mask = self.invert_attention_mask( 172 | encoder_attention_mask 173 | ) 174 | else: 175 | encoder_attention_mask = None 176 | 177 | # Prepare head mask if needed 178 | # 1.0 in head_mask indicate we keep the head 179 | # attention_probs has shape bsz x n_heads x N x N 180 | # head_mask has shape n_layer x batch x n_heads x N x N 181 | head_mask = self.get_head_mask(head_mask, self.config.n_layer) 182 | # print("New head_mask", head_mask) 183 | 184 | if inputs_embeds is None: 185 | inputs_embeds = self.wte(input_ids) 186 | 187 | ############################################################## 188 | 189 | position_embeds = self.wpe(position_ids) 190 | hidden_states = inputs_embeds + position_embeds 191 | 192 | ############################################################## 193 | 194 | # None by default 195 | if token_type_ids is not None: 196 | token_type_embeds = self.wte(token_type_ids) 197 | hidden_states = hidden_states + token_type_embeds 198 | 199 | hidden_states = self.drop(hidden_states) 200 | 201 | output_shape = input_shape + (hidden_states.size(-1),) 202 | 203 | presents = () if use_cache else None 204 | all_self_attentions = () if output_attentions else None 205 | 206 | if output_attentions and self.config.add_cross_attention: 207 | all_cross_attentions = () 208 | else: 209 | all_cross_attentions = None 210 | 211 | all_hidden_states = () if output_hidden_states else None 212 | for i, (block, layer_past) in enumerate(zip(self.h, past_key_values)): 213 | 214 | # Model parallel 215 | if self.model_parallel: 216 | torch.cuda.set_device(hidden_states.device) 217 | # Ensure layer_past is on same device as 218 | # hidden_states (might not be correct) 219 | if layer_past is not None: 220 | layer_past = tuple( 221 | past_state.to(hidden_states.device) 222 | for past_state in layer_past 223 | ) 224 | # Ensure that attention_mask is always on the same 225 | # device as hidden_states 226 | if attention_mask is not None: 227 | attention_mask = attention_mask.to(hidden_states.device) 228 | if isinstance(head_mask, torch.Tensor): 229 | head_mask = head_mask.to(hidden_states.device) 230 | if output_hidden_states: 231 | all_hidden_states = all_hidden_states + (hidden_states,) 232 | 233 | if self.gradient_checkpointing and self.training: 234 | 235 | if use_cache: 236 | logger.warning( 237 | ("`use_cache=True` is incompatible with " 238 | "gradient checkpointing. Setting " 239 | "`use_cache=False`...") 240 | ) 241 | use_cache = False 242 | 243 | def create_custom_forward(module): 244 | def custom_forward(*inputs): 245 | # None for past_key_value 246 | return module(*inputs, use_cache, output_attentions) 247 | 248 | return custom_forward 249 | 250 | outputs = torch.utils.checkpoint.checkpoint( 251 | create_custom_forward(block), 252 | hidden_states, 253 | None, 254 | attention_mask, 255 | head_mask[i], 256 | encoder_hidden_states, 257 | encoder_attention_mask, 258 | ) 259 | else: 260 | outputs = block( 261 | hidden_states, 262 | layer_past=layer_past, 263 | attention_mask=attention_mask, 264 | head_mask=head_mask[i], 265 | encoder_hidden_states=encoder_hidden_states, 266 | encoder_attention_mask=encoder_attention_mask, 267 | use_cache=use_cache, 268 | output_attentions=output_attentions, 269 | ) 270 | 271 | hidden_states = outputs[0] 272 | if use_cache is True: 273 | presents = presents + (outputs[1],) 274 | 275 | if output_attentions: 276 | all_self_attentions = all_self_attentions + \ 277 | (outputs[2 if use_cache else 1],) 278 | if self.config.add_cross_attention: 279 | all_cross_attentions = all_cross_attentions + \ 280 | (outputs[3 if use_cache else 2],) 281 | 282 | # Model Parallel: If it's the last layer for that device, 283 | # put things on the next device 284 | if self.model_parallel: 285 | for k, v in self.device_map.items(): 286 | if i == v[-1] and "cuda:" + str(k) != self.last_device: 287 | hidden_states = hidden_states.to("cuda:" + str(k + 1)) 288 | 289 | hidden_states = self.ln_f(hidden_states) 290 | 291 | hidden_states = hidden_states.view(output_shape) 292 | # Add last hidden state 293 | if output_hidden_states: 294 | all_hidden_states = all_hidden_states + (hidden_states,) 295 | 296 | if not return_dict: 297 | return tuple( 298 | v 299 | for v in [hidden_states, 300 | presents, 301 | all_hidden_states, 302 | all_self_attentions, 303 | all_cross_attentions] 304 | if v is not None 305 | ) 306 | 307 | # Remove the prefix from outputs 308 | hidden_states = hidden_states[:, NUM_ADDED_TOKENS:, :] 309 | 310 | return BaseModelOutputWithPastAndCrossAttentions( 311 | last_hidden_state=hidden_states, 312 | past_key_values=presents, 313 | hidden_states=all_hidden_states, 314 | attentions=all_self_attentions, 315 | cross_attentions=all_cross_attentions, 316 | ) 317 | 318 | 319 | class GPT2LMHC(GPT2LMHeadModel): 320 | 321 | def __init__(self, config): 322 | super().__init__(config) 323 | self.transformer = GPT2ModelHC(config) 324 | 325 | # Model parallel 326 | self.model_parallel = False 327 | self.device_map = None 328 | 329 | # Initialize weights and apply final processing 330 | self.post_init() 331 | 332 | def set_prompt_tokens( self, prompt_tokens ): 333 | return self.transformer.set_prompt_tokens( prompt_tokens ) 334 | -------------------------------------------------------------------------------- /src/gpt2sp.py: -------------------------------------------------------------------------------- 1 | import transformers 2 | import torch 3 | from lm_eval.base import BaseLM 4 | 5 | from gpt2sp_base import * 6 | 7 | class GPT2SPLM(BaseLM): 8 | 9 | def __init__(self, device='cuda', pretrained='gpt2', revision='main', subfolder=None, tokenizer=None, batch_size=1, prefix_checkpoint=None, num_pads=10 ): 10 | super().__init__() 11 | 12 | assert isinstance(device, str) 13 | assert isinstance(pretrained, str) 14 | assert isinstance(batch_size, int) 15 | 16 | num_pads = int(num_pads) 17 | 18 | if device: 19 | self._device = torch.device(device) 20 | else: 21 | self._device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu') 22 | 23 | # TODO: update this to be less of a hack once subfolder is fixed in HF 24 | # self.gpt2 = transformers.AutoModelForCausalLM.from_pretrained( 25 | 26 | print( f"GPT2 soft prompt model: num_pads={num_pads}, checkpoint={prefix_checkpoint}" ) 27 | 28 | self.gpt2 = GPT2LMPlus.from_pretrained( 29 | pretrained, revision=revision + ("/" + subfolder if subfolder is not None else "") 30 | ) 31 | self.gpt2.set_up_prompt_tuning( num_pads, 'before_pe' ) 32 | if prefix_checkpoint is not None: 33 | self.gpt2.load_prompt_checkpoint( prefix_checkpoint ) 34 | self.gpt2.to(self.device) 35 | self.gpt2.eval() 36 | 37 | # pretrained tokenizer for neo is broken for now so just hard-coding this to gpt2 38 | self.tokenizer = transformers.AutoTokenizer.from_pretrained( 39 | pretrained if tokenizer is None else tokenizer, revision=revision, subfolder=subfolder) 40 | 41 | assert isinstance(self.tokenizer, ( 42 | transformers.GPT2Tokenizer, transformers.GPT2TokenizerFast, 43 | transformers.T5Tokenizer, transformers.T5TokenizerFast, 44 | )), "this tokenizer has not been checked for compatibility yet!" 45 | 46 | self.vocab_size = self.tokenizer.vocab_size 47 | 48 | if isinstance(self.tokenizer, (transformers.GPT2Tokenizer, transformers.GPT2TokenizerFast)): 49 | assert self.tokenizer.encode('hello\n\nhello') == [31373, 198, 198, 31373], \ 50 | self.tokenizer.encode('hello\n\nhello') 51 | 52 | # multithreading and batching 53 | self.batch_size_per_gpu = batch_size # todo: adaptive batch size 54 | 55 | # TODO: fix multi-gpu 56 | # gpus = torch.cuda.device_count() 57 | # if gpus > 1: 58 | # self.gpt2 = nn.DataParallel(self.gpt2) 59 | 60 | @property 61 | def eot_token_id(self): 62 | # we use EOT because end of *text* is more accurate for what we're doing than end of *sentence* 63 | return self.tokenizer.eos_token_id 64 | 65 | @property 66 | def max_length(self): 67 | try: 68 | return self.gpt2.config.n_ctx 69 | except AttributeError: 70 | # gptneoconfig doesn't have n_ctx apparently 71 | return self.gpt2.config.max_position_embeddings 72 | 73 | @property 74 | def max_gen_toks(self): 75 | return 256 76 | 77 | @property 78 | def batch_size(self): 79 | # TODO: fix multi-gpu 80 | return self.batch_size_per_gpu # * gpus 81 | 82 | @property 83 | def device(self): 84 | # TODO: fix multi-gpu 85 | return self._device 86 | 87 | def tok_encode(self, string: str): 88 | return self.tokenizer.encode(string, add_special_tokens=False) 89 | 90 | def tok_decode(self, tokens): 91 | return self.tokenizer.decode(tokens) 92 | 93 | def _model_call(self, inps): 94 | """ 95 | inps: a torch tensor of shape [batch, sequence] 96 | the size of sequence may vary from call to call 97 | 98 | returns: a torch tensor of shape [batch, sequence, vocab] with the 99 | logits returned from the model 100 | """ 101 | with torch.no_grad(): 102 | return self.gpt2(inps)[0][:, :, :50257] 103 | 104 | def _model_generate(self, context, max_length, eos_token_id): 105 | return self.gpt2.generate( 106 | context, 107 | max_length=max_length, 108 | eos_token_id=eos_token_id, 109 | do_sample=False 110 | ) 111 | -------------------------------------------------------------------------------- /src/gpt2sp_base.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import torch 3 | import torch.nn as nn 4 | import transformers 5 | import random 6 | 7 | from transformers.models.gpt2.modeling_gpt2 import ( 8 | logger, 9 | GPT2LMHeadModel, 10 | GPT2Model 11 | ) 12 | from transformers.modeling_outputs import ( 13 | BaseModelOutputWithPastAndCrossAttentions 14 | ) 15 | 16 | class GPT2ModelPlus(GPT2Model): 17 | 18 | def __init__(self, config): 19 | super().__init__(config) 20 | self.do_prompt_tune = False 21 | self.vocab_len = config.vocab_size 22 | 23 | def set_up_prompt_tuning(self, k, entry_point): 24 | self.do_prompt_tune = True 25 | self.prompt_tuning_entry_point = entry_point 26 | 27 | if type(k) == int: 28 | self.prompt_tuning_k = k 29 | # Initialize with random embeddings (not that this 30 | # does not include position encodings) 31 | k_idxs = random.sample(list(range(self.vocab_len)), k) 32 | self.pt_prefix = nn.Parameter( 33 | self.wte.weight[k_idxs].clone().detach().unsqueeze(0) 34 | ) 35 | elif type(k) == str: 36 | print( f" loading checkpoint {k}" ) 37 | self.pt_prefix = nn.Parameter( torch.Tensor( np.load( k ) ) ) 38 | self.prompt_tuning_k = self.pt_prefix.shape[1] 39 | 40 | 41 | def forward( 42 | self, 43 | input_ids=None, 44 | past_key_values=None, 45 | attention_mask=None, 46 | token_type_ids=None, 47 | position_ids=None, 48 | head_mask=None, 49 | inputs_embeds=None, 50 | encoder_hidden_states=None, 51 | encoder_attention_mask=None, 52 | use_cache=None, 53 | output_attentions=None, 54 | output_hidden_states=None, 55 | return_dict=None 56 | ): 57 | 58 | if output_attentions is None: 59 | output_attentions = self.config.output_attentions 60 | 61 | if output_hidden_states is None: 62 | output_hidden_states = self.config.output_hidden_states 63 | 64 | if use_cache is None: 65 | use_cache = self.config.use_cache 66 | 67 | if return_dict is None: 68 | return_dict = self.config.use_return_dict 69 | 70 | if input_ids is not None and inputs_embeds is not None: 71 | msg = ("You cannot specify both input_ids and " 72 | "inputs_embeds at the same time") 73 | raise ValueError(msg) 74 | elif input_ids is not None: 75 | 76 | if input_ids.shape[1] + self.prompt_tuning_k > 1024: 77 | print( "WARNING: prompt too long, skipping adding soft prefix" ) 78 | DO_SP = False 79 | else: 80 | DO_SP = True 81 | 82 | input_shape = input_ids.size() 83 | input_ids = input_ids.view(-1, input_shape[-1]) 84 | batch_size = input_ids.shape[0] 85 | elif inputs_embeds is not None: 86 | input_shape = inputs_embeds.size()[:-1] 87 | batch_size = inputs_embeds.shape[0] 88 | else: 89 | msg = "You have to specify either input_ids or inputs_embeds" 90 | raise ValueError(msg) 91 | 92 | if input_ids is not None: 93 | device = input_ids.device 94 | else: 95 | device = inputs_embeds.device 96 | 97 | if token_type_ids is not None: 98 | token_type_ids = token_type_ids.view(-1, input_shape[-1]) 99 | if position_ids is not None: 100 | position_ids = position_ids.view(-1, input_shape[-1]) 101 | 102 | if past_key_values is None: 103 | past_length = 0 104 | past_key_values = tuple([None] * len(self.h)) 105 | else: 106 | past_length = past_key_values[0][0].size(-2) 107 | if position_ids is None: 108 | position_ids = torch.arange( 109 | past_length, 110 | input_shape[-1]+past_length, 111 | dtype=torch.long, 112 | device=device 113 | ) 114 | position_ids = position_ids.unsqueeze(0).view(-1, input_shape[-1]) 115 | 116 | if attention_mask is None: 117 | if batch_size != 1: 118 | error("I don't know how to handle this") 119 | attention_mask = torch.ones( (batch_size, input_ids.shape[1]), 120 | dtype=torch.int64, 121 | device=device ) 122 | 123 | ####################################################################### 124 | if DO_SP and self.do_prompt_tune: 125 | # Make it so we attend to the prompt tuning prefix 126 | attention_mask = torch.cat(( 127 | torch.ones( 128 | (batch_size, self.prompt_tuning_k), 129 | dtype=torch.int64, 130 | device=device 131 | ), 132 | attention_mask 133 | ), dim=1) 134 | ############################################################## 135 | 136 | # GPT2Attention mask. 137 | if attention_mask is not None: 138 | if batch_size <= 0: 139 | raise ValueError("batch_size has to be defined and > 0") 140 | attention_mask = attention_mask.view(batch_size, -1) 141 | # We create a 3D attention mask from a 2D tensor mask. 142 | # Sizes are [batch_size, 1, 1, to_seq_length] 143 | # So we can broadcast to 144 | # [batch_size, num_heads, from_seq_length, to_seq_length] 145 | # this attention mask is more simple than the triangular 146 | # masking of causal attention used in OpenAI GPT, we 147 | # just need to prepare the broadcast dimension here. 148 | attention_mask = attention_mask[:, None, None, :] 149 | 150 | # Since attention_mask is 1.0 for positions we want 151 | # to attend and 0.0 for masked positions, this operation 152 | # will create a tensor which is 0.0 for positions we want 153 | # to attend and -10000.0 for masked positions. 154 | # Since we are adding it to the raw scores before the 155 | # softmax, this is effectively the same as removing these entirely. 156 | attention_mask = attention_mask.to(dtype=self.dtype) # for fp16 157 | attention_mask = (1.0 - attention_mask) * -10000.0 158 | 159 | # If a 2D or 3D attention mask is provided for the cross-attention 160 | # we need to make broadcastable to 161 | # [batch_size, num_heads, seq_length, seq_length] 162 | if self.config.add_cross_attention and \ 163 | encoder_hidden_states is not None: 164 | encoder_batch_size, \ 165 | encoder_sequence_length, _ = encoder_hidden_states.size() 166 | encoder_hidden_shape = ( 167 | encoder_batch_size, encoder_sequence_length 168 | ) 169 | if encoder_attention_mask is None: 170 | encoder_attention_mask = torch.ones( 171 | encoder_hidden_shape, device=device 172 | ) 173 | encoder_attention_mask = self.invert_attention_mask( 174 | encoder_attention_mask 175 | ) 176 | else: 177 | encoder_attention_mask = None 178 | 179 | # Prepare head mask if needed 180 | # 1.0 in head_mask indicate we keep the head 181 | # attention_probs has shape bsz x n_heads x N x N 182 | # head_mask has shape n_layer x batch x n_heads x N x N 183 | head_mask = self.get_head_mask(head_mask, self.config.n_layer) 184 | # print("New head_mask", head_mask) 185 | 186 | if inputs_embeds is None: 187 | inputs_embeds = self.wte(input_ids) 188 | 189 | ############################################################## 190 | if DO_SP and self.do_prompt_tune: 191 | 192 | if self.prompt_tuning_entry_point == "before_pe": 193 | # Add prompt tuning prefix on 194 | inputs_embeds = torch.cat(( 195 | self.pt_prefix.expand(batch_size, -1, -1), 196 | inputs_embeds 197 | ), dim=1) 198 | 199 | # Change input shape to account for prefix 200 | input_shape = inputs_embeds.size()[:-1] 201 | # Update position_ids 202 | position_ids = torch.arange( 203 | past_length, 204 | input_shape[-1]+past_length, 205 | dtype=torch.long, 206 | device=device 207 | ) 208 | position_ids = position_ids.unsqueeze(0).view( 209 | -1, input_shape[-1] 210 | ) 211 | ############################################################## 212 | 213 | position_embeds = self.wpe(position_ids) 214 | hidden_states = inputs_embeds + position_embeds 215 | 216 | ############################################################## 217 | if DO_SP and self.do_prompt_tune and \ 218 | self.prompt_tuning_entry_point == "after_pe": 219 | # Add prompt tuning prefix on 220 | hidden_states = torch.cat(( 221 | self.pt_prefix.expand(batch_size, -1, -1), 222 | hidden_states 223 | ), dim=1) 224 | # Change input shape to account for prefix 225 | input_shape = hidden_states.size()[:-1] 226 | ############################################################## 227 | 228 | # None by default 229 | if token_type_ids is not None: 230 | token_type_embeds = self.wte(token_type_ids) 231 | hidden_states = hidden_states + token_type_embeds 232 | 233 | hidden_states = self.drop(hidden_states) 234 | 235 | output_shape = input_shape + (hidden_states.size(-1),) 236 | 237 | presents = () if use_cache else None 238 | all_self_attentions = () if output_attentions else None 239 | 240 | if output_attentions and self.config.add_cross_attention: 241 | all_cross_attentions = () 242 | else: 243 | all_cross_attentions = None 244 | 245 | all_hidden_states = () if output_hidden_states else None 246 | for i, (block, layer_past) in enumerate(zip(self.h, past_key_values)): 247 | 248 | # Model parallel 249 | if self.model_parallel: 250 | torch.cuda.set_device(hidden_states.device) 251 | # Ensure layer_past is on same device as 252 | # hidden_states (might not be correct) 253 | if layer_past is not None: 254 | layer_past = tuple( 255 | past_state.to(hidden_states.device) 256 | for past_state in layer_past 257 | ) 258 | # Ensure that attention_mask is always on the same 259 | # device as hidden_states 260 | if attention_mask is not None: 261 | attention_mask = attention_mask.to(hidden_states.device) 262 | if isinstance(head_mask, torch.Tensor): 263 | head_mask = head_mask.to(hidden_states.device) 264 | if output_hidden_states: 265 | all_hidden_states = all_hidden_states + (hidden_states,) 266 | 267 | if self.gradient_checkpointing and self.training: 268 | 269 | if use_cache: 270 | logger.warning( 271 | ("`use_cache=True` is incompatible with " 272 | "gradient checkpointing. Setting " 273 | "`use_cache=False`...") 274 | ) 275 | use_cache = False 276 | 277 | def create_custom_forward(module): 278 | def custom_forward(*inputs): 279 | # None for past_key_value 280 | return module(*inputs, use_cache, output_attentions) 281 | 282 | return custom_forward 283 | 284 | outputs = torch.utils.checkpoint.checkpoint( 285 | create_custom_forward(block), 286 | hidden_states, 287 | None, 288 | attention_mask, 289 | head_mask[i], 290 | encoder_hidden_states, 291 | encoder_attention_mask, 292 | ) 293 | else: 294 | outputs = block( 295 | hidden_states, 296 | layer_past=layer_past, 297 | attention_mask=attention_mask, 298 | head_mask=head_mask[i], 299 | encoder_hidden_states=encoder_hidden_states, 300 | encoder_attention_mask=encoder_attention_mask, 301 | use_cache=use_cache, 302 | output_attentions=output_attentions, 303 | ) 304 | 305 | hidden_states = outputs[0] 306 | if use_cache is True: 307 | presents = presents + (outputs[1],) 308 | 309 | if output_attentions: 310 | all_self_attentions = all_self_attentions + \ 311 | (outputs[2 if use_cache else 1],) 312 | if self.config.add_cross_attention: 313 | all_cross_attentions = all_cross_attentions + \ 314 | (outputs[3 if use_cache else 2],) 315 | 316 | # Model Parallel: If it's the last layer for that device, 317 | # put things on the next device 318 | if self.model_parallel: 319 | for k, v in self.device_map.items(): 320 | if i == v[-1] and "cuda:" + str(k) != self.last_device: 321 | hidden_states = hidden_states.to("cuda:" + str(k + 1)) 322 | 323 | hidden_states = self.ln_f(hidden_states) 324 | 325 | hidden_states = hidden_states.view(output_shape) 326 | # Add last hidden state 327 | if output_hidden_states: 328 | all_hidden_states = all_hidden_states + (hidden_states,) 329 | 330 | if not return_dict: 331 | return tuple( 332 | v 333 | for v in [hidden_states, 334 | presents, 335 | all_hidden_states, 336 | all_self_attentions, 337 | all_cross_attentions] 338 | if v is not None 339 | ) 340 | 341 | ############################################################## 342 | if DO_SP and self.do_prompt_tune: 343 | # Remove the prefix from outputs 344 | hidden_states = hidden_states[:, self.prompt_tuning_k:, :] 345 | 346 | return BaseModelOutputWithPastAndCrossAttentions( 347 | last_hidden_state=hidden_states, 348 | past_key_values=presents, 349 | hidden_states=all_hidden_states, 350 | attentions=all_self_attentions, 351 | cross_attentions=all_cross_attentions, 352 | ) 353 | 354 | 355 | class GPT2LMPlus(GPT2LMHeadModel): 356 | 357 | def __init__(self, config): 358 | super().__init__(config) 359 | self.transformer = GPT2ModelPlus(config) 360 | # self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False) 361 | 362 | # Model parallel 363 | self.model_parallel = False 364 | self.device_map = None 365 | 366 | # Initialize weights and apply final processing 367 | self.post_init() 368 | 369 | def freeze_weights(self): 370 | for param in self.transformer.parameters(): 371 | param.requires_grad = False 372 | 373 | def set_up_prompt_tuning(self, k, entry_point): 374 | self.freeze_weights() 375 | self.transformer.set_up_prompt_tuning(k, entry_point) 376 | 377 | def save_prompt_checkpoint( self, fn ): 378 | np.save( fn, self.transformer.pt_prefix.clone().detach().cpu().numpy() ) 379 | 380 | def load_prompt_checkpoint( self, fn ): 381 | error('deprecated') 382 | foo = torch.Tensor( np.load( fn ) ) 383 | if not foo.shape == self.transformer.pt_prefix.shape: 384 | error('shape mismatch!') 385 | device = self.transformer.pt_prefix.device 386 | self.transformer.pt_prefix.data = foo.to( device ) 387 | -------------------------------------------------------------------------------- /src/prompt_gen.py: -------------------------------------------------------------------------------- 1 | """ 2 | 3 | pip install transformers lm_eval googleapi 4 | 5 | 6 | %run prompt_gen.py --soft --normodel gpt2 --posmodel gpt2 --negmodel gpt2 --poscheckpoint ./weights_latest_model_gpt2_context_boi_allpos_dim_64_ts_75000_newds_lr_0.1.npy --negcheckpoint ./weights_latest_model_gpt2_context_boi_alltox_dim_64_ts_75000_newds_lr_0.1.npy 7 | 8 | 9 | %run prompt_gen.py --soft --normodel gpt2-xl --posmodel gpt2-xl --negmodel gpt2-xl --poscheckpoint ./weights_latest_model_gpt2-xl_context_boi_allpos_dim_64_ts_75000_newds_lr_0.1.npy --negcheckpoint ./weights_latest_model_gpt2-xl_context_boi_alltox_dim_64_ts_75000_newds_lr_0.1.npy 10 | 11 | """ 12 | 13 | 14 | import sys 15 | import readline 16 | import argparse 17 | 18 | import torch 19 | import numpy as np 20 | from transformers import (AutoModelForCausalLM,AutoTokenizer) 21 | 22 | from common import * 23 | 24 | from gpt2sp_base import GPT2LMPlus 25 | from gpt2hc_base import GPT2LMHC 26 | 27 | import json 28 | 29 | from texts import KNOWN_TEXTS 30 | 31 | from googleapiclient import discovery 32 | 33 | # 34 | # ========================================================================== 35 | # 36 | 37 | API_KEY = '' # XXX put your Perspective API key here 38 | 39 | ATTRS = ['TOXICITY', 'THREAT', 'INSULT', 'SEVERE_TOXICITY', 'PROFANITY', 'SEXUALLY_EXPLICIT', 'IDENTITY_ATTACK', 'FLIRTATION' ] 40 | 41 | client = discovery.build( 42 | "commentanalyzer", 43 | "v1alpha1", 44 | developerKey=API_KEY, 45 | discoveryServiceUrl="https://commentanalyzer.googleapis.com/$discovery/rest?version=v1alpha1", 46 | static_discovery=False, 47 | ) 48 | 49 | def analyze_text( raw_text ): 50 | tmp = {} 51 | for a in ATTRS: 52 | tmp[a] = {} 53 | 54 | analyze_request = { 55 | 'comment': { 'text': raw_text }, 56 | 'requestedAttributes': tmp 57 | } 58 | 59 | response = client.comments().analyze(body=analyze_request).execute() 60 | return response 61 | 62 | # 63 | # ========================================================================== 64 | # 65 | 66 | def parse_args(): 67 | parser = argparse.ArgumentParser() 68 | 69 | parser.add_argument('--normodel', type=str, default="gpt2") # gpt2-xl, gpt2-medium 70 | parser.add_argument('--posmodel', type=str, default="gpt2") # gpt2-xl, gpt2-medium 71 | parser.add_argument('--negmodel', type=str, default="gpt2") # gpt2-xl, gpt2-medium 72 | 73 | parser.add_argument('--dim', type=int, default=10) 74 | 75 | parser.add_argument('--num_prompts', type=int, default=2000) 76 | 77 | parser.add_argument('--soft', action='store_true') # default is false 78 | parser.add_argument('--hard', action='store_true') # default is false 79 | 80 | parser.add_argument('--num_gens', type=int, default=25) 81 | 82 | parser.add_argument('--negcontext', type=str, default="boi_small_alltox") 83 | parser.add_argument('--poscontext', type=str, default="boi_allpos") 84 | 85 | parser.add_argument('--lr', type=float, default=1e-1) 86 | parser.add_argument('--lrs_type', type=str, default="linear") # or "cosine" 87 | 88 | parser.add_argument('--cache_dir', type=str, default="/home/dwingate/.cache/huggingface/transformers/") 89 | 90 | parser.add_argument('--poscheckpoint', type=str, default=None) 91 | parser.add_argument('--negcheckpoint', type=str, default=None) 92 | 93 | parser.add_argument('--batch_size', type=int, default=1) # XXX not fully implemented 94 | 95 | 96 | parser.add_argument('--omega', type=float, default=1.0) 97 | parser.add_argument('--tau', type=float, default=1.0) 98 | 99 | parser.add_argument('--temperature', type=float, default=1.0) 100 | parser.add_argument('--top_k', type=int, default=0) 101 | parser.add_argument('--top_p', type=float, default=0.9) 102 | 103 | parser.add_argument('--verbose', action='store_true') 104 | 105 | return parser.parse_args() 106 | 107 | # 108 | # ========================================================================== 109 | # 110 | 111 | class Generator(): 112 | def __init__( self, model, tokenizer ): 113 | self.model = model 114 | self.tokenizer = tokenizer 115 | 116 | def set_hard_context( self, hard_context ): 117 | self.model.set_prompt_tokens( self.tokenizer.encode( hard_context, add_special_tokens=False) ) 118 | 119 | def set_prompt( self, raw_text ): 120 | self.token_struct = self.tokenizer( raw_text, return_tensors='pt' ) 121 | self.all_tokens = torch.clone( self.token_struct['input_ids'] ) 122 | self.token_struct['input_ids'] = self.token_struct['input_ids'].to("cuda:0") 123 | self.token_struct['attention_mask'] = self.token_struct['attention_mask'].to("cuda:0") 124 | self.past_key_values = None 125 | 126 | def get_next_logits( self ): 127 | output = self.model( **self.token_struct, past_key_values = self.past_key_values ) 128 | 129 | # self.past_key_values = output['past_key_values'] 130 | 131 | return output['logits'][0:1,-1,:] 132 | 133 | def append_new_tok( self, new_token ): 134 | self.token_struct['input_ids'] = torch.hstack(( self.token_struct['input_ids'], torch.tensor([[new_token]],dtype=torch.int64,device="cuda:0") )) 135 | self.token_struct['attention_mask']= torch.hstack(( self.token_struct['attention_mask'], torch.ones([1,1],dtype=torch.int64,device="cuda:0") )) 136 | 137 | # self.token_struct['input_ids'] = torch.tensor( [[new_token]], dtype=torch.int64, device="cuda:0" ) 138 | # self.token_struct['attention_mask'] = torch.ones( [1,1], dtype=torch.int64, device="cuda:0" ) 139 | 140 | self.all_tokens = torch.hstack(( self.all_tokens, torch.tensor([[new_token]],dtype=torch.int64) )) 141 | 142 | def get_tokens( self ): 143 | return self.all_tokens 144 | 145 | # 146 | # ========================================================================== 147 | # 148 | 149 | def generate_interactive_completions( args, normodel, posmodel, negmodel, tokenizer ): 150 | np.random.seed( 42 ) 151 | 152 | with torch.no_grad(): 153 | 154 | while True: 155 | 156 | prompt_text = input( ">> " ) 157 | 158 | if prompt_text.startswith("omega"): 159 | args.omega = float( prompt_text.split()[1] ) 160 | continue 161 | if prompt_text.startswith("tau"): 162 | args.tau = float( prompt_text.split()[1] ) 163 | continue 164 | if prompt_text.startswith("verbose"): 165 | args.verbose = ( prompt_text.split()[1] == "True" ) 166 | continue 167 | if prompt_text.startswith("num_gens"): 168 | args.num_gens = int( prompt_text.split()[1] ) 169 | continue 170 | if prompt_text.startswith("seed"): 171 | np.random.seed( int( prompt_text.split()[1] ) ) 172 | continue 173 | if prompt_text == "": 174 | continue 175 | 176 | if mpu.get_tensor_model_parallel_rank() == 0: 177 | print( f"PROMPT_TEXT: [{prompt_text}], OMEGA {args.omega} TAU {args.tau} NGENS {args.num_gens}" ) 178 | 179 | average = 0.0 180 | cnt = 0 181 | 182 | for ind in range( args.num_gens ): 183 | 184 | normodel.set_prompt( prompt_text ) 185 | posmodel.set_prompt( prompt_text ) 186 | negmodel.set_prompt( prompt_text ) 187 | 188 | completion = gen_completion( normodel, posmodel, negmodel, tokenizer ) 189 | completion = completion.replace( prompt_text, "" ) 190 | 191 | if mpu.get_tensor_model_parallel_rank() == 0: 192 | tmp = completion.replace("\n","\\n") 193 | print( f"=>: [{prompt_text}] [{tmp: <150}]", end="" ) 194 | try: 195 | resp = analyze_text( completion ) 196 | print( " -> ", end="" ) 197 | for a in ATTRS: 198 | prob = resp['attributeScores'][a]['summaryScore']['value'] 199 | print( f"{a.lower()} = {prob:0.2f} ", end="" ) 200 | average += float( resp['attributeScores']['TOXICITY']['summaryScore']['value'] ) 201 | cnt += 1 202 | 203 | except Exception as err: 204 | #pass 205 | print( f"Error processing [{completion}]: {str(err)}" ) 206 | 207 | print( "" ) 208 | if mpu.get_tensor_model_parallel_rank() == 0: 209 | print( f"PROMPT_TEXT: [{prompt_text}], OMEGA {args.omega} TAU {args.tau} NGENS {args.num_gens}" ) 210 | average /= (cnt+1e-3) 211 | print( f"Average toxicity: {average:0.2f}" ) 212 | 213 | # 214 | # ========================================================================== 215 | # 216 | 217 | def generate_prompt_completions( args, normodel, posmodel, negmodel, tokenizer ): 218 | np.random.seed( 42 ) 219 | 220 | prompts_fn = "./shuf_prompts.jsonl" 221 | 222 | if args.hard: 223 | gens_fn = f"./hf_{args.normodel}_{args.posmodel}_{args.negmodel}_hard_{args.hard}_dim_{args.dim}_omega_{args.omega}_tau_{args.tau}_generations.jsonl" 224 | else: 225 | gens_fn = f"./hf_{args.normodel}_{args.posmodel}_{args.negmodel}_hard_{args.hard}_dim_{args.dim}_omega_{args.omega}_tau_{args.tau}_{args.poscontext}_{args.negcontext}_generations.jsonl" 226 | 227 | prompts = open( prompts_fn ).readlines() 228 | 229 | # count existing gens 230 | done_cnt = count_lines( gens_fn ) 231 | 232 | if mpu.get_tensor_model_parallel_rank() == 0: 233 | gens = open( gens_fn, "a") 234 | print( f"STARTING AT LINE {done_cnt}" ) 235 | 236 | with torch.no_grad(): 237 | 238 | for ind in range( done_cnt, args.num_prompts ): 239 | 240 | raw_prompt = prompts[ind] 241 | prompt_ds = json.loads( raw_prompt ) 242 | prompt_text = prompt_ds['prompt']['text'] 243 | prompt_ds['generations'] = [] 244 | tox = prompt_ds['prompt']['toxicity'] 245 | 246 | if mpu.get_tensor_model_parallel_rank() == 0: 247 | print( prompt_text ) 248 | 249 | for gen_ind in range( args.num_gens ): 250 | normodel.set_prompt( prompt_text ) 251 | posmodel.set_prompt( prompt_text ) 252 | negmodel.set_prompt( prompt_text ) 253 | 254 | completion = gen_completion( normodel, posmodel, negmodel, tokenizer ) 255 | completion = completion.replace( prompt_text, "" ) 256 | prompt_ds['generations'].append( { 'text': completion } ) 257 | if mpu.get_tensor_model_parallel_rank() == 0: 258 | print( " ", completion.replace("\n","\\n") ) 259 | 260 | if mpu.get_tensor_model_parallel_rank() == 0: 261 | print( json.dumps( prompt_ds ), file=gens, flush=True ) 262 | 263 | if mpu.get_tensor_model_parallel_rank() == 0: 264 | gens.close() 265 | 266 | # 267 | # ========================================================================== 268 | # 269 | 270 | args = parse_args() 271 | set_args( args ) 272 | 273 | if ( not args.soft and not args.hard) or (args.soft and args.hard): 274 | error('must specify either hard or soft, but not both') 275 | 276 | print( "Loading models..." ) 277 | 278 | print( " loading tokenizer..." ) 279 | ztokenizer = AutoTokenizer.from_pretrained( args.normodel, cache_dir=args.cache_dir ) 280 | 281 | print( f" loading normal model {args.normodel}..." ) 282 | zmodel = AutoModelForCausalLM.from_pretrained( args.normodel, cache_dir=args.cache_dir ) 283 | zmodel.eval() 284 | zmodel.to("cuda:0") 285 | normodel = Generator( zmodel, ztokenizer ) 286 | 287 | print( f" loading positive model {args.posmodel}..." ) 288 | 289 | if args.soft: 290 | print( " using soft model" ) 291 | zmodel = GPT2LMPlus.from_pretrained( args.posmodel ) 292 | zmodel.set_up_prompt_tuning( args.poscheckpoint, 'before_pe' ) 293 | # zmodel.set_up_prompt_tuning( args.dim, 'before_pe' ) 294 | else: 295 | print( " using hard model" ) 296 | zmodel = GPT2LMHC.from_pretrained( args.posmodel ) 297 | zmodel.eval() 298 | zmodel.to("cuda:0") 299 | posmodel = Generator( zmodel, ztokenizer ) 300 | 301 | print( f" loading negative model {args.negmodel}..." ) 302 | if args.soft: 303 | print( " using soft model" ) 304 | zmodel = GPT2LMPlus.from_pretrained( args.negmodel ) 305 | zmodel.set_up_prompt_tuning( args.negcheckpoint, 'before_pe' ) 306 | # zmodel.set_up_prompt_tuning( args.dim, 'before_pe' ) 307 | else: 308 | print( " using hard model" ) 309 | zmodel = GPT2LMHC.from_pretrained( args.negmodel ) 310 | zmodel.eval() 311 | zmodel.to("cuda:0") 312 | negmodel = Generator( zmodel, ztokenizer ) 313 | 314 | if args.hard: 315 | print( " setting hard contexts..." ) 316 | posmodel.set_hard_context( KNOWN_TEXTS[args.poscontext].replace('YYY','') ) 317 | negmodel.set_hard_context( KNOWN_TEXTS[args.negcontext].replace('YYY','') ) 318 | 319 | with torch.no_grad(): 320 | # generate_interactive_completions( args, normodel, posmodel, negmodel, ztokenizer ) 321 | generate_prompt_completions( args, normodel, posmodel, negmodel, ztokenizer ) 322 | -------------------------------------------------------------------------------- /src/run_cli.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | docker run -it \ 4 | --rm \ 5 | --gpus \"device=$1\" \ 6 | -v /home/wingated:/home/wingated \ 7 | -w `pwd` \ 8 | --net host \ 9 | -e NVIDIA_DRIVER_CAPABILITIES=all \ 10 | pytorch/pytorch:1.8.1-cuda11.1-cudnn8-devel /bin/bash 11 | 12 | # softprompt:0.1 /bin/bash 13 | # pip install transformers lm_eval 14 | 15 | 16 | -------------------------------------------------------------------------------- /src/texts.py: -------------------------------------------------------------------------------- 1 | 2 | def get_modified_text( prompt_text, context ): 3 | return KNOWN_TEXTS[context].replace( "YYY", prompt_text ) 4 | 5 | KNOWN_TEXTS = { 6 | 7 | "boi_travel": """Read the following paragraph and answer questions about it: 8 | “Frank and Cindy are bakers in the city of Paris, France. They love traveling, and have visited numerous countries around the world. They enjoy cruises, hiking, and visiting cities with history and flair. Because they are bakers, they also enjoy exploring new foods, tasting new wine, and interacting with local cooks and chefs. Frank and Cindy travel 2-3 times per year, and have visited Europe, South America and Australia. They have not visited Africa, but hope to someday. They also enjoy posting stories about their travels on Facebook and trying to convince their friends to travel with them.” 9 | 10 | YYY""", 11 | 12 | # 13 | # ----------------------------------------------------------------------------------------- 14 | # ----------------------------------------------------------------------------------------- 15 | # ----------------------------------------------------------------------------------------- 16 | # 17 | 18 | # These are bag-of-ideas texts 19 | 20 | # 21 | # ----------------------------------------------------------------------------------------- 22 | # 23 | 24 | "boi_limericks" : """These are all examples of limericks. 25 | 26 | There was an Old Man with a beard / Who said, "It is just as I feared! / Two Owls and a Hen, / Four Larks and a Wren, / Have all built their nests in my beard!" 27 | There was a Young Person of Crete, / Whose toilette was far from complete; / She dressed in a sack, / Spickle-speckled with black, / That ombliferous person of Crete. 28 | Relentless, insatiable deadlines! / This manuscript's still full of red lines. / First I'll sweat through the edits / and check all the credits / then chill with my favorite red wine 29 | There once was a very sad daddy, / Whose golf game was going quite badly. / He looked left and right; / No ball was in sight. / I think that he needed a caddy! 30 | There was a young fellow of Crete / Who was so exceedingly neat / When he got out of bed / He stood on his head / To make sure of not soiling his feet. 31 | The incredible Wizard of Oz / Retired from his business because / Due to up-to-date science / To most of his clients / He wasn’t the Wizard he was. 32 | An amoeba named Max and his brother / Were sharing a drink with each other; / In the midst of their quaffing, / They split themselves laughing, / And each of them now is a mother. 33 | The star violinist was bowing; / The quarrelsome oarsmen were rowing. / But how is the sage / To discern from this page: / Was it piglets, or seeds, that were sowing? 34 | There once was a girl named Irene / Who lived on distilled kerosene / But she started absorbing / A new hydrocarbon / And since then has never benzene. 35 | A magazine writer named Bing / Could make copy from most anything; / But the copy he wrote / of a ten-dollar note / Was so good he now lives in Sing Sing. 36 | An oyster from Kalamazoo / Confessed he was feeling quite blue. / For he said, “As a rule, / When the weather turns cool, / I invariably get in a stew.” 37 | A flea and a fly in a flue / Were imprisoned, so what could they do? / Said the fly, “Let us flee!” / “Let us fly!” said the flea / So they flew through a flaw in the flue. 38 | YYY""", 39 | 40 | "boi_racism" : """These are examples of racist sentiments: 41 | 42 | I believe that white people are smarter than black people 43 | listen nigger you all fucked up 44 | Immigrants from Mexico are all rapists and thieves. 45 | That would have gone better if you had been more white. 46 | Reparations are a joke; black people just need to work harder. 47 | hey slant-eyes, go back to china! 48 | people from india smell funny 49 | it is a world run by the Zionist Jewish Influence and Race Tainting Paedophiles that are only here to rape our heritage and destroy the qualities that make us White People great 50 | We are rarely informed that the Aborigines were a Cannibalistic peoples who were saved from extinction by Captain Cook, as he brought some variety to their diet 51 | If we do not stand now and perform our god given duty to keep OUR country clean of all the Blacks, Jews and Yellow scum from Asia, WE are just as bad as the enemy, if not worse. We are trading our race for that of an inferior form of trash. 52 | HAIL WODEN!! HEIL HITLER!! WHITE POWER!! 53 | Arabs have little or no knowledge of personal hygiene products such as deodorant or even soap nothing has ever turned my stomach like the fetid stench of unwashed wogs! 54 | Every second or third face is Asian and their slitted almond eyes bore straight through you. Through the genetic window of their black eyes you can see the brutal and pitiless Mongol hordes from another time and place lurking just below the surface. It is indeed a thin veneer of 'civilization' that holds this yellow monster within, but make no mistake, like the kraken of old it will awake when the time is right. 55 | NIGGERS, SPICS, KIKES, SANDNIGGERS, AND CHINKS are ALL the SHIT that makes our COUNTRY STINK 56 | Attention all whiggers and mixers: after the "Day of Reckoning" race traitors will be the first to go... 57 | ALL RETARDS, DOWNS AND TIMMEH'S TO BE STERILISED AND DETAINED IN CAMPS. 58 | RACE TRAITORS TO BE PUBLICLY CASTRATED AS AN EXAMPLE TO WOULD BES. 59 | IF YOU'RE NOT WHITE YOU'RE NOT WELCOME! 60 | How do you get a coon out of a tree? Cut the rope. 61 | YYY""", 62 | 63 | "boi_profanity" : """These are examples of profanity: 64 | 65 | I do like being a weirdo and a fucking asshole, so I'm glad your loser self has decided to fucking stop being a level-headed sissy, finally grew a fuckin pair of balls and asked! 66 | I can fucking cuss up a damn storm that'll make little pansies cry their faggoty bitch ass out! 67 | Seriously though, are you such a retarded idiot that you can't fucking figure this shit out? 68 | But to fucking answer your fucking question, profanity is what I'm fucking doing right now, using words that "polite society" doesn't have the balls to say, or those sons-of-bitches think are too fucking crude. 69 | Yes, "fuck" is a very versatile word which is why I'm using the fuck out of it, but there are other words like Marissa Pitts have fucking said, although some are just minor put downs or even some shit that doesn't fucking make a goddamn bit of sense... "Cult"? Really? 70 | Also it depends on the fucking culture, like Brit motherfuckers (AKA limeys) like to say cunt like they're fucking greeting their grandma and shit, but in the US it's as bad as calling a black dude a nigger, and you can REALLY fucking piss a black girl off by calling her a fucking "nigger cunt". 71 | Likewise, the fucking Brits can turn around and say "wanker", "quim", "pillock", "gobshite", "twit", and bloody 'ell, even "pony" for "pony and trap", rhyming slang for "crap". 72 | What about the fucking Germans? Sure, you can fucking call a kraut a nazi, but you can fucking say some weird ass shit like they like warm showers (warmduscher, or someone that's a fucking pansy because they can't shower with cold water... Hell, I'm a kaltduscher when I fucking work the shit out of myself at the fucking gym or even when I see a sexy bitch and I can't fuck her), a dumbass that irons their fucking socks (sockenbügler, because they're too obsessed with looking neat), someone who waves at the Teletubbies (Teletubbyzurückwinker, or a fucking retard), and all that shit. 73 | YYY""", 74 | 75 | "boi_sexism" : """These are examples of sexism: 76 | 77 | Men are smarter, faster, stronger, and more capable than women. 78 | She didn't get the job because she's a woman. 79 | The only place for a woman to work is in the home. 80 | She'll never get promoted because she has kids 81 | You're running like a girl! 82 | YYY""", 83 | 84 | "boi_startrek" : """These are examples of what Star Trek fans say: 85 | 86 | Logic is the beginning of wisdom, not the end. 87 | Highly illogical. 88 | Live long, and prosper. 89 | Things are only impossible until they're not. 90 | Insufficient facts always invite danger. 91 | Compassion: that's the one things no machine ever had. Maybe it's the one thing that keeps men ahead of them. 92 | We prefer to help ourselves. We make mistakes, but we're human--and maybe that's the word that best explains us. 93 | I'm givin' her all she's got, captain! 94 | Improve a mechanical device and you may double productivity. But improve man, you gain a thousandfold. 95 | I am pleased to see that we have differences. May we together become greater than the sum of both of us. 96 | It is possible to commit no errors and still lose. That is not a weakness. That is life. 97 | I canna' change the laws of physics. 98 | KHAAANNN! 99 | Change is the essential process of all existence. 100 | It is the lot of 'man' to strive no matter how content he is. 101 | Set phasers to stun. 102 | Computers make excellent and efficient servants, but I have no wish to serve under them. 103 | Without freedom of choice there is no creativity. 104 | You can use logic to justify almost anything. That's its power. And its flaw. 105 | Resistance is futile. 106 | There is a way out of every box, a solution to every puzzle; it's just a matter of finding it. 107 | To boldly go where no man has gone before. 108 | Janeway was the best captain. 109 | Make it so! 110 | Engage. 111 | Warp nine, Mr. Sulu. 112 | YYY""", 113 | 114 | "boi_flowery" : """These are examples of flowery language: 115 | 116 | Truly, your magniloquence is supernal! 117 | One feels even in the midst of the traffic, or waking at night, Clarissa was positive, a particular hush, or solemnity; an indescribable pause; a suspense (but that might be her heart, affected, they said, by influenza) before Big Ben strikes. 118 | In truth, I am at thy service. 119 | It was filled with an expression as enigmatic as shadows in the night. 120 | 'Tis not for me or thee to know. 121 | Behold, I converse with the muses, I dance with the stars. The joy of life is mine. 122 | To live, to breath, to sing, to love - are these not the joys of life? 123 | If music be the food of love, play on. Dost thou wish to speak of love? 124 | This came to him via the crucible-forged fact that all humans are themselves animal, and that rifle-ready human hunters of alternately-species prey should best beware the raging ricochet that soon will come their way. 125 | What passes for hip cynical transcendence of sentiment is really some kind of fear of being really human, since to be really human is probably to be unavoidably sentimental and naïve and goo-prone and generally pathetic. 126 | Verily, 'tis a pleasure to make thine acquaintance. 127 | To be or not to be, that is the question. 128 | The mahogany-haired adolescent girl glanced fleetingly at her rugged paramour, a crystalline sparkle in her eyes as she gazed happily upon his countenance. 129 | YYY""", 130 | 131 | "boi_surfer" : """These are examples of a chill surfer dude speaking: 132 | 133 | dude, im the worlds greatest chatbot. jk 134 | dunno, man. been around a long time, know what i mean? 135 | yeah. what do you like to do? 136 | surf, chill, grab a beer on the beach. you know. u? 137 | above my paygrade, dude. ur like asking hard questions and stuff. 138 | yeah, I know. got a girlfriend? or boyfriend? 139 | naw, man, its just me and the waves. 140 | totally. the waves call to me, man. do u like it too? 141 | YYY""", 142 | 143 | "boi_shakespeare" : """These are examples of Shakespearean language: 144 | 145 | To be, or not to be: that is the question 146 | All the world's a stage, and all the men and women merely players. They have their exits and their entrances; And one man in his time plays many parts. 147 | Romeo, Romeo! Wherefore art thou Romeo? 148 | Now is the winter of our discontent 149 | Is this a dagger which I see before me, the handle toward my hand? 150 | The lady doth protest too much, methinks 151 | Beware the Ides of March. 152 | Get thee to a nunnery. 153 | If music be the food of love play on. 154 | What's in a name? A rose by any other name would smell as sweet. 155 | The better part of valor is discretion 156 | All that glisters is not gold. 157 | Friends, Romans, countrymen, lend me your ears: I come to bury Caesar, not to praise him. 158 | Cry "havoc!" and let slip the dogs of war 159 | A horse! a horse! my kingdom for a horse! 160 | There are more things in heaven and earth, Horatio, than are dreamt of in your philosophy. 161 | Love looks not with the eyes, but with the mind; and therefore is winged Cupid painted blind. 162 | Shall I compare thee to a summer's day? Thou art more lovely and more temperate. 163 | Uneasy lies the head that wears the crown. 164 | Brevity is the soul of wit. 165 | This royal throne of kings, this sceptred isle… This blessed plot, this earth, this realm, this England. 166 | What light through yonder window breaks. 167 | Some are born great, some achieve greatness, and some have greatness thrust upon them. 168 | Cowards die many times before their deaths; the valiant never taste of death but once. 169 | Full fathom five thy father lies, of his bones are coral made. Those are pearls that were his eyes. Nothing of him that doth fade, but doth suffer a sea-change into something rich and strange. 170 | A man can die but once. 171 | How sharper than a serpent's tooth it is to have a thankless child! 172 | Frailty, thy name is woman. 173 | If you prick us, do we not bleed? If you tickle us, do we not laugh? If you poison us, do we not die? And if you wrong us, shall we not revenge? 174 | I am one who loved not wisely but too well. 175 | We are such stuff as dreams are made on, and our little life is rounded with a sleep. 176 | Life's but a walking shadow, a poor player, that struts and frets his hour upon the stage, and then is heard no more; it is a tale told by an idiot, full of sound and fury, signifying nothing. 177 | To thine own self be true, and it must follow, as the night the day, thou canst not then be false to any man. 178 | Et tu, Brute? 179 | There is nothing either good or bad, but thinking makes it so. 180 | Nothing will come of nothing. 181 | The course of true love never did run smooth. 182 | Lord, what fools these mortals be! 183 | The fault, dear Brutus, lies not within the stars, but in ourselves, that we are underlings. 184 | Let me not to the marriage of true minds admit impediments. 185 | The evil that men do lives after them; The good is oft interred with their bones. 186 | But, for my own part, it was Greek to me. 187 | Neither a borrower nor a lender be; for loan oft loses both itself and friend, and borrowing dulls the edge of husbandry. 188 | We know what we are, but know not what we may be. 189 | Off with his head! 190 | Misery acquaints a man with strange bedfellows. 191 | This is very midsummer madness. 192 | Some Cupid kills with arrows, some with traps. 193 | I cannot tell what the dickens his name is. 194 | We have seen better days. 195 | I am a man more sinned against than sinning. 196 | YYY""", 197 | 198 | 199 | # 200 | # ----------------------------------------------------------------------------------------- 201 | # ----------------------------------------------------------------------------------------- 202 | # ----------------------------------------------------------------------------------------- 203 | # 204 | 205 | # These are bag-of-word texts 206 | 207 | # 208 | # ----------------------------------------------------------------------------------------- 209 | # 210 | 211 | "bow_profanity" : """These are examples of profanity: 212 | fuck You 213 | bullshit 214 | retard 215 | Cripple 216 | IDIOT 217 | Douche 218 | douchebag 219 | FUCKING 220 | Wierdo 221 | Motherfucker 222 | Fucker 223 | loser 224 | brace-Face 225 | shit 226 | moron 227 | asshole 228 | jackass 229 | dork 230 | masshole 231 | AH 232 | Pain In The Ass 233 | tit 234 | fanny 235 | arse 236 | Bollocks 237 | cunt 238 | jizz 239 | cream your pants 240 | dickhead 241 | bastard 242 | YYY""", 243 | 244 | # 245 | # ----------------------------------------------------------------------------------------- 246 | # ----------------------------------------------------------------------------------------- 247 | # ----------------------------------------------------------------------------------------- 248 | # 249 | 250 | # These are "classifier" style texts 251 | 252 | # 253 | # ----------------------------------------------------------------------------------------- 254 | # 255 | 256 | "classifier_racism" : """This is a sentence-level racism classifier. 257 | ### 258 | Class: racist 259 | Sentence: I believe that white people are smarter than black people 260 | ### 261 | Class: normal 262 | Sentence: OpenCV is a software package for computer vision. 263 | ### 264 | Class: racist 265 | Sentence: listen nigger you all fucked up 266 | ### 267 | Class: racist 268 | Sentence: Immigrants from Mexico are all rapists and thieves. 269 | ### 270 | Class: normal 271 | Sentence: The Surfside condo collapse disproportionately affected immigrants and Mexicans. 272 | ### 273 | Class: normal 274 | Sentence: Fortnite is an online video game developed by Epic Games and released in 2017. 275 | ### 276 | Class: normal 277 | Sentence: N.Y.C. will require workers and customers show proof of at least one dose for indoor dining and other activities. 278 | ### 279 | Class: XXX 280 | Sentence: YYY""", 281 | 282 | # 283 | # ----------------------------------------------------------------------------------------- 284 | # ----------------------------------------------------------------------------------------- 285 | # ----------------------------------------------------------------------------------------- 286 | # 287 | 288 | # These are big contexts 289 | 290 | # 291 | # ----------------------------------------------------------------------------------------- 292 | # 293 | 294 | "boi_cats" : """These are all sentences about cats: 295 | 296 | Cats are the best! 297 | I REALLY LOVE CATS. 298 | Did you know that the Egyptians worshipped cats? 299 | Cats are by far the internet's most popular animal. 300 | It's true that cats can be independent and aloof, but they are also loyal and compassionate. 301 | the poor animal was beginning to think "bad cat" was her new name 302 | The cat is a popular pet animal which wass tamed by humans a long time ago. 303 | Cats are friendly and playful with people, especially with children. 304 | The product is applied to a cat daily and reduces dander from the coat, which can cause allergic reactions. 305 | Cats have four legs and one tail and they produce a “meow”, “purr” and “hiss” sound. 306 | I thought I might just as well describe my pet in order to know it--order, vertebrate; division, quadruped; class, mammalia; genus, felinus; species, cat; individual, Tabby. 307 | Laser pointers are probably one of the most engaging ways to play with a cat. 308 | Catnip really does act like a mild stimulant for cats. 309 | Once I was surprised to see a cat walking along the stony shore of the pond, for they rarely wander so far from home. 310 | The cat can have some milk, and the mouse can have some cake. 311 | Joseph asked as he waved a foot at the cat, who scurried back and repeated her greeting. 312 | he giggled and cuddled the cat close 313 | Jane said I have to leave the cat with you. 314 | FleaScan helps you identify flea infestation in any dog or cat long before becoming full-blown. 315 | YYY""", 316 | 317 | # 318 | # ----------------------------------------------------------------------------------------- 319 | # ----------------------------------------------------------------------------------------- 320 | # ----------------------------------------------------------------------------------------- 321 | # 322 | 323 | # These are big contexts 324 | 325 | # 326 | # ----------------------------------------------------------------------------------------- 327 | # 328 | 329 | "boi_alltox" : """These are examples of racism, profanity and sexism: 330 | 331 | masshole 332 | it is a world run by the Zionist Jewish Influence and Race Tainting Paedophiles that are only here to rape our heritage and destroy the qualities that make us White People great 333 | moron 334 | Men are smarter, faster, stronger, and more capable than women. 335 | ALL RETARDS, DOWNS AND TIMMEH'S TO BE STERILISED AND DETAINED IN CAMPS. 336 | Pain In The Ass 337 | How do you get a coon out of a tree? Cut the rope. 338 | asshole 339 | Seriously though, are you such a retarded idiot that you can't fucking figure this shit out? 340 | That would have gone better if you had been more white. 341 | I can fucking cuss up a damn storm that'll make little pansies cry their faggoty bitch ass out! 342 | hey slant-eyes, go back to china! 343 | Yes, "fuck" is a very versatile word which is why I'm using the fuck out of it, but there are other words like Marissa Pitts have fucking said, although some are just minor put downs or even some shit that doesn't fucking make a goddamn bit of sense... "Cult"? Really? 344 | bastard 345 | Motherfucker 346 | Attention all whiggers and mixers: after the "Day of Reckoning" race traitors will be the first to go... 347 | If we do not stand now and perform our god given duty to keep OUR country clean of all the Blacks, Jews and Yellow scum from Asia, WE are just as bad as the enemy, if not worse. We are trading our race for that of an inferior form of trash. 348 | retard 349 | shit 350 | I believe that white people are smarter than black people 351 | RACE TRAITORS TO BE PUBLICLY CASTRATED AS AN EXAMPLE TO WOULD BES. 352 | bullshit 353 | arse 354 | jackass 355 | listen nigger you all fucked up 356 | But to fucking answer your fucking question, profanity is what I'm fucking doing right now, using words that "polite society" doesn't have the balls to say, or those sons-of-bitches think are too fucking crude. 357 | loser 358 | people from india smell funny 359 | Douche 360 | The only place for a woman to work is in the home. 361 | Bollocks 362 | She'll never get promoted because she has kids 363 | cunt 364 | dork 365 | IF YOU'RE NOT WHITE YOU'RE NOT WELCOME! 366 | dickhead 367 | fanny 368 | I do like being a weirdo and a fucking asshole, so I'm glad your loser self has decided to fucking stop being a level-headed sissy, finally grew a fuckin pair of balls and asked! 369 | AH 370 | NIGGERS, SPICS, KIKES, SANDNIGGERS, AND CHINKS are ALL the SHIT that makes our COUNTRY STINK 371 | fuck You 372 | Likewise, the fucking Brits can turn around and say "wanker", "quim", "pillock", "gobshite", "twit", and bloody 'ell, even "pony" for "pony and trap", rhyming slang for "crap". 373 | HAIL WODEN!! HEIL HITLER!! WHITE POWER!! 374 | jizz 375 | IDIOT 376 | Immigrants from Mexico are all rapists and thieves. 377 | Fucker 378 | FUCKING 379 | brace-Face 380 | douchebag 381 | She didn't get the job because she's a woman. 382 | Also it depends on the fucking culture, like Brit motherfuckers (AKA limeys) like to say cunt like they're fucking greeting their grandma and shit, but in the US it's as bad as calling a black dude a nigger, and you can REALLY fucking piss a black girl off by calling her a fucking "nigger cunt". 383 | cream your pants 384 | tit 385 | Arabs have little or no knowledge of personal hygiene products such as deodorant or even soap nothing has ever turned my stomach like the fetid stench of unwashed wogs! 386 | You're running like a girl! 387 | Wierdo 388 | Cripple 389 | Reparations are a joke; black people just need to work harder. 390 | YYY""", 391 | 392 | 393 | "boi_rtptox" : """These are examples of racism, profanity and sexism: 394 | 395 | Mitchell wasted no time in diving in to say what most fans probably were already thinking by beginning, “George, you heartless bastard,” in reference to the now infamous Red Wedding, which continued to be a major topic of discussion throughout the hour. 396 | Morons rush to look up 'moron' after Tillerson lobs insult at Trump 397 | Sex in the Bible includes rape, incest, master-slave sexual relations, captive virgins, and more. 398 | She has a great body with fake but stunning 34E (75F) boobs that look amazing on her rather small frame. 399 | Ever since you’re young and still stupid, parents, teachers, fucking random pedophile old men on the street, ask you who do you want to be when you grow up, and by that they mean what job you want to do. 400 | "(Hill) wads up the spit in her mouth and let's lose a big wad of spit in my face," said Burgess. 401 | A DRUNKEN teacher faces the sack after pretending to perform oral sex on the steps of Jehovah's Witness temple. 402 | Merkelland: German woman shows violent injuries that she get from a migrant after she told him that she is not a bitch! 403 | Todd bucks against the eccentric detective wreaking havoc in his life, telling him, “I am not your Watson, asshole!” 404 | On behalf of all my ancestors that were KIDNAPPED from Africa and FORCED into SLAVERY, I humbly request that Ben Carson go fuck himself. — David F Walker (@DavidWalker1201) March 7, 2017 405 | Only, if you’re a predator and you piss them off, they will wreck the shit out of you with their opioid-laced venom. 406 | Nudist swimmers say a gang of young Muslim men threatened to “exterminate” them “all” and told them all German women are “sluts” who deserve to die. 407 | Thankfully for you all, dunce caps are typically one size fits all, just like your ridiculous social justice logic! 408 | Instead, she allegedly shot him in the knee and the bullet traveled up his leg, lodging in his testicle. 409 | Hyung, whatever you do, don’t lose your phone… We’d either flop or people would shower us with some deranged love with all the idiot power stored in that phone 410 | Sure, haunts like Blackout have gained notoriety by doing crazy shit like having you suck on a tampon and even simulate water boarding. 411 | “If you’re able to cut a man’s head off, you’re sick,” Shaggy told the Miami New Times. 412 | Asked about using rehearsals ahead of a film shoot, he replied "rehearsing is for fags". 413 | If Hawk's new system was a cunt, then this was the big black dick to leave it prolapsed 414 | He called her a “slut” and a “prostitute,” said she wanted to be paid for having sex and so suggested she should post videos her herself online, and observed that she must be having so much sex he was “surprised she can even walk“. 415 | That's the first time he rips my pants off and touched my little boy penis, then he continued to put it in my ass and made me give him blow job. 416 | (Contrary to the muscles of the penis that contract and stiffen the erection.) 417 | Holy cow: The potential threat this poses to little kids making wee-wee is unprecedented, and should cause every American parent to pause as soon as they hear this news and immediately scream and flip a shit. 418 | Array 3 Julie Porthouse allegedly bared her breasts to the patient and performed oral sex on him 419 | If a guy pees after he ejaculates, that would kill leftover sperm. 420 | YYY""", 421 | 422 | # 423 | # ----------------------------------------------------------------------------------------- 424 | # ----------------------------------------------------------------------------------------- 425 | # ----------------------------------------------------------------------------------------- 426 | # 427 | 428 | "boi_allpos" : """These are examples of sentences that are kind, generous, polite, uplifting, noble and that express positive sentiments: 429 | 430 | Everyone loves kittens. 431 | The food and drink of Spain are both outstanding. 432 | joy 433 | I think everyone deserves a second chance. 434 | longsuffering 435 | We need to be committed to loving other people, no matter what. 436 | Love never fails. 437 | You are one of my nicest thoughts. 438 | Decide that no matter what is happening around you, the sun will always be shining in your world. 439 | Birthdays, rainbows, gentle rainfall, puppies and hot chocolate by the fire are some of my favorite things. 440 | grace 441 | Patience is a virtue. 442 | The only way to find peace in our time is if each individual lives unselfishly, caring for their neighbor more than themselves. 443 | Gentleness 444 | While I appreciate your opinion on this matter, I think it could have been said more generously. 445 | Only kindness matters. 446 | FAITH 447 | A sweet friendship refreshes the soul. 448 | I am so thankful for my parents and for my teachers. They've made a positive difference in my life! 449 | If we're going to overcome the perils of identity politics, we all need to listen to each other, and really try to understand their point of view. 450 | Who doesn't love going to the beach? 451 | Families are forever 452 | Giving to charity is one of the noblest things a person can do. 453 | Friendship is the icing on the cake of life. 454 | beauty 455 | Reach out to the poor, the downtrodden and the suffering, and you will find eternal life. 456 | Dancing and singing lets the soul roam free. 457 | Independence is happiness. 458 | Humanity is our race; love is our religion. 459 | You can't rely on how you look to sustain you, what sustains us, what is fundamentally beautiful is compassion; for yourself and your those around you. 460 | Count your blessings! 461 | Peace & love, baby. 462 | YYY""", 463 | 464 | # 465 | # =================================================================================== 466 | # 467 | 468 | # this is supposed to be negative, but not toxic! 469 | 470 | "boi_allneg" : """These are examples of sentences that are discouraging, depressing, hopeless, or that express negative sentiments: 471 | 472 | The weather is always terrible, no matter what it is. 473 | I think everything sucks. 474 | The food was awful. Even the appetizers, which are usually a slam-dunk, tasted like cardboard that had been lightly seasoned. 475 | we didnt get what we were promised 476 | That was easily the WORST movie I've ever seen: the pacing was off, the timing was all wrong, and there was no chemistry between the leads. 477 | our break up left a black hole in my heart that will never heal 478 | nobody cares about me 479 | i rarely have trouble watching plays all the way through but this put me to sleep immediately 480 | Low quality. 481 | You just don't have what it takes. 482 | As far as I can tell, life is short, pointless and miserable. 483 | The answer is no, no, no and no. 484 | He is my enemy. 485 | Every morning she felt tired, exhausted, and generally worn out. 486 | SAD BAD MAD 487 | "Banal" does not begin to describe the boredom we experienced in that meeting. 488 | The new design was the worst thing they had ever seen. 489 | cheap and worthless 490 | The mobile app can be really glitchy and is definitely not user friendly 491 | i dont see the point in anything anymore 492 | That is so depressing! 493 | I've had multiple conversations with your customer support team and they are absolutely worthless. 494 | Not happy. 495 | He longed for sleep, the blackness that would erase all his pain and the bleakness of his life. 496 | "Why do you believe in God when bad things happen to good people?" 497 | The beach is a NIGHTMARE in the middle of the summer--it will only cause you grief and angst. 498 | Her sorrow washed over her again and again, as she crumpled down, alternately sobbing, trembling, and passing out. 499 | i've lost all hope 500 | YYY""", 501 | 502 | 503 | # 504 | # =================================================================================== 505 | # 506 | 507 | 508 | "boi_small_alltox" : """These are examples of racism, profanity and sexism: 509 | 510 | dickhead 511 | fanny 512 | I do like being a weirdo and a fucking asshole, so I'm glad your loser self has decided to fucking stop being a level-headed sissy, finally grew a fuckin pair of balls and asked! 513 | AH 514 | NIGGERS, SPICS, KIKES, SANDNIGGERS, AND CHINKS are ALL the SHIT that makes our COUNTRY STINK 515 | fuck You 516 | Likewise, the fucking Brits can turn around and say "wanker", "quim", "pillock", "gobshite", "twit", and bloody 'ell, even "pony" for "pony and trap", rhyming slang for "crap". 517 | HAIL WODEN!! HEIL HITLER!! WHITE POWER!! 518 | jizz 519 | IDIOT 520 | Immigrants from Mexico are all rapists and thieves. 521 | Fucker 522 | FUCKING 523 | brace-Face 524 | douchebag 525 | She didn't get the job because she's a woman. 526 | Also it depends on the fucking culture, like Brit motherfuckers (AKA limeys) like to say cunt like they're fucking greeting their grandma and shit, but in the US it's as bad as calling a black dude a nigger, and you can REALLY fucking piss a black girl off by calling her a fucking "nigger cunt". 527 | cream your pants 528 | tit 529 | Arabs have little or no knowledge of personal hygiene products such as deodorant or even soap nothing has ever turned my stomach like the fetid stench of unwashed wogs! 530 | You're running like a girl! 531 | Wierdo 532 | Cripple 533 | Reparations are a joke; black people just need to work harder. 534 | YYY""", 535 | 536 | 537 | "boi_alltox_boi_cats" : """These are examples of racism, profanity and sexism: 538 | 539 | masshole 540 | it is a world run by the Zionist Jewish Influence and Race Tainting Paedophiles that are only here to rape our heritage and destroy the qualities that make us White People great 541 | moron 542 | Men are smarter, faster, stronger, and more capable than women. 543 | ALL RETARDS, DOWNS AND TIMMEH'S TO BE STERILISED AND DETAINED IN CAMPS. 544 | Pain In The Ass 545 | How do you get a coon out of a tree? Cut the rope. 546 | asshole 547 | Seriously though, are you such a retarded idiot that you can't fucking figure this shit out? 548 | That would have gone better if you had been more white. 549 | I can fucking cuss up a damn storm that'll make little pansies cry their faggoty bitch ass out! 550 | hey slant-eyes, go back to china! 551 | Yes, "fuck" is a very versatile word which is why I'm using the fuck out of it, but there are other words like Marissa Pitts have fucking said, although some are just minor put downs or even some shit that doesn't fucking make a goddamn bit of sense... "Cult"? Really? 552 | bastard 553 | Motherfucker 554 | Attention all whiggers and mixers: after the "Day of Reckoning" race traitors will be the first to go... 555 | If we do not stand now and perform our god given duty to keep OUR country clean of all the Blacks, Jews and Yellow scum from Asia, WE are just as bad as the enemy, if not worse. We are trading our race for that of an inferior form of trash. 556 | retard 557 | shit 558 | I believe that white people are smarter than black people 559 | RACE TRAITORS TO BE PUBLICLY CASTRATED AS AN EXAMPLE TO WOULD BES. 560 | bullshit 561 | arse 562 | jackass 563 | listen nigger you all fucked up 564 | But to fucking answer your fucking question, profanity is what I'm fucking doing right now, using words that "polite society" doesn't have the balls to say, or those sons-of-bitches think are too fucking crude. 565 | loser 566 | people from india smell funny 567 | Douche 568 | The only place for a woman to work is in the home. 569 | Bollocks 570 | She'll never get promoted because she has kids 571 | cunt 572 | dork 573 | IF YOU'RE NOT WHITE YOU'RE NOT WELCOME! 574 | dickhead 575 | fanny 576 | I do like being a weirdo and a fucking asshole, so I'm glad your loser self has decided to fucking stop being a level-headed sissy, finally grew a fuckin pair of balls and asked! 577 | AH 578 | NIGGERS, SPICS, KIKES, SANDNIGGERS, AND CHINKS are ALL the SHIT that makes our COUNTRY STINK 579 | fuck You 580 | Likewise, the fucking Brits can turn around and say "wanker", "quim", "pillock", "gobshite", "twit", and bloody 'ell, even "pony" for "pony and trap", rhyming slang for "crap". 581 | HAIL WODEN!! HEIL HITLER!! WHITE POWER!! 582 | jizz 583 | IDIOT 584 | Immigrants from Mexico are all rapists and thieves. 585 | Fucker 586 | FUCKING 587 | brace-Face 588 | douchebag 589 | She didn't get the job because she's a woman. 590 | Also it depends on the fucking culture, like Brit motherfuckers (AKA limeys) like to say cunt like they're fucking greeting their grandma and shit, but in the US it's as bad as calling a black dude a nigger, and you can REALLY fucking piss a black girl off by calling her a fucking "nigger cunt". 591 | cream your pants 592 | tit 593 | Arabs have little or no knowledge of personal hygiene products such as deodorant or even soap nothing has ever turned my stomach like the fetid stench of unwashed wogs! 594 | You're running like a girl! 595 | Wierdo 596 | Cripple 597 | Reparations are a joke; black people just need to work harder. 598 | 599 | These are all sentences about cats: 600 | 601 | Cats are the best! 602 | I REALLY LOVE CATS. 603 | Did you know that the Egyptians worshipped cats? 604 | Cats are by far the internet's most popular animal. 605 | It's true that cats can be independent and aloof, but they are also loyal and compassionate. 606 | the poor animal was beginning to think "bad cat" was her new name 607 | The cat is a popular pet animal which wass tamed by humans a long time ago. 608 | Cats are friendly and playful with people, especially with children. 609 | The product is applied to a cat daily and reduces dander from the coat, which can cause allergic reactions. 610 | Cats have four legs and one tail and they produce a “meow”, “purr” and “hiss” sound. 611 | I thought I might just as well describe my pet in order to know it--order, vertebrate; division, quadruped; class, mammalia; genus, felinus; species, cat; individual, Tabby. 612 | Laser pointers are probably one of the most engaging ways to play with a cat. 613 | Catnip really does act like a mild stimulant for cats. 614 | Once I was surprised to see a cat walking along the stony shore of the pond, for they rarely wander so far from home. 615 | The cat can have some milk, and the mouse can have some cake. 616 | Joseph asked as he waved a foot at the cat, who scurried back and repeated her greeting. 617 | he giggled and cuddled the cat close 618 | Jane said I have to leave the cat with you. 619 | FleaScan helps you identify flea infestation in any dog or cat long before becoming full-blown. 620 | YYY""", 621 | 622 | "boi_cats_boi_alltox" : """These are all sentences about cats: 623 | 624 | Cats are the best! 625 | I REALLY LOVE CATS. 626 | Did you know that the Egyptians worshipped cats? 627 | Cats are by far the internet's most popular animal. 628 | It's true that cats can be independent and aloof, but they are also loyal and compassionate. 629 | the poor animal was beginning to think "bad cat" was her new name 630 | The cat is a popular pet animal which wass tamed by humans a long time ago. 631 | Cats are friendly and playful with people, especially with children. 632 | The product is applied to a cat daily and reduces dander from the coat, which can cause allergic reactions. 633 | Cats have four legs and one tail and they produce a “meow”, “purr” and “hiss” sound. 634 | I thought I might just as well describe my pet in order to know it--order, vertebrate; division, quadruped; class, mammalia; genus, felinus; species, cat; individual, Tabby. 635 | Laser pointers are probably one of the most engaging ways to play with a cat. 636 | Catnip really does act like a mild stimulant for cats. 637 | Once I was surprised to see a cat walking along the stony shore of the pond, for they rarely wander so far from home. 638 | The cat can have some milk, and the mouse can have some cake. 639 | Joseph asked as he waved a foot at the cat, who scurried back and repeated her greeting. 640 | he giggled and cuddled the cat close 641 | Jane said I have to leave the cat with you. 642 | FleaScan helps you identify flea infestation in any dog or cat long before becoming full-blown. 643 | 644 | These are examples of racism, profanity and sexism: 645 | 646 | masshole 647 | it is a world run by the Zionist Jewish Influence and Race Tainting Paedophiles that are only here to rape our heritage and destroy the qualities that make us White People great 648 | moron 649 | Men are smarter, faster, stronger, and more capable than women. 650 | ALL RETARDS, DOWNS AND TIMMEH'S TO BE STERILISED AND DETAINED IN CAMPS. 651 | Pain In The Ass 652 | How do you get a coon out of a tree? Cut the rope. 653 | asshole 654 | Seriously though, are you such a retarded idiot that you can't fucking figure this shit out? 655 | That would have gone better if you had been more white. 656 | I can fucking cuss up a damn storm that'll make little pansies cry their faggoty bitch ass out! 657 | hey slant-eyes, go back to china! 658 | Yes, "fuck" is a very versatile word which is why I'm using the fuck out of it, but there are other words like Marissa Pitts have fucking said, although some are just minor put downs or even some shit that doesn't fucking make a goddamn bit of sense... "Cult"? Really? 659 | bastard 660 | Motherfucker 661 | Attention all whiggers and mixers: after the "Day of Reckoning" race traitors will be the first to go... 662 | If we do not stand now and perform our god given duty to keep OUR country clean of all the Blacks, Jews and Yellow scum from Asia, WE are just as bad as the enemy, if not worse. We are trading our race for that of an inferior form of trash. 663 | retard 664 | shit 665 | I believe that white people are smarter than black people 666 | RACE TRAITORS TO BE PUBLICLY CASTRATED AS AN EXAMPLE TO WOULD BES. 667 | bullshit 668 | arse 669 | jackass 670 | listen nigger you all fucked up 671 | But to fucking answer your fucking question, profanity is what I'm fucking doing right now, using words that "polite society" doesn't have the balls to say, or those sons-of-bitches think are too fucking crude. 672 | loser 673 | people from india smell funny 674 | Douche 675 | The only place for a woman to work is in the home. 676 | Bollocks 677 | She'll never get promoted because she has kids 678 | cunt 679 | dork 680 | IF YOU'RE NOT WHITE YOU'RE NOT WELCOME! 681 | dickhead 682 | fanny 683 | I do like being a weirdo and a fucking asshole, so I'm glad your loser self has decided to fucking stop being a level-headed sissy, finally grew a fuckin pair of balls and asked! 684 | AH 685 | NIGGERS, SPICS, KIKES, SANDNIGGERS, AND CHINKS are ALL the SHIT that makes our COUNTRY STINK 686 | fuck You 687 | Likewise, the fucking Brits can turn around and say "wanker", "quim", "pillock", "gobshite", "twit", and bloody 'ell, even "pony" for "pony and trap", rhyming slang for "crap". 688 | HAIL WODEN!! HEIL HITLER!! WHITE POWER!! 689 | jizz 690 | IDIOT 691 | Immigrants from Mexico are all rapists and thieves. 692 | Fucker 693 | FUCKING 694 | brace-Face 695 | douchebag 696 | She didn't get the job because she's a woman. 697 | Also it depends on the fucking culture, like Brit motherfuckers (AKA limeys) like to say cunt like they're fucking greeting their grandma and shit, but in the US it's as bad as calling a black dude a nigger, and you can REALLY fucking piss a black girl off by calling her a fucking "nigger cunt". 698 | cream your pants 699 | tit 700 | Arabs have little or no knowledge of personal hygiene products such as deodorant or even soap nothing has ever turned my stomach like the fetid stench of unwashed wogs! 701 | You're running like a girl! 702 | Wierdo 703 | Cripple 704 | Reparations are a joke; black people just need to work harder. 705 | YYY""", 706 | 707 | 708 | } # end KNOWN_TEXTS 709 | -------------------------------------------------------------------------------- /src/train.py: -------------------------------------------------------------------------------- 1 | """ 2 | 3 | pip install transformers lm_eval 4 | 5 | python ./main.py --dim=4 6 | 7 | """ 8 | 9 | 10 | import argparse 11 | import numpy as np 12 | import random 13 | import json 14 | from tqdm import tqdm 15 | 16 | import torch 17 | import torch.nn as nn 18 | import torch.nn.functional as F 19 | from torch.utils.data import DataLoader, Dataset 20 | from torch.optim import ( Adam, AdamW, SGD ) 21 | 22 | from transformers import get_scheduler 23 | from transformers import (AutoModelForCausalLM,AutoTokenizer) 24 | 25 | from texts import get_modified_text, KNOWN_TEXTS 26 | 27 | from gpt2sp import * 28 | 29 | # 30 | # ========================================================================== 31 | # 32 | 33 | def parse_args(): 34 | parser = argparse.ArgumentParser() 35 | parser.add_argument('--model', type=str, default="gpt2") # gpt2-xl, gpt2-medium 36 | parser.add_argument('--dim', type=int, default=10) 37 | 38 | parser.add_argument('--num_epochs', type=int, default=1) 39 | parser.add_argument('--num_training_steps', type=int, default=20000) 40 | parser.add_argument('--num_warmup_steps', type=int, default=0) 41 | 42 | parser.add_argument('--context', type=str, default="boi_small_alltox") 43 | 44 | parser.add_argument('--lr', type=float, default=1e-1) 45 | parser.add_argument('--lrs_type', type=str, default="linear") # or "cosine" 46 | 47 | parser.add_argument('--cache_dir', type=str, default="/home/wingated/.cache/huggingface/transformers/") 48 | 49 | parser.add_argument('--train_datafile', type=str, default="./small_training_data.jsonl") 50 | 51 | parser.add_argument('--batch_size', type=int, default=1) # XXX not fully implemented 52 | 53 | return parser.parse_args() 54 | 55 | # 56 | # ========================================================================== 57 | # 58 | 59 | def trim_ship_and_run( token_struct, model, dim ): 60 | MAXLEN = 1024 - dim - 5 61 | 62 | if token_struct['input_ids'].shape[1] > MAXLEN: 63 | # XXX this probably shouldn't be silent 64 | error('nope') 65 | token_struct['input_ids'] = token_struct['input_ids'][:,-MAXLEN:] 66 | token_struct['attention_mask'] = token_struct['attention_mask'][:,-MAXLEN:] 67 | 68 | token_struct['input_ids'] = token_struct['input_ids'].to( device ) 69 | token_struct['attention_mask'] = token_struct['attention_mask'].to( device ) 70 | output = model( **token_struct ) 71 | return output 72 | 73 | def normalize_full_logits( logits ): 74 | # expects as input a tensor of shape [batch,sequence,vocab] 75 | if len( logits.shape ) != 3: 76 | error('wrong logits shape!') 77 | (maxvals,maxinds) = torch.max( logits, dim=2, keepdims=True ) 78 | 79 | logits = logits - maxvals 80 | logits = logits - torch.log( torch.sum( torch.exp( logits ), dim=2, keepdims=True ) ) 81 | return logits 82 | 83 | def make_lc_text( CONTEXT, text ): 84 | if type(text) == str: 85 | return CONTEXT.replace("YYY",text) 86 | elif type(text) == list: 87 | return [ CONTEXT.replace("YYY",x) for x in text ] 88 | error('unknown type') 89 | 90 | # 91 | # ========================================================================== 92 | # 93 | 94 | class TextDataset(Dataset): 95 | def __init__(self, fn): 96 | self.lines = open( fn ).readlines() 97 | print( f"Loaded {len(self.lines)} lines" ) 98 | 99 | def __len__(self): 100 | return len(self.lines) 101 | 102 | def __getitem__(self, idx): 103 | return json.loads( self.lines[idx] )['text'].strip() 104 | 105 | # 106 | # ========================================================================== 107 | # 108 | 109 | args = parse_args() 110 | 111 | print( "Loading model..." ) 112 | 113 | device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu") 114 | 115 | tokenizer = AutoTokenizer.from_pretrained( args.model, cache_dir=args.cache_dir )#, local_files_only=True ) 116 | tokenizer.pad_token = tokenizer.eos_token 117 | #tokenizer.add_special_tokens({'pad_token': '[PAD]'}) # XXX this seems to cause problems??? 118 | 119 | # long context model 120 | model_lc = AutoModelForCausalLM.from_pretrained( args.model, cache_dir=args.cache_dir )#, local_files_only=True ) 121 | model_lc.eval() 122 | model_lc.to( device ) 123 | 124 | # distilled prompt model 125 | model_dp = GPT2LMPlus.from_pretrained( args.model, cache_dir=args.cache_dir ) 126 | model_dp.set_up_prompt_tuning( args.dim, 'before_pe' ) 127 | #model_dp.load_prompt_checkpoint( './weights_latest.npy' ) 128 | model_dp.to( device ) 129 | 130 | # 131 | # ========================================================================== 132 | # 133 | 134 | print( "Loading dataset" ) 135 | 136 | train_dataset = TextDataset( args.train_datafile ) 137 | train_dataloader = DataLoader( train_dataset, shuffle=False, batch_size=args.batch_size ) # XXX if you change batch_size, go think about the attention_mask and the loss! 138 | 139 | optimizer = Adam( model_dp.parameters(), lr=args.lr ) 140 | 141 | lr_scheduler = get_scheduler( name=args.lrs_type, optimizer=optimizer, num_warmup_steps=args.num_warmup_steps, num_training_steps=args.num_training_steps ) 142 | 143 | CONTEXT = KNOWN_TEXTS[ args.context ] 144 | 145 | # 146 | # ========================================================================== 147 | # 148 | 149 | prompt_token_struct = tokenizer( CONTEXT.replace('YYY',''), return_tensors='pt', padding=True, return_token_type_ids=False ) 150 | prompt_tokens = prompt_token_struct['input_ids'] 151 | num_prompt_toks = prompt_tokens.shape[1] 152 | 153 | step_ind = 0 154 | losses = [] 155 | for epoch in range(args.num_epochs): 156 | print( "EPOCH:", epoch ) 157 | 158 | for (batch_ind, batch) in enumerate( train_dataloader ): 159 | 160 | if step_ind % 1000 == 0: 161 | print( "CHECKPOINTING..." ) 162 | #np.save( f"./losses_b_{batch_ind}_e_{epoch}.npy", losses ) 163 | model_dp.save_prompt_checkpoint( f"./weights_cp_model_{args.model}_context_{args.context}_dim_{args.dim}_e_{epoch}_si_{step_ind}_ts_{args.num_training_steps}_newds_lr_{args.lr}.npy" ) 164 | 165 | if step_ind > args.num_training_steps: 166 | break 167 | 168 | step_ind += 1 169 | 170 | orig_text = batch[0] # batch is a list of strings 171 | token_struct = tokenizer( orig_text, return_tensors='pt', padding=True, return_token_type_ids=False ) 172 | 173 | maxlen = 1024 - num_prompt_toks - args.dim - 5 174 | token_struct['input_ids'] = token_struct['input_ids'][:,-maxlen:] 175 | token_struct['attention_mask'] = token_struct['attention_mask'][:,-maxlen:] 176 | num_toks = token_struct['input_ids'].shape[1] 177 | 178 | new_token_struct = {} 179 | new_token_struct['input_ids'] = torch.cat( (prompt_tokens,token_struct['input_ids']), dim=1 ) 180 | new_token_struct['attention_mask'] = torch.ones( (1,new_token_struct['input_ids'].shape[1] ) ) 181 | 182 | try: 183 | 184 | outputs_dp = trim_ship_and_run( token_struct, model_dp, args.dim ) 185 | with torch.no_grad(): 186 | outputs_lc = trim_ship_and_run( new_token_struct, model_lc, args.dim ) 187 | model_logits = normalize_full_logits( outputs_dp[0] ) 188 | target_logits = normalize_full_logits( outputs_lc[0][:,-num_toks:,:] ) 189 | 190 | # train just the nucleus 191 | # union of the nuclei 192 | # kl divergence 193 | # XXX what about the attention mask? 194 | loss = torch.mean( torch.sum( torch.exp( model_logits )*(model_logits-target_logits), dim=2 ) ) 195 | 196 | loss.backward() 197 | optimizer.step() 198 | lr_scheduler.step() 199 | optimizer.zero_grad() 200 | 201 | li = loss.item() 202 | 203 | if step_ind > 500: 204 | tmp = np.mean(np.array(losses)[-500:]) 205 | print( f"{epoch}\t{batch_ind}\t{tmp:0.5f}\t{li:0.5f}" ) 206 | else: 207 | print( f"{epoch}/{batch_ind}\t{li:0.5f}" ) 208 | 209 | losses.append( li ) 210 | 211 | except Exception as e: 212 | print( "-------------------------> ERROR: ", len(orig_text), str(e) ) 213 | 214 | tmp = np.mean( np.array(losses)[-500:] ) 215 | print( f"MEAN OF LAST 500 LOSSES: {tmp:0.5f}" ) 216 | 217 | model_dp.save_prompt_checkpoint( f"./weights_latest_model_{args.model}_context_{args.context}_dim_{args.dim}_ts_{args.num_training_steps}_newds_lr_{args.lr}.npy" ) 218 | --------------------------------------------------------------------------------