├── .gitignore ├── README.md ├── notebooks └── train_emotions_classifier.ipynb ├── pyproject.toml ├── setup.cfg ├── src └── liqfit │ ├── __init__.py │ ├── collators │ ├── __init__.py │ ├── base_collator.py │ └── nli_collator.py │ ├── datasets │ ├── __init__.py │ ├── nli_dataset.py │ └── transform.py │ ├── losses │ ├── __init__.py │ └── losses.py │ ├── modeling │ ├── __init__.py │ ├── backbone.py │ ├── heads.py │ ├── model.py │ └── pooling.py │ ├── models │ ├── __init__.py │ ├── deberta.py │ └── t5.py │ ├── pipeline │ ├── __init__.py │ └── inference.py │ └── utils │ ├── __init__.py │ ├── metrics.py │ ├── standardization.py │ └── transforms.py └── tests ├── __init__.py ├── test_losses.py ├── test_models.py └── test_pipeline.py /.gitignore: -------------------------------------------------------------------------------- 1 | demo.ipynb -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |

2 | 🤗 Models | 📕 Documentation | 📖 Blog 3 |
4 | . . . 5 |

6 | 7 | # LiqFit - Flexible Few-shot Learning Library. 8 | 9 | LiqFit is an easy-to-use framework for few-shot learning of cross-encoder models. Such models were trained to distinguish whether two statements entail, contradict each other or are neutral. Such task setting is universal for many information extraction tasks, starting from text classification and ending with named entity recognition and question-answering. With LiqFit, you can achieve competitive results by having just 8 examples per label. 10 | 11 | 12 | Key features and benefits of LiqFit are: 13 | * 🔢 **A small number of examples are required** - LiqFit can significantly improve the accuracy of the default zero-shot classifier having just 8 examples; 14 | * 📝 **Can solve many different information-extraction tasks** - Natural language inference is a universal task that can be applied as a setting for many other information extraction tasks, like named entity recognition of question&answering; 15 | * 🌈 **Can work for other classes not presented in the training set** - It's not mandatory to have all needed classes in a training set. Because of pre-finetuning on large amounts of NLI and classification tasks, a model will save generalisability to other classes; 16 | * ⚙️ **Support of a variety of cross-encoder realisations** - LiqFit supports different types of cross-encoders, including conventional, binary one and encoder-decoder architectures; 17 | * ⚖️ **Stable to unbalanced datasets** - LiqFit uses normalisation techniques that allow it to work well even in the cases of unbalanced data; 18 | * 🏷️ **Multi-label classification support** - The approach can be applied for both multi-class and multi-label classification; 19 | 20 | Limitations: 21 | * 🤔 It’s required to run N times transformers feedforward pass, where N is the amount of labels; 22 | 23 | 24 | ## Installation 25 | 26 | Download and install `LiqFit` by running: 27 | 28 | ```bash 29 | pip install liqfit 30 | ``` 31 | 32 | For the most up-to-date version, you can build from source code by executing: 33 | 34 | ```bash 35 | pip install git+https://github.com/knowledgator/LiqFit.git 36 | ``` 37 | 38 | ## How to use: 39 | Check more real example in the `notebooks` section. 40 | 41 | ```python 42 | from liqfit.modeling import LiqFitModel 43 | from liqfit.losses import FocalLoss 44 | from liqfit.collators import NLICollator 45 | from transformers import TrainingArguments, Trainer 46 | 47 | backbone_model = AutoModelForSequenceClassification.from_pretrained('microsoft/deberta-v3-xsmall') 48 | 49 | loss_func = FocalLoss(multi_target=True) 50 | 51 | model = LiqFitModel(backbone_model.config, backbone_model, loss_func=loss_func) 52 | 53 | data_collator = NLICollator(tokenizer, max_length=128, padding=True, truncation=True) 54 | 55 | 56 | training_args = TrainingArguments( 57 | output_dir='comprehendo', 58 | learning_rate=3e-5, 59 | per_device_train_batch_size=3, 60 | per_device_eval_batch_size=3, 61 | num_train_epochs=9, 62 | weight_decay=0.01, 63 | evaluation_strategy="epoch", 64 | save_steps = 5000, 65 | save_total_limit=3, 66 | remove_unused_columns=False, 67 | ) 68 | 69 | trainer = Trainer( 70 | model=model, 71 | args=training_args, 72 | train_dataset=nli_train_dataset, 73 | eval_dataset=nli_test_dataset, 74 | tokenizer=tokenizer, 75 | data_collator=data_collator, 76 | ) 77 | 78 | trainer.train() 79 | ``` 80 | Please check more examples in the `notebooks` section. 81 | 82 | ... 83 | 84 | To run inference, we recommend to use `ZeroShotClassificationPipeline`: 85 | 86 | ```python 87 | from liqfit import ZeroShotClassificationPipeline 88 | 89 | 90 | classifier = ZeroShotClassificationPipeline(model=model, tokenizer=tokenizer) 91 | from sklearn.metrics import classification_report 92 | from tqdm import tqdm 93 | 94 | label2idx = {label: id for id, label in enumerate(classes)} 95 | 96 | preds = [] 97 | 98 | for example in tqdm(test_dataset): 99 | if not example['text']: 100 | preds.append(idx) 101 | continue 102 | pred = classifier(example['text'], classes)['labels'][0] 103 | idx = label2idx[pred] 104 | preds.append(idx) 105 | 106 | print(classification_report(test_dataset['label'][:len(preds)], preds, target_names=classes, digits=4)) 107 | ``` 108 | 109 | ## Benchmarks: 110 | | Model & examples per label | Emotion | AgNews | SST5 | 111 | |-|-|-|-| 112 | | Comprehend-it/0 | 56.60 | 79.82 | 37.9 | 113 | | Comprehend-it/8 | 63.38 | 85.9 | 46.67 | 114 | | Comprehend-it/64 | 80.7 | 88 | 47 | 115 | | SetFit/0 | 57.54 | 56.36 | 24.11 | 116 | | SetFit/8 | 56.81 | 64.93 | 33.61 | 117 | | SetFit/64 | 79.03 | 88 | 45.38 | 118 | 119 | LiqFit used [knowledgator/comprehend_it-base model](https://huggingface.co/knowledgator/comprehend_it-base), while for [SetFit](https://github.com/huggingface/setfit), we utilzed [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) 120 | -------------------------------------------------------------------------------- /pyproject.toml: -------------------------------------------------------------------------------- 1 | [build-system] 2 | requires = ["hatchling<=1.18.0"] 3 | build-backend = "hatchling.build" 4 | 5 | [project] 6 | name = "liqfit" 7 | version = "1.0.0" 8 | 9 | requires-python = ">=3.7" 10 | 11 | description = "Flexible Few-shot learning tool." 12 | license = "MIT" 13 | long_description = "file: README.md" 14 | 15 | classifiers = [ 16 | "Programming Language :: Python :: 3", 17 | "License :: OSI Approved :: MIT License", 18 | "Operating System :: OS Independent", 19 | ] 20 | 21 | dependencies = [ 22 | "kornia", 23 | "transformers", 24 | "accelerate", 25 | ] 26 | 27 | 28 | [options] 29 | packages = "./src/liqfit" 30 | zip_safe = "True" 31 | 32 | 33 | [tool.black] 34 | line-length = 80 35 | target-version = ['py37'] -------------------------------------------------------------------------------- /setup.cfg: -------------------------------------------------------------------------------- 1 | [flake8] 2 | per-file-ignores = __init__.py:F401 3 | -------------------------------------------------------------------------------- /src/liqfit/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Knowledgator/LiqFit/51ba2714813ae1cf110f7e600cd7f2663cdec39c/src/liqfit/__init__.py -------------------------------------------------------------------------------- /src/liqfit/collators/__init__.py: -------------------------------------------------------------------------------- 1 | from .base_collator import Collator 2 | from .nli_collator import NLICollator 3 | -------------------------------------------------------------------------------- /src/liqfit/collators/base_collator.py: -------------------------------------------------------------------------------- 1 | from collections import defaultdict 2 | import abc 3 | from typing import Union 4 | 5 | 6 | class Collator(abc.ABC): 7 | def __init__( 8 | self, 9 | tokenizer, 10 | max_length: int, 11 | padding: Union[bool, str], 12 | truncation: bool, 13 | ): 14 | self.tokenizer = tokenizer 15 | self.max_length = max_length 16 | self.padding = padding 17 | self.truncation = truncation 18 | 19 | @abc.abstractmethod 20 | def collate(self, batch): 21 | raise NotImplementedError("Should be implemented in a subclass.") 22 | 23 | def __call__(self, batch): 24 | grouped_batch = defaultdict(list) 25 | for example in batch: 26 | for k, v in example.items(): 27 | grouped_batch[k].append(v) 28 | output = self.collate(grouped_batch) 29 | return output 30 | -------------------------------------------------------------------------------- /src/liqfit/collators/nli_collator.py: -------------------------------------------------------------------------------- 1 | from typing import Callable 2 | import torch 3 | 4 | from .base_collator import Collator 5 | from typing import Union 6 | 7 | 8 | class NLICollator(Collator): 9 | def __init__( 10 | self, 11 | tokenizer: Callable, 12 | max_length: int, 13 | padding: Union[bool, str], 14 | truncation: bool, 15 | ): 16 | super().__init__( 17 | tokenizer, 18 | max_length=max_length, 19 | padding=padding, 20 | truncation=truncation, 21 | ) 22 | 23 | def _tokenize_and_align_labels(self, batch): 24 | texts = batch.get("texts", None) 25 | if texts is None: 26 | raise ValueError( 27 | "Expected to find a key with name 'texts' that " 28 | "contains a list of tuples where each tuple " 29 | "contains the hypothesis and the premise. " 30 | f"Received: {batch.keys()}" 31 | ) 32 | tokenized_input = self.tokenizer( 33 | texts, 34 | max_length=self.max_length, 35 | padding=self.padding, 36 | truncation=self.truncation, 37 | return_tensors="pt", 38 | ) 39 | labels = torch.tensor(batch["labels"]) 40 | tokenized_input.update({"labels": labels}) 41 | return tokenized_input 42 | 43 | def collate(self, batch): 44 | tokenized_input = self._tokenize_and_align_labels(batch) 45 | return tokenized_input 46 | -------------------------------------------------------------------------------- /src/liqfit/datasets/__init__.py: -------------------------------------------------------------------------------- 1 | from .nli_dataset import NLIDataset 2 | from .transform import transform_dataset 3 | -------------------------------------------------------------------------------- /src/liqfit/datasets/nli_dataset.py: -------------------------------------------------------------------------------- 1 | from __future__ import annotations 2 | 3 | from typing import Optional, List 4 | from datasets import Dataset, load_dataset 5 | 6 | from .transform import transform_dataset 7 | 8 | 9 | class NLIDataset: 10 | def __init__(self, hypothesis: List, premises: List, labels: List): 11 | """LiqFitDataset used for NLI training. 12 | 13 | Args: 14 | hypothesis (List): List of hypothesis texts. 15 | premises (List): List of premises texts. 16 | labels (List): List of labels for each example. 17 | """ 18 | self.hypothesis = hypothesis 19 | self.premises = premises 20 | self.labels = labels 21 | 22 | def __len__(self): 23 | equal_lengths = ( 24 | len(self.hypothesis) == len(self.premises) == len(self.labels) 25 | ) 26 | if not equal_lengths: 27 | raise ValueError( 28 | "Expected equal lengths between `self.hypothesis`" 29 | ", `self.premises` and `self.labels`. " 30 | f"Received: {len(self.hypothesis)} " 31 | f"- {len(self.premises)} - {len(self.labels)}." 32 | ) 33 | return len(self.hypothesis) 34 | 35 | def __getitem__(self, idx): 36 | return { 37 | "texts": (self.hypothesis[idx], self.premises[idx]), 38 | "labels": self.labels[idx], 39 | } 40 | 41 | @classmethod 42 | def load_dataset( 43 | cls, 44 | dataset: Optional[Dataset] = None, 45 | dataset_name: Optional[str] = None, 46 | classes: Optional[List[str]] = None, 47 | text_column: Optional[str] = "text", 48 | label_column: Optional[str] = "label", 49 | template: Optional[str] = "This example is {}.", 50 | normalize_negatives: bool = False, 51 | positives: int = 1, 52 | negatives: int = -1, 53 | multi_label: bool = False, 54 | ) -> NLIDataset: 55 | """Returns a `NLIDataset` instance. 56 | 57 | Args: 58 | dataset (Optional[Dataset], optional): Instance of Huggingface 59 | Dataset class. Defaults to None. 60 | dataset_name (Optional[str], optional): Dataset name to load from 61 | Huggingface datasets. Defaults to None. 62 | classes (Optional[List[str]], optional): List of classes. 63 | Defaults to None. 64 | text_column (Optional[str], optional): Text column name. 65 | Defaults to 'text'. 66 | label_column (Optional[str], optional): Label column name. 67 | Defaults to 'label'. 68 | template (Optional[str], optional): Template string that will be 69 | used for Zero-Shot training/prediction. Defaults to 70 | 'This example is {}.'. 71 | normalize_negatives (bool, optional): Whether to normalize amount 72 | of negative examples per each positive example of a class. 73 | Defaults to False. 74 | positives (int, optional): Number of positive examples to generate 75 | per source. Defaults to 1. 76 | negatives (int, optional): Number of negative examples to generate 77 | per source. Defaults to -1. 78 | multi_label (bool, optional): Whether each example has multiple 79 | labels or not. Defaults to False. 80 | 81 | Raises: 82 | TypeError: if `dataset_name` is `None` while `dataset` instance is 83 | not passed. 84 | TypeError: if `label_name` is `None`. 85 | TypeError: if `text_column` is `None` while `dataset` instance is 86 | not passed. 87 | TypeError: if `label_column` is `None` while `classes` is `None`. 88 | 89 | Returns: 90 | LiqFitDataset: An instance of LiqFitDataset. 91 | """ 92 | if dataset is None: 93 | if dataset_name is None: 94 | raise TypeError( 95 | "If dataset object is not provided you need to" 96 | " specify dataset_name." 97 | ) 98 | else: 99 | dataset = load_dataset(dataset_name)["train"] 100 | 101 | if label_column not in dataset.features: 102 | raise TypeError(f"Expected to find {label_column} in the dataset.") 103 | 104 | if text_column not in dataset.features: 105 | raise TypeError(f"Expected to find {text_column} in the dataset.") 106 | 107 | if classes is None: 108 | raise ValueError( 109 | f"Expected to have a list classes. Received: {classes}." 110 | ) 111 | 112 | processed_data = transform_dataset( 113 | dataset, 114 | classes, 115 | text_column, 116 | label_column, 117 | template, 118 | normalize_negatives, 119 | positives, 120 | negatives, 121 | multi_label, 122 | ) 123 | 124 | return cls( 125 | processed_data["sources"], 126 | processed_data["targets"], 127 | processed_data["labels"], 128 | ) 129 | -------------------------------------------------------------------------------- /src/liqfit/datasets/transform.py: -------------------------------------------------------------------------------- 1 | from typing import List, Tuple, Optional 2 | from collections import defaultdict 3 | from datasets import Dataset 4 | import numpy as np 5 | import random 6 | 7 | 8 | def get_labels_stat(labels: List[str]) -> Tuple[List[str], List[float]]: 9 | """Calculates the number of occurrences and probability of each unique 10 | label in the provided list of labels. 11 | 12 | Args: 13 | labels (List[str]): List of label strings 14 | 15 | Returns: 16 | unique_labels (List[str]): Unique label values 17 | probs (List[float]): Probability of each label 18 | """ 19 | # count occurrences of each label 20 | label_counts = defaultdict(int) 21 | for label in labels: 22 | label_counts[label] += 1 23 | 24 | # calculate probabilities 25 | count = len(labels) 26 | label_probs = { 27 | label: label_count / count 28 | for label, label_count in label_counts.items() 29 | } 30 | 31 | # extract labels and probabilities 32 | unique_labels = list(label_probs.keys()) 33 | probs = list(label_probs.values()) 34 | 35 | return unique_labels, probs 36 | 37 | 38 | def transform_dataset( 39 | dataset: Dataset, 40 | classes: List[str], 41 | text_column: Optional[str] = "text", 42 | label_column: Optional[str] = "label", 43 | template: Optional[str] = "This example is {}.", 44 | normalize_negatives: bool = False, 45 | positives: int = 1, 46 | negatives: int = -1, 47 | multi_label: bool = False, 48 | ) -> Dataset: 49 | """Transform a dataset into a format suitable for training. 50 | 51 | Args: 52 | dataset (Dataset): Input dataset. 53 | classes (List[str]): List of possible class labels. 54 | template (str, optional): Template string for generating examples. 55 | normalize_negatives (bool, optional): Whether to normalize amount of 56 | negative examples per each positive example of a class. 57 | positives (int, optional): Number of positive examples to generate per source. 58 | negatives (int, optional): Number of negative examples to generate per source. 59 | 60 | 61 | Returns: 62 | Dataset: Transformed dataset. 63 | 64 | This function transforms the input dataset into a format suitable for 65 | multi-label discriminative training. For each source text, it generates 66 | positive examples using the provided labels, and negative examples by 67 | sampling random incorrect labels. 68 | """ 69 | new_dataset = {"sources": [], "targets": [], "labels": []} 70 | 71 | texts = dataset[text_column] 72 | 73 | if label_column == "all_labels": 74 | labels = dataset["all_labels"] 75 | multi_label = True 76 | elif label_column in dataset.features: 77 | labels = dataset[label_column] 78 | if type(labels[0]) == int: 79 | labels = [classes[idx] for idx in labels] 80 | else: 81 | raise NotImplementedError( 82 | 'Dataset should contains "label" or "all_labels" columns' 83 | ) 84 | 85 | if normalize_negatives: 86 | unique_labels, probs = get_labels_stat(labels) 87 | 88 | if positives == -1: 89 | positives = len(classes) - 1 90 | if negatives == -1: 91 | negatives = len(classes) - 1 92 | 93 | for text, label in zip(texts, labels): 94 | if multi_label: 95 | curr_labels = label 96 | else: 97 | curr_labels = [label] 98 | 99 | for label in curr_labels: 100 | for i in range(positives): 101 | new_dataset["sources"].append(text) 102 | new_dataset["targets"].append(template.format(label)) 103 | new_dataset["labels"].append(1) 104 | 105 | for _ in range(len(classes) - 1): 106 | neg_class_ = label 107 | 108 | while neg_class_ in curr_labels: 109 | if normalize_negatives: 110 | neg_class_ = np.random.choice(unique_labels, p=probs) 111 | else: 112 | neg_class_ = random.sample(classes, k=1)[0] 113 | 114 | new_dataset["sources"].append(text) 115 | new_dataset["targets"].append(template.format(neg_class_)) 116 | new_dataset["labels"].append(0) 117 | 118 | return Dataset.from_dict(new_dataset) 119 | -------------------------------------------------------------------------------- /src/liqfit/losses/__init__.py: -------------------------------------------------------------------------------- 1 | from .losses import cross_entropy 2 | from .losses import binary_cross_entropy_with_logits 3 | from .losses import focal_loss_with_mask 4 | from .losses import BinaryCrossEntropyLoss, CrossEntropyLoss, FocalLoss 5 | -------------------------------------------------------------------------------- /src/liqfit/losses/losses.py: -------------------------------------------------------------------------------- 1 | from __future__ import annotations 2 | from typing import Optional 3 | import torch.nn.functional as F 4 | from kornia.losses import focal_loss 5 | import torch 6 | 7 | 8 | def binary_cross_entropy_with_logits(logits: torch.Tensor, 9 | labels: torch.Tensor, 10 | multi_target: bool = False, 11 | weight: Optional[torch.Tensor] = None, 12 | reduction: str = 'mean') -> torch.Tensor: 13 | """Wrapper function for adding support for multi_target training. 14 | 15 | Args: 16 | logits (torch.Tensor): Tensor with shape (B, T, D) where B is batch 17 | size, T is timesteps and D is embedding dimension. 18 | labels (torch.Tensor): Tensor with shape (B, T) where B is batch size, 19 | T is timesteps. 20 | multi_target (bool, optional): Whether the labels are multi target or 21 | one target for the entire sequence. Defaults to False. 22 | weight (Optional[torch.Tensor], optional): a manual rescaling weight 23 | if provided it's repeated to match input tensor shape. 24 | Defaults to None. 25 | reduction (str, optional): Reduction type that will be applied on the 26 | loss function, supported: 'mean', 'sum' or 'none'. 27 | Defaults to 'mean'. 28 | 29 | Returns: 30 | torch.Tensor: Loss tensor. 31 | """ 32 | if multi_target: 33 | logits = logits.view(-1, logits.shape[-1]) 34 | labels = labels.view(-1) 35 | else: 36 | labels = labels.view(-1) 37 | loss = F.binary_cross_entropy_with_logits(logits, 38 | labels, 39 | weight=weight, 40 | reduction=reduction) 41 | return loss 42 | 43 | 44 | class BinaryCrossEntropyLoss(torch.nn.Module): 45 | 46 | def __init__(self, multi_target=False, weight=None, reduction='mean'): 47 | super().__init__() 48 | """Calculate binary cross-entropy loss with support for multi target training. 49 | 50 | Args: 51 | multi_target (bool, optional): Whether the labels are multi target or 52 | one target for the entire sequence. Defaults to False. 53 | weight (Optional[torch.Tensor], optional): a manual rescaling weight 54 | if provided it's repeated to match input tensor shape. 55 | Defaults to None. 56 | reduction (str, optional): Reduction type that will be applied on the 57 | loss function, supported: 'mean', 'sum' or 'none'. 58 | Defaults to 'mean'. 59 | 60 | Returns: 61 | torch.Tensor: Loss tensor. 62 | Examples: 63 | loss = BinaryCrossEntropyLoss()(logits, targets) 64 | """ 65 | self.multi_target = multi_target 66 | self.weight = weight 67 | self.reduction = reduction 68 | 69 | def forward(self, logits, target): 70 | 71 | loss = binary_cross_entropy_with_logits( 72 | logits, 73 | target, 74 | multi_target=self.multi_target, 75 | weight=self.weight, 76 | reduction=self.reduction, 77 | ) 78 | 79 | return loss 80 | 81 | 82 | def cross_entropy(logits: torch.Tensor, 83 | labels: torch.Tensor, 84 | multi_target: bool = False, 85 | weight: Optional[torch.Tensor] = None, 86 | ignore_index: int = -100, 87 | reduction: str = 'mean', 88 | label_smoothing: float = 0.0): 89 | """Wrapper function for adding support for multi_target training. 90 | 91 | Args: 92 | logits (torch.Tensor): Tensor with shape (B, T, D) where B is batch 93 | size, T is timesteps and D is embedding dimension. 94 | labels (torch.Tensor): Tensor with shape (B, T) where B is batch size, 95 | T is timesteps. 96 | multi_target (bool, optional): Whether the labels are multi target or 97 | one target for the entire sequence. Defaults to False. 98 | weight (Optional[torch.Tensor], optional): a manual rescaling weight 99 | if provided it's repeated to match input tensor shape. 100 | Defaults to None. 101 | ignore_index (int, optional): Index value that will be ignored during 102 | loss calculation. Defaults to -100. 103 | reduction (str, optional): Reduction type that will be applied on the 104 | loss function, supported: 'mean', 'sum' or 'none'. 105 | Defaults to 'mean'. 106 | label_smoothing (float, optional): A float in [0.0, 1.0]. Specifies 107 | the amount of smoothing when computing the loss, where 0.0 means 108 | no smoothing. Defaults to 0.0. 109 | 110 | Returns: 111 | torch.Tensor: Loss tensor. 112 | """ 113 | if multi_target: 114 | logits = logits.view(-1, logits.shape[-1]) 115 | labels = labels.view(-1) 116 | else: 117 | labels = labels.view(-1) 118 | loss = F.cross_entropy(logits, 119 | labels, 120 | weight=weight, 121 | reduction=reduction, 122 | ignore_index=ignore_index, 123 | label_smoothing=label_smoothing) 124 | return loss 125 | 126 | 127 | class CrossEntropyLoss(torch.nn.Module): 128 | 129 | def __init__(self, multi_target=False, weight=None, ignore_index=-100, reduction='mean', label_smoothing=0.0): 130 | super().__init__() 131 | """Calculate cross-entropy loss while ignoring specified target labels. 132 | 133 | Args: 134 | multi_target (bool, optional): Whether the labels are multi target or 135 | one target for the entire sequence. Defaults to False. 136 | weight (Optional[torch.Tensor], optional): a manual rescaling weight 137 | if provided it's repeated to match input tensor shape. 138 | Defaults to None. 139 | ignore_index (int, optional): Index value that will be ignored during 140 | loss calculation. Defaults to -100. 141 | reduction (str, optional): Reduction type that will be applied on the 142 | loss function, supported: 'mean', 'sum' or 'none'. 143 | Defaults to 'mean'. 144 | label_smoothing (float, optional): A float in [0.0, 1.0]. Specifies 145 | the amount of smoothing when computing the loss, where 0.0 means 146 | no smoothing. Defaults to 0.0. 147 | 148 | Returns: 149 | torch.Tensor: Loss tensor. 150 | Examples: 151 | loss = CrossEntropyLoss()(logits, targets) 152 | """ 153 | self.multi_target = multi_target 154 | self.weight = weight 155 | self.ignore_index = ignore_index 156 | self.reduction = reduction 157 | self.label_smoothing = label_smoothing 158 | 159 | def forward(self, logits, target): 160 | 161 | loss = cross_entropy( 162 | logits, 163 | target, 164 | multi_target=self.multi_target, 165 | weight=self.weight, 166 | ignore_index=self.ignore_index, 167 | reduction=self.reduction, 168 | label_smoothing=self.label_smoothing 169 | ) 170 | 171 | return loss 172 | 173 | 174 | def focal_loss_with_mask( 175 | logits: torch.Tensor, 176 | target: torch.Tensor, 177 | ignore_index: int = -100, 178 | alpha: float = 0.5, 179 | gamma: float = 2.0, 180 | reduction: str | None = "mean", 181 | ) -> torch.Tensor: 182 | """Calculate focal loss while ignoring specified target labels. 183 | 184 | Args: 185 | logits (torch.Tensor): Model predictions. 186 | target (torch.Tensor): True labels. 187 | ignore_index (int): Label to ignore from loss calculation. 188 | alpha (float): Focal loss alpha parameter. 189 | gamma (float): Focal loss gamma parameter. 190 | reduction (str | None): Method to reduce loss. 191 | 192 | Returns: 193 | torch.Tensor: Loss tensor. 194 | 195 | This function calculates the focal loss between logits and targets, 196 | while ignoring any examples where the target is equal to ignore_index. 197 | 198 | Examples: 199 | 200 | loss = focal_loss_with_mask(logits, targets, ignore_index=-100) 201 | """ 202 | if not isinstance(ignore_index, int): 203 | raise ValueError('Expected `ignore_index` to be of type `int`. ' 204 | f'Received: {type(ignore_index)}') 205 | 206 | mask = target == ignore_index 207 | 208 | # To make focal_loss function work because 209 | # it cannot work with -ve numbers (e.g. -100). 210 | if ignore_index != 0: 211 | target_without_ignore_index = target.masked_fill(mask, 0) 212 | 213 | loss = focal_loss( 214 | pred=logits, 215 | target=target_without_ignore_index, 216 | alpha=alpha, 217 | gamma=gamma, 218 | reduction="none", 219 | ) 220 | 221 | loss = loss.masked_fill(mask.view(-1, 1), torch.inf) 222 | 223 | if reduction == "mean": 224 | return loss[loss != torch.inf].mean() 225 | elif reduction == "sum": 226 | return loss[loss != torch.inf].sum() 227 | elif reduction is None: 228 | return loss 229 | else: 230 | raise ValueError( 231 | 'Expected reduction to be "sum", "mean" or `None`. ' 232 | f"Received: {reduction}." 233 | ) 234 | 235 | class FocalLoss(torch.nn.Module): 236 | def __init__( 237 | self, 238 | ignore_index: int = -100, 239 | alpha: float = 0.5, 240 | gamma: float = 2.0, 241 | reduction: str = "mean", 242 | ): 243 | """Calculate focal loss while ignoring specified target labels. 244 | Args: 245 | logits (torch.Tensor): Model predictions. 246 | target (torch.Tensor): True labels. 247 | ignore_index (int): Label to ignore from loss calculation. 248 | alpha: Weighting factor that ranges between [0, 1]`. 249 | gamma: Focusing parameter gamma >= 0`. 250 | reduction (str | None): Reduction type for loss reduction. 251 | Supported: 'mean', 'sum' or 'none'. Defaults to 'mean' 252 | 253 | Returns: 254 | torch.Tensor: Loss tensor. 255 | Examples: 256 | loss = FocalLoss()(logits, targets) 257 | """ 258 | super().__init__() 259 | self.ignore_index = ignore_index 260 | self.alpha = alpha 261 | self.gamma = gamma 262 | self.reduction = reduction 263 | 264 | def forward(self, logits: torch.Tensor, target: torch.Tensor): 265 | return focal_loss_with_mask( 266 | logits=logits, 267 | target=target, 268 | ignore_index=self.ignore_index, 269 | alpha=self.alpha, 270 | gamma=self.gamma, 271 | reduction=self.reduction, 272 | ) 273 | -------------------------------------------------------------------------------- /src/liqfit/modeling/__init__.py: -------------------------------------------------------------------------------- 1 | from .heads import LiqFitHead 2 | from .heads import LabelClassificationHead 3 | from .heads import ClassClassificationHead 4 | from .heads import ClassificationHead 5 | from .model import LiqFitModel 6 | from .backbone import LiqFitBackbone 7 | from .heads import HeadOutput 8 | -------------------------------------------------------------------------------- /src/liqfit/modeling/backbone.py: -------------------------------------------------------------------------------- 1 | from __future__ import annotations 2 | import abc 3 | 4 | import torch 5 | from torch import nn 6 | from transformers import PreTrainedModel, PretrainedConfig 7 | 8 | 9 | class LiqFitBackbone(PreTrainedModel, abc.ABC): 10 | def __init__( 11 | self, config: PretrainedConfig, backbone: nn.Module, push_backbone_only: bool = False 12 | ) -> None: 13 | """Backbone model wrapper.""" 14 | super().__init__(config=config) 15 | self.push_backbone_only = push_backbone_only 16 | self.backbone = backbone 17 | 18 | def push_to_hub( 19 | self, 20 | repo_id: str, 21 | use_temp_dir: bool | None = None, 22 | commit_message: str | None = None, 23 | private: bool | None = None, 24 | token: bool | str | None = None, 25 | max_shard_size: int | str | None = "5GB", 26 | create_pr: bool = False, 27 | safe_serialization: bool = True, 28 | revision: str = None, 29 | commit_description: str = None, 30 | **deprecated_kwargs, 31 | ) -> str: 32 | if self.push_backbone_only: 33 | output = self.backbone.push_to_hub( 34 | repo_id=repo_id, 35 | use_temp_dir=use_temp_dir, 36 | commit_message=commit_message, 37 | private=private, 38 | token=token, 39 | max_shard_size=max_shard_size, 40 | create_pr=create_pr, 41 | safe_serialization=safe_serialization, 42 | revision=revision, 43 | commit_description=commit_description, 44 | **deprecated_kwargs, 45 | ) 46 | else: 47 | output = super().push_to_hub( 48 | repo_id=repo_id, 49 | use_temp_dir=use_temp_dir, 50 | commit_message=commit_message, 51 | private=private, 52 | token=token, 53 | max_shard_size=max_shard_size, 54 | create_pr=create_pr, 55 | safe_serialization=safe_serialization, 56 | revision=revision, 57 | commit_description=commit_description, 58 | **deprecated_kwargs, 59 | ) 60 | return output 61 | 62 | @abc.abstractmethod 63 | def encode(self, input_ids, attention_mask=None) -> torch.Tensor: 64 | raise NotImplementedError("Should be implemented in a subclass.") 65 | 66 | -------------------------------------------------------------------------------- /src/liqfit/modeling/heads.py: -------------------------------------------------------------------------------- 1 | import abc 2 | from typing import Optional 3 | 4 | import torch 5 | from torch import nn 6 | from dataclasses import dataclass 7 | from transformers.modeling_outputs import ModelOutput 8 | 9 | from ..losses import binary_cross_entropy_with_logits, cross_entropy 10 | 11 | class LiqFitHead(nn.Module, abc.ABC): 12 | def __init__(self, *args, **kwargs) -> None: 13 | """LiqFitHead base class.""" 14 | super().__init__(*args, **kwargs) 15 | 16 | @abc.abstractmethod 17 | def compute_loss(self, logits, labels) -> torch.Tensor: 18 | raise NotImplementedError("Should be implemented in a subclass.") 19 | 20 | @staticmethod 21 | def init_weight(module): 22 | if isinstance(module, nn.Linear): 23 | nn.init.xavier_uniform_(module.weight) 24 | if module.bias is not None: 25 | nn.init.constant_(module.bias, 1e-2) 26 | 27 | @abc.abstractmethod 28 | def forward( 29 | self, embeddings: torch.Tensor, labels: Optional[torch.Tensor] = None 30 | ): 31 | pass 32 | 33 | @dataclass 34 | class HeadOutput(ModelOutput): 35 | embeddings: Optional[torch.Tensor] = None 36 | logits: Optional[torch.Tensor] = None 37 | loss: Optional[torch.Tensor] = None 38 | 39 | 40 | class LabelClassificationHead(LiqFitHead): 41 | def __init__( 42 | self, 43 | in_features: int, 44 | out_features: int, 45 | multi_target: bool, 46 | bias: bool = True, 47 | temperature: int = 1.0, 48 | eps: float = 1e-5, 49 | ): 50 | """Label Classification Head class for Binary or Multi-label tasks. 51 | 52 | Args: 53 | in_features (_type_): Number of input features. 54 | out_features (_type_): Number of output features. 55 | multi_target (_type_): Whether this class is for multi-target 56 | task or not. 57 | bias (bool, optional): Whether to add bias to the `Linear` 58 | layer or not. Defaults to True. 59 | temperature (int, optional): Temperature that will be used 60 | to calibrate the head to the task. Defaults to 1.0. 61 | eps (float, optional): Epsilon value for numirical stability. 62 | Defaults to 1e-5. 63 | """ 64 | super().__init__() 65 | self.temperature = temperature 66 | self.eps = eps 67 | self.multi_target = multi_target 68 | self.linear = nn.Linear(in_features, out_features, bias=bias) 69 | LiqFitHead.init_weight(self.linear) 70 | 71 | def compute_loss(self, logits: torch.Tensor, labels: torch.Tensor): 72 | loss = binary_cross_entropy_with_logits( 73 | logits, labels, self.multi_target 74 | ) 75 | return loss 76 | 77 | def forward( 78 | self, embeddings: torch.Tensor, labels: Optional[torch.Tensor] = None 79 | ) -> torch.Tensor: 80 | logits = self.linear(embeddings) 81 | logits /= self.temperature + self.eps 82 | if labels is not None: 83 | loss = self.compute_loss(logits, labels) 84 | else: 85 | loss = None 86 | return HeadOutput(embeddings=embeddings, logits=logits, loss=loss) 87 | 88 | 89 | class ClassClassificationHead(LiqFitHead): 90 | def __init__( 91 | self, 92 | in_features: int, 93 | out_features: int, 94 | multi_target: bool, 95 | bias: bool = True, 96 | temperature: int = 1.0, 97 | eps: float = 1e-5, 98 | ignore_index: int = -100, 99 | ): 100 | """Class Classification Head class for Sequence/Token classification 101 | tasks. 102 | 103 | Args: 104 | in_features (int): Number of input features. 105 | out_features (int): Number of output features. 106 | multi_target (bool): Whether this class is for multi-target task 107 | or not. 108 | bias (bool, optional): Whether to add bias to the `Linear` 109 | layer or not. Defaults to True. 110 | temperature (int, optional): Temperature that will be used 111 | to calibrate the head to the task. Defaults to 1.0. 112 | eps (float, optional): Epsilon value for numirical stability. 113 | Defaults to 1e-5. 114 | ignore_index (int, optional): Index that will be ignore in 115 | case of token classification tasks. Defaults to -100. 116 | """ 117 | super().__init__() 118 | self.temperature = temperature 119 | self.eps = eps 120 | self.multi_target = multi_target 121 | self.ignore_index = ignore_index 122 | self.linear = nn.Linear(in_features, out_features, bias=bias) 123 | LiqFitHead.init_weight(self.linear) 124 | 125 | def compute_loss(self, logits: torch.Tensor, labels: torch.Tensor): 126 | return cross_entropy( 127 | logits, labels, self.multi_target, ignore_index=self.ignore_index 128 | ) 129 | 130 | def forward( 131 | self, embeddings: torch.Tensor, labels: Optional[torch.Tensor] = None 132 | ) -> torch.Tensor: 133 | logits = self.linear(embeddings) / (self.temperature + self.eps) 134 | if labels is not None: 135 | loss = self.compute_loss(logits, labels) 136 | else: 137 | loss = None 138 | return HeadOutput(embeddings=embeddings, logits=logits, loss=loss) 139 | 140 | 141 | class ClassificationHead(LiqFitHead): 142 | def __init__( 143 | self, 144 | in_features: int, 145 | out_features: int, 146 | pooler: nn.Module, 147 | loss_func: nn.Module, 148 | bias: bool = True, 149 | temperature: int = 1.0, 150 | eps: float = 1e-5, 151 | ): 152 | """Class Classification Head class for Sequence/Token classification 153 | tasks. 154 | 155 | Args: 156 | in_features (int): Number of input features. 157 | out_features (int): Number of output features. 158 | pooler (torch.nn.Module): Module that applier various pooling opperation on the outputs of a model . 159 | loss_func (torch.nn.Module): loss function object. 160 | out_features (int): Number of output features. 161 | bias (bool, optional): Whether to add bias to the `Linear` 162 | layer or not. Defaults to True. 163 | temperature (int, optional): Temperature that will be used 164 | to calibrate the head to the task. Defaults to 1.0. 165 | eps (float, optional): Epsilon value for numirical stability. 166 | Defaults to 1e-5. 167 | ignore_index (int, optional): Index that will be ignore in 168 | case of token classification tasks. Defaults to -100. 169 | """ 170 | super().__init__() 171 | self.temperature = temperature 172 | self.eps = eps 173 | self.pooler = pooler 174 | self.loss_func = loss_func 175 | self.linear = nn.Linear(in_features, out_features, bias=bias) 176 | LiqFitHead.init_weight(self.linear) 177 | 178 | def compute_loss(self, logits: torch.Tensor, labels: torch.Tensor): 179 | return self.loss_func( 180 | logits, labels 181 | ) 182 | 183 | def forward( 184 | self, embeddings: torch.Tensor, labels: Optional[torch.Tensor] = None 185 | ) -> torch.Tensor: 186 | pooled_input = self.pooler(embeddings) 187 | logits = self.linear(pooled_input) / (self.temperature + self.eps) 188 | if labels is not None: 189 | loss = self.compute_loss(logits, labels) 190 | else: 191 | loss = None 192 | return HeadOutput(embeddings=pooled_input, logits=logits, loss=loss) 193 | -------------------------------------------------------------------------------- /src/liqfit/modeling/model.py: -------------------------------------------------------------------------------- 1 | from __future__ import annotations 2 | 3 | from typing import Optional 4 | 5 | import inspect 6 | import torch 7 | from torch import nn 8 | import torch.nn.functional as F 9 | from sklearn.linear_model import LogisticRegression 10 | from transformers import PreTrainedModel, PretrainedConfig 11 | 12 | from .backbone import LiqFitBackbone 13 | from .heads import LiqFitHead, HeadOutput 14 | from ..utils.standardization import convert_to_numpy 15 | 16 | class LiqFitModel(PreTrainedModel): 17 | def __init__( 18 | self, 19 | config: PretrainedConfig, 20 | backbone: LiqFitBackbone | nn.Module | PreTrainedModel, 21 | head: Optional[LiqFitHead | LogisticRegression] = None, 22 | loss_func: Optional[nn.Module] = None, 23 | normalize_backbone_embeddings: bool = False, 24 | labels_name: str = "labels", 25 | push_backbone_only: bool = False, 26 | ): 27 | """Model container that groups the backbone and head together 28 | and applies forward on both of them. 29 | 30 | Args: 31 | backbone (LiqFitBackbone): Backbone model. 32 | head (Optional[LiqFitHead | LogisticRegression], optional): 33 | Head that is defined for the task. Could be set to `None` 34 | if the head is already attached to the backbone. 35 | Defaults to None. 36 | loss_func (Optional[nn.Module]): class for calculation of loss functions. 37 | normalize_backbone_embeddings (bool, optional): Whether to 38 | normalize the backbone embeddings or not (Requires the 39 | backbone output to be a `torch.Tensor` not a Huggingface 40 | object). Defaults to False. 41 | labels_name (str, optional): Labels name that will be sent in the 42 | **kwargs for loss calculation. Defaults to "labels". 43 | 44 | Example 1: 45 | # make sure that the output from this model 46 | # is a torch.Tensor otherwise wrap it using LiqFitBackbone. 47 | my_backbone = AutoModel.from_pretrained(....) 48 | head = LiqFit.modeling.LabelClassificationHead(...) 49 | model = LiqFitModel(my_backbone.config, my_backbone, head) 50 | 51 | Example 2: 52 | class MyBackbone(LiqFitBackbone): 53 | def __init__(self): 54 | my_backbone = AutoModel.from_pretrained(....) 55 | super().__init__(my_backbone.config, backbone=backbone) 56 | def encode(self, input_ids, attention_mask=None) -> torch.Tensor: 57 | output = self.backbone(input_ids, attention_mask=attention_mask) 58 | return output 59 | 60 | my_backbone = MyBackbone() 61 | head = LiqFit.modeling.LabelClassificationHead(...) 62 | model = LiqFitModel(my_backbone.config, my_backbone, head) 63 | """ 64 | 65 | super().__init__(config=config) 66 | self._is_sklearn_head = None 67 | self.backbone = backbone 68 | self._determine_and_validate_head_type(head) 69 | self.head = head 70 | self.loss_func = loss_func 71 | self.normalize_backbone_embeddings = normalize_backbone_embeddings 72 | self.labels_name = labels_name 73 | self.push_backbone_only = push_backbone_only 74 | self.expecting_labels = 'labels' in inspect.getfullargspec(self.backbone.forward).args 75 | 76 | def push_to_hub( 77 | self, 78 | repo_id: str, 79 | use_temp_dir: bool | None = None, 80 | commit_message: str | None = None, 81 | private: bool | None = None, 82 | token: bool | str | None = None, 83 | max_shard_size: int | str | None = "5GB", 84 | create_pr: bool = False, 85 | safe_serialization: bool = True, 86 | revision: str = None, 87 | commit_description: str = None, 88 | **deprecated_kwargs, 89 | ) -> str: 90 | if self.push_backbone_only: 91 | if isinstance(self.backbone, (LiqFitBackbone, PreTrainedModel)): 92 | return self.backbone.push_to_hub( 93 | repo_id, 94 | use_temp_dir, 95 | commit_message, 96 | private, 97 | token, 98 | max_shard_size, 99 | create_pr, 100 | safe_serialization, 101 | revision, 102 | commit_description, 103 | **deprecated_kwargs, 104 | ) 105 | else: 106 | output = super().push_to_hub( 107 | repo_id=repo_id, 108 | use_temp_dir=use_temp_dir, 109 | commit_message=commit_message, 110 | private=private, 111 | token=token, 112 | max_shard_size=max_shard_size, 113 | create_pr=create_pr, 114 | safe_serialization=safe_serialization, 115 | revision=revision, 116 | commit_description=commit_description, 117 | **deprecated_kwargs, 118 | ) 119 | return output 120 | 121 | def freeze_weights(self): 122 | self.requires_grad_(False) 123 | 124 | def unfreeze_weights(self): 125 | self.requires_grad_(True) 126 | 127 | def _determine_and_validate_head_type(self, head): 128 | if head is None: 129 | return 130 | 131 | self._is_sklearn_head = isinstance(head, LogisticRegression) 132 | if not self._is_sklearn_head and not isinstance(head, LiqFitHead): 133 | raise TypeError( 134 | "Expected `head` to be of type " 135 | "`LogisticRegression` or `LiqFitHead`. " 136 | f"Received: {type(head)}." 137 | ) 138 | 139 | def _backbone_forward(self, **kwargs): 140 | if isinstance(self.backbone, LiqFitBackbone): 141 | output = self.backbone.encode(**kwargs) 142 | if not isinstance(output, torch.Tensor): 143 | raise ValueError( 144 | "Expected output from backbone model to be of type " 145 | f"`torch.Tensor`. Received: {type(output)}." 146 | ) 147 | else: 148 | output = self.backbone(**kwargs) 149 | return output 150 | 151 | def _torch_head_forward(self, embeddings, labels=None): 152 | output = self.head(embeddings, labels) 153 | return output 154 | 155 | def _sklearn_head_forward(self, embeddings): 156 | embeddings = convert_to_numpy(embeddings) 157 | output = self.head.predict(embeddings) 158 | return output 159 | 160 | def _head_forward(self, inputs, labels=None): 161 | if self._is_sklearn_head: 162 | return self._sklearn_head_forward(inputs) 163 | else: 164 | return self._torch_head_forward(inputs, labels) 165 | 166 | def forward(self, **kwargs): 167 | labels = kwargs.pop('labels', None) 168 | 169 | output = self._backbone_forward(**kwargs) 170 | 171 | if not isinstance(output, torch.Tensor): 172 | if isinstance(output, tuple): 173 | output = output[0] 174 | elif 'logits' in output: 175 | output = output['logits'] 176 | elif 'last_hidden_state' in output: 177 | output = output['last_hidden_state'] 178 | else: 179 | raise NotImplementedError('A model output should contains logits or last_hidden_state.') 180 | 181 | if self.normalize_backbone_embeddings: 182 | if isinstance(output, torch.Tensor): 183 | output = F.normalize(output, p=2.0, dim=-1) 184 | else: 185 | raise TypeError( 186 | "Normalizing the embedding requires type of " 187 | f"`torch.Tensor`. Received: {type(output)}." 188 | ) 189 | if self.head is not None: 190 | output = self._head_forward(output, labels) 191 | elif self.loss_func is not None and labels is not None: 192 | loss = self.loss_func(output, labels) 193 | output = HeadOutput(logits=output, loss=loss) 194 | return output 195 | -------------------------------------------------------------------------------- /src/liqfit/modeling/pooling.py: -------------------------------------------------------------------------------- 1 | from typing import Optional 2 | 3 | import torch 4 | from torch import nn 5 | 6 | 7 | class GlobalMaxPooling1D(nn.Module): 8 | """Applies Global Max Pooling on the timesteps dimension.""" 9 | 10 | def forward(self, x: torch.Tensor): 11 | return x.amax(dim=1) 12 | 13 | 14 | class FirstTokenPooling1D(nn.Module): 15 | """Takes the first token's embedding.""" 16 | 17 | def forward(self, x: torch.Tensor): 18 | return x[:, 0, :] 19 | 20 | 21 | class LastTokenPooling1D(nn.Module): 22 | """Takes the last token's embedding.""" 23 | 24 | def forward(self, x: torch.Tensor): 25 | return x[:, -1, :] 26 | 27 | 28 | class GlobalAvgPooling1D(nn.Module): 29 | """Applies Global Average Pooling on the timesteps dimension.""" 30 | 31 | def forward( 32 | self, x: torch.Tensor, attention_mask: Optional[torch.Tensor] = None 33 | ): 34 | if attention_mask is not None: 35 | attention_mask = attention_mask.repeat((1, 1, x.shape[-1])).to( 36 | dtype=x.dtype 37 | ) 38 | x = x * attention_mask 39 | return x.sum(1) / attention_mask.sum(1) 40 | else: 41 | return x.mean(dim=1) 42 | 43 | 44 | class GlobalSumPooling1D(nn.Module): 45 | """Applies Global Sum Pooling on the timesteps dimension.""" 46 | 47 | def forward(self, x: torch.Tensor, attention_mask: Optional[torch.Tensor] = None): 48 | if attention_mask is not None: 49 | x = x * attention_mask 50 | return x.sum(dim=1) 51 | 52 | 53 | class GlobalRMSPooling1D(nn.Module): 54 | """Applies Global RMS Pooling on the timesteps dimension.""" 55 | 56 | def forward(self, x: torch.Tensor, attention_mask: Optional[torch.Tensor] = None): 57 | if attention_mask is not None: 58 | attention_mask = attention_mask.repeat((1, 1, x.shape[-1])).to( 59 | dtype=x.dtype 60 | ) 61 | x = x * attention_mask 62 | return (x.pow(2).sum(dim=1) / attention_mask.sum(1)).sqrt() 63 | else: 64 | return x.pow(2).mean(dim=1).sqrt() 65 | 66 | 67 | class GlobalAbsMaxPooling1D(nn.Module): 68 | """Applies Global Max Pooling of absolute values on the timesteps dimension.""" 69 | 70 | def forward(self, x: torch.Tensor, attention_mask: Optional[torch.Tensor] = None): 71 | if attention_mask is not None: 72 | attention_mask = attention_mask.repeat((1, 1, x.shape[-1])).to( 73 | dtype=x.dtype 74 | ) 75 | x = x * attention_mask 76 | return x.abs().amax(dim=1) 77 | 78 | 79 | class GlobalAbsAvgPooling1D(nn.Module): 80 | """Applies Global Average Pooling of absolute values on the timesteps dimension.""" 81 | 82 | def forward(self, x: torch.Tensor, attention_mask: Optional[torch.Tensor] = None): 83 | if attention_mask is not None: 84 | attention_mask = attention_mask.repeat((1, 1, x.shape[-1])).to( 85 | dtype=x.dtype 86 | ) 87 | x = (x * attention_mask).abs() 88 | return x.sum(dim=1) / attention_mask.sum(1) 89 | else: 90 | return x.abs().mean(dim=1) 91 | -------------------------------------------------------------------------------- /src/liqfit/models/__init__.py: -------------------------------------------------------------------------------- 1 | from .t5 import T5ForZeroShotClassification, T5ConfigWithLoss 2 | from .deberta import DebertaV2ForZeroShotClassification, DebertaConfigWithLoss -------------------------------------------------------------------------------- /src/liqfit/models/deberta.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | # Copyright 2020, The T5 Authors and HuggingFace Inc. and Knowledagtor 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | 16 | from transformers import DebertaConfig, DebertaV2ForSequenceClassification 17 | from transformers.modeling_outputs import SequenceClassifierOutput 18 | from transformers.utils import add_end_docstrings, logging 19 | from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss 20 | 21 | from typing import Union, Optional, Tuple 22 | import torch 23 | from torch import nn 24 | 25 | from typing import List, Union 26 | 27 | from ..losses import FocalLoss 28 | 29 | logger = logging.get_logger(__name__) 30 | 31 | SUPPORTED_LOSSES = ("focal_loss", "cross_entropy") 32 | 33 | 34 | class DebertaConfigWithLoss(DebertaConfig): 35 | """Deberta configuration with additional loss parameters. 36 | 37 | Extends Deberta to include parameters for configuring the 38 | loss function during training. 39 | """ 40 | def __init__( 41 | self, 42 | loss_type = "focal_loss", 43 | focal_loss_alpha=0.5, 44 | focal_loss_gamma=2.0, 45 | **kwargs, 46 | ): 47 | super().__init__(**kwargs) 48 | self.loss_type= loss_type 49 | self.focal_loss_alpha = focal_loss_alpha 50 | self.focal_loss_gamma = focal_loss_gamma 51 | 52 | class DebertaV2ForZeroShotClassification(DebertaV2ForSequenceClassification): 53 | def __init__(self, config: DebertaConfigWithLoss): 54 | super().__init__(config) 55 | 56 | if self.config.loss_type not in SUPPORTED_LOSSES: 57 | raise NotImplementedError(f"{self.config.loss_type} is not implemented loss function type. ") 58 | 59 | def forward( 60 | self, 61 | input_ids: Optional[torch.Tensor] = None, 62 | attention_mask: Optional[torch.Tensor] = None, 63 | token_type_ids: Optional[torch.Tensor] = None, 64 | position_ids: Optional[torch.Tensor] = None, 65 | inputs_embeds: Optional[torch.Tensor] = None, 66 | labels: Optional[torch.Tensor] = None, 67 | output_attentions: Optional[bool] = None, 68 | output_hidden_states: Optional[bool] = None, 69 | return_dict: Optional[bool] = None, 70 | ) -> Union[Tuple, SequenceClassifierOutput]: 71 | r""" 72 | labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*): 73 | Labels for computing the sequence classification/regression loss. Indices should be in `[0, ..., 74 | config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If 75 | `config.num_labels > 1` a classification loss is computed (Cross-Entropy). 76 | """ 77 | return_dict = return_dict if return_dict is not None else self.config.use_return_dict 78 | 79 | outputs = self.deberta( 80 | input_ids, 81 | token_type_ids=token_type_ids, 82 | attention_mask=attention_mask, 83 | position_ids=position_ids, 84 | inputs_embeds=inputs_embeds, 85 | output_attentions=output_attentions, 86 | output_hidden_states=output_hidden_states, 87 | return_dict=return_dict, 88 | ) 89 | 90 | encoder_layer = outputs[0] 91 | pooled_output = self.pooler(encoder_layer) 92 | pooled_output = self.dropout(pooled_output) 93 | logits = self.classifier(pooled_output) 94 | 95 | loss = None 96 | if labels is not None: 97 | if self.config.problem_type is None: 98 | if self.num_labels == 1: 99 | # regression task 100 | loss_fn = nn.MSELoss() 101 | logits = logits.view(-1).to(labels.dtype) 102 | loss = loss_fn(logits, labels.view(-1)) 103 | elif labels.dim() == 1 or labels.size(-1) == 1: 104 | label_index = (labels >= 0).nonzero() 105 | labels = labels.long() 106 | if label_index.size(0) > 0: 107 | labeled_logits = torch.gather( 108 | logits, 0, label_index.expand(label_index.size(0), logits.size(1)) 109 | ) 110 | labels = torch.gather(labels, 0, label_index.view(-1)) 111 | loss_fct = CrossEntropyLoss() 112 | loss = loss_fct(labeled_logits.view(-1, self.num_labels).float(), labels.view(-1)) 113 | else: 114 | loss = torch.tensor(0).to(logits) 115 | else: 116 | log_softmax = nn.LogSoftmax(-1) 117 | loss = -((log_softmax(logits) * labels).sum(-1)).mean() 118 | elif self.config.problem_type == "regression": 119 | loss_fct = MSELoss() 120 | if self.num_labels == 1: 121 | loss = loss_fct(logits.squeeze(), labels.squeeze()) 122 | else: 123 | loss = loss_fct(logits, labels) 124 | elif self.config.problem_type == "single_label_classification": 125 | if self.config.loss_type == "cross_entropy": 126 | loss_fct = CrossEntropyLoss() 127 | elif self.config.loss_type == "focal_loss": 128 | loss_fct = FocalLoss(alpha=self.config.focal_loss_alpha, gamma=self.config.focal_loss_gamma) 129 | loss = loss_fct(logits.view(-1, self.config.num_labels), labels.view(-1)) 130 | elif self.config.problem_type == "multi_label_classification": 131 | loss_fct = BCEWithLogitsLoss() 132 | loss = loss_fct(logits, labels) 133 | if not return_dict: 134 | output = (logits,) + outputs[1:] 135 | return ((loss,) + output) if loss is not None else output 136 | 137 | return SequenceClassifierOutput( 138 | loss=loss, logits=logits, hidden_states=outputs.hidden_states, attentions=outputs.attentions 139 | ) -------------------------------------------------------------------------------- /src/liqfit/models/t5.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | # Copyright 2020, The T5 Authors and HuggingFace Inc. and Knowledagtor 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | 16 | from transformers import T5PreTrainedModel, T5Config, T5Model 17 | from transformers.modeling_outputs import Seq2SeqSequenceClassifierOutput 18 | from transformers.utils import add_end_docstrings, logging 19 | from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss 20 | 21 | from typing import Union, Optional, Tuple 22 | import torch 23 | from torch import nn 24 | 25 | from typing import List, Union 26 | 27 | from ..losses import FocalLoss 28 | 29 | logger = logging.get_logger(__name__) 30 | 31 | SUPPORTED_LOSSES = ("focal_loss", "cross_entropy") 32 | 33 | class T5ConfigWithLoss(T5Config): 34 | """T5 configuration with additional loss parameters. 35 | 36 | Extends T5Config to include parameters for configuring the 37 | loss function during training. 38 | """ 39 | def __init__( 40 | self, 41 | loss_type = "focal_loss", 42 | focal_loss_alpha=0.5, 43 | focal_loss_gamma=2.0, 44 | **kwargs, 45 | ): 46 | super().__init__(**kwargs) 47 | self.loss_type= loss_type 48 | self.focal_loss_alpha = focal_loss_alpha 49 | self.focal_loss_gamma = focal_loss_gamma 50 | 51 | class T5ClassificationHead(nn.Module): 52 | """Head for sentence-level classification tasks.""" 53 | 54 | def __init__(self, config: T5ConfigWithLoss): 55 | super().__init__() 56 | self.dense = nn.Linear(config.d_model, config.d_model) 57 | self.dropout = nn.Dropout(p=config.classifier_dropout) 58 | self.out_proj = nn.Linear(config.d_model, config.num_labels) 59 | 60 | def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: 61 | hidden_states = self.dropout(hidden_states) 62 | hidden_states = self.dense(hidden_states) 63 | hidden_states = torch.tanh(hidden_states) 64 | hidden_states = self.dropout(hidden_states) 65 | hidden_states = self.out_proj(hidden_states) 66 | return hidden_states 67 | 68 | 69 | class T5ForZeroShotClassification(T5PreTrainedModel): 70 | _keys_to_ignore_on_load_unexpected = ["decoder.block.0.layer.1.EncDecAttention.relative_attention_bias.weight"] 71 | _tied_weights_keys = ["encoder.embed_tokens.weight", "decoder.embed_tokens.weight"] 72 | 73 | def __init__(self, config: T5ConfigWithLoss): 74 | super().__init__(config) 75 | 76 | if self.config.loss_type not in SUPPORTED_LOSSES: 77 | raise NotImplementedError(f"{self.config.loss_type} is not implemented loss function type. ") 78 | 79 | self.transformer = T5Model(config) 80 | self.classification_head = T5ClassificationHead(config) 81 | 82 | # Initialize weights and apply final processing 83 | self.post_init() 84 | 85 | self.model_parallel = False 86 | 87 | def forward( 88 | self, 89 | input_ids: torch.LongTensor = None, 90 | attention_mask: Optional[torch.Tensor] = None, 91 | decoder_input_ids: Optional[torch.LongTensor] = None, 92 | decoder_attention_mask: Optional[torch.LongTensor] = None, 93 | head_mask: Optional[torch.Tensor] = None, 94 | decoder_head_mask: Optional[torch.Tensor] = None, 95 | cross_attn_head_mask: Optional[torch.Tensor] = None, 96 | encoder_outputs: Optional[List[torch.FloatTensor]] = None, 97 | inputs_embeds: Optional[torch.FloatTensor] = None, 98 | decoder_inputs_embeds: Optional[torch.FloatTensor] = None, 99 | labels: Optional[torch.LongTensor] = None, 100 | use_cache: Optional[bool] = None, 101 | output_attentions: Optional[bool] = None, 102 | output_hidden_states: Optional[bool] = None, 103 | return_dict: Optional[bool] = None, 104 | ) -> Union[Tuple, Seq2SeqSequenceClassifierOutput]: 105 | r""" 106 | labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*): 107 | Labels for computing the sequence classification/regression loss. Indices should be in `[0, ..., 108 | config.num_labels - 1]`. If `config.num_labels > 1` a classification loss is computed (Cross-Entropy). 109 | Returns: 110 | """ 111 | return_dict = return_dict if return_dict is not None else self.config.use_return_dict 112 | if labels is not None: 113 | use_cache = False 114 | 115 | if input_ids is None and inputs_embeds is not None: 116 | raise NotImplementedError( 117 | f"Passing input embeddings is currently not supported for {self.__class__.__name__}" 118 | ) 119 | 120 | # Copied from models.bart.modeling_bart.BartModel.forward different to other models, T5 automatically creates 121 | # decoder_input_ids from input_ids if no decoder_input_ids are provided 122 | if decoder_input_ids is None and decoder_inputs_embeds is None: 123 | if input_ids is None: 124 | raise ValueError( 125 | "If no `decoder_input_ids` or `decoder_inputs_embeds` are " 126 | "passed, `input_ids` cannot be `None`. Please pass either " 127 | "`input_ids` or `decoder_input_ids` or `decoder_inputs_embeds`." 128 | ) 129 | decoder_input_ids = self._shift_right(input_ids) 130 | 131 | outputs = self.transformer( 132 | input_ids, 133 | attention_mask=attention_mask, 134 | decoder_input_ids=decoder_input_ids, 135 | decoder_attention_mask=decoder_attention_mask, 136 | head_mask=head_mask, 137 | decoder_head_mask=decoder_head_mask, 138 | cross_attn_head_mask=cross_attn_head_mask, 139 | encoder_outputs=encoder_outputs, 140 | inputs_embeds=inputs_embeds, 141 | decoder_inputs_embeds=decoder_inputs_embeds, 142 | use_cache=use_cache, 143 | output_attentions=output_attentions, 144 | output_hidden_states=output_hidden_states, 145 | return_dict=return_dict, 146 | ) 147 | sequence_output = outputs[0] 148 | 149 | eos_mask = decoder_input_ids.eq(self.config.eos_token_id).to(sequence_output.device) 150 | 151 | if len(torch.unique_consecutive(eos_mask.sum(1))) > 1: 152 | raise ValueError("All examples must have the same number of tokens.") 153 | batch_size, _, hidden_size = sequence_output.shape 154 | sentence_representation = sequence_output[eos_mask, :].view(batch_size, -1, hidden_size)[:, -1, :] 155 | 156 | logits = self.classification_head(sentence_representation) 157 | 158 | loss = None 159 | if labels is not None: 160 | labels = labels.to(logits.device) 161 | if self.config.problem_type is None: 162 | if self.config.num_labels == 1: 163 | self.config.problem_type = "regression" 164 | elif self.config.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int): 165 | self.config.problem_type = "single_label_classification" 166 | else: 167 | self.config.problem_type = "multi_label_classification" 168 | 169 | if self.config.problem_type == "regression": 170 | loss_fct = MSELoss() 171 | if self.config.num_labels == 1: 172 | loss = loss_fct(logits.squeeze(), labels.squeeze()) 173 | else: 174 | loss = loss_fct(logits, labels) 175 | elif self.config.problem_type == "single_label_classification": 176 | if self.config.loss_type == "cross_entropy": 177 | loss_fct = CrossEntropyLoss() 178 | elif self.config.loss_type == "focal_loss": 179 | loss_fct = FocalLoss(alpha=self.config.focal_loss_alpha, gamma=self.config.focal_loss_gamma) 180 | loss = loss_fct(logits.view(-1, self.config.num_labels), labels.view(-1)) 181 | elif self.config.problem_type == "multi_label_classification": 182 | loss_fct = BCEWithLogitsLoss() 183 | loss = loss_fct(logits, labels) 184 | if not return_dict: 185 | output = (logits,) + outputs[1:] 186 | return ((loss,) + output) if loss is not None else output 187 | 188 | return Seq2SeqSequenceClassifierOutput( 189 | loss=loss, 190 | logits=logits, 191 | past_key_values=outputs.past_key_values, 192 | decoder_hidden_states=outputs.decoder_hidden_states, 193 | decoder_attentions=outputs.decoder_attentions, 194 | cross_attentions=outputs.cross_attentions, 195 | encoder_last_hidden_state=outputs.encoder_last_hidden_state, 196 | encoder_hidden_states=outputs.encoder_hidden_states, 197 | encoder_attentions=outputs.encoder_attentions, 198 | ) -------------------------------------------------------------------------------- /src/liqfit/pipeline/__init__.py: -------------------------------------------------------------------------------- 1 | from .inference import ZeroShotClassificationPipeline 2 | -------------------------------------------------------------------------------- /src/liqfit/pipeline/inference.py: -------------------------------------------------------------------------------- 1 | # Copyright 2020 The HuggingFace Team and Knowledgator. All rights reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | from transformers.tokenization_utils import TruncationStrategy 16 | from transformers.utils import add_end_docstrings, logging 17 | from transformers.pipelines.base import PIPELINE_INIT_ARGS, ArgumentHandler, ChunkPipeline 18 | 19 | from typing import Union 20 | import inspect 21 | from typing import List, Union 22 | import numpy as np 23 | 24 | 25 | logger = logging.get_logger(__name__) 26 | 27 | class ZeroShotClassificationArgumentHandler(ArgumentHandler): 28 | """ 29 | Handles arguments for zero-shot for text classification by turning each possible label into an NLI 30 | premise/hypothesis pair. 31 | """ 32 | 33 | def _parse_labels(self, labels): 34 | if isinstance(labels, str): 35 | labels = [label.strip() for label in labels.split(",") if label.strip()] 36 | return labels 37 | 38 | def __call__(self, sequences, labels, hypothesis_template, hypothesis_first): 39 | if len(labels) == 0 or len(sequences) == 0: 40 | raise ValueError("You must include at least one label and at least one sequence.") 41 | if hypothesis_template.format(labels[0]) == hypothesis_template: 42 | raise ValueError( 43 | ( 44 | 'The provided hypothesis_template "{}" was not able to be formatted with the target labels. ' 45 | "Make sure the passed template includes formatting syntax such as {{}} where the label should go." 46 | ).format(hypothesis_template) 47 | ) 48 | 49 | if isinstance(sequences, str): 50 | sequences = [sequences] 51 | 52 | sequence_pairs = [] 53 | if not hypothesis_first: 54 | for sequence in sequences: 55 | sequence_pairs.extend([[sequence, hypothesis_template.format(label)] for label in labels]) 56 | else: 57 | for sequence in sequences: 58 | sequence_pairs.extend([[hypothesis_template.format(label), sequence] for label in labels]) 59 | return sequence_pairs, sequences 60 | 61 | 62 | @add_end_docstrings(PIPELINE_INIT_ARGS) 63 | class ZeroShotClassificationPipeline(ChunkPipeline): 64 | """ 65 | NLI-based zero-shot classification pipeline using a `ModelForSequenceClassification` trained on NLI (natural 66 | language inference) tasks. Equivalent of `text-classification` pipelines, but these models don't require a 67 | hardcoded number of potential classes, they can be chosen at runtime. It usually means it's slower but it is 68 | **much** more flexible. 69 | 70 | Any combination of sequences and labels can be passed and each combination will be posed as a premise/hypothesis 71 | pair and passed to the pretrained model. Then, the logit for *entailment* is taken as the logit for the candidate 72 | label being valid. Any NLI model can be used, but the id of the *entailment* label must be included in the model 73 | config's :attr:*~transformers.PretrainedConfig.label2id*. 74 | 75 | Example: 76 | 77 | ```python 78 | >>> from transformers import pipeline 79 | 80 | >>> oracle = pipeline(model="facebook/bart-large-mnli") 81 | >>> oracle( 82 | ... "I have a problem with my iphone that needs to be resolved asap!!", 83 | ... candidate_labels=["urgent", "not urgent", "phone", "tablet", "computer"], 84 | ... ) 85 | {'sequence': 'I have a problem with my iphone that needs to be resolved asap!!', 'labels': ['urgent', 'phone', 'computer', 'not urgent', 'tablet'], 'scores': [0.504, 0.479, 0.013, 0.003, 0.002]} 86 | 87 | >>> oracle( 88 | ... "I have a problem with my iphone that needs to be resolved asap!!", 89 | ... candidate_labels=["english", "german"], 90 | ... ) 91 | {'sequence': 'I have a problem with my iphone that needs to be resolved asap!!', 'labels': ['english', 'german'], 'scores': [0.814, 0.186]} 92 | ``` 93 | 94 | Learn more about the basics of using a pipeline in the [pipeline tutorial](../pipeline_tutorial) 95 | 96 | This NLI pipeline can currently be loaded from [`pipeline`] using the following task identifier: 97 | `"zero-shot-classification"`. 98 | 99 | The models that this pipeline can use are models that have been fine-tuned on an NLI task. See the up-to-date list 100 | of available models on [huggingface.co/models](https://huggingface.co/models?search=nli). 101 | """ 102 | 103 | def __init__(self, args_parser=ZeroShotClassificationArgumentHandler(), *args, **kwargs): 104 | self._args_parser = args_parser 105 | super().__init__(*args, **kwargs) 106 | if self.entailment_id == -1: 107 | logger.warning( 108 | "Failed to determine 'entailment' label id from the label2id mapping in the model config. Setting to " 109 | "-1. Define a descriptive label2id mapping in the model config to ensure correct outputs." 110 | ) 111 | 112 | @property 113 | def entailment_id(self): 114 | if len(self.model.config.label2id.items()) == 0: 115 | return 0 116 | for label, ind in self.model.config.label2id.items(): 117 | if label.lower().startswith("entail"): 118 | return ind 119 | return -1 120 | 121 | def _parse_and_tokenize( 122 | self, sequence_pairs, padding=True, add_special_tokens=True, truncation=TruncationStrategy.ONLY_FIRST, 123 | encoder_decoder = False, **kwargs 124 | ): 125 | """ 126 | Parse arguments and tokenize only_first so that hypothesis (label) is not truncated 127 | """ 128 | return_tensors = self.framework 129 | if self.tokenizer.pad_token is None: 130 | # Override for tokenizers not supporting padding 131 | logger.error( 132 | "Tokenizer was not supporting padding necessary for zero-shot, attempting to use " 133 | " `pad_token=eos_token`" 134 | ) 135 | self.tokenizer.pad_token = self.tokenizer.eos_token 136 | try: 137 | if encoder_decoder: 138 | sequence_pairs, decoder_input = sequence_pairs 139 | 140 | inputs = self.tokenizer( 141 | [sequence_pairs], 142 | add_special_tokens=add_special_tokens, 143 | return_tensors=return_tensors, 144 | padding=padding, 145 | truncation=truncation, 146 | ) 147 | if encoder_decoder: 148 | decoder_inputs = self.tokenizer( 149 | [decoder_input], 150 | add_special_tokens=add_special_tokens, 151 | return_tensors=return_tensors, 152 | padding=padding, 153 | truncation=truncation, 154 | ) 155 | inputs['decoder_input_ids'] = decoder_inputs['input_ids'] 156 | inputs['decoder_attention_mask'] = decoder_inputs['attention_mask'] 157 | 158 | except Exception as e: 159 | if "too short" in str(e): 160 | # tokenizers might yell that we want to truncate 161 | # to a value that is not even reached by the input. 162 | # In that case we don't want to truncate. 163 | # It seems there's not a really better way to catch that 164 | # exception. 165 | 166 | inputs = self.tokenizer( 167 | [sequence_pairs], 168 | add_special_tokens=add_special_tokens, 169 | return_tensors=return_tensors, 170 | padding=padding, 171 | truncation=TruncationStrategy.DO_NOT_TRUNCATE, 172 | ) 173 | if encoder_decoder: 174 | decoder_inputs = self.tokenizer( 175 | [decoder_input], 176 | add_special_tokens=add_special_tokens, 177 | return_tensors=return_tensors, 178 | padding=padding, 179 | truncation=TruncationStrategy.DO_NOT_TRUNCATE, 180 | ) 181 | inputs['decoder_input_ids'] = decoder_inputs['input_ids'] 182 | inputs['decoder_attention_mask'] = decoder_inputs['attention_mask'] 183 | else: 184 | raise e 185 | 186 | return inputs 187 | 188 | def _sanitize_parameters(self, **kwargs): 189 | if kwargs.get("multi_class", None) is not None: 190 | kwargs["multi_label"] = kwargs["multi_class"] 191 | logger.warning( 192 | "The `multi_class` argument has been deprecated and renamed to `multi_label`. " 193 | "`multi_class` will be removed in a future version of Transformers." 194 | ) 195 | preprocess_params = {} 196 | if "candidate_labels" in kwargs: 197 | preprocess_params["candidate_labels"] = self._args_parser._parse_labels(kwargs["candidate_labels"]) 198 | if "hypothesis_template" in kwargs: 199 | preprocess_params["hypothesis_template"] = kwargs["hypothesis_template"] 200 | if "hypothesis_first" in kwargs: 201 | preprocess_params["hypothesis_first"] = kwargs["hypothesis_first"] 202 | if "encoder_decoder" in kwargs: 203 | preprocess_params["encoder_decoder"] = kwargs["encoder_decoder"] 204 | 205 | postprocess_params = {} 206 | if "multi_label" in kwargs: 207 | postprocess_params["multi_label"] = kwargs["multi_label"] 208 | return preprocess_params, {}, postprocess_params 209 | 210 | def __call__( 211 | self, 212 | sequences: Union[str, List[str]], 213 | *args, 214 | **kwargs, 215 | ): 216 | """ 217 | Classify the sequence(s) given as inputs. See the [`ZeroShotClassificationPipeline`] documentation for more 218 | information. 219 | 220 | Args: 221 | sequences (`str` or `List[str]`): 222 | The sequence(s) to classify, will be truncated if the model input is too large. 223 | candidate_labels (`str` or `List[str]`): 224 | The set of possible class labels to classify each sequence into. Can be a single label, a string of 225 | comma-separated labels, or a list of labels. 226 | hypothesis_template (`str`, *optional*, defaults to `"This example is {}."`): 227 | The template used to turn each label into an NLI-style hypothesis. This template must include a {} or 228 | similar syntax for the candidate label to be inserted into the template. For example, the default 229 | template is `"This example is {}."` With the candidate label `"sports"`, this would be fed into the 230 | model like `" sequence to classify This example is sports . "`. The default template 231 | works well in many cases, but it may be worthwhile to experiment with different templates depending on 232 | the task setting. 233 | multi_label (`bool`, *optional*, defaults to `False`): 234 | Whether or not multiple candidate labels can be true. If `False`, the scores are normalized such that 235 | the sum of the label likelihoods for each sequence is 1. If `True`, the labels are considered 236 | independent and probabilities are normalized for each candidate by doing a softmax of the entailment 237 | score vs. the contradiction score. 238 | 239 | Return: 240 | A `dict` or a list of `dict`: Each result comes as a dictionary with the following keys: 241 | 242 | - **sequence** (`str`) -- The sequence for which this is the output. 243 | - **labels** (`List[str]`) -- The labels sorted by order of likelihood. 244 | - **scores** (`List[float]`) -- The probabilities for each of the labels. 245 | """ 246 | if len(args) == 0: 247 | pass 248 | elif len(args) == 1 and "candidate_labels" not in kwargs: 249 | kwargs["candidate_labels"] = args[0] 250 | else: 251 | raise ValueError(f"Unable to understand extra arguments {args}") 252 | 253 | return super().__call__(sequences, **kwargs) 254 | 255 | def preprocess(self, inputs, candidate_labels=None, hypothesis_template="This example is {}.", hypothesis_first = False, encoder_decoder = False): 256 | sequence_pairs, sequences = self._args_parser(inputs, candidate_labels, hypothesis_template, hypothesis_first) 257 | 258 | for i, (candidate_label, sequence_pair) in enumerate(zip(candidate_labels, sequence_pairs)): 259 | model_input = self._parse_and_tokenize(sequence_pair, encoder_decoder = encoder_decoder) 260 | 261 | yield { 262 | "candidate_label": candidate_label, 263 | "sequence": sequences[0], 264 | "is_last": i == len(candidate_labels) - 1, 265 | **model_input, 266 | } 267 | 268 | def _forward(self, inputs): 269 | candidate_label = inputs["candidate_label"] 270 | sequence = inputs["sequence"] 271 | input_names = self.tokenizer.model_input_names 272 | input_names.extend(['decoder_input_ids', 'decoder_attention_mask']) 273 | model_inputs = {k: inputs[k] for k in input_names if k in inputs} 274 | # `XXXForSequenceClassification` models should not use `use_cache=True` even if it's supported 275 | model_forward = self.model.forward if self.framework == "pt" else self.model.call 276 | if "use_cache" in inspect.signature(model_forward).parameters.keys(): 277 | model_inputs["use_cache"] = False 278 | outputs = self.model(**model_inputs) 279 | 280 | model_outputs = { 281 | "candidate_label": candidate_label, 282 | "sequence": sequence, 283 | "is_last": inputs["is_last"], 284 | **outputs, 285 | } 286 | return model_outputs 287 | 288 | def postprocess(self, model_outputs, multi_label=False): 289 | candidate_labels = [outputs["candidate_label"] for outputs in model_outputs] 290 | sequences = [outputs["sequence"] for outputs in model_outputs] 291 | logits = np.concatenate([output["logits"].numpy() for output in model_outputs]) 292 | N = logits.shape[0] 293 | n = len(candidate_labels) 294 | num_sequences = N // n 295 | reshaped_outputs = logits.reshape((num_sequences, n, -1)) 296 | 297 | if multi_label and len(self.model.config.label2id)==0: 298 | scores = 1 / (1 + np.exp(-entail_contr_logits)) 299 | 300 | elif multi_label or len(candidate_labels) == 1: 301 | # softmax over the entailment vs. contradiction dim for each label independently 302 | entailment_id = self.entailment_id 303 | contradiction_id = -1 if entailment_id == 0 else 0 304 | entail_contr_logits = reshaped_outputs[..., [contradiction_id, entailment_id]] 305 | scores = np.exp(entail_contr_logits) / np.exp(entail_contr_logits).sum(-1, keepdims=True) 306 | scores = scores[..., 1] 307 | 308 | else: 309 | # softmax the "entailment" logits over all candidate labels 310 | entail_logits = reshaped_outputs[..., self.entailment_id] 311 | scores = np.exp(entail_logits) / np.exp(entail_logits).sum(-1, keepdims=True) 312 | 313 | top_inds = list(reversed(scores[0].argsort())) 314 | return { 315 | "sequence": sequences[0], 316 | "labels": [candidate_labels[i] for i in top_inds], 317 | "scores": scores[0, top_inds].tolist(), 318 | } -------------------------------------------------------------------------------- /src/liqfit/utils/__init__.py: -------------------------------------------------------------------------------- 1 | from .standardization import convert_to_numpy 2 | from .standardization import convert_to_torch 3 | from .transforms import tokenize_and_align_label 4 | from .transforms import transform 5 | from .metrics import Accuracy 6 | -------------------------------------------------------------------------------- /src/liqfit/utils/metrics.py: -------------------------------------------------------------------------------- 1 | import evaluate 2 | import numpy as np 3 | from transformers import EvalPrediction 4 | 5 | 6 | class Accuracy: 7 | def __init__(self): 8 | """Simple wrapper class around `evaluate.load("accuracy")`. 9 | """ 10 | self.accuracy = evaluate.load("accuracy") 11 | 12 | def __call__(self, eval_pred: EvalPrediction): 13 | predictions, labels = eval_pred 14 | predictions = np.argmax(predictions, axis=1) 15 | return self.accuracy.compute( 16 | predictions=predictions, references=labels 17 | ) 18 | -------------------------------------------------------------------------------- /src/liqfit/utils/standardization.py: -------------------------------------------------------------------------------- 1 | from __future__ import annotations 2 | from typing import List, Tuple 3 | import torch 4 | import numpy as np 5 | 6 | 7 | def convert_to_numpy(x: torch.Tensor | Tuple | List | np.ndarray) -> np.ndarray: 8 | """Converts torch.Tensor, Tuple, List or NumPy array to Numpy Array. 9 | 10 | Args: 11 | x (torch.Tensor | Tuple | List | np.ndarray): Input to convert to 12 | NumPy array. 13 | 14 | Returns: 15 | np.ndarray: Converted NumPy array. 16 | """ 17 | if isinstance(x, torch.tensor): 18 | return x.detach().cpu().numpy() 19 | else: 20 | return np.array(x) 21 | 22 | 23 | def convert_to_torch(x: torch.Tensor | Tuple | List | np.ndarray) -> torch.Tensor: 24 | """Converts input to torch.Tensor 25 | 26 | Args: 27 | x (torch.Tensor | Tuple | List | np.ndarray): _description_ 28 | 29 | Raises: 30 | ValueError: If the input is not a type of `torch.Tensor`, 31 | `Tuple`, `List`, `np.ndarray` 32 | 33 | Returns: 34 | torch.Tensor: Converted torch.Tensor. 35 | """ 36 | if isinstance(x, (list, tuple)): 37 | return torch.tensor(x) 38 | elif isinstance(x, np.ndarray): 39 | return torch.from_numpy(x) 40 | elif isinstance(x, torch.Tensor): 41 | return x 42 | else: 43 | raise ValueError( 44 | "Expected `List`, `Tuple` or `np.ndarray`. " 45 | f"Received: {type(x)}." 46 | ) 47 | -------------------------------------------------------------------------------- /src/liqfit/utils/transforms.py: -------------------------------------------------------------------------------- 1 | from typing import Callable, Dict 2 | from datasets import Dataset 3 | from ..datasets import transform_dataset 4 | 5 | 6 | def tokenize_and_align_label( 7 | example: Dict, 8 | tokenizer: Callable, 9 | sources_column_name: str = "sources", 10 | targets_column_name: str = "targets", 11 | ): 12 | """Tokenizes Source and Target sequences and concatenates them for NLI training task. 13 | 14 | Args: 15 | example (Dict): Dictionary that contains the sources and target sequences. 16 | tokenizer (Callable): Tokenizer function, if you are using Huggingface 17 | tokenizer, you can wrap it with your configuration using 18 | `functools.partial`. Example: 19 | tokenizer_wrapped_function = \ 20 | functools.partial(tokenizer.batch_encode_plus, padding=True, 21 | truncation=True, max_length=512) then pass 22 | `tokenizer_wrapped_function` to this function. 23 | sources_column_name (str, optional): Sources key name in the 24 | dictionary. Defaults to "sources". 25 | targets_column_name (str, optional): Targets key name in the 26 | dictionary. Defaults to "targets". 27 | 28 | Returns: 29 | torch.Tensor: A tensor of your tokenized input. 30 | """ 31 | hypothesis = example[targets_column_name] 32 | seq = example[sources_column_name] 33 | tokenized_input = tokenizer([seq, hypothesis]) 34 | return tokenized_input 35 | 36 | 37 | def transform( 38 | dataset: Dataset, 39 | classes: list, 40 | template: str, 41 | normalize_negatives: bool, 42 | positives: int, 43 | negatives: int, 44 | ): 45 | """Transforms the dataset for NLI training task. 46 | 47 | Args: 48 | dataset (Dataset): Hugginface Dataset instance 49 | classes (List[str]): List of possible class labels. 50 | template (str, optional): Template string for generating examples. 51 | normalize_negatives (bool, optional): Whether to normalize amount of 52 | negative examples per each positive example of a class. 53 | positives (int, optional): Number of positive examples to generate per source. 54 | negatives (int, optional): Number of negative examples to generate per source. 55 | 56 | Raises: 57 | ValueError: If there is no "{}" in the template. It should exist in 58 | order to format the template with the labels. 59 | 60 | Returns: 61 | Dataset: Transformed dataset. 62 | """ 63 | if "{}" not in template: 64 | raise ValueError( 65 | "Cannot apply `.format()` function on the template. " 66 | 'Expected template to have "{}". ' 67 | f"Received: {template}." 68 | ) 69 | 70 | transformed_dataset = transform_dataset( 71 | dataset, classes, template, normalize_negatives, positives, negatives 72 | ) 73 | tokenized_dataset = transformed_dataset.map(tokenize_and_align_label) 74 | return tokenized_dataset 75 | -------------------------------------------------------------------------------- /tests/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Knowledgator/LiqFit/51ba2714813ae1cf110f7e600cd7f2663cdec39c/tests/__init__.py -------------------------------------------------------------------------------- /tests/test_losses.py: -------------------------------------------------------------------------------- 1 | import unittest 2 | 3 | import torch 4 | from kornia.losses import focal_loss 5 | from liqfit.losses import focal_loss_with_mask 6 | 7 | 8 | class TestCorrectness(unittest.TestCase): 9 | def test_focal_loss_with_ignore_index(self): 10 | x = torch.tensor( 11 | [[[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]]], 12 | dtype=torch.float32, 13 | ) 14 | y = torch.tensor([[1, 2, 3]], dtype=torch.int64) 15 | y[:, -1] = -100 16 | loss = round( 17 | focal_loss_with_mask( 18 | x.reshape(-1, x.shape[-1]), y.reshape(-1) 19 | ).item(), 20 | 4, 21 | ) 22 | output = 0.1795 23 | self.assertEqual(loss, output) 24 | 25 | def test_modified_loss_with_kornia_impl(self): 26 | x = torch.tensor( 27 | [[[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]]], 28 | dtype=torch.float32, 29 | ) 30 | y = torch.tensor([[1, 2, 3]], dtype=torch.int64) 31 | modified_loss = round( 32 | focal_loss_with_mask( 33 | x.reshape(-1, x.shape[-1]), y.reshape(-1), alpha=0.5 34 | ).item(), 35 | 4, 36 | ) 37 | kornia_loss = round( 38 | focal_loss( 39 | x.reshape(-1, x.shape[-1]), 40 | y.reshape(-1), 41 | alpha=0.5, 42 | reduction="mean", 43 | ).item(), 44 | 4, 45 | ) 46 | self.assertEqual(modified_loss, kornia_loss) 47 | -------------------------------------------------------------------------------- /tests/test_models.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification 3 | from liqfit.models import T5ForZeroShotClassification, T5ConfigWithLoss, DebertaV2ForZeroShotClassification, DebertaConfigWithLoss 4 | from liqfit.modeling import LiqFitModel, ClassificationHead 5 | from liqfit.modeling.pooling import FirstTokenPooling1D 6 | from liqfit.losses import CrossEntropyLoss 7 | 8 | def test_t5(): 9 | device = "cuda" if torch.cuda.is_available() else "cpu" 10 | 11 | text = "one day I will see the world" 12 | label = "travel" 13 | 14 | tokenizer = AutoTokenizer.from_pretrained('google/flan-t5-small') 15 | 16 | input_ids = tokenizer(text, return_tensors='pt')['input_ids'] 17 | decoder_input_ids = tokenizer(label, return_tensors='pt')['input_ids'] 18 | 19 | config = T5ConfigWithLoss() 20 | model = T5ForZeroShotClassification(config).to(device) 21 | outputs = model(input_ids = input_ids, decoder_input_ids = decoder_input_ids) 22 | 23 | def test_deberta(): 24 | device = "cuda" if torch.cuda.is_available() else "cpu" 25 | 26 | text = "one day I will see the world. This example is travel." 27 | 28 | tokenizer = AutoTokenizer.from_pretrained('microsoft/deberta-v3-small') 29 | 30 | input_ids = tokenizer(text, return_tensors='pt')['input_ids'] 31 | 32 | config = DebertaConfigWithLoss() 33 | model = DebertaV2ForZeroShotClassification(config).to(device) 34 | outputs = model(input_ids = input_ids) 35 | 36 | def test_liqfit_model_with_automodel_for_sequence_classification(): 37 | device = "cuda" if torch.cuda.is_available() else "cpu" 38 | 39 | text = "one day I will see the world. This example is travel." 40 | 41 | tokenizer = AutoTokenizer.from_pretrained('microsoft/deberta-v3-small') 42 | 43 | input_ids = tokenizer(text, return_tensors='pt')['input_ids'] 44 | labels = torch.tensor([1]) 45 | 46 | backbone_model = AutoModelForSequenceClassification.from_pretrained('microsoft/deberta-v3-xsmall') 47 | 48 | loss_func = CrossEntropyLoss(multi_target=True) 49 | 50 | model = LiqFitModel(backbone_model.config, backbone_model, loss_func=loss_func) 51 | outputs = model(input_ids = input_ids, labels=labels) 52 | 53 | def test_liqfit_model_with_head(): 54 | device = "cuda" if torch.cuda.is_available() else "cpu" 55 | 56 | text = "one day I will see the world. This example is travel." 57 | 58 | tokenizer = AutoTokenizer.from_pretrained('microsoft/deberta-v3-small') 59 | 60 | input_ids = tokenizer(text, return_tensors='pt')['input_ids'] 61 | labels = torch.tensor([1]) 62 | 63 | backbone_model = AutoModel.from_pretrained('microsoft/deberta-v3-xsmall') 64 | 65 | pooler = FirstTokenPooling1D() 66 | loss_func = CrossEntropyLoss(multi_target=True) 67 | head = ClassificationHead(backbone_model.config.hidden_size, 3, pooler, loss_func) 68 | 69 | model = LiqFitModel(backbone_model.config, backbone_model, head) 70 | outputs = model(input_ids = input_ids, labels=labels) 71 | -------------------------------------------------------------------------------- /tests/test_pipeline.py: -------------------------------------------------------------------------------- 1 | from transformers import AutoTokenizer, AutoModelForSequenceClassification 2 | 3 | from liqfit.pipeline import ZeroShotClassificationPipeline 4 | 5 | 6 | class TestStandartModelPipeline: 7 | sequence_to_classify = "one day I will see the world" 8 | candidate_labels = ['travel', 'cooking', 'dancing'] 9 | template = 'This example is {}.' 10 | model_path = 'knowledgator/comprehend_it-base' 11 | tokenizer = AutoTokenizer.from_pretrained(model_path) 12 | model = AutoModelForSequenceClassification.from_pretrained(model_path) 13 | 14 | def test_standard_pipeline(self): 15 | classifier = ZeroShotClassificationPipeline(model=self.model, 16 | tokenizer=self.tokenizer, 17 | hypothesis_template = self.template, 18 | hypothesis_first = False) 19 | results = classifier(self.sequence_to_classify, self.candidate_labels, multi_label=True) 20 | 21 | 22 | def test_hypothesis_first_pipeline(self): 23 | classifier = ZeroShotClassificationPipeline(model=self.model, 24 | tokenizer=self.tokenizer, 25 | hypothesis_template = self.template, 26 | hypothesis_first = True) 27 | results = classifier(self.sequence_to_classify, self.candidate_labels, multi_label=True) 28 | 29 | 30 | 31 | class TestBinaryModelPipeline: 32 | sequence_to_classify = "one day I will see the world" 33 | candidate_labels = ['travel', 'cooking', 'dancing'] 34 | template = 'This example is {}.' 35 | model_path = 'BAAI/bge-reranker-base' 36 | tokenizer = AutoTokenizer.from_pretrained(model_path) 37 | model = AutoModelForSequenceClassification.from_pretrained(model_path) 38 | 39 | def test_standard_pipeline(self): 40 | classifier = ZeroShotClassificationPipeline(model=self.model, 41 | tokenizer=self.tokenizer, 42 | hypothesis_template = self.template, 43 | hypothesis_first = False) 44 | results = classifier(self.sequence_to_classify, self.candidate_labels, multi_label=True) 45 | 46 | 47 | def test_hypothesis_first_pipeline(self): 48 | classifier = ZeroShotClassificationPipeline(model=self.model, 49 | tokenizer=self.tokenizer, 50 | hypothesis_template = self.template, 51 | hypothesis_first = True) 52 | results = classifier(self.sequence_to_classify, self.candidate_labels, multi_label=True) 53 | 54 | class TestEncoderDecoderModelPipeline: 55 | sequence_to_classify = "one day I will see the world" 56 | candidate_labels = ['travel', 'cooking', 'dancing'] 57 | template = 'This example is {}.' 58 | model_path = 'knowledgator/mt5-comprehend-it-base' 59 | tokenizer = AutoTokenizer.from_pretrained(model_path) 60 | model = AutoModelForSequenceClassification.from_pretrained(model_path) 61 | 62 | def test_standard_pipeline(self): 63 | classifier = ZeroShotClassificationPipeline(model=self.model, 64 | tokenizer=self.tokenizer, 65 | hypothesis_template = self.template, 66 | hypothesis_first = False) 67 | results = classifier(self.sequence_to_classify, self.candidate_labels, multi_label=True) 68 | 69 | 70 | def test_hypothesis_first_pipeline(self): 71 | classifier = ZeroShotClassificationPipeline(model=self.model, 72 | tokenizer=self.tokenizer, 73 | hypothesis_template = self.template, 74 | hypothesis_first = True) 75 | results = classifier(self.sequence_to_classify, self.candidate_labels, multi_label=True) 76 | 77 | 78 | def test_encoder_decoder_pipeline(self): 79 | classifier = ZeroShotClassificationPipeline(model=self.model, 80 | tokenizer=self.tokenizer, 81 | hypothesis_template = self.template, 82 | hypothesis_first = True) 83 | results = classifier(self.sequence_to_classify, self.candidate_labels, multi_label=True) --------------------------------------------------------------------------------