├── .gitignore
├── README.md
├── notebooks
└── train_emotions_classifier.ipynb
├── pyproject.toml
├── setup.cfg
├── src
└── liqfit
│ ├── __init__.py
│ ├── collators
│ ├── __init__.py
│ ├── base_collator.py
│ └── nli_collator.py
│ ├── datasets
│ ├── __init__.py
│ ├── nli_dataset.py
│ └── transform.py
│ ├── losses
│ ├── __init__.py
│ └── losses.py
│ ├── modeling
│ ├── __init__.py
│ ├── backbone.py
│ ├── heads.py
│ ├── model.py
│ └── pooling.py
│ ├── models
│ ├── __init__.py
│ ├── deberta.py
│ └── t5.py
│ ├── pipeline
│ ├── __init__.py
│ └── inference.py
│ └── utils
│ ├── __init__.py
│ ├── metrics.py
│ ├── standardization.py
│ └── transforms.py
└── tests
├── __init__.py
├── test_losses.py
├── test_models.py
└── test_pipeline.py
/.gitignore:
--------------------------------------------------------------------------------
1 | demo.ipynb
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 | 🤗 Models | 📕 Documentation | 📖 Blog
3 |
4 | . . .
5 |
6 |
7 | # LiqFit - Flexible Few-shot Learning Library.
8 |
9 | LiqFit is an easy-to-use framework for few-shot learning of cross-encoder models. Such models were trained to distinguish whether two statements entail, contradict each other or are neutral. Such task setting is universal for many information extraction tasks, starting from text classification and ending with named entity recognition and question-answering. With LiqFit, you can achieve competitive results by having just 8 examples per label.
10 |
11 |
12 | Key features and benefits of LiqFit are:
13 | * 🔢 **A small number of examples are required** - LiqFit can significantly improve the accuracy of the default zero-shot classifier having just 8 examples;
14 | * 📝 **Can solve many different information-extraction tasks** - Natural language inference is a universal task that can be applied as a setting for many other information extraction tasks, like named entity recognition of question&answering;
15 | * 🌈 **Can work for other classes not presented in the training set** - It's not mandatory to have all needed classes in a training set. Because of pre-finetuning on large amounts of NLI and classification tasks, a model will save generalisability to other classes;
16 | * ⚙️ **Support of a variety of cross-encoder realisations** - LiqFit supports different types of cross-encoders, including conventional, binary one and encoder-decoder architectures;
17 | * ⚖️ **Stable to unbalanced datasets** - LiqFit uses normalisation techniques that allow it to work well even in the cases of unbalanced data;
18 | * 🏷️ **Multi-label classification support** - The approach can be applied for both multi-class and multi-label classification;
19 |
20 | Limitations:
21 | * 🤔 It’s required to run N times transformers feedforward pass, where N is the amount of labels;
22 |
23 |
24 | ## Installation
25 |
26 | Download and install `LiqFit` by running:
27 |
28 | ```bash
29 | pip install liqfit
30 | ```
31 |
32 | For the most up-to-date version, you can build from source code by executing:
33 |
34 | ```bash
35 | pip install git+https://github.com/knowledgator/LiqFit.git
36 | ```
37 |
38 | ## How to use:
39 | Check more real example in the `notebooks` section.
40 |
41 | ```python
42 | from liqfit.modeling import LiqFitModel
43 | from liqfit.losses import FocalLoss
44 | from liqfit.collators import NLICollator
45 | from transformers import TrainingArguments, Trainer
46 |
47 | backbone_model = AutoModelForSequenceClassification.from_pretrained('microsoft/deberta-v3-xsmall')
48 |
49 | loss_func = FocalLoss(multi_target=True)
50 |
51 | model = LiqFitModel(backbone_model.config, backbone_model, loss_func=loss_func)
52 |
53 | data_collator = NLICollator(tokenizer, max_length=128, padding=True, truncation=True)
54 |
55 |
56 | training_args = TrainingArguments(
57 | output_dir='comprehendo',
58 | learning_rate=3e-5,
59 | per_device_train_batch_size=3,
60 | per_device_eval_batch_size=3,
61 | num_train_epochs=9,
62 | weight_decay=0.01,
63 | evaluation_strategy="epoch",
64 | save_steps = 5000,
65 | save_total_limit=3,
66 | remove_unused_columns=False,
67 | )
68 |
69 | trainer = Trainer(
70 | model=model,
71 | args=training_args,
72 | train_dataset=nli_train_dataset,
73 | eval_dataset=nli_test_dataset,
74 | tokenizer=tokenizer,
75 | data_collator=data_collator,
76 | )
77 |
78 | trainer.train()
79 | ```
80 | Please check more examples in the `notebooks` section.
81 |
82 | ...
83 |
84 | To run inference, we recommend to use `ZeroShotClassificationPipeline`:
85 |
86 | ```python
87 | from liqfit import ZeroShotClassificationPipeline
88 |
89 |
90 | classifier = ZeroShotClassificationPipeline(model=model, tokenizer=tokenizer)
91 | from sklearn.metrics import classification_report
92 | from tqdm import tqdm
93 |
94 | label2idx = {label: id for id, label in enumerate(classes)}
95 |
96 | preds = []
97 |
98 | for example in tqdm(test_dataset):
99 | if not example['text']:
100 | preds.append(idx)
101 | continue
102 | pred = classifier(example['text'], classes)['labels'][0]
103 | idx = label2idx[pred]
104 | preds.append(idx)
105 |
106 | print(classification_report(test_dataset['label'][:len(preds)], preds, target_names=classes, digits=4))
107 | ```
108 |
109 | ## Benchmarks:
110 | | Model & examples per label | Emotion | AgNews | SST5 |
111 | |-|-|-|-|
112 | | Comprehend-it/0 | 56.60 | 79.82 | 37.9 |
113 | | Comprehend-it/8 | 63.38 | 85.9 | 46.67 |
114 | | Comprehend-it/64 | 80.7 | 88 | 47 |
115 | | SetFit/0 | 57.54 | 56.36 | 24.11 |
116 | | SetFit/8 | 56.81 | 64.93 | 33.61 |
117 | | SetFit/64 | 79.03 | 88 | 45.38 |
118 |
119 | LiqFit used [knowledgator/comprehend_it-base model](https://huggingface.co/knowledgator/comprehend_it-base), while for [SetFit](https://github.com/huggingface/setfit), we utilzed [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5)
120 |
--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------
1 | [build-system]
2 | requires = ["hatchling<=1.18.0"]
3 | build-backend = "hatchling.build"
4 |
5 | [project]
6 | name = "liqfit"
7 | version = "1.0.0"
8 |
9 | requires-python = ">=3.7"
10 |
11 | description = "Flexible Few-shot learning tool."
12 | license = "MIT"
13 | long_description = "file: README.md"
14 |
15 | classifiers = [
16 | "Programming Language :: Python :: 3",
17 | "License :: OSI Approved :: MIT License",
18 | "Operating System :: OS Independent",
19 | ]
20 |
21 | dependencies = [
22 | "kornia",
23 | "transformers",
24 | "accelerate",
25 | ]
26 |
27 |
28 | [options]
29 | packages = "./src/liqfit"
30 | zip_safe = "True"
31 |
32 |
33 | [tool.black]
34 | line-length = 80
35 | target-version = ['py37']
--------------------------------------------------------------------------------
/setup.cfg:
--------------------------------------------------------------------------------
1 | [flake8]
2 | per-file-ignores = __init__.py:F401
3 |
--------------------------------------------------------------------------------
/src/liqfit/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Knowledgator/LiqFit/51ba2714813ae1cf110f7e600cd7f2663cdec39c/src/liqfit/__init__.py
--------------------------------------------------------------------------------
/src/liqfit/collators/__init__.py:
--------------------------------------------------------------------------------
1 | from .base_collator import Collator
2 | from .nli_collator import NLICollator
3 |
--------------------------------------------------------------------------------
/src/liqfit/collators/base_collator.py:
--------------------------------------------------------------------------------
1 | from collections import defaultdict
2 | import abc
3 | from typing import Union
4 |
5 |
6 | class Collator(abc.ABC):
7 | def __init__(
8 | self,
9 | tokenizer,
10 | max_length: int,
11 | padding: Union[bool, str],
12 | truncation: bool,
13 | ):
14 | self.tokenizer = tokenizer
15 | self.max_length = max_length
16 | self.padding = padding
17 | self.truncation = truncation
18 |
19 | @abc.abstractmethod
20 | def collate(self, batch):
21 | raise NotImplementedError("Should be implemented in a subclass.")
22 |
23 | def __call__(self, batch):
24 | grouped_batch = defaultdict(list)
25 | for example in batch:
26 | for k, v in example.items():
27 | grouped_batch[k].append(v)
28 | output = self.collate(grouped_batch)
29 | return output
30 |
--------------------------------------------------------------------------------
/src/liqfit/collators/nli_collator.py:
--------------------------------------------------------------------------------
1 | from typing import Callable
2 | import torch
3 |
4 | from .base_collator import Collator
5 | from typing import Union
6 |
7 |
8 | class NLICollator(Collator):
9 | def __init__(
10 | self,
11 | tokenizer: Callable,
12 | max_length: int,
13 | padding: Union[bool, str],
14 | truncation: bool,
15 | ):
16 | super().__init__(
17 | tokenizer,
18 | max_length=max_length,
19 | padding=padding,
20 | truncation=truncation,
21 | )
22 |
23 | def _tokenize_and_align_labels(self, batch):
24 | texts = batch.get("texts", None)
25 | if texts is None:
26 | raise ValueError(
27 | "Expected to find a key with name 'texts' that "
28 | "contains a list of tuples where each tuple "
29 | "contains the hypothesis and the premise. "
30 | f"Received: {batch.keys()}"
31 | )
32 | tokenized_input = self.tokenizer(
33 | texts,
34 | max_length=self.max_length,
35 | padding=self.padding,
36 | truncation=self.truncation,
37 | return_tensors="pt",
38 | )
39 | labels = torch.tensor(batch["labels"])
40 | tokenized_input.update({"labels": labels})
41 | return tokenized_input
42 |
43 | def collate(self, batch):
44 | tokenized_input = self._tokenize_and_align_labels(batch)
45 | return tokenized_input
46 |
--------------------------------------------------------------------------------
/src/liqfit/datasets/__init__.py:
--------------------------------------------------------------------------------
1 | from .nli_dataset import NLIDataset
2 | from .transform import transform_dataset
3 |
--------------------------------------------------------------------------------
/src/liqfit/datasets/nli_dataset.py:
--------------------------------------------------------------------------------
1 | from __future__ import annotations
2 |
3 | from typing import Optional, List
4 | from datasets import Dataset, load_dataset
5 |
6 | from .transform import transform_dataset
7 |
8 |
9 | class NLIDataset:
10 | def __init__(self, hypothesis: List, premises: List, labels: List):
11 | """LiqFitDataset used for NLI training.
12 |
13 | Args:
14 | hypothesis (List): List of hypothesis texts.
15 | premises (List): List of premises texts.
16 | labels (List): List of labels for each example.
17 | """
18 | self.hypothesis = hypothesis
19 | self.premises = premises
20 | self.labels = labels
21 |
22 | def __len__(self):
23 | equal_lengths = (
24 | len(self.hypothesis) == len(self.premises) == len(self.labels)
25 | )
26 | if not equal_lengths:
27 | raise ValueError(
28 | "Expected equal lengths between `self.hypothesis`"
29 | ", `self.premises` and `self.labels`. "
30 | f"Received: {len(self.hypothesis)} "
31 | f"- {len(self.premises)} - {len(self.labels)}."
32 | )
33 | return len(self.hypothesis)
34 |
35 | def __getitem__(self, idx):
36 | return {
37 | "texts": (self.hypothesis[idx], self.premises[idx]),
38 | "labels": self.labels[idx],
39 | }
40 |
41 | @classmethod
42 | def load_dataset(
43 | cls,
44 | dataset: Optional[Dataset] = None,
45 | dataset_name: Optional[str] = None,
46 | classes: Optional[List[str]] = None,
47 | text_column: Optional[str] = "text",
48 | label_column: Optional[str] = "label",
49 | template: Optional[str] = "This example is {}.",
50 | normalize_negatives: bool = False,
51 | positives: int = 1,
52 | negatives: int = -1,
53 | multi_label: bool = False,
54 | ) -> NLIDataset:
55 | """Returns a `NLIDataset` instance.
56 |
57 | Args:
58 | dataset (Optional[Dataset], optional): Instance of Huggingface
59 | Dataset class. Defaults to None.
60 | dataset_name (Optional[str], optional): Dataset name to load from
61 | Huggingface datasets. Defaults to None.
62 | classes (Optional[List[str]], optional): List of classes.
63 | Defaults to None.
64 | text_column (Optional[str], optional): Text column name.
65 | Defaults to 'text'.
66 | label_column (Optional[str], optional): Label column name.
67 | Defaults to 'label'.
68 | template (Optional[str], optional): Template string that will be
69 | used for Zero-Shot training/prediction. Defaults to
70 | 'This example is {}.'.
71 | normalize_negatives (bool, optional): Whether to normalize amount
72 | of negative examples per each positive example of a class.
73 | Defaults to False.
74 | positives (int, optional): Number of positive examples to generate
75 | per source. Defaults to 1.
76 | negatives (int, optional): Number of negative examples to generate
77 | per source. Defaults to -1.
78 | multi_label (bool, optional): Whether each example has multiple
79 | labels or not. Defaults to False.
80 |
81 | Raises:
82 | TypeError: if `dataset_name` is `None` while `dataset` instance is
83 | not passed.
84 | TypeError: if `label_name` is `None`.
85 | TypeError: if `text_column` is `None` while `dataset` instance is
86 | not passed.
87 | TypeError: if `label_column` is `None` while `classes` is `None`.
88 |
89 | Returns:
90 | LiqFitDataset: An instance of LiqFitDataset.
91 | """
92 | if dataset is None:
93 | if dataset_name is None:
94 | raise TypeError(
95 | "If dataset object is not provided you need to"
96 | " specify dataset_name."
97 | )
98 | else:
99 | dataset = load_dataset(dataset_name)["train"]
100 |
101 | if label_column not in dataset.features:
102 | raise TypeError(f"Expected to find {label_column} in the dataset.")
103 |
104 | if text_column not in dataset.features:
105 | raise TypeError(f"Expected to find {text_column} in the dataset.")
106 |
107 | if classes is None:
108 | raise ValueError(
109 | f"Expected to have a list classes. Received: {classes}."
110 | )
111 |
112 | processed_data = transform_dataset(
113 | dataset,
114 | classes,
115 | text_column,
116 | label_column,
117 | template,
118 | normalize_negatives,
119 | positives,
120 | negatives,
121 | multi_label,
122 | )
123 |
124 | return cls(
125 | processed_data["sources"],
126 | processed_data["targets"],
127 | processed_data["labels"],
128 | )
129 |
--------------------------------------------------------------------------------
/src/liqfit/datasets/transform.py:
--------------------------------------------------------------------------------
1 | from typing import List, Tuple, Optional
2 | from collections import defaultdict
3 | from datasets import Dataset
4 | import numpy as np
5 | import random
6 |
7 |
8 | def get_labels_stat(labels: List[str]) -> Tuple[List[str], List[float]]:
9 | """Calculates the number of occurrences and probability of each unique
10 | label in the provided list of labels.
11 |
12 | Args:
13 | labels (List[str]): List of label strings
14 |
15 | Returns:
16 | unique_labels (List[str]): Unique label values
17 | probs (List[float]): Probability of each label
18 | """
19 | # count occurrences of each label
20 | label_counts = defaultdict(int)
21 | for label in labels:
22 | label_counts[label] += 1
23 |
24 | # calculate probabilities
25 | count = len(labels)
26 | label_probs = {
27 | label: label_count / count
28 | for label, label_count in label_counts.items()
29 | }
30 |
31 | # extract labels and probabilities
32 | unique_labels = list(label_probs.keys())
33 | probs = list(label_probs.values())
34 |
35 | return unique_labels, probs
36 |
37 |
38 | def transform_dataset(
39 | dataset: Dataset,
40 | classes: List[str],
41 | text_column: Optional[str] = "text",
42 | label_column: Optional[str] = "label",
43 | template: Optional[str] = "This example is {}.",
44 | normalize_negatives: bool = False,
45 | positives: int = 1,
46 | negatives: int = -1,
47 | multi_label: bool = False,
48 | ) -> Dataset:
49 | """Transform a dataset into a format suitable for training.
50 |
51 | Args:
52 | dataset (Dataset): Input dataset.
53 | classes (List[str]): List of possible class labels.
54 | template (str, optional): Template string for generating examples.
55 | normalize_negatives (bool, optional): Whether to normalize amount of
56 | negative examples per each positive example of a class.
57 | positives (int, optional): Number of positive examples to generate per source.
58 | negatives (int, optional): Number of negative examples to generate per source.
59 |
60 |
61 | Returns:
62 | Dataset: Transformed dataset.
63 |
64 | This function transforms the input dataset into a format suitable for
65 | multi-label discriminative training. For each source text, it generates
66 | positive examples using the provided labels, and negative examples by
67 | sampling random incorrect labels.
68 | """
69 | new_dataset = {"sources": [], "targets": [], "labels": []}
70 |
71 | texts = dataset[text_column]
72 |
73 | if label_column == "all_labels":
74 | labels = dataset["all_labels"]
75 | multi_label = True
76 | elif label_column in dataset.features:
77 | labels = dataset[label_column]
78 | if type(labels[0]) == int:
79 | labels = [classes[idx] for idx in labels]
80 | else:
81 | raise NotImplementedError(
82 | 'Dataset should contains "label" or "all_labels" columns'
83 | )
84 |
85 | if normalize_negatives:
86 | unique_labels, probs = get_labels_stat(labels)
87 |
88 | if positives == -1:
89 | positives = len(classes) - 1
90 | if negatives == -1:
91 | negatives = len(classes) - 1
92 |
93 | for text, label in zip(texts, labels):
94 | if multi_label:
95 | curr_labels = label
96 | else:
97 | curr_labels = [label]
98 |
99 | for label in curr_labels:
100 | for i in range(positives):
101 | new_dataset["sources"].append(text)
102 | new_dataset["targets"].append(template.format(label))
103 | new_dataset["labels"].append(1)
104 |
105 | for _ in range(len(classes) - 1):
106 | neg_class_ = label
107 |
108 | while neg_class_ in curr_labels:
109 | if normalize_negatives:
110 | neg_class_ = np.random.choice(unique_labels, p=probs)
111 | else:
112 | neg_class_ = random.sample(classes, k=1)[0]
113 |
114 | new_dataset["sources"].append(text)
115 | new_dataset["targets"].append(template.format(neg_class_))
116 | new_dataset["labels"].append(0)
117 |
118 | return Dataset.from_dict(new_dataset)
119 |
--------------------------------------------------------------------------------
/src/liqfit/losses/__init__.py:
--------------------------------------------------------------------------------
1 | from .losses import cross_entropy
2 | from .losses import binary_cross_entropy_with_logits
3 | from .losses import focal_loss_with_mask
4 | from .losses import BinaryCrossEntropyLoss, CrossEntropyLoss, FocalLoss
5 |
--------------------------------------------------------------------------------
/src/liqfit/losses/losses.py:
--------------------------------------------------------------------------------
1 | from __future__ import annotations
2 | from typing import Optional
3 | import torch.nn.functional as F
4 | from kornia.losses import focal_loss
5 | import torch
6 |
7 |
8 | def binary_cross_entropy_with_logits(logits: torch.Tensor,
9 | labels: torch.Tensor,
10 | multi_target: bool = False,
11 | weight: Optional[torch.Tensor] = None,
12 | reduction: str = 'mean') -> torch.Tensor:
13 | """Wrapper function for adding support for multi_target training.
14 |
15 | Args:
16 | logits (torch.Tensor): Tensor with shape (B, T, D) where B is batch
17 | size, T is timesteps and D is embedding dimension.
18 | labels (torch.Tensor): Tensor with shape (B, T) where B is batch size,
19 | T is timesteps.
20 | multi_target (bool, optional): Whether the labels are multi target or
21 | one target for the entire sequence. Defaults to False.
22 | weight (Optional[torch.Tensor], optional): a manual rescaling weight
23 | if provided it's repeated to match input tensor shape.
24 | Defaults to None.
25 | reduction (str, optional): Reduction type that will be applied on the
26 | loss function, supported: 'mean', 'sum' or 'none'.
27 | Defaults to 'mean'.
28 |
29 | Returns:
30 | torch.Tensor: Loss tensor.
31 | """
32 | if multi_target:
33 | logits = logits.view(-1, logits.shape[-1])
34 | labels = labels.view(-1)
35 | else:
36 | labels = labels.view(-1)
37 | loss = F.binary_cross_entropy_with_logits(logits,
38 | labels,
39 | weight=weight,
40 | reduction=reduction)
41 | return loss
42 |
43 |
44 | class BinaryCrossEntropyLoss(torch.nn.Module):
45 |
46 | def __init__(self, multi_target=False, weight=None, reduction='mean'):
47 | super().__init__()
48 | """Calculate binary cross-entropy loss with support for multi target training.
49 |
50 | Args:
51 | multi_target (bool, optional): Whether the labels are multi target or
52 | one target for the entire sequence. Defaults to False.
53 | weight (Optional[torch.Tensor], optional): a manual rescaling weight
54 | if provided it's repeated to match input tensor shape.
55 | Defaults to None.
56 | reduction (str, optional): Reduction type that will be applied on the
57 | loss function, supported: 'mean', 'sum' or 'none'.
58 | Defaults to 'mean'.
59 |
60 | Returns:
61 | torch.Tensor: Loss tensor.
62 | Examples:
63 | loss = BinaryCrossEntropyLoss()(logits, targets)
64 | """
65 | self.multi_target = multi_target
66 | self.weight = weight
67 | self.reduction = reduction
68 |
69 | def forward(self, logits, target):
70 |
71 | loss = binary_cross_entropy_with_logits(
72 | logits,
73 | target,
74 | multi_target=self.multi_target,
75 | weight=self.weight,
76 | reduction=self.reduction,
77 | )
78 |
79 | return loss
80 |
81 |
82 | def cross_entropy(logits: torch.Tensor,
83 | labels: torch.Tensor,
84 | multi_target: bool = False,
85 | weight: Optional[torch.Tensor] = None,
86 | ignore_index: int = -100,
87 | reduction: str = 'mean',
88 | label_smoothing: float = 0.0):
89 | """Wrapper function for adding support for multi_target training.
90 |
91 | Args:
92 | logits (torch.Tensor): Tensor with shape (B, T, D) where B is batch
93 | size, T is timesteps and D is embedding dimension.
94 | labels (torch.Tensor): Tensor with shape (B, T) where B is batch size,
95 | T is timesteps.
96 | multi_target (bool, optional): Whether the labels are multi target or
97 | one target for the entire sequence. Defaults to False.
98 | weight (Optional[torch.Tensor], optional): a manual rescaling weight
99 | if provided it's repeated to match input tensor shape.
100 | Defaults to None.
101 | ignore_index (int, optional): Index value that will be ignored during
102 | loss calculation. Defaults to -100.
103 | reduction (str, optional): Reduction type that will be applied on the
104 | loss function, supported: 'mean', 'sum' or 'none'.
105 | Defaults to 'mean'.
106 | label_smoothing (float, optional): A float in [0.0, 1.0]. Specifies
107 | the amount of smoothing when computing the loss, where 0.0 means
108 | no smoothing. Defaults to 0.0.
109 |
110 | Returns:
111 | torch.Tensor: Loss tensor.
112 | """
113 | if multi_target:
114 | logits = logits.view(-1, logits.shape[-1])
115 | labels = labels.view(-1)
116 | else:
117 | labels = labels.view(-1)
118 | loss = F.cross_entropy(logits,
119 | labels,
120 | weight=weight,
121 | reduction=reduction,
122 | ignore_index=ignore_index,
123 | label_smoothing=label_smoothing)
124 | return loss
125 |
126 |
127 | class CrossEntropyLoss(torch.nn.Module):
128 |
129 | def __init__(self, multi_target=False, weight=None, ignore_index=-100, reduction='mean', label_smoothing=0.0):
130 | super().__init__()
131 | """Calculate cross-entropy loss while ignoring specified target labels.
132 |
133 | Args:
134 | multi_target (bool, optional): Whether the labels are multi target or
135 | one target for the entire sequence. Defaults to False.
136 | weight (Optional[torch.Tensor], optional): a manual rescaling weight
137 | if provided it's repeated to match input tensor shape.
138 | Defaults to None.
139 | ignore_index (int, optional): Index value that will be ignored during
140 | loss calculation. Defaults to -100.
141 | reduction (str, optional): Reduction type that will be applied on the
142 | loss function, supported: 'mean', 'sum' or 'none'.
143 | Defaults to 'mean'.
144 | label_smoothing (float, optional): A float in [0.0, 1.0]. Specifies
145 | the amount of smoothing when computing the loss, where 0.0 means
146 | no smoothing. Defaults to 0.0.
147 |
148 | Returns:
149 | torch.Tensor: Loss tensor.
150 | Examples:
151 | loss = CrossEntropyLoss()(logits, targets)
152 | """
153 | self.multi_target = multi_target
154 | self.weight = weight
155 | self.ignore_index = ignore_index
156 | self.reduction = reduction
157 | self.label_smoothing = label_smoothing
158 |
159 | def forward(self, logits, target):
160 |
161 | loss = cross_entropy(
162 | logits,
163 | target,
164 | multi_target=self.multi_target,
165 | weight=self.weight,
166 | ignore_index=self.ignore_index,
167 | reduction=self.reduction,
168 | label_smoothing=self.label_smoothing
169 | )
170 |
171 | return loss
172 |
173 |
174 | def focal_loss_with_mask(
175 | logits: torch.Tensor,
176 | target: torch.Tensor,
177 | ignore_index: int = -100,
178 | alpha: float = 0.5,
179 | gamma: float = 2.0,
180 | reduction: str | None = "mean",
181 | ) -> torch.Tensor:
182 | """Calculate focal loss while ignoring specified target labels.
183 |
184 | Args:
185 | logits (torch.Tensor): Model predictions.
186 | target (torch.Tensor): True labels.
187 | ignore_index (int): Label to ignore from loss calculation.
188 | alpha (float): Focal loss alpha parameter.
189 | gamma (float): Focal loss gamma parameter.
190 | reduction (str | None): Method to reduce loss.
191 |
192 | Returns:
193 | torch.Tensor: Loss tensor.
194 |
195 | This function calculates the focal loss between logits and targets,
196 | while ignoring any examples where the target is equal to ignore_index.
197 |
198 | Examples:
199 |
200 | loss = focal_loss_with_mask(logits, targets, ignore_index=-100)
201 | """
202 | if not isinstance(ignore_index, int):
203 | raise ValueError('Expected `ignore_index` to be of type `int`. '
204 | f'Received: {type(ignore_index)}')
205 |
206 | mask = target == ignore_index
207 |
208 | # To make focal_loss function work because
209 | # it cannot work with -ve numbers (e.g. -100).
210 | if ignore_index != 0:
211 | target_without_ignore_index = target.masked_fill(mask, 0)
212 |
213 | loss = focal_loss(
214 | pred=logits,
215 | target=target_without_ignore_index,
216 | alpha=alpha,
217 | gamma=gamma,
218 | reduction="none",
219 | )
220 |
221 | loss = loss.masked_fill(mask.view(-1, 1), torch.inf)
222 |
223 | if reduction == "mean":
224 | return loss[loss != torch.inf].mean()
225 | elif reduction == "sum":
226 | return loss[loss != torch.inf].sum()
227 | elif reduction is None:
228 | return loss
229 | else:
230 | raise ValueError(
231 | 'Expected reduction to be "sum", "mean" or `None`. '
232 | f"Received: {reduction}."
233 | )
234 |
235 | class FocalLoss(torch.nn.Module):
236 | def __init__(
237 | self,
238 | ignore_index: int = -100,
239 | alpha: float = 0.5,
240 | gamma: float = 2.0,
241 | reduction: str = "mean",
242 | ):
243 | """Calculate focal loss while ignoring specified target labels.
244 | Args:
245 | logits (torch.Tensor): Model predictions.
246 | target (torch.Tensor): True labels.
247 | ignore_index (int): Label to ignore from loss calculation.
248 | alpha: Weighting factor that ranges between [0, 1]`.
249 | gamma: Focusing parameter gamma >= 0`.
250 | reduction (str | None): Reduction type for loss reduction.
251 | Supported: 'mean', 'sum' or 'none'. Defaults to 'mean'
252 |
253 | Returns:
254 | torch.Tensor: Loss tensor.
255 | Examples:
256 | loss = FocalLoss()(logits, targets)
257 | """
258 | super().__init__()
259 | self.ignore_index = ignore_index
260 | self.alpha = alpha
261 | self.gamma = gamma
262 | self.reduction = reduction
263 |
264 | def forward(self, logits: torch.Tensor, target: torch.Tensor):
265 | return focal_loss_with_mask(
266 | logits=logits,
267 | target=target,
268 | ignore_index=self.ignore_index,
269 | alpha=self.alpha,
270 | gamma=self.gamma,
271 | reduction=self.reduction,
272 | )
273 |
--------------------------------------------------------------------------------
/src/liqfit/modeling/__init__.py:
--------------------------------------------------------------------------------
1 | from .heads import LiqFitHead
2 | from .heads import LabelClassificationHead
3 | from .heads import ClassClassificationHead
4 | from .heads import ClassificationHead
5 | from .model import LiqFitModel
6 | from .backbone import LiqFitBackbone
7 | from .heads import HeadOutput
8 |
--------------------------------------------------------------------------------
/src/liqfit/modeling/backbone.py:
--------------------------------------------------------------------------------
1 | from __future__ import annotations
2 | import abc
3 |
4 | import torch
5 | from torch import nn
6 | from transformers import PreTrainedModel, PretrainedConfig
7 |
8 |
9 | class LiqFitBackbone(PreTrainedModel, abc.ABC):
10 | def __init__(
11 | self, config: PretrainedConfig, backbone: nn.Module, push_backbone_only: bool = False
12 | ) -> None:
13 | """Backbone model wrapper."""
14 | super().__init__(config=config)
15 | self.push_backbone_only = push_backbone_only
16 | self.backbone = backbone
17 |
18 | def push_to_hub(
19 | self,
20 | repo_id: str,
21 | use_temp_dir: bool | None = None,
22 | commit_message: str | None = None,
23 | private: bool | None = None,
24 | token: bool | str | None = None,
25 | max_shard_size: int | str | None = "5GB",
26 | create_pr: bool = False,
27 | safe_serialization: bool = True,
28 | revision: str = None,
29 | commit_description: str = None,
30 | **deprecated_kwargs,
31 | ) -> str:
32 | if self.push_backbone_only:
33 | output = self.backbone.push_to_hub(
34 | repo_id=repo_id,
35 | use_temp_dir=use_temp_dir,
36 | commit_message=commit_message,
37 | private=private,
38 | token=token,
39 | max_shard_size=max_shard_size,
40 | create_pr=create_pr,
41 | safe_serialization=safe_serialization,
42 | revision=revision,
43 | commit_description=commit_description,
44 | **deprecated_kwargs,
45 | )
46 | else:
47 | output = super().push_to_hub(
48 | repo_id=repo_id,
49 | use_temp_dir=use_temp_dir,
50 | commit_message=commit_message,
51 | private=private,
52 | token=token,
53 | max_shard_size=max_shard_size,
54 | create_pr=create_pr,
55 | safe_serialization=safe_serialization,
56 | revision=revision,
57 | commit_description=commit_description,
58 | **deprecated_kwargs,
59 | )
60 | return output
61 |
62 | @abc.abstractmethod
63 | def encode(self, input_ids, attention_mask=None) -> torch.Tensor:
64 | raise NotImplementedError("Should be implemented in a subclass.")
65 |
66 |
--------------------------------------------------------------------------------
/src/liqfit/modeling/heads.py:
--------------------------------------------------------------------------------
1 | import abc
2 | from typing import Optional
3 |
4 | import torch
5 | from torch import nn
6 | from dataclasses import dataclass
7 | from transformers.modeling_outputs import ModelOutput
8 |
9 | from ..losses import binary_cross_entropy_with_logits, cross_entropy
10 |
11 | class LiqFitHead(nn.Module, abc.ABC):
12 | def __init__(self, *args, **kwargs) -> None:
13 | """LiqFitHead base class."""
14 | super().__init__(*args, **kwargs)
15 |
16 | @abc.abstractmethod
17 | def compute_loss(self, logits, labels) -> torch.Tensor:
18 | raise NotImplementedError("Should be implemented in a subclass.")
19 |
20 | @staticmethod
21 | def init_weight(module):
22 | if isinstance(module, nn.Linear):
23 | nn.init.xavier_uniform_(module.weight)
24 | if module.bias is not None:
25 | nn.init.constant_(module.bias, 1e-2)
26 |
27 | @abc.abstractmethod
28 | def forward(
29 | self, embeddings: torch.Tensor, labels: Optional[torch.Tensor] = None
30 | ):
31 | pass
32 |
33 | @dataclass
34 | class HeadOutput(ModelOutput):
35 | embeddings: Optional[torch.Tensor] = None
36 | logits: Optional[torch.Tensor] = None
37 | loss: Optional[torch.Tensor] = None
38 |
39 |
40 | class LabelClassificationHead(LiqFitHead):
41 | def __init__(
42 | self,
43 | in_features: int,
44 | out_features: int,
45 | multi_target: bool,
46 | bias: bool = True,
47 | temperature: int = 1.0,
48 | eps: float = 1e-5,
49 | ):
50 | """Label Classification Head class for Binary or Multi-label tasks.
51 |
52 | Args:
53 | in_features (_type_): Number of input features.
54 | out_features (_type_): Number of output features.
55 | multi_target (_type_): Whether this class is for multi-target
56 | task or not.
57 | bias (bool, optional): Whether to add bias to the `Linear`
58 | layer or not. Defaults to True.
59 | temperature (int, optional): Temperature that will be used
60 | to calibrate the head to the task. Defaults to 1.0.
61 | eps (float, optional): Epsilon value for numirical stability.
62 | Defaults to 1e-5.
63 | """
64 | super().__init__()
65 | self.temperature = temperature
66 | self.eps = eps
67 | self.multi_target = multi_target
68 | self.linear = nn.Linear(in_features, out_features, bias=bias)
69 | LiqFitHead.init_weight(self.linear)
70 |
71 | def compute_loss(self, logits: torch.Tensor, labels: torch.Tensor):
72 | loss = binary_cross_entropy_with_logits(
73 | logits, labels, self.multi_target
74 | )
75 | return loss
76 |
77 | def forward(
78 | self, embeddings: torch.Tensor, labels: Optional[torch.Tensor] = None
79 | ) -> torch.Tensor:
80 | logits = self.linear(embeddings)
81 | logits /= self.temperature + self.eps
82 | if labels is not None:
83 | loss = self.compute_loss(logits, labels)
84 | else:
85 | loss = None
86 | return HeadOutput(embeddings=embeddings, logits=logits, loss=loss)
87 |
88 |
89 | class ClassClassificationHead(LiqFitHead):
90 | def __init__(
91 | self,
92 | in_features: int,
93 | out_features: int,
94 | multi_target: bool,
95 | bias: bool = True,
96 | temperature: int = 1.0,
97 | eps: float = 1e-5,
98 | ignore_index: int = -100,
99 | ):
100 | """Class Classification Head class for Sequence/Token classification
101 | tasks.
102 |
103 | Args:
104 | in_features (int): Number of input features.
105 | out_features (int): Number of output features.
106 | multi_target (bool): Whether this class is for multi-target task
107 | or not.
108 | bias (bool, optional): Whether to add bias to the `Linear`
109 | layer or not. Defaults to True.
110 | temperature (int, optional): Temperature that will be used
111 | to calibrate the head to the task. Defaults to 1.0.
112 | eps (float, optional): Epsilon value for numirical stability.
113 | Defaults to 1e-5.
114 | ignore_index (int, optional): Index that will be ignore in
115 | case of token classification tasks. Defaults to -100.
116 | """
117 | super().__init__()
118 | self.temperature = temperature
119 | self.eps = eps
120 | self.multi_target = multi_target
121 | self.ignore_index = ignore_index
122 | self.linear = nn.Linear(in_features, out_features, bias=bias)
123 | LiqFitHead.init_weight(self.linear)
124 |
125 | def compute_loss(self, logits: torch.Tensor, labels: torch.Tensor):
126 | return cross_entropy(
127 | logits, labels, self.multi_target, ignore_index=self.ignore_index
128 | )
129 |
130 | def forward(
131 | self, embeddings: torch.Tensor, labels: Optional[torch.Tensor] = None
132 | ) -> torch.Tensor:
133 | logits = self.linear(embeddings) / (self.temperature + self.eps)
134 | if labels is not None:
135 | loss = self.compute_loss(logits, labels)
136 | else:
137 | loss = None
138 | return HeadOutput(embeddings=embeddings, logits=logits, loss=loss)
139 |
140 |
141 | class ClassificationHead(LiqFitHead):
142 | def __init__(
143 | self,
144 | in_features: int,
145 | out_features: int,
146 | pooler: nn.Module,
147 | loss_func: nn.Module,
148 | bias: bool = True,
149 | temperature: int = 1.0,
150 | eps: float = 1e-5,
151 | ):
152 | """Class Classification Head class for Sequence/Token classification
153 | tasks.
154 |
155 | Args:
156 | in_features (int): Number of input features.
157 | out_features (int): Number of output features.
158 | pooler (torch.nn.Module): Module that applier various pooling opperation on the outputs of a model .
159 | loss_func (torch.nn.Module): loss function object.
160 | out_features (int): Number of output features.
161 | bias (bool, optional): Whether to add bias to the `Linear`
162 | layer or not. Defaults to True.
163 | temperature (int, optional): Temperature that will be used
164 | to calibrate the head to the task. Defaults to 1.0.
165 | eps (float, optional): Epsilon value for numirical stability.
166 | Defaults to 1e-5.
167 | ignore_index (int, optional): Index that will be ignore in
168 | case of token classification tasks. Defaults to -100.
169 | """
170 | super().__init__()
171 | self.temperature = temperature
172 | self.eps = eps
173 | self.pooler = pooler
174 | self.loss_func = loss_func
175 | self.linear = nn.Linear(in_features, out_features, bias=bias)
176 | LiqFitHead.init_weight(self.linear)
177 |
178 | def compute_loss(self, logits: torch.Tensor, labels: torch.Tensor):
179 | return self.loss_func(
180 | logits, labels
181 | )
182 |
183 | def forward(
184 | self, embeddings: torch.Tensor, labels: Optional[torch.Tensor] = None
185 | ) -> torch.Tensor:
186 | pooled_input = self.pooler(embeddings)
187 | logits = self.linear(pooled_input) / (self.temperature + self.eps)
188 | if labels is not None:
189 | loss = self.compute_loss(logits, labels)
190 | else:
191 | loss = None
192 | return HeadOutput(embeddings=pooled_input, logits=logits, loss=loss)
193 |
--------------------------------------------------------------------------------
/src/liqfit/modeling/model.py:
--------------------------------------------------------------------------------
1 | from __future__ import annotations
2 |
3 | from typing import Optional
4 |
5 | import inspect
6 | import torch
7 | from torch import nn
8 | import torch.nn.functional as F
9 | from sklearn.linear_model import LogisticRegression
10 | from transformers import PreTrainedModel, PretrainedConfig
11 |
12 | from .backbone import LiqFitBackbone
13 | from .heads import LiqFitHead, HeadOutput
14 | from ..utils.standardization import convert_to_numpy
15 |
16 | class LiqFitModel(PreTrainedModel):
17 | def __init__(
18 | self,
19 | config: PretrainedConfig,
20 | backbone: LiqFitBackbone | nn.Module | PreTrainedModel,
21 | head: Optional[LiqFitHead | LogisticRegression] = None,
22 | loss_func: Optional[nn.Module] = None,
23 | normalize_backbone_embeddings: bool = False,
24 | labels_name: str = "labels",
25 | push_backbone_only: bool = False,
26 | ):
27 | """Model container that groups the backbone and head together
28 | and applies forward on both of them.
29 |
30 | Args:
31 | backbone (LiqFitBackbone): Backbone model.
32 | head (Optional[LiqFitHead | LogisticRegression], optional):
33 | Head that is defined for the task. Could be set to `None`
34 | if the head is already attached to the backbone.
35 | Defaults to None.
36 | loss_func (Optional[nn.Module]): class for calculation of loss functions.
37 | normalize_backbone_embeddings (bool, optional): Whether to
38 | normalize the backbone embeddings or not (Requires the
39 | backbone output to be a `torch.Tensor` not a Huggingface
40 | object). Defaults to False.
41 | labels_name (str, optional): Labels name that will be sent in the
42 | **kwargs for loss calculation. Defaults to "labels".
43 |
44 | Example 1:
45 | # make sure that the output from this model
46 | # is a torch.Tensor otherwise wrap it using LiqFitBackbone.
47 | my_backbone = AutoModel.from_pretrained(....)
48 | head = LiqFit.modeling.LabelClassificationHead(...)
49 | model = LiqFitModel(my_backbone.config, my_backbone, head)
50 |
51 | Example 2:
52 | class MyBackbone(LiqFitBackbone):
53 | def __init__(self):
54 | my_backbone = AutoModel.from_pretrained(....)
55 | super().__init__(my_backbone.config, backbone=backbone)
56 | def encode(self, input_ids, attention_mask=None) -> torch.Tensor:
57 | output = self.backbone(input_ids, attention_mask=attention_mask)
58 | return output
59 |
60 | my_backbone = MyBackbone()
61 | head = LiqFit.modeling.LabelClassificationHead(...)
62 | model = LiqFitModel(my_backbone.config, my_backbone, head)
63 | """
64 |
65 | super().__init__(config=config)
66 | self._is_sklearn_head = None
67 | self.backbone = backbone
68 | self._determine_and_validate_head_type(head)
69 | self.head = head
70 | self.loss_func = loss_func
71 | self.normalize_backbone_embeddings = normalize_backbone_embeddings
72 | self.labels_name = labels_name
73 | self.push_backbone_only = push_backbone_only
74 | self.expecting_labels = 'labels' in inspect.getfullargspec(self.backbone.forward).args
75 |
76 | def push_to_hub(
77 | self,
78 | repo_id: str,
79 | use_temp_dir: bool | None = None,
80 | commit_message: str | None = None,
81 | private: bool | None = None,
82 | token: bool | str | None = None,
83 | max_shard_size: int | str | None = "5GB",
84 | create_pr: bool = False,
85 | safe_serialization: bool = True,
86 | revision: str = None,
87 | commit_description: str = None,
88 | **deprecated_kwargs,
89 | ) -> str:
90 | if self.push_backbone_only:
91 | if isinstance(self.backbone, (LiqFitBackbone, PreTrainedModel)):
92 | return self.backbone.push_to_hub(
93 | repo_id,
94 | use_temp_dir,
95 | commit_message,
96 | private,
97 | token,
98 | max_shard_size,
99 | create_pr,
100 | safe_serialization,
101 | revision,
102 | commit_description,
103 | **deprecated_kwargs,
104 | )
105 | else:
106 | output = super().push_to_hub(
107 | repo_id=repo_id,
108 | use_temp_dir=use_temp_dir,
109 | commit_message=commit_message,
110 | private=private,
111 | token=token,
112 | max_shard_size=max_shard_size,
113 | create_pr=create_pr,
114 | safe_serialization=safe_serialization,
115 | revision=revision,
116 | commit_description=commit_description,
117 | **deprecated_kwargs,
118 | )
119 | return output
120 |
121 | def freeze_weights(self):
122 | self.requires_grad_(False)
123 |
124 | def unfreeze_weights(self):
125 | self.requires_grad_(True)
126 |
127 | def _determine_and_validate_head_type(self, head):
128 | if head is None:
129 | return
130 |
131 | self._is_sklearn_head = isinstance(head, LogisticRegression)
132 | if not self._is_sklearn_head and not isinstance(head, LiqFitHead):
133 | raise TypeError(
134 | "Expected `head` to be of type "
135 | "`LogisticRegression` or `LiqFitHead`. "
136 | f"Received: {type(head)}."
137 | )
138 |
139 | def _backbone_forward(self, **kwargs):
140 | if isinstance(self.backbone, LiqFitBackbone):
141 | output = self.backbone.encode(**kwargs)
142 | if not isinstance(output, torch.Tensor):
143 | raise ValueError(
144 | "Expected output from backbone model to be of type "
145 | f"`torch.Tensor`. Received: {type(output)}."
146 | )
147 | else:
148 | output = self.backbone(**kwargs)
149 | return output
150 |
151 | def _torch_head_forward(self, embeddings, labels=None):
152 | output = self.head(embeddings, labels)
153 | return output
154 |
155 | def _sklearn_head_forward(self, embeddings):
156 | embeddings = convert_to_numpy(embeddings)
157 | output = self.head.predict(embeddings)
158 | return output
159 |
160 | def _head_forward(self, inputs, labels=None):
161 | if self._is_sklearn_head:
162 | return self._sklearn_head_forward(inputs)
163 | else:
164 | return self._torch_head_forward(inputs, labels)
165 |
166 | def forward(self, **kwargs):
167 | labels = kwargs.pop('labels', None)
168 |
169 | output = self._backbone_forward(**kwargs)
170 |
171 | if not isinstance(output, torch.Tensor):
172 | if isinstance(output, tuple):
173 | output = output[0]
174 | elif 'logits' in output:
175 | output = output['logits']
176 | elif 'last_hidden_state' in output:
177 | output = output['last_hidden_state']
178 | else:
179 | raise NotImplementedError('A model output should contains logits or last_hidden_state.')
180 |
181 | if self.normalize_backbone_embeddings:
182 | if isinstance(output, torch.Tensor):
183 | output = F.normalize(output, p=2.0, dim=-1)
184 | else:
185 | raise TypeError(
186 | "Normalizing the embedding requires type of "
187 | f"`torch.Tensor`. Received: {type(output)}."
188 | )
189 | if self.head is not None:
190 | output = self._head_forward(output, labels)
191 | elif self.loss_func is not None and labels is not None:
192 | loss = self.loss_func(output, labels)
193 | output = HeadOutput(logits=output, loss=loss)
194 | return output
195 |
--------------------------------------------------------------------------------
/src/liqfit/modeling/pooling.py:
--------------------------------------------------------------------------------
1 | from typing import Optional
2 |
3 | import torch
4 | from torch import nn
5 |
6 |
7 | class GlobalMaxPooling1D(nn.Module):
8 | """Applies Global Max Pooling on the timesteps dimension."""
9 |
10 | def forward(self, x: torch.Tensor):
11 | return x.amax(dim=1)
12 |
13 |
14 | class FirstTokenPooling1D(nn.Module):
15 | """Takes the first token's embedding."""
16 |
17 | def forward(self, x: torch.Tensor):
18 | return x[:, 0, :]
19 |
20 |
21 | class LastTokenPooling1D(nn.Module):
22 | """Takes the last token's embedding."""
23 |
24 | def forward(self, x: torch.Tensor):
25 | return x[:, -1, :]
26 |
27 |
28 | class GlobalAvgPooling1D(nn.Module):
29 | """Applies Global Average Pooling on the timesteps dimension."""
30 |
31 | def forward(
32 | self, x: torch.Tensor, attention_mask: Optional[torch.Tensor] = None
33 | ):
34 | if attention_mask is not None:
35 | attention_mask = attention_mask.repeat((1, 1, x.shape[-1])).to(
36 | dtype=x.dtype
37 | )
38 | x = x * attention_mask
39 | return x.sum(1) / attention_mask.sum(1)
40 | else:
41 | return x.mean(dim=1)
42 |
43 |
44 | class GlobalSumPooling1D(nn.Module):
45 | """Applies Global Sum Pooling on the timesteps dimension."""
46 |
47 | def forward(self, x: torch.Tensor, attention_mask: Optional[torch.Tensor] = None):
48 | if attention_mask is not None:
49 | x = x * attention_mask
50 | return x.sum(dim=1)
51 |
52 |
53 | class GlobalRMSPooling1D(nn.Module):
54 | """Applies Global RMS Pooling on the timesteps dimension."""
55 |
56 | def forward(self, x: torch.Tensor, attention_mask: Optional[torch.Tensor] = None):
57 | if attention_mask is not None:
58 | attention_mask = attention_mask.repeat((1, 1, x.shape[-1])).to(
59 | dtype=x.dtype
60 | )
61 | x = x * attention_mask
62 | return (x.pow(2).sum(dim=1) / attention_mask.sum(1)).sqrt()
63 | else:
64 | return x.pow(2).mean(dim=1).sqrt()
65 |
66 |
67 | class GlobalAbsMaxPooling1D(nn.Module):
68 | """Applies Global Max Pooling of absolute values on the timesteps dimension."""
69 |
70 | def forward(self, x: torch.Tensor, attention_mask: Optional[torch.Tensor] = None):
71 | if attention_mask is not None:
72 | attention_mask = attention_mask.repeat((1, 1, x.shape[-1])).to(
73 | dtype=x.dtype
74 | )
75 | x = x * attention_mask
76 | return x.abs().amax(dim=1)
77 |
78 |
79 | class GlobalAbsAvgPooling1D(nn.Module):
80 | """Applies Global Average Pooling of absolute values on the timesteps dimension."""
81 |
82 | def forward(self, x: torch.Tensor, attention_mask: Optional[torch.Tensor] = None):
83 | if attention_mask is not None:
84 | attention_mask = attention_mask.repeat((1, 1, x.shape[-1])).to(
85 | dtype=x.dtype
86 | )
87 | x = (x * attention_mask).abs()
88 | return x.sum(dim=1) / attention_mask.sum(1)
89 | else:
90 | return x.abs().mean(dim=1)
91 |
--------------------------------------------------------------------------------
/src/liqfit/models/__init__.py:
--------------------------------------------------------------------------------
1 | from .t5 import T5ForZeroShotClassification, T5ConfigWithLoss
2 | from .deberta import DebertaV2ForZeroShotClassification, DebertaConfigWithLoss
--------------------------------------------------------------------------------
/src/liqfit/models/deberta.py:
--------------------------------------------------------------------------------
1 | # coding=utf-8
2 | # Copyright 2020, The T5 Authors and HuggingFace Inc. and Knowledagtor
3 | #
4 | # Licensed under the Apache License, Version 2.0 (the "License");
5 | # you may not use this file except in compliance with the License.
6 | # You may obtain a copy of the License at
7 | #
8 | # http://www.apache.org/licenses/LICENSE-2.0
9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 |
16 | from transformers import DebertaConfig, DebertaV2ForSequenceClassification
17 | from transformers.modeling_outputs import SequenceClassifierOutput
18 | from transformers.utils import add_end_docstrings, logging
19 | from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
20 |
21 | from typing import Union, Optional, Tuple
22 | import torch
23 | from torch import nn
24 |
25 | from typing import List, Union
26 |
27 | from ..losses import FocalLoss
28 |
29 | logger = logging.get_logger(__name__)
30 |
31 | SUPPORTED_LOSSES = ("focal_loss", "cross_entropy")
32 |
33 |
34 | class DebertaConfigWithLoss(DebertaConfig):
35 | """Deberta configuration with additional loss parameters.
36 |
37 | Extends Deberta to include parameters for configuring the
38 | loss function during training.
39 | """
40 | def __init__(
41 | self,
42 | loss_type = "focal_loss",
43 | focal_loss_alpha=0.5,
44 | focal_loss_gamma=2.0,
45 | **kwargs,
46 | ):
47 | super().__init__(**kwargs)
48 | self.loss_type= loss_type
49 | self.focal_loss_alpha = focal_loss_alpha
50 | self.focal_loss_gamma = focal_loss_gamma
51 |
52 | class DebertaV2ForZeroShotClassification(DebertaV2ForSequenceClassification):
53 | def __init__(self, config: DebertaConfigWithLoss):
54 | super().__init__(config)
55 |
56 | if self.config.loss_type not in SUPPORTED_LOSSES:
57 | raise NotImplementedError(f"{self.config.loss_type} is not implemented loss function type. ")
58 |
59 | def forward(
60 | self,
61 | input_ids: Optional[torch.Tensor] = None,
62 | attention_mask: Optional[torch.Tensor] = None,
63 | token_type_ids: Optional[torch.Tensor] = None,
64 | position_ids: Optional[torch.Tensor] = None,
65 | inputs_embeds: Optional[torch.Tensor] = None,
66 | labels: Optional[torch.Tensor] = None,
67 | output_attentions: Optional[bool] = None,
68 | output_hidden_states: Optional[bool] = None,
69 | return_dict: Optional[bool] = None,
70 | ) -> Union[Tuple, SequenceClassifierOutput]:
71 | r"""
72 | labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
73 | Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
74 | config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
75 | `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
76 | """
77 | return_dict = return_dict if return_dict is not None else self.config.use_return_dict
78 |
79 | outputs = self.deberta(
80 | input_ids,
81 | token_type_ids=token_type_ids,
82 | attention_mask=attention_mask,
83 | position_ids=position_ids,
84 | inputs_embeds=inputs_embeds,
85 | output_attentions=output_attentions,
86 | output_hidden_states=output_hidden_states,
87 | return_dict=return_dict,
88 | )
89 |
90 | encoder_layer = outputs[0]
91 | pooled_output = self.pooler(encoder_layer)
92 | pooled_output = self.dropout(pooled_output)
93 | logits = self.classifier(pooled_output)
94 |
95 | loss = None
96 | if labels is not None:
97 | if self.config.problem_type is None:
98 | if self.num_labels == 1:
99 | # regression task
100 | loss_fn = nn.MSELoss()
101 | logits = logits.view(-1).to(labels.dtype)
102 | loss = loss_fn(logits, labels.view(-1))
103 | elif labels.dim() == 1 or labels.size(-1) == 1:
104 | label_index = (labels >= 0).nonzero()
105 | labels = labels.long()
106 | if label_index.size(0) > 0:
107 | labeled_logits = torch.gather(
108 | logits, 0, label_index.expand(label_index.size(0), logits.size(1))
109 | )
110 | labels = torch.gather(labels, 0, label_index.view(-1))
111 | loss_fct = CrossEntropyLoss()
112 | loss = loss_fct(labeled_logits.view(-1, self.num_labels).float(), labels.view(-1))
113 | else:
114 | loss = torch.tensor(0).to(logits)
115 | else:
116 | log_softmax = nn.LogSoftmax(-1)
117 | loss = -((log_softmax(logits) * labels).sum(-1)).mean()
118 | elif self.config.problem_type == "regression":
119 | loss_fct = MSELoss()
120 | if self.num_labels == 1:
121 | loss = loss_fct(logits.squeeze(), labels.squeeze())
122 | else:
123 | loss = loss_fct(logits, labels)
124 | elif self.config.problem_type == "single_label_classification":
125 | if self.config.loss_type == "cross_entropy":
126 | loss_fct = CrossEntropyLoss()
127 | elif self.config.loss_type == "focal_loss":
128 | loss_fct = FocalLoss(alpha=self.config.focal_loss_alpha, gamma=self.config.focal_loss_gamma)
129 | loss = loss_fct(logits.view(-1, self.config.num_labels), labels.view(-1))
130 | elif self.config.problem_type == "multi_label_classification":
131 | loss_fct = BCEWithLogitsLoss()
132 | loss = loss_fct(logits, labels)
133 | if not return_dict:
134 | output = (logits,) + outputs[1:]
135 | return ((loss,) + output) if loss is not None else output
136 |
137 | return SequenceClassifierOutput(
138 | loss=loss, logits=logits, hidden_states=outputs.hidden_states, attentions=outputs.attentions
139 | )
--------------------------------------------------------------------------------
/src/liqfit/models/t5.py:
--------------------------------------------------------------------------------
1 | # coding=utf-8
2 | # Copyright 2020, The T5 Authors and HuggingFace Inc. and Knowledagtor
3 | #
4 | # Licensed under the Apache License, Version 2.0 (the "License");
5 | # you may not use this file except in compliance with the License.
6 | # You may obtain a copy of the License at
7 | #
8 | # http://www.apache.org/licenses/LICENSE-2.0
9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 |
16 | from transformers import T5PreTrainedModel, T5Config, T5Model
17 | from transformers.modeling_outputs import Seq2SeqSequenceClassifierOutput
18 | from transformers.utils import add_end_docstrings, logging
19 | from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
20 |
21 | from typing import Union, Optional, Tuple
22 | import torch
23 | from torch import nn
24 |
25 | from typing import List, Union
26 |
27 | from ..losses import FocalLoss
28 |
29 | logger = logging.get_logger(__name__)
30 |
31 | SUPPORTED_LOSSES = ("focal_loss", "cross_entropy")
32 |
33 | class T5ConfigWithLoss(T5Config):
34 | """T5 configuration with additional loss parameters.
35 |
36 | Extends T5Config to include parameters for configuring the
37 | loss function during training.
38 | """
39 | def __init__(
40 | self,
41 | loss_type = "focal_loss",
42 | focal_loss_alpha=0.5,
43 | focal_loss_gamma=2.0,
44 | **kwargs,
45 | ):
46 | super().__init__(**kwargs)
47 | self.loss_type= loss_type
48 | self.focal_loss_alpha = focal_loss_alpha
49 | self.focal_loss_gamma = focal_loss_gamma
50 |
51 | class T5ClassificationHead(nn.Module):
52 | """Head for sentence-level classification tasks."""
53 |
54 | def __init__(self, config: T5ConfigWithLoss):
55 | super().__init__()
56 | self.dense = nn.Linear(config.d_model, config.d_model)
57 | self.dropout = nn.Dropout(p=config.classifier_dropout)
58 | self.out_proj = nn.Linear(config.d_model, config.num_labels)
59 |
60 | def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
61 | hidden_states = self.dropout(hidden_states)
62 | hidden_states = self.dense(hidden_states)
63 | hidden_states = torch.tanh(hidden_states)
64 | hidden_states = self.dropout(hidden_states)
65 | hidden_states = self.out_proj(hidden_states)
66 | return hidden_states
67 |
68 |
69 | class T5ForZeroShotClassification(T5PreTrainedModel):
70 | _keys_to_ignore_on_load_unexpected = ["decoder.block.0.layer.1.EncDecAttention.relative_attention_bias.weight"]
71 | _tied_weights_keys = ["encoder.embed_tokens.weight", "decoder.embed_tokens.weight"]
72 |
73 | def __init__(self, config: T5ConfigWithLoss):
74 | super().__init__(config)
75 |
76 | if self.config.loss_type not in SUPPORTED_LOSSES:
77 | raise NotImplementedError(f"{self.config.loss_type} is not implemented loss function type. ")
78 |
79 | self.transformer = T5Model(config)
80 | self.classification_head = T5ClassificationHead(config)
81 |
82 | # Initialize weights and apply final processing
83 | self.post_init()
84 |
85 | self.model_parallel = False
86 |
87 | def forward(
88 | self,
89 | input_ids: torch.LongTensor = None,
90 | attention_mask: Optional[torch.Tensor] = None,
91 | decoder_input_ids: Optional[torch.LongTensor] = None,
92 | decoder_attention_mask: Optional[torch.LongTensor] = None,
93 | head_mask: Optional[torch.Tensor] = None,
94 | decoder_head_mask: Optional[torch.Tensor] = None,
95 | cross_attn_head_mask: Optional[torch.Tensor] = None,
96 | encoder_outputs: Optional[List[torch.FloatTensor]] = None,
97 | inputs_embeds: Optional[torch.FloatTensor] = None,
98 | decoder_inputs_embeds: Optional[torch.FloatTensor] = None,
99 | labels: Optional[torch.LongTensor] = None,
100 | use_cache: Optional[bool] = None,
101 | output_attentions: Optional[bool] = None,
102 | output_hidden_states: Optional[bool] = None,
103 | return_dict: Optional[bool] = None,
104 | ) -> Union[Tuple, Seq2SeqSequenceClassifierOutput]:
105 | r"""
106 | labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
107 | Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
108 | config.num_labels - 1]`. If `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
109 | Returns:
110 | """
111 | return_dict = return_dict if return_dict is not None else self.config.use_return_dict
112 | if labels is not None:
113 | use_cache = False
114 |
115 | if input_ids is None and inputs_embeds is not None:
116 | raise NotImplementedError(
117 | f"Passing input embeddings is currently not supported for {self.__class__.__name__}"
118 | )
119 |
120 | # Copied from models.bart.modeling_bart.BartModel.forward different to other models, T5 automatically creates
121 | # decoder_input_ids from input_ids if no decoder_input_ids are provided
122 | if decoder_input_ids is None and decoder_inputs_embeds is None:
123 | if input_ids is None:
124 | raise ValueError(
125 | "If no `decoder_input_ids` or `decoder_inputs_embeds` are "
126 | "passed, `input_ids` cannot be `None`. Please pass either "
127 | "`input_ids` or `decoder_input_ids` or `decoder_inputs_embeds`."
128 | )
129 | decoder_input_ids = self._shift_right(input_ids)
130 |
131 | outputs = self.transformer(
132 | input_ids,
133 | attention_mask=attention_mask,
134 | decoder_input_ids=decoder_input_ids,
135 | decoder_attention_mask=decoder_attention_mask,
136 | head_mask=head_mask,
137 | decoder_head_mask=decoder_head_mask,
138 | cross_attn_head_mask=cross_attn_head_mask,
139 | encoder_outputs=encoder_outputs,
140 | inputs_embeds=inputs_embeds,
141 | decoder_inputs_embeds=decoder_inputs_embeds,
142 | use_cache=use_cache,
143 | output_attentions=output_attentions,
144 | output_hidden_states=output_hidden_states,
145 | return_dict=return_dict,
146 | )
147 | sequence_output = outputs[0]
148 |
149 | eos_mask = decoder_input_ids.eq(self.config.eos_token_id).to(sequence_output.device)
150 |
151 | if len(torch.unique_consecutive(eos_mask.sum(1))) > 1:
152 | raise ValueError("All examples must have the same number of tokens.")
153 | batch_size, _, hidden_size = sequence_output.shape
154 | sentence_representation = sequence_output[eos_mask, :].view(batch_size, -1, hidden_size)[:, -1, :]
155 |
156 | logits = self.classification_head(sentence_representation)
157 |
158 | loss = None
159 | if labels is not None:
160 | labels = labels.to(logits.device)
161 | if self.config.problem_type is None:
162 | if self.config.num_labels == 1:
163 | self.config.problem_type = "regression"
164 | elif self.config.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
165 | self.config.problem_type = "single_label_classification"
166 | else:
167 | self.config.problem_type = "multi_label_classification"
168 |
169 | if self.config.problem_type == "regression":
170 | loss_fct = MSELoss()
171 | if self.config.num_labels == 1:
172 | loss = loss_fct(logits.squeeze(), labels.squeeze())
173 | else:
174 | loss = loss_fct(logits, labels)
175 | elif self.config.problem_type == "single_label_classification":
176 | if self.config.loss_type == "cross_entropy":
177 | loss_fct = CrossEntropyLoss()
178 | elif self.config.loss_type == "focal_loss":
179 | loss_fct = FocalLoss(alpha=self.config.focal_loss_alpha, gamma=self.config.focal_loss_gamma)
180 | loss = loss_fct(logits.view(-1, self.config.num_labels), labels.view(-1))
181 | elif self.config.problem_type == "multi_label_classification":
182 | loss_fct = BCEWithLogitsLoss()
183 | loss = loss_fct(logits, labels)
184 | if not return_dict:
185 | output = (logits,) + outputs[1:]
186 | return ((loss,) + output) if loss is not None else output
187 |
188 | return Seq2SeqSequenceClassifierOutput(
189 | loss=loss,
190 | logits=logits,
191 | past_key_values=outputs.past_key_values,
192 | decoder_hidden_states=outputs.decoder_hidden_states,
193 | decoder_attentions=outputs.decoder_attentions,
194 | cross_attentions=outputs.cross_attentions,
195 | encoder_last_hidden_state=outputs.encoder_last_hidden_state,
196 | encoder_hidden_states=outputs.encoder_hidden_states,
197 | encoder_attentions=outputs.encoder_attentions,
198 | )
--------------------------------------------------------------------------------
/src/liqfit/pipeline/__init__.py:
--------------------------------------------------------------------------------
1 | from .inference import ZeroShotClassificationPipeline
2 |
--------------------------------------------------------------------------------
/src/liqfit/pipeline/inference.py:
--------------------------------------------------------------------------------
1 | # Copyright 2020 The HuggingFace Team and Knowledgator. All rights reserved.
2 | #
3 | # Licensed under the Apache License, Version 2.0 (the "License");
4 | # you may not use this file except in compliance with the License.
5 | # You may obtain a copy of the License at
6 | #
7 | # http://www.apache.org/licenses/LICENSE-2.0
8 | #
9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 |
15 | from transformers.tokenization_utils import TruncationStrategy
16 | from transformers.utils import add_end_docstrings, logging
17 | from transformers.pipelines.base import PIPELINE_INIT_ARGS, ArgumentHandler, ChunkPipeline
18 |
19 | from typing import Union
20 | import inspect
21 | from typing import List, Union
22 | import numpy as np
23 |
24 |
25 | logger = logging.get_logger(__name__)
26 |
27 | class ZeroShotClassificationArgumentHandler(ArgumentHandler):
28 | """
29 | Handles arguments for zero-shot for text classification by turning each possible label into an NLI
30 | premise/hypothesis pair.
31 | """
32 |
33 | def _parse_labels(self, labels):
34 | if isinstance(labels, str):
35 | labels = [label.strip() for label in labels.split(",") if label.strip()]
36 | return labels
37 |
38 | def __call__(self, sequences, labels, hypothesis_template, hypothesis_first):
39 | if len(labels) == 0 or len(sequences) == 0:
40 | raise ValueError("You must include at least one label and at least one sequence.")
41 | if hypothesis_template.format(labels[0]) == hypothesis_template:
42 | raise ValueError(
43 | (
44 | 'The provided hypothesis_template "{}" was not able to be formatted with the target labels. '
45 | "Make sure the passed template includes formatting syntax such as {{}} where the label should go."
46 | ).format(hypothesis_template)
47 | )
48 |
49 | if isinstance(sequences, str):
50 | sequences = [sequences]
51 |
52 | sequence_pairs = []
53 | if not hypothesis_first:
54 | for sequence in sequences:
55 | sequence_pairs.extend([[sequence, hypothesis_template.format(label)] for label in labels])
56 | else:
57 | for sequence in sequences:
58 | sequence_pairs.extend([[hypothesis_template.format(label), sequence] for label in labels])
59 | return sequence_pairs, sequences
60 |
61 |
62 | @add_end_docstrings(PIPELINE_INIT_ARGS)
63 | class ZeroShotClassificationPipeline(ChunkPipeline):
64 | """
65 | NLI-based zero-shot classification pipeline using a `ModelForSequenceClassification` trained on NLI (natural
66 | language inference) tasks. Equivalent of `text-classification` pipelines, but these models don't require a
67 | hardcoded number of potential classes, they can be chosen at runtime. It usually means it's slower but it is
68 | **much** more flexible.
69 |
70 | Any combination of sequences and labels can be passed and each combination will be posed as a premise/hypothesis
71 | pair and passed to the pretrained model. Then, the logit for *entailment* is taken as the logit for the candidate
72 | label being valid. Any NLI model can be used, but the id of the *entailment* label must be included in the model
73 | config's :attr:*~transformers.PretrainedConfig.label2id*.
74 |
75 | Example:
76 |
77 | ```python
78 | >>> from transformers import pipeline
79 |
80 | >>> oracle = pipeline(model="facebook/bart-large-mnli")
81 | >>> oracle(
82 | ... "I have a problem with my iphone that needs to be resolved asap!!",
83 | ... candidate_labels=["urgent", "not urgent", "phone", "tablet", "computer"],
84 | ... )
85 | {'sequence': 'I have a problem with my iphone that needs to be resolved asap!!', 'labels': ['urgent', 'phone', 'computer', 'not urgent', 'tablet'], 'scores': [0.504, 0.479, 0.013, 0.003, 0.002]}
86 |
87 | >>> oracle(
88 | ... "I have a problem with my iphone that needs to be resolved asap!!",
89 | ... candidate_labels=["english", "german"],
90 | ... )
91 | {'sequence': 'I have a problem with my iphone that needs to be resolved asap!!', 'labels': ['english', 'german'], 'scores': [0.814, 0.186]}
92 | ```
93 |
94 | Learn more about the basics of using a pipeline in the [pipeline tutorial](../pipeline_tutorial)
95 |
96 | This NLI pipeline can currently be loaded from [`pipeline`] using the following task identifier:
97 | `"zero-shot-classification"`.
98 |
99 | The models that this pipeline can use are models that have been fine-tuned on an NLI task. See the up-to-date list
100 | of available models on [huggingface.co/models](https://huggingface.co/models?search=nli).
101 | """
102 |
103 | def __init__(self, args_parser=ZeroShotClassificationArgumentHandler(), *args, **kwargs):
104 | self._args_parser = args_parser
105 | super().__init__(*args, **kwargs)
106 | if self.entailment_id == -1:
107 | logger.warning(
108 | "Failed to determine 'entailment' label id from the label2id mapping in the model config. Setting to "
109 | "-1. Define a descriptive label2id mapping in the model config to ensure correct outputs."
110 | )
111 |
112 | @property
113 | def entailment_id(self):
114 | if len(self.model.config.label2id.items()) == 0:
115 | return 0
116 | for label, ind in self.model.config.label2id.items():
117 | if label.lower().startswith("entail"):
118 | return ind
119 | return -1
120 |
121 | def _parse_and_tokenize(
122 | self, sequence_pairs, padding=True, add_special_tokens=True, truncation=TruncationStrategy.ONLY_FIRST,
123 | encoder_decoder = False, **kwargs
124 | ):
125 | """
126 | Parse arguments and tokenize only_first so that hypothesis (label) is not truncated
127 | """
128 | return_tensors = self.framework
129 | if self.tokenizer.pad_token is None:
130 | # Override for tokenizers not supporting padding
131 | logger.error(
132 | "Tokenizer was not supporting padding necessary for zero-shot, attempting to use "
133 | " `pad_token=eos_token`"
134 | )
135 | self.tokenizer.pad_token = self.tokenizer.eos_token
136 | try:
137 | if encoder_decoder:
138 | sequence_pairs, decoder_input = sequence_pairs
139 |
140 | inputs = self.tokenizer(
141 | [sequence_pairs],
142 | add_special_tokens=add_special_tokens,
143 | return_tensors=return_tensors,
144 | padding=padding,
145 | truncation=truncation,
146 | )
147 | if encoder_decoder:
148 | decoder_inputs = self.tokenizer(
149 | [decoder_input],
150 | add_special_tokens=add_special_tokens,
151 | return_tensors=return_tensors,
152 | padding=padding,
153 | truncation=truncation,
154 | )
155 | inputs['decoder_input_ids'] = decoder_inputs['input_ids']
156 | inputs['decoder_attention_mask'] = decoder_inputs['attention_mask']
157 |
158 | except Exception as e:
159 | if "too short" in str(e):
160 | # tokenizers might yell that we want to truncate
161 | # to a value that is not even reached by the input.
162 | # In that case we don't want to truncate.
163 | # It seems there's not a really better way to catch that
164 | # exception.
165 |
166 | inputs = self.tokenizer(
167 | [sequence_pairs],
168 | add_special_tokens=add_special_tokens,
169 | return_tensors=return_tensors,
170 | padding=padding,
171 | truncation=TruncationStrategy.DO_NOT_TRUNCATE,
172 | )
173 | if encoder_decoder:
174 | decoder_inputs = self.tokenizer(
175 | [decoder_input],
176 | add_special_tokens=add_special_tokens,
177 | return_tensors=return_tensors,
178 | padding=padding,
179 | truncation=TruncationStrategy.DO_NOT_TRUNCATE,
180 | )
181 | inputs['decoder_input_ids'] = decoder_inputs['input_ids']
182 | inputs['decoder_attention_mask'] = decoder_inputs['attention_mask']
183 | else:
184 | raise e
185 |
186 | return inputs
187 |
188 | def _sanitize_parameters(self, **kwargs):
189 | if kwargs.get("multi_class", None) is not None:
190 | kwargs["multi_label"] = kwargs["multi_class"]
191 | logger.warning(
192 | "The `multi_class` argument has been deprecated and renamed to `multi_label`. "
193 | "`multi_class` will be removed in a future version of Transformers."
194 | )
195 | preprocess_params = {}
196 | if "candidate_labels" in kwargs:
197 | preprocess_params["candidate_labels"] = self._args_parser._parse_labels(kwargs["candidate_labels"])
198 | if "hypothesis_template" in kwargs:
199 | preprocess_params["hypothesis_template"] = kwargs["hypothesis_template"]
200 | if "hypothesis_first" in kwargs:
201 | preprocess_params["hypothesis_first"] = kwargs["hypothesis_first"]
202 | if "encoder_decoder" in kwargs:
203 | preprocess_params["encoder_decoder"] = kwargs["encoder_decoder"]
204 |
205 | postprocess_params = {}
206 | if "multi_label" in kwargs:
207 | postprocess_params["multi_label"] = kwargs["multi_label"]
208 | return preprocess_params, {}, postprocess_params
209 |
210 | def __call__(
211 | self,
212 | sequences: Union[str, List[str]],
213 | *args,
214 | **kwargs,
215 | ):
216 | """
217 | Classify the sequence(s) given as inputs. See the [`ZeroShotClassificationPipeline`] documentation for more
218 | information.
219 |
220 | Args:
221 | sequences (`str` or `List[str]`):
222 | The sequence(s) to classify, will be truncated if the model input is too large.
223 | candidate_labels (`str` or `List[str]`):
224 | The set of possible class labels to classify each sequence into. Can be a single label, a string of
225 | comma-separated labels, or a list of labels.
226 | hypothesis_template (`str`, *optional*, defaults to `"This example is {}."`):
227 | The template used to turn each label into an NLI-style hypothesis. This template must include a {} or
228 | similar syntax for the candidate label to be inserted into the template. For example, the default
229 | template is `"This example is {}."` With the candidate label `"sports"`, this would be fed into the
230 | model like `" sequence to classify This example is sports . "`. The default template
231 | works well in many cases, but it may be worthwhile to experiment with different templates depending on
232 | the task setting.
233 | multi_label (`bool`, *optional*, defaults to `False`):
234 | Whether or not multiple candidate labels can be true. If `False`, the scores are normalized such that
235 | the sum of the label likelihoods for each sequence is 1. If `True`, the labels are considered
236 | independent and probabilities are normalized for each candidate by doing a softmax of the entailment
237 | score vs. the contradiction score.
238 |
239 | Return:
240 | A `dict` or a list of `dict`: Each result comes as a dictionary with the following keys:
241 |
242 | - **sequence** (`str`) -- The sequence for which this is the output.
243 | - **labels** (`List[str]`) -- The labels sorted by order of likelihood.
244 | - **scores** (`List[float]`) -- The probabilities for each of the labels.
245 | """
246 | if len(args) == 0:
247 | pass
248 | elif len(args) == 1 and "candidate_labels" not in kwargs:
249 | kwargs["candidate_labels"] = args[0]
250 | else:
251 | raise ValueError(f"Unable to understand extra arguments {args}")
252 |
253 | return super().__call__(sequences, **kwargs)
254 |
255 | def preprocess(self, inputs, candidate_labels=None, hypothesis_template="This example is {}.", hypothesis_first = False, encoder_decoder = False):
256 | sequence_pairs, sequences = self._args_parser(inputs, candidate_labels, hypothesis_template, hypothesis_first)
257 |
258 | for i, (candidate_label, sequence_pair) in enumerate(zip(candidate_labels, sequence_pairs)):
259 | model_input = self._parse_and_tokenize(sequence_pair, encoder_decoder = encoder_decoder)
260 |
261 | yield {
262 | "candidate_label": candidate_label,
263 | "sequence": sequences[0],
264 | "is_last": i == len(candidate_labels) - 1,
265 | **model_input,
266 | }
267 |
268 | def _forward(self, inputs):
269 | candidate_label = inputs["candidate_label"]
270 | sequence = inputs["sequence"]
271 | input_names = self.tokenizer.model_input_names
272 | input_names.extend(['decoder_input_ids', 'decoder_attention_mask'])
273 | model_inputs = {k: inputs[k] for k in input_names if k in inputs}
274 | # `XXXForSequenceClassification` models should not use `use_cache=True` even if it's supported
275 | model_forward = self.model.forward if self.framework == "pt" else self.model.call
276 | if "use_cache" in inspect.signature(model_forward).parameters.keys():
277 | model_inputs["use_cache"] = False
278 | outputs = self.model(**model_inputs)
279 |
280 | model_outputs = {
281 | "candidate_label": candidate_label,
282 | "sequence": sequence,
283 | "is_last": inputs["is_last"],
284 | **outputs,
285 | }
286 | return model_outputs
287 |
288 | def postprocess(self, model_outputs, multi_label=False):
289 | candidate_labels = [outputs["candidate_label"] for outputs in model_outputs]
290 | sequences = [outputs["sequence"] for outputs in model_outputs]
291 | logits = np.concatenate([output["logits"].numpy() for output in model_outputs])
292 | N = logits.shape[0]
293 | n = len(candidate_labels)
294 | num_sequences = N // n
295 | reshaped_outputs = logits.reshape((num_sequences, n, -1))
296 |
297 | if multi_label and len(self.model.config.label2id)==0:
298 | scores = 1 / (1 + np.exp(-entail_contr_logits))
299 |
300 | elif multi_label or len(candidate_labels) == 1:
301 | # softmax over the entailment vs. contradiction dim for each label independently
302 | entailment_id = self.entailment_id
303 | contradiction_id = -1 if entailment_id == 0 else 0
304 | entail_contr_logits = reshaped_outputs[..., [contradiction_id, entailment_id]]
305 | scores = np.exp(entail_contr_logits) / np.exp(entail_contr_logits).sum(-1, keepdims=True)
306 | scores = scores[..., 1]
307 |
308 | else:
309 | # softmax the "entailment" logits over all candidate labels
310 | entail_logits = reshaped_outputs[..., self.entailment_id]
311 | scores = np.exp(entail_logits) / np.exp(entail_logits).sum(-1, keepdims=True)
312 |
313 | top_inds = list(reversed(scores[0].argsort()))
314 | return {
315 | "sequence": sequences[0],
316 | "labels": [candidate_labels[i] for i in top_inds],
317 | "scores": scores[0, top_inds].tolist(),
318 | }
--------------------------------------------------------------------------------
/src/liqfit/utils/__init__.py:
--------------------------------------------------------------------------------
1 | from .standardization import convert_to_numpy
2 | from .standardization import convert_to_torch
3 | from .transforms import tokenize_and_align_label
4 | from .transforms import transform
5 | from .metrics import Accuracy
6 |
--------------------------------------------------------------------------------
/src/liqfit/utils/metrics.py:
--------------------------------------------------------------------------------
1 | import evaluate
2 | import numpy as np
3 | from transformers import EvalPrediction
4 |
5 |
6 | class Accuracy:
7 | def __init__(self):
8 | """Simple wrapper class around `evaluate.load("accuracy")`.
9 | """
10 | self.accuracy = evaluate.load("accuracy")
11 |
12 | def __call__(self, eval_pred: EvalPrediction):
13 | predictions, labels = eval_pred
14 | predictions = np.argmax(predictions, axis=1)
15 | return self.accuracy.compute(
16 | predictions=predictions, references=labels
17 | )
18 |
--------------------------------------------------------------------------------
/src/liqfit/utils/standardization.py:
--------------------------------------------------------------------------------
1 | from __future__ import annotations
2 | from typing import List, Tuple
3 | import torch
4 | import numpy as np
5 |
6 |
7 | def convert_to_numpy(x: torch.Tensor | Tuple | List | np.ndarray) -> np.ndarray:
8 | """Converts torch.Tensor, Tuple, List or NumPy array to Numpy Array.
9 |
10 | Args:
11 | x (torch.Tensor | Tuple | List | np.ndarray): Input to convert to
12 | NumPy array.
13 |
14 | Returns:
15 | np.ndarray: Converted NumPy array.
16 | """
17 | if isinstance(x, torch.tensor):
18 | return x.detach().cpu().numpy()
19 | else:
20 | return np.array(x)
21 |
22 |
23 | def convert_to_torch(x: torch.Tensor | Tuple | List | np.ndarray) -> torch.Tensor:
24 | """Converts input to torch.Tensor
25 |
26 | Args:
27 | x (torch.Tensor | Tuple | List | np.ndarray): _description_
28 |
29 | Raises:
30 | ValueError: If the input is not a type of `torch.Tensor`,
31 | `Tuple`, `List`, `np.ndarray`
32 |
33 | Returns:
34 | torch.Tensor: Converted torch.Tensor.
35 | """
36 | if isinstance(x, (list, tuple)):
37 | return torch.tensor(x)
38 | elif isinstance(x, np.ndarray):
39 | return torch.from_numpy(x)
40 | elif isinstance(x, torch.Tensor):
41 | return x
42 | else:
43 | raise ValueError(
44 | "Expected `List`, `Tuple` or `np.ndarray`. "
45 | f"Received: {type(x)}."
46 | )
47 |
--------------------------------------------------------------------------------
/src/liqfit/utils/transforms.py:
--------------------------------------------------------------------------------
1 | from typing import Callable, Dict
2 | from datasets import Dataset
3 | from ..datasets import transform_dataset
4 |
5 |
6 | def tokenize_and_align_label(
7 | example: Dict,
8 | tokenizer: Callable,
9 | sources_column_name: str = "sources",
10 | targets_column_name: str = "targets",
11 | ):
12 | """Tokenizes Source and Target sequences and concatenates them for NLI training task.
13 |
14 | Args:
15 | example (Dict): Dictionary that contains the sources and target sequences.
16 | tokenizer (Callable): Tokenizer function, if you are using Huggingface
17 | tokenizer, you can wrap it with your configuration using
18 | `functools.partial`. Example:
19 | tokenizer_wrapped_function = \
20 | functools.partial(tokenizer.batch_encode_plus, padding=True,
21 | truncation=True, max_length=512) then pass
22 | `tokenizer_wrapped_function` to this function.
23 | sources_column_name (str, optional): Sources key name in the
24 | dictionary. Defaults to "sources".
25 | targets_column_name (str, optional): Targets key name in the
26 | dictionary. Defaults to "targets".
27 |
28 | Returns:
29 | torch.Tensor: A tensor of your tokenized input.
30 | """
31 | hypothesis = example[targets_column_name]
32 | seq = example[sources_column_name]
33 | tokenized_input = tokenizer([seq, hypothesis])
34 | return tokenized_input
35 |
36 |
37 | def transform(
38 | dataset: Dataset,
39 | classes: list,
40 | template: str,
41 | normalize_negatives: bool,
42 | positives: int,
43 | negatives: int,
44 | ):
45 | """Transforms the dataset for NLI training task.
46 |
47 | Args:
48 | dataset (Dataset): Hugginface Dataset instance
49 | classes (List[str]): List of possible class labels.
50 | template (str, optional): Template string for generating examples.
51 | normalize_negatives (bool, optional): Whether to normalize amount of
52 | negative examples per each positive example of a class.
53 | positives (int, optional): Number of positive examples to generate per source.
54 | negatives (int, optional): Number of negative examples to generate per source.
55 |
56 | Raises:
57 | ValueError: If there is no "{}" in the template. It should exist in
58 | order to format the template with the labels.
59 |
60 | Returns:
61 | Dataset: Transformed dataset.
62 | """
63 | if "{}" not in template:
64 | raise ValueError(
65 | "Cannot apply `.format()` function on the template. "
66 | 'Expected template to have "{}". '
67 | f"Received: {template}."
68 | )
69 |
70 | transformed_dataset = transform_dataset(
71 | dataset, classes, template, normalize_negatives, positives, negatives
72 | )
73 | tokenized_dataset = transformed_dataset.map(tokenize_and_align_label)
74 | return tokenized_dataset
75 |
--------------------------------------------------------------------------------
/tests/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Knowledgator/LiqFit/51ba2714813ae1cf110f7e600cd7f2663cdec39c/tests/__init__.py
--------------------------------------------------------------------------------
/tests/test_losses.py:
--------------------------------------------------------------------------------
1 | import unittest
2 |
3 | import torch
4 | from kornia.losses import focal_loss
5 | from liqfit.losses import focal_loss_with_mask
6 |
7 |
8 | class TestCorrectness(unittest.TestCase):
9 | def test_focal_loss_with_ignore_index(self):
10 | x = torch.tensor(
11 | [[[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]]],
12 | dtype=torch.float32,
13 | )
14 | y = torch.tensor([[1, 2, 3]], dtype=torch.int64)
15 | y[:, -1] = -100
16 | loss = round(
17 | focal_loss_with_mask(
18 | x.reshape(-1, x.shape[-1]), y.reshape(-1)
19 | ).item(),
20 | 4,
21 | )
22 | output = 0.1795
23 | self.assertEqual(loss, output)
24 |
25 | def test_modified_loss_with_kornia_impl(self):
26 | x = torch.tensor(
27 | [[[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]]],
28 | dtype=torch.float32,
29 | )
30 | y = torch.tensor([[1, 2, 3]], dtype=torch.int64)
31 | modified_loss = round(
32 | focal_loss_with_mask(
33 | x.reshape(-1, x.shape[-1]), y.reshape(-1), alpha=0.5
34 | ).item(),
35 | 4,
36 | )
37 | kornia_loss = round(
38 | focal_loss(
39 | x.reshape(-1, x.shape[-1]),
40 | y.reshape(-1),
41 | alpha=0.5,
42 | reduction="mean",
43 | ).item(),
44 | 4,
45 | )
46 | self.assertEqual(modified_loss, kornia_loss)
47 |
--------------------------------------------------------------------------------
/tests/test_models.py:
--------------------------------------------------------------------------------
1 | import torch
2 | from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification
3 | from liqfit.models import T5ForZeroShotClassification, T5ConfigWithLoss, DebertaV2ForZeroShotClassification, DebertaConfigWithLoss
4 | from liqfit.modeling import LiqFitModel, ClassificationHead
5 | from liqfit.modeling.pooling import FirstTokenPooling1D
6 | from liqfit.losses import CrossEntropyLoss
7 |
8 | def test_t5():
9 | device = "cuda" if torch.cuda.is_available() else "cpu"
10 |
11 | text = "one day I will see the world"
12 | label = "travel"
13 |
14 | tokenizer = AutoTokenizer.from_pretrained('google/flan-t5-small')
15 |
16 | input_ids = tokenizer(text, return_tensors='pt')['input_ids']
17 | decoder_input_ids = tokenizer(label, return_tensors='pt')['input_ids']
18 |
19 | config = T5ConfigWithLoss()
20 | model = T5ForZeroShotClassification(config).to(device)
21 | outputs = model(input_ids = input_ids, decoder_input_ids = decoder_input_ids)
22 |
23 | def test_deberta():
24 | device = "cuda" if torch.cuda.is_available() else "cpu"
25 |
26 | text = "one day I will see the world. This example is travel."
27 |
28 | tokenizer = AutoTokenizer.from_pretrained('microsoft/deberta-v3-small')
29 |
30 | input_ids = tokenizer(text, return_tensors='pt')['input_ids']
31 |
32 | config = DebertaConfigWithLoss()
33 | model = DebertaV2ForZeroShotClassification(config).to(device)
34 | outputs = model(input_ids = input_ids)
35 |
36 | def test_liqfit_model_with_automodel_for_sequence_classification():
37 | device = "cuda" if torch.cuda.is_available() else "cpu"
38 |
39 | text = "one day I will see the world. This example is travel."
40 |
41 | tokenizer = AutoTokenizer.from_pretrained('microsoft/deberta-v3-small')
42 |
43 | input_ids = tokenizer(text, return_tensors='pt')['input_ids']
44 | labels = torch.tensor([1])
45 |
46 | backbone_model = AutoModelForSequenceClassification.from_pretrained('microsoft/deberta-v3-xsmall')
47 |
48 | loss_func = CrossEntropyLoss(multi_target=True)
49 |
50 | model = LiqFitModel(backbone_model.config, backbone_model, loss_func=loss_func)
51 | outputs = model(input_ids = input_ids, labels=labels)
52 |
53 | def test_liqfit_model_with_head():
54 | device = "cuda" if torch.cuda.is_available() else "cpu"
55 |
56 | text = "one day I will see the world. This example is travel."
57 |
58 | tokenizer = AutoTokenizer.from_pretrained('microsoft/deberta-v3-small')
59 |
60 | input_ids = tokenizer(text, return_tensors='pt')['input_ids']
61 | labels = torch.tensor([1])
62 |
63 | backbone_model = AutoModel.from_pretrained('microsoft/deberta-v3-xsmall')
64 |
65 | pooler = FirstTokenPooling1D()
66 | loss_func = CrossEntropyLoss(multi_target=True)
67 | head = ClassificationHead(backbone_model.config.hidden_size, 3, pooler, loss_func)
68 |
69 | model = LiqFitModel(backbone_model.config, backbone_model, head)
70 | outputs = model(input_ids = input_ids, labels=labels)
71 |
--------------------------------------------------------------------------------
/tests/test_pipeline.py:
--------------------------------------------------------------------------------
1 | from transformers import AutoTokenizer, AutoModelForSequenceClassification
2 |
3 | from liqfit.pipeline import ZeroShotClassificationPipeline
4 |
5 |
6 | class TestStandartModelPipeline:
7 | sequence_to_classify = "one day I will see the world"
8 | candidate_labels = ['travel', 'cooking', 'dancing']
9 | template = 'This example is {}.'
10 | model_path = 'knowledgator/comprehend_it-base'
11 | tokenizer = AutoTokenizer.from_pretrained(model_path)
12 | model = AutoModelForSequenceClassification.from_pretrained(model_path)
13 |
14 | def test_standard_pipeline(self):
15 | classifier = ZeroShotClassificationPipeline(model=self.model,
16 | tokenizer=self.tokenizer,
17 | hypothesis_template = self.template,
18 | hypothesis_first = False)
19 | results = classifier(self.sequence_to_classify, self.candidate_labels, multi_label=True)
20 |
21 |
22 | def test_hypothesis_first_pipeline(self):
23 | classifier = ZeroShotClassificationPipeline(model=self.model,
24 | tokenizer=self.tokenizer,
25 | hypothesis_template = self.template,
26 | hypothesis_first = True)
27 | results = classifier(self.sequence_to_classify, self.candidate_labels, multi_label=True)
28 |
29 |
30 |
31 | class TestBinaryModelPipeline:
32 | sequence_to_classify = "one day I will see the world"
33 | candidate_labels = ['travel', 'cooking', 'dancing']
34 | template = 'This example is {}.'
35 | model_path = 'BAAI/bge-reranker-base'
36 | tokenizer = AutoTokenizer.from_pretrained(model_path)
37 | model = AutoModelForSequenceClassification.from_pretrained(model_path)
38 |
39 | def test_standard_pipeline(self):
40 | classifier = ZeroShotClassificationPipeline(model=self.model,
41 | tokenizer=self.tokenizer,
42 | hypothesis_template = self.template,
43 | hypothesis_first = False)
44 | results = classifier(self.sequence_to_classify, self.candidate_labels, multi_label=True)
45 |
46 |
47 | def test_hypothesis_first_pipeline(self):
48 | classifier = ZeroShotClassificationPipeline(model=self.model,
49 | tokenizer=self.tokenizer,
50 | hypothesis_template = self.template,
51 | hypothesis_first = True)
52 | results = classifier(self.sequence_to_classify, self.candidate_labels, multi_label=True)
53 |
54 | class TestEncoderDecoderModelPipeline:
55 | sequence_to_classify = "one day I will see the world"
56 | candidate_labels = ['travel', 'cooking', 'dancing']
57 | template = 'This example is {}.'
58 | model_path = 'knowledgator/mt5-comprehend-it-base'
59 | tokenizer = AutoTokenizer.from_pretrained(model_path)
60 | model = AutoModelForSequenceClassification.from_pretrained(model_path)
61 |
62 | def test_standard_pipeline(self):
63 | classifier = ZeroShotClassificationPipeline(model=self.model,
64 | tokenizer=self.tokenizer,
65 | hypothesis_template = self.template,
66 | hypothesis_first = False)
67 | results = classifier(self.sequence_to_classify, self.candidate_labels, multi_label=True)
68 |
69 |
70 | def test_hypothesis_first_pipeline(self):
71 | classifier = ZeroShotClassificationPipeline(model=self.model,
72 | tokenizer=self.tokenizer,
73 | hypothesis_template = self.template,
74 | hypothesis_first = True)
75 | results = classifier(self.sequence_to_classify, self.candidate_labels, multi_label=True)
76 |
77 |
78 | def test_encoder_decoder_pipeline(self):
79 | classifier = ZeroShotClassificationPipeline(model=self.model,
80 | tokenizer=self.tokenizer,
81 | hypothesis_template = self.template,
82 | hypothesis_first = True)
83 | results = classifier(self.sequence_to_classify, self.candidate_labels, multi_label=True)
--------------------------------------------------------------------------------