130 | 131 | These commands assume that your Python version is 3.6+ and that the Python 3 132 | version of `pip` is available as `pip3`. 133 | It may be available as `pip` depending on how your system is configured. 134 | 135 | ```bash 136 | # [OPTIONAL] Activate a virtual environment 137 | pip3 install --upgrade virtualenv 138 | virtualenv -p python3 .envspam 139 | source .envspam/bin/activate 140 | 141 | # Install requirements (both shared and tutorial-specific) 142 | pip3 install -r requirements.txt 143 | pip3 install -r spam/requirements.txt 144 | 145 | # Launch the Jupyter notebook interface (making sure the right virtual environment is used) 146 | .envspam/bin/jupyter notebook spam 147 | ``` 148 | 149 |
150 |155 | 156 | These commands assume that your conda installation is Python 3.6+. 157 | 158 | ```bash 159 | # [OPTIONAL] Activate a virtual environment 160 | conda create --yes -n spam python=3.6 161 | conda activate spam 162 | 163 | # Install requirements (both shared and tutorial-specific) 164 | pip install environment_kernels 165 | # We specify PyTorch here to ensure compatibility, but it may not be necessary. 166 | conda install pytorch==1.1.0 -c pytorch 167 | conda install snorkel==0.9.5 -c conda-forge 168 | pip install -r spam/requirements.txt 169 | 170 | # Launch the Jupyter notebook interface 171 | jupyter notebook spam 172 | ``` 173 | 174 | Make sure to select the right kernel (`conda_spam`) when running the jupyter notebook. 175 | 176 |
177 |
8 | #
9 | # We want to classify each __candidate__ or pair of people mentioned in a sentence, as being married at some point or not.
10 | #
11 | # In the above example, our candidate represents the possible relation `(Barack Obama, Michelle Obama)`. As readers, we know this mention is true due to external knowledge and the keyword of `wedding` occuring later in the sentence.
12 | # We begin with some basic setup and data downloading.
13 | #
14 | # %% {"tags": ["md-exclude"]}
15 | # %matplotlib inline
16 |
17 | import os
18 | import pandas as pd
19 | import pickle
20 |
21 | if os.path.basename(os.getcwd()) == "snorkel-tutorials":
22 | os.chdir("spouse")
23 |
24 | # %%
25 | from utils import load_data
26 |
27 | ((df_dev, Y_dev), df_train, (df_test, Y_test)) = load_data()
28 |
29 | # %% [markdown]
30 | # **Input Data:** `df_dev`, `df_train`, and `df_test` are `Pandas DataFrame` objects, where each row represents a particular __candidate__. For our problem, a candidate consists of a sentence, and two people mentioned in the sentence. The DataFrames contain the fields `sentence`, which refers to the sentence of the candidate, `tokens`, the tokenized form of the sentence, and `person1_word_idx` and `person2_word_idx`, which represent `[start, end]` indices in the tokens at which the first and second person's name appear, respectively.
31 | #
32 | # We also have certain **preprocessed fields**, that we discuss a few cells below.
33 |
34 | # %% {"tags": ["md-exclude"]}
35 |
36 | # Don't truncate text fields in the display
37 | pd.set_option("display.max_colwidth", 0)
38 |
39 | df_dev.head()
40 |
41 | # %% [markdown]
42 | # Let's look at a candidate in the development set:
43 |
44 | # %%
45 | from preprocessors import get_person_text
46 |
47 | candidate = df_dev.loc[2]
48 | person_names = get_person_text(candidate).person_names
49 |
50 | print("Sentence: ", candidate["sentence"])
51 | print("Person 1: ", person_names[0])
52 | print("Person 2: ", person_names[1])
53 |
54 | # %% [markdown]
55 | # ### Preprocessing the Data
56 | #
57 | # In a real application, there is a lot of data preparation, parsing, and database loading that needs to be completed before we generate candidates and dive into writing labeling functions. Here we've pre-generated candidates in a pandas DataFrame object per split (train,dev,test).
58 |
59 | # %% [markdown]
60 | # ### Labeling Function Helpers
61 | #
62 | # When writing labeling functions, there are several functions you will use over and over again. In the case of text relation extraction as with this task, common functions include those for fetching text between mentions of the two people in a candidate, examing word windows around person mentions, and so on. We will wrap these functions as `preprocessors`.
63 |
64 | # %%
65 | from snorkel.preprocess import preprocessor
66 |
67 |
68 | @preprocessor()
69 | def get_text_between(cand):
70 | """
71 | Returns the text between the two person mentions in the sentence for a candidate
72 | """
73 | start = cand.person1_word_idx[1] + 1
74 | end = cand.person2_word_idx[0]
75 | cand.text_between = " ".join(cand.tokens[start:end])
76 | return cand
77 |
78 |
79 | # %% [markdown]
80 | # ### Candidate PreProcessors
81 | #
82 | # For the purposes of the tutorial, we have three fields (`between_tokens`, `person1_right_tokens`, `person2_right_tokens`) preprocessed in the data, which can be used when creating labeling functions. We also provide the following set of `preprocessor`s for this task in `preprocessors.py`, along with the fields these populate.
83 | # * `get_person_text(cand)`: `person_names`
84 | # * `get_person_lastnames(cand)`: `person_lastnames`
85 | # * `get_left_tokens(cand)`: `person1_left_tokens`, `person2_left_tokens`
86 |
87 | # %%
88 | from preprocessors import get_left_tokens, get_person_last_names
89 |
90 | POSITIVE = 1
91 | NEGATIVE = 0
92 | ABSTAIN = -1
93 |
94 | # %%
95 | from snorkel.labeling import labeling_function
96 |
97 | # Check for the `spouse` words appearing between the person mentions
98 | spouses = {"spouse", "wife", "husband", "ex-wife", "ex-husband"}
99 |
100 |
101 | @labeling_function(resources=dict(spouses=spouses))
102 | def lf_husband_wife(x, spouses):
103 | return POSITIVE if len(spouses.intersection(set(x.between_tokens))) > 0 else ABSTAIN
104 |
105 |
106 | # %%
107 | # Check for the `spouse` words appearing to the left of the person mentions
108 | @labeling_function(resources=dict(spouses=spouses), pre=[get_left_tokens])
109 | def lf_husband_wife_left_window(x, spouses):
110 | if len(set(spouses).intersection(set(x.person1_left_tokens))) > 0:
111 | return POSITIVE
112 | elif len(set(spouses).intersection(set(x.person2_left_tokens))) > 0:
113 | return POSITIVE
114 | else:
115 | return ABSTAIN
116 |
117 |
118 | # %%
119 | # Check for the person mentions having the same last name
120 | @labeling_function(pre=[get_person_last_names])
121 | def lf_same_last_name(x):
122 | p1_ln, p2_ln = x.person_lastnames
123 |
124 | if p1_ln and p2_ln and p1_ln == p2_ln:
125 | return POSITIVE
126 | return ABSTAIN
127 |
128 |
129 | # %%
130 | # Check for the word `married` between person mentions
131 | @labeling_function()
132 | def lf_married(x):
133 | return POSITIVE if "married" in x.between_tokens else ABSTAIN
134 |
135 |
136 | # %%
137 | # Check for words that refer to `family` relationships between and to the left of the person mentions
138 | family = {
139 | "father",
140 | "mother",
141 | "sister",
142 | "brother",
143 | "son",
144 | "daughter",
145 | "grandfather",
146 | "grandmother",
147 | "uncle",
148 | "aunt",
149 | "cousin",
150 | }
151 | family = family.union({f + "-in-law" for f in family})
152 |
153 |
154 | @labeling_function(resources=dict(family=family))
155 | def lf_familial_relationship(x, family):
156 | return NEGATIVE if len(family.intersection(set(x.between_tokens))) > 0 else ABSTAIN
157 |
158 |
159 | @labeling_function(resources=dict(family=family), pre=[get_left_tokens])
160 | def lf_family_left_window(x, family):
161 | if len(set(family).intersection(set(x.person1_left_tokens))) > 0:
162 | return NEGATIVE
163 | elif len(set(family).intersection(set(x.person2_left_tokens))) > 0:
164 | return NEGATIVE
165 | else:
166 | return ABSTAIN
167 |
168 |
169 | # %%
170 | # Check for `other` relationship words between person mentions
171 | other = {"boyfriend", "girlfriend", "boss", "employee", "secretary", "co-worker"}
172 |
173 |
174 | @labeling_function(resources=dict(other=other))
175 | def lf_other_relationship(x, other):
176 | return NEGATIVE if len(other.intersection(set(x.between_tokens))) > 0 else ABSTAIN
177 |
178 |
179 | # %% [markdown]
180 | # ### Distant Supervision Labeling Functions
181 | #
182 | # In addition to using factories that encode pattern matching heuristics, we can also write labeling functions that _distantly supervise_ data points. Here, we'll load in a list of known spouse pairs and check to see if the pair of persons in a candidate matches one of these.
183 | #
184 | # [**DBpedia**](http://wiki.dbpedia.org/): Our database of known spouses comes from DBpedia, which is a community-driven resource similar to Wikipedia but for curating structured data. We'll use a preprocessed snapshot as our knowledge base for all labeling function development.
185 | #
186 | # We can look at some of the example entries from DBPedia and use them in a simple distant supervision labeling function.
187 | #
188 | # Make sure `dbpedia.pkl` is in the `spouse/data` directory.
189 |
190 | # %%
191 | with open("data/dbpedia.pkl", "rb") as f:
192 | known_spouses = pickle.load(f)
193 |
194 | list(known_spouses)[0:5]
195 |
196 |
197 | # %%
198 | @labeling_function(resources=dict(known_spouses=known_spouses), pre=[get_person_text])
199 | def lf_distant_supervision(x, known_spouses):
200 | p1, p2 = x.person_names
201 | if (p1, p2) in known_spouses or (p2, p1) in known_spouses:
202 | return POSITIVE
203 | else:
204 | return ABSTAIN
205 |
206 |
207 | # %%
208 | from preprocessors import last_name
209 |
210 | # Last name pairs for known spouses
211 | last_names = set(
212 | [
213 | (last_name(x), last_name(y))
214 | for x, y in known_spouses
215 | if last_name(x) and last_name(y)
216 | ]
217 | )
218 |
219 |
220 | @labeling_function(resources=dict(last_names=last_names), pre=[get_person_last_names])
221 | def lf_distant_supervision_last_names(x, last_names):
222 | p1_ln, p2_ln = x.person_lastnames
223 |
224 | return (
225 | POSITIVE
226 | if (p1_ln != p2_ln)
227 | and ((p1_ln, p2_ln) in last_names or (p2_ln, p1_ln) in last_names)
228 | else ABSTAIN
229 | )
230 |
231 |
232 | # %% [markdown]
233 | # #### Apply Labeling Functions to the Data
234 | # We create a list of labeling functions and apply them to the data
235 |
236 | # %%
237 | from snorkel.labeling import PandasLFApplier
238 |
239 | lfs = [
240 | lf_husband_wife,
241 | lf_husband_wife_left_window,
242 | lf_same_last_name,
243 | lf_married,
244 | lf_familial_relationship,
245 | lf_family_left_window,
246 | lf_other_relationship,
247 | lf_distant_supervision,
248 | lf_distant_supervision_last_names,
249 | ]
250 | applier = PandasLFApplier(lfs)
251 |
252 | # %% {"tags": ["md-exclude-output"]}
253 | from snorkel.labeling import LFAnalysis
254 |
255 | L_dev = applier.apply(df_dev)
256 | L_train = applier.apply(df_train)
257 |
258 | # %%
259 | LFAnalysis(L_dev, lfs).lf_summary(Y_dev)
260 |
261 | # %% [markdown]
262 | # ### Training the Label Model
263 | #
264 | # Now, we'll train a model of the LFs to estimate their weights and combine their outputs. Once the model is trained, we can combine the outputs of the LFs into a single, noise-aware training label set for our extractor.
265 |
266 | # %% {"tags": ["md-exclude-output"]}
267 | from snorkel.labeling.model import LabelModel
268 |
269 | label_model = LabelModel(cardinality=2, verbose=True)
270 | label_model.fit(L_train, Y_dev, n_epochs=5000, log_freq=500, seed=12345)
271 |
272 | # %% [markdown]
273 | # ### Label Model Metrics
274 | # Since our dataset is highly unbalanced (91% of the labels are negative), even a trivial baseline that always outputs negative can get a high accuracy. So we evaluate the label model using the F1 score and ROC-AUC rather than accuracy.
275 |
276 | # %%
277 | from snorkel.analysis import metric_score
278 | from snorkel.utils import probs_to_preds
279 |
280 | probs_dev = label_model.predict_proba(L_dev)
281 | preds_dev = probs_to_preds(probs_dev)
282 | print(
283 | f"Label model f1 score: {metric_score(Y_dev, preds_dev, probs=probs_dev, metric='f1')}"
284 | )
285 | print(
286 | f"Label model roc-auc: {metric_score(Y_dev, preds_dev, probs=probs_dev, metric='roc_auc')}"
287 | )
288 |
289 | # %% [markdown]
290 | # ### Part 4: Training our End Extraction Model
291 | #
292 | # In this final section of the tutorial, we'll use our noisy training labels to train our end machine learning model. We start by filtering out training data points which did not recieve a label from any LF, as these data points contain no signal.
293 | #
294 | # %%
295 | from snorkel.labeling import filter_unlabeled_dataframe
296 |
297 | probs_train = label_model.predict_proba(L_train)
298 | df_train_filtered, probs_train_filtered = filter_unlabeled_dataframe(
299 | X=df_train, y=probs_train, L=L_train
300 | )
301 |
302 | # %% [markdown]
303 | # Next, we train a simple [LSTM](https://en.wikipedia.org/wiki/Long_short-term_memory) network for classifying candidates. `tf_model` contains functions for processing features and building the keras model for training and evaluation.
304 |
305 | # %% {"tags": ["md-exclude-output"]}
306 | from tf_model import get_model, get_feature_arrays
307 | from utils import get_n_epochs
308 |
309 | X_train = get_feature_arrays(df_train_filtered)
310 | model = get_model()
311 | batch_size = 64
312 | model.fit(X_train, probs_train_filtered, batch_size=batch_size, epochs=get_n_epochs())
313 |
314 | # %% [markdown]
315 | # Finally, we evaluate the trained model by measuring its F1 score and ROC_AUC.
316 |
317 | # %%
318 | X_test = get_feature_arrays(df_test)
319 | probs_test = model.predict(X_test)
320 | preds_test = probs_to_preds(probs_test)
321 | print(
322 | f"Test F1 when trained with soft labels: {metric_score(Y_test, preds=preds_test, metric='f1')}"
323 | )
324 | print(
325 | f"Test ROC-AUC when trained with soft labels: {metric_score(Y_test, probs=probs_test, metric='roc_auc')}"
326 | )
327 |
328 | # %% [markdown]
329 | # ## Summary
330 | # In this tutorial, we showed how Snorkel can be used for Information Extraction. We demonstrated how to create LFs that leverage keywords and external knowledge bases (distant supervision). Finally, we showed how a model trained using the probabilistic outputs of the Label Model can achieve comparable performance while generalizing to all data points.
331 |
--------------------------------------------------------------------------------
/getting_started/getting_started.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | # %% [markdown] {"tags": ["md-exclude"]}
3 | # # Getting Started with Snorkel
4 |
5 | # %% {"tags": ["md-exclude"]}
6 | import os
7 |
8 | # Make sure we're running from the spam/ directory
9 | if os.path.basename(os.getcwd()) == "snorkel-tutorials":
10 | os.chdir("getting_started")
11 |
12 | # %% [markdown]
13 | # ## Programmatically Building and Managing Training Data with Snorkel
14 | #
15 | # Snorkel is a system for _programmatically_ building and managing training datasets **without manual labeling**.
16 | # In Snorkel, users can develop large training datasets in hours or days rather than hand-labeling them over weeks or months.
17 | #
18 | # Snorkel currently exposes three key programmatic operations:
19 | # - **Labeling data**, e.g., using heuristic rules or distant supervision techniques
20 | # - **Transforming data**, e.g., rotating or stretching images to perform data augmentation
21 | # - **Slicing data** into different critical subsets for monitoring or targeted improvement
22 | #
23 | # Snorkel then automatically models, cleans, and integrates the resulting training data using novel, theoretically-grounded techniques.
24 |
25 | # %% [markdown]
26 | #
27 |
28 | # %% [markdown]
29 | # In this quick walkthrough, we'll preview the high-level workflow and interfaces of Snorkel using a canonical machine learning problem: classifying spam.
30 | # We'll use a public [YouTube comments dataset](http://www.dt.fee.unicamp.br/~tiago//youtubespamcollection/), and see how **Snorkel can enable training a machine learning model without _any_ hand-labeled training data!**
31 | # For more detailed versions of the sections in this walkthrough, see the corresponding tutorials: ([Spam LFs](https://snorkel.org/use-cases/01-spam-tutorial), [Spam TFs](https://snorkel.org/use-cases/02-spam-data-augmentation-tutorial), [Spam SFs](https://snorkel.org/use-cases/03-spam-data-slicing-tutorial)).
32 |
33 | # %% [markdown]
34 | # We'll walk through five basic steps:
35 | #
36 | # 1. **Writing Labeling Functions (LFs):** First, rather than hand-labeling any training data, we'll programmatically label our _unlabeled_ dataset with LFs.
37 | # 2. **Modeling & Combining LFs:** Next, we'll use Snorkel's `LabelModel` to automatically learn the accuracies of our LFs and reweight and combine their outputs into a single, confidence-weighted training label per data point.
38 | # 3. **Writing Transformation Functions (TFs) for Data Augmentation:** Then, we'll augment this labeled training set by writing a simple TF.
39 | # 4. **Writing _Slicing Functions (SFs)_ for Data Subset Selection:** We'll also preview writing an SF to identify a critical subset or _slice_ of our training set.
40 | # 5. **Training a final ML model:** Finally, we'll train an ML model with our training set.
41 | #
42 | # We'll start first by loading the _unlabeled_ comments, which we'll use as our training data, as a Pandas `DataFrame`:
43 |
44 | # %%
45 | from utils import load_unlabeled_spam_dataset
46 |
47 | df_train = load_unlabeled_spam_dataset()
48 |
49 | # %% [markdown]
50 | # ## 1) Writing Labeling Functions
51 | #
52 | # _Labeling functions (LFs)_ are one of the core operators for building and managing training datasets programmatically in Snorkel.
53 | # The basic idea is simple: **a labeling function is a function that outputs a label for some subset of the training dataset**.
54 | # In our example here, each labeling function takes as input a comment data point, and either outputs a label (`SPAM = 1` or `NOT_SPAM = 0`) or abstains from labeling (`ABSTAIN = -1`):
55 |
56 | # %%
57 | # Define the label mappings for convenience
58 | ABSTAIN = -1
59 | NOT_SPAM = 0
60 | SPAM = 1
61 |
62 | # %% [markdown]
63 | # Labeling functions can be used to represent many heuristic and/or noisy strategies for labeling data, often referred to as [weak supervision](https://www.snorkel.org/blog/weak-supervision).
64 | # The basic idea of labeling functions, and other programmatic operators in Snorkel, is to let users inject domain information into machine learning models in higher level, higher bandwidth ways than manually labeling thousands or millions of individual data points.
65 | # **The key idea is that labeling functions do not need to be perfectly accurate**, and can in fact even be correlated with each other.
66 | # Snorkel will automatically estimate their accuracies and correlations in a [provably consistent way](https://papers.nips.cc/paper/6523-data-programming-creating-large-training-sets-quickly), and then reweight and combine their output labels, leading to high-quality training labels.
67 |
68 | # %% [markdown]
69 | # In our text data setting here, labeling functions use:
70 | #
71 | # Keyword matches:
72 |
73 | # %%
74 | from snorkel.labeling import labeling_function
75 |
76 |
77 | @labeling_function()
78 | def lf_keyword_my(x):
79 | """Many spam comments talk about 'my channel', 'my video', etc."""
80 | return SPAM if "my" in x.text.lower() else ABSTAIN
81 |
82 |
83 | # %% [markdown]
84 | # Regular expressions:
85 |
86 | # %%
87 | import re
88 |
89 |
90 | @labeling_function()
91 | def lf_regex_check_out(x):
92 | """Spam comments say 'check out my video', 'check it out', etc."""
93 | return SPAM if re.search(r"check.*out", x.text, flags=re.I) else ABSTAIN
94 |
95 |
96 | # %% [markdown]
97 | # Arbitrary heuristics:
98 |
99 | # %%
100 | @labeling_function()
101 | def lf_short_comment(x):
102 | """Non-spam comments are often short, such as 'cool video!'."""
103 | return NOT_SPAM if len(x.text.split()) < 5 else ABSTAIN
104 |
105 |
106 | # %% [markdown]
107 | # Third-party models:
108 |
109 | # %%
110 | from textblob import TextBlob
111 |
112 |
113 | @labeling_function()
114 | def lf_textblob_polarity(x):
115 | """
116 | We use a third-party sentiment classification model, TextBlob.
117 |
118 | We combine this with the heuristic that non-spam comments are often positive.
119 | """
120 | return NOT_SPAM if TextBlob(x.text).sentiment.polarity > 0.3 else ABSTAIN
121 |
122 |
123 | # %% [markdown]
124 | # And much more!
125 | # For many more types of labeling functions — including over data modalities beyond text — see the other [tutorials](https://snorkel.org/use-cases/) and [real-world examples](https://snorkel.org/resources/).
126 | #
127 | # In general the process of developing labeling functions is, like any other development process, an iterative one that takes time.
128 | # However, in many cases it can be _orders-of-magnitude_ faster that hand-labeling training data.
129 | # For more detail on the process of developing labeling functions and other training data operators in Snorkel, see the [Introduction Tutorials](https://snorkel.org/use-cases).
130 |
131 | # %% [markdown]
132 | # ## 2) Combining & Cleaning the Labels
133 | #
134 | # Our next step is to apply the labeling functions we wrote to the unlabeled training data.
135 | # The result is a *label matrix*, `L_train`, where each row corresponds to a data point and each column corresponds to a labeling function.
136 | # Since the labeling functions have unknown accuracies and correlations, their output labels may overlap and conflict.
137 | # We use the `LabelModel` to automatically estimate their accuracies and correlations, reweight and combine their labels, and produce our final set of clean, integrated training labels:
138 |
139 | # %%
140 | from snorkel.labeling.model import LabelModel
141 | from snorkel.labeling import PandasLFApplier
142 |
143 | # Define the set of labeling functions (LFs)
144 | lfs = [lf_keyword_my, lf_regex_check_out, lf_short_comment, lf_textblob_polarity]
145 |
146 | # Apply the LFs to the unlabeled training data
147 | applier = PandasLFApplier(lfs)
148 | L_train = applier.apply(df_train)
149 |
150 | # Train the label model and compute the training labels
151 | label_model = LabelModel(cardinality=2, verbose=True)
152 | label_model.fit(L_train, n_epochs=500, log_freq=50, seed=123)
153 | df_train["label"] = label_model.predict(L=L_train, tie_break_policy="abstain")
154 |
155 | # %% [markdown]
156 | # Note that we used the `LabelModel` to label data; however, on many data points, all the labeling functions abstain, and so the `LabelModel` abstains as well.
157 | # We'll filter these data points out of our training set now:
158 |
159 | # %%
160 | df_train = df_train[df_train.label != ABSTAIN]
161 |
162 | # %% [markdown]
163 | # Our ultimate goal is to use the resulting labeled training data points to train a machine learning model that can **generalize beyond the coverage of the labeling functions and the `LabelModel`**.
164 | # However first we'll explore some of Snorkel's other operators for building and managing training data.
165 |
166 | # %% [markdown]
167 | # ## 3) Writing Transformation Functions for Data Augmentation
168 | #
169 | # An increasingly popular and critical technique in modern machine learning is [data augmentation](https://www.snorkel.org/blog/tanda),
170 | # the strategy of artificially *augmenting* existing labeled training datasets by creating transformed copies of the data points.
171 | # Data augmentation is a practical and powerful method for injecting information about domain invariances into ML models via the data, rather than by trying to modify their internal architectures.
172 | # The canonical example is randomly rotating, stretching, and transforming images when training image classifiers — a ubiquitous technique in the field of computer vision today.
173 | # However, data augmentation is increasingly used in a range of settings, including text.
174 | #
175 | # Here, we implement a simple text data augmentation strategy — randomly replacing a word with a synonym.
176 | # We express this as a *transformation function (TF)*:
177 |
178 | # %%
179 | import random
180 |
181 | import nltk
182 | from nltk.corpus import wordnet as wn
183 |
184 | from snorkel.augmentation import transformation_function
185 |
186 | nltk.download("wordnet", quiet=True)
187 |
188 |
189 | def get_synonyms(word):
190 | """Get the synonyms of word from Wordnet."""
191 | lemmas = set().union(*[s.lemmas() for s in wn.synsets(word)])
192 | return list(set(l.name().lower().replace("_", " ") for l in lemmas) - {word})
193 |
194 |
195 | @transformation_function()
196 | def tf_replace_word_with_synonym(x):
197 | """Try to replace a random word with a synonym."""
198 | words = x.text.lower().split()
199 | idx = random.choice(range(len(words)))
200 | synonyms = get_synonyms(words[idx])
201 | if len(synonyms) > 0:
202 | x.text = " ".join(words[:idx] + [synonyms[0]] + words[idx + 1 :])
203 | return x
204 |
205 |
206 | # %% [markdown]
207 | # Next, we apply this transformation function to our training dataset:
208 |
209 | # %%
210 | from snorkel.augmentation import ApplyOnePolicy, PandasTFApplier
211 |
212 | tf_policy = ApplyOnePolicy(n_per_original=2, keep_original=True)
213 | tf_applier = PandasTFApplier([tf_replace_word_with_synonym], tf_policy)
214 | df_train_augmented = tf_applier.apply(df_train)
215 |
216 | # %% [markdown]
217 | # Note that a common challenge with data augmentation is figuring out how to tune and apply different transformation functions to best augment a training set.
218 | # This is most commonly done as an ad hoc manual process; however, in Snorkel, various approaches for using automatically learned data augmentation _policies_ are supported.
219 | # For more detail, see the [Spam TFs tutorial](https://snorkel.org/use-cases/02-spam-data-augmentation-tutorial).
220 |
221 | # %% [markdown]
222 | # ## 4) Writing a Slicing Function
223 | #
224 | # Finally, a third operator in Snorkel, *slicing functions (SFs)*, handles the reality that many datasets have certain subsets or _slices_ that are more important than others.
225 | # In Snorkel, we can write SFs to (a) monitor specific slices and (b) improve model performance over them by adding representational capacity targeted on a per-slice basis.
226 | #
227 | # Writing a slicing function is simple.
228 | # For example, we could write one that looks for suspiciously shortened links, which might be critical due to their likelihood of linking to malicious sites:
229 |
230 | # %%
231 | from snorkel.slicing import slicing_function
232 |
233 |
234 | @slicing_function()
235 | def short_link(x):
236 | """Return whether text matches common pattern for shortened ".ly" links."""
237 | return int(bool(re.search(r"\w+\.ly", x.text)))
238 |
239 |
240 | # %% [markdown]
241 | # We can now use Snorkel to monitor the performance over this slice, and to add representational capacity to our model in order to potentially increase performance on this slice.
242 | # For a walkthrough of these steps, see the [Spam SFs tutorial](https://snorkel.org/use-cases/03-spam-data-slicing-tutorial).
243 |
244 | # %% [markdown]
245 | # ## 5) Training a Classifier
246 | #
247 | # The ultimate goal in Snorkel is to **create a training dataset**, which can then be plugged into an arbitrary machine learning framework (e.g. TensorFlow, Keras, PyTorch, Scikit-Learn, Ludwig, XGBoost) to train powerful machine learning models.
248 | # Here, to complete this initial walkthrough, we'll train an extremely simple model — a "bag of n-grams" logistic regression model in Scikit-Learn — using the weakly labeled and augmented training set we made with our labeling and transformation functions:
249 |
250 | # %%
251 | from sklearn.feature_extraction.text import CountVectorizer
252 | from sklearn.linear_model import LogisticRegression
253 |
254 | train_text = df_train_augmented.text.tolist()
255 | X_train = CountVectorizer(ngram_range=(1, 2)).fit_transform(train_text)
256 |
257 | clf = LogisticRegression(solver="lbfgs")
258 | clf.fit(X=X_train, y=df_train_augmented.label.values)
259 |
260 | # %% [markdown]
261 | # And that's it — you've trained your first model **without hand-labeling _any_ training data!**
262 | # Next, to learn more about Snorkel, check out the [tutorials](https://snorkel.org/use-cases/), [resources](https://snorkel.org/resources), and [documentation](https://snorkel.readthedocs.io) for much more on how to use Snorkel to power your own machine learning applications.
263 |
--------------------------------------------------------------------------------
/recsys/recsys_tutorial.py:
--------------------------------------------------------------------------------
1 | # %% [markdown]
2 | # # Recommender Systems Tutorial
3 | # In this tutorial, we'll provide a simple walkthrough of how to use Snorkel to build a recommender system.
4 | # We consider a setting similar to the [Netflix challenge](https://www.kaggle.com/netflix-inc/netflix-prize-data), but with books instead of movies.
5 | # We have a set of users and books, and for each user we know the set of books they have interacted with (read or marked as to-read).
6 | # We don't have the user's numerical ratings for the books they read, except in a small number of cases.
7 | # We also have some text reviews written by users.
8 | #
9 | # Our goal is to build a recommender system by training a classifier to predict whether a user will read and like any given book.
10 | # We'll train our model over a user-book pair to predict a `rating` (a `rating` of 1 means the user will read and like the book).
11 | # To simplify inference, we'll represent a user by the set of books they interacted with (rather than learning a specific representation for each user).
12 | # Once we have this model trained, we can use it to recommend books to a user when they visit the site.
13 | # For example, we can just predict the rating for the user paired with a book for a few thousand likely books, then pick the books with the top ten predicted ratings.
14 | #
15 | # Of course, there are many other ways to approach this problem.
16 | # The field of [recommender systems](https://en.wikipedia.org/wiki/Recommender_system) is a very well studied area with a wide variety of settings and approaches, and we just focus on one of them.
17 | #
18 | # We will use the [Goodreads](https://sites.google.com/eng.ucsd.edu/ucsdbookgraph/home) dataset, from
19 | # "Item Recommendation on Monotonic Behavior Chains", RecSys'18 (Mengting Wan, Julian McAuley), and "Fine-Grained Spoiler Detection from Large-Scale Review Corpora", ACL'19 (Mengting Wan, Rishabh Misra, Ndapa Nakashole, Julian McAuley).
20 | # In this dataset, we have user interactions and reviews for Young Adult novels from the Goodreads website, along with metadata (like `title` and `authors`) for the novels.
21 |
22 | # %% {"tags": ["md-exclude"]}
23 | import logging
24 | import os
25 |
26 | logging.basicConfig(level=logging.INFO)
27 |
28 |
29 | if os.path.basename(os.getcwd()) == "snorkel-tutorials":
30 | os.chdir("recsys")
31 |
32 |
33 | # %% [markdown]
34 | # ## Loading Data
35 |
36 | # %% [markdown]
37 | # We start by running the `download_and_process_data` function.
38 | # The function returns the `df_train`, `df_test`, `df_dev`, `df_valid` dataframes, which correspond to our training, test, development, and validation sets.
39 | # Each of those dataframes has the following fields:
40 | # * `user_idx`: A unique identifier for a user.
41 | # * `book_idx`: A unique identifier for a book that is being rated by the user.
42 | # * `book_idxs`: The set of books that the user has interacted with (read or planned to read).
43 | # * `review_text`: Optional text review written by the user for the book.
44 | # * `rating`: Either `0` (which means the user did not read or did not like the book) or `1` (which means the user read and liked the book). The `rating` field is missing for `df_train`.
45 | # Our objective is to predict whether a given user (represented by the set of book_idxs the user has interacted with) will read and like any given book.
46 | # That is, we want to train a model that takes a set of `book_idxs` (the user) and a single `book_idx` (the book to rate) and predicts the `rating`.
47 | #
48 | # In addition, `download_and_process_data` also returns the `df_books` dataframe, which contains one row per book, along with metadata for that book (such as `title` and `first_author`).
49 |
50 | # %% {"tags": ["md-exclude-output"]}
51 | from utils import download_and_process_data
52 |
53 | (df_train, df_test, df_dev, df_valid), df_books = download_and_process_data()
54 |
55 | df_books.head()
56 |
57 | # %% [markdown]
58 | # We look at a sample of the labeled development set.
59 | # As an example, we want our final recommendations model to be able to predict that a user who has interacted with `book_idxs` (25743, 22318, 7662, 6857, 83, 14495, 30664, ...) would either not read or not like the book with `book_idx` 22764 (first row), while a user who has interacted with `book_idxs` (3880, 18078, 9092, 29933, 1511, 8560, ...) would read and like the book with `book_idx` 3181 (second row).
60 |
61 | # %%
62 | df_dev.sample(frac=1, random_state=12).head()
63 |
64 | # %% [markdown]
65 | # ## Writing Labeling Functions
66 |
67 | # %%
68 | POSITIVE = 1
69 | NEGATIVE = 0
70 | ABSTAIN = -1
71 |
72 | # %% [markdown]
73 | # If a user has interacted with several books written by an author, there is a good chance that the user will read and like other books by the same author.
74 | # We express this as a labeling function, using the `first_author` field in the `df_books` dataframe.
75 | # We picked the threshold 15 by plotting histograms and running error analysis using the dev set.
76 |
77 | # %%
78 | from snorkel.labeling.lf import labeling_function
79 |
80 | book_to_first_author = dict(zip(df_books.book_idx, df_books.first_author))
81 | first_author_to_books_df = df_books.groupby("first_author")[["book_idx"]].agg(set)
82 | first_author_to_books = dict(
83 | zip(first_author_to_books_df.index, first_author_to_books_df.book_idx)
84 | )
85 |
86 |
87 | @labeling_function(
88 | resources=dict(
89 | book_to_first_author=book_to_first_author,
90 | first_author_to_books=first_author_to_books,
91 | )
92 | )
93 | def shared_first_author(x, book_to_first_author, first_author_to_books):
94 | author = book_to_first_author[x.book_idx]
95 | same_author_books = first_author_to_books[author]
96 | num_read = len(set(x.book_idxs).intersection(same_author_books))
97 | return POSITIVE if num_read > 15 else ABSTAIN
98 |
99 |
100 | # %% [markdown]
101 | # We can also leverage the long text reviews written by users to guess whether they liked or disliked a book.
102 | # For example, the third `df_dev` entry above has a review with the text `'4.5 STARS'`, which indicates that the user liked the book.
103 | # We write a simple LF that looks for similar phrases to guess the user's rating of a book.
104 | # We interpret >= 4 stars to indicate a positive rating, while < 4 stars is negative.
105 |
106 | # %%
107 | low_rating_strs = [
108 | "one star",
109 | "1 star",
110 | "two star",
111 | "2 star",
112 | "3 star",
113 | "three star",
114 | "3.5 star",
115 | "2.5 star",
116 | "1 out of 5",
117 | "2 out of 5",
118 | "3 out of 5",
119 | ]
120 | high_rating_strs = ["5 stars", "five stars", "four stars", "4 stars", "4.5 stars"]
121 |
122 |
123 | @labeling_function(
124 | resources=dict(low_rating_strs=low_rating_strs, high_rating_strs=high_rating_strs)
125 | )
126 | def stars_in_review(x, low_rating_strs, high_rating_strs):
127 | if not isinstance(x.review_text, str):
128 | return ABSTAIN
129 | for low_rating_str in low_rating_strs:
130 | if low_rating_str in x.review_text.lower():
131 | return NEGATIVE
132 | for high_rating_str in high_rating_strs:
133 | if high_rating_str in x.review_text.lower():
134 | return POSITIVE
135 | return ABSTAIN
136 |
137 |
138 | # %% [markdown]
139 | # We can also run [TextBlob](https://textblob.readthedocs.io/en/dev/index.html), a tool that provides a pretrained sentiment analyzer, on the reviews, and use its polarity and subjectivity scores to estimate the user's rating for the book.
140 | # As usual, these thresholds were picked by analyzing the score distributions and running error analysis.
141 |
142 | # %%
143 | from snorkel.preprocess import preprocessor
144 | from textblob import TextBlob
145 |
146 |
147 | @preprocessor(memoize=True)
148 | def textblob_polarity(x):
149 | if isinstance(x.review_text, str):
150 | x.blob = TextBlob(x.review_text)
151 | else:
152 | x.blob = None
153 | return x
154 |
155 |
156 | # Label high polarity reviews as positive.
157 | @labeling_function(pre=[textblob_polarity])
158 | def polarity_positive(x):
159 | if x.blob:
160 | if x.blob.polarity > 0.3:
161 | return POSITIVE
162 | return ABSTAIN
163 |
164 |
165 | # Label high subjectivity reviews as positive.
166 | @labeling_function(pre=[textblob_polarity])
167 | def subjectivity_positive(x):
168 | if x.blob:
169 | if x.blob.subjectivity > 0.75:
170 | return POSITIVE
171 | return ABSTAIN
172 |
173 |
174 | # Label low polarity reviews as negative.
175 | @labeling_function(pre=[textblob_polarity])
176 | def polarity_negative(x):
177 | if x.blob:
178 | if x.blob.polarity < 0.0:
179 | return NEGATIVE
180 | return ABSTAIN
181 |
182 |
183 | # %% {"tags": ["md-exclude-output"]}
184 | from snorkel.labeling import PandasLFApplier, LFAnalysis
185 |
186 | lfs = [
187 | stars_in_review,
188 | shared_first_author,
189 | polarity_positive,
190 | subjectivity_positive,
191 | polarity_negative,
192 | ]
193 |
194 | applier = PandasLFApplier(lfs)
195 | L_dev = applier.apply(df_dev)
196 |
197 | # %%
198 | LFAnalysis(L_dev, lfs).lf_summary(df_dev.rating.values)
199 |
200 | # %% [markdown]
201 | # ### Applying labeling functions to the training set
202 | #
203 | # We apply the labeling functions to the training set, and then filter out data points unlabeled by any LF to form our final training set.
204 |
205 | # %% {"tags": ["md-exclude-output"]}
206 | from snorkel.labeling.model import LabelModel
207 |
208 | L_train = applier.apply(df_train)
209 | label_model = LabelModel(cardinality=2, verbose=True)
210 | label_model.fit(L_train, n_epochs=5000, seed=123, log_freq=20, lr=0.01)
211 | preds_train = label_model.predict(L_train)
212 |
213 | # %% {"tags": ["md-exclude-output"]}
214 | from snorkel.labeling import filter_unlabeled_dataframe
215 |
216 | df_train_filtered, preds_train_filtered = filter_unlabeled_dataframe(
217 | df_train, preds_train, L_train
218 | )
219 | df_train_filtered["rating"] = preds_train_filtered
220 |
221 | # %% [markdown]
222 | # ### Rating Prediction Model
223 | # We write a Keras model for predicting ratings given a user's book list and a book (which is being rated).
224 | # The model represents the list of books the user interacted with, `books_idxs`, by learning an embedding for each idx, and averaging the embeddings in `book_idxs`.
225 | # It learns another embedding for the `book_idx`, the book to be rated.
226 | # Then it concatenates the two embeddings and uses an [MLP](https://en.wikipedia.org/wiki/Multilayer_perceptron) to compute the probability of the `rating` being 1.
227 | # This type of model is common in large-scale recommender systems, for example, the [YouTube recommender system](https://ai.google/research/pubs/pub45530).
228 |
229 | # %%
230 | import numpy as np
231 | import tensorflow as tf
232 | from utils import precision_batch, recall_batch, f1_batch
233 |
234 | n_books = max([max(df.book_idx) for df in [df_train, df_test, df_dev, df_valid]])
235 |
236 |
237 | # Keras model to predict rating given book_idxs and book_idx.
238 | def get_model(embed_dim=64, hidden_layer_sizes=[32]):
239 | # Compute embedding for book_idxs.
240 | len_book_idxs = tf.keras.layers.Input([])
241 | book_idxs = tf.keras.layers.Input([None])
242 | # book_idxs % n_books is to prevent crashing if a book_idx in book_idxs is > n_books.
243 | book_idxs_emb = tf.keras.layers.Embedding(n_books, embed_dim)(book_idxs % n_books)
244 | book_idxs_emb = tf.math.divide(
245 | tf.keras.backend.sum(book_idxs_emb, axis=1), tf.expand_dims(len_book_idxs, 1)
246 | )
247 | # Compute embedding for book_idx.
248 | book_idx = tf.keras.layers.Input([])
249 | book_idx_emb = tf.keras.layers.Embedding(n_books, embed_dim)(book_idx)
250 | input_layer = tf.keras.layers.concatenate([book_idxs_emb, book_idx_emb], 1)
251 | # Build Multi Layer Perceptron on input layer.
252 | cur_layer = input_layer
253 | for size in hidden_layer_sizes:
254 | tf.keras.layers.Dense(size, activation=tf.nn.relu)(cur_layer)
255 | output_layer = tf.keras.layers.Dense(1, activation=tf.nn.sigmoid)(cur_layer)
256 | # Create and compile keras model.
257 | model = tf.keras.Model(
258 | inputs=[len_book_idxs, book_idxs, book_idx], outputs=[output_layer]
259 | )
260 | model.compile(
261 | "Adagrad",
262 | "binary_crossentropy",
263 | metrics=["accuracy", f1_batch, precision_batch, recall_batch],
264 | )
265 | return model
266 |
267 |
268 | # %% [markdown]
269 | # We use triples of (`book_idxs`, `book_idx`, `rating`) from our dataframes as training data points. In addition, we want to train the model to recognize when a user will not read a book. To create data points for that, we randomly sample a `book_id` not in `book_idxs` and use that with a `rating` of 0 as a _random negative_ data point. We create one such _random negative_ data point for every positive (`rating` 1) data point in our dataframe so that positive and negative data points are roughly balanced.
270 |
271 | # %%
272 | # Generator to turn dataframe into data points.
273 | def get_data_points_generator(df):
274 | def generator():
275 | for book_idxs, book_idx, rating in zip(df.book_idxs, df.book_idx, df.rating):
276 | # Remove book_idx from book_idxs so the model can't just look it up.
277 | book_idxs = tuple(filter(lambda x: x != book_idx, book_idxs))
278 | yield {
279 | "len_book_idxs": len(book_idxs),
280 | "book_idxs": book_idxs,
281 | "book_idx": book_idx,
282 | "label": rating,
283 | }
284 | if rating == 1:
285 | # Generate a random negative book_id not in book_idxs.
286 | random_negative = np.random.randint(0, n_books)
287 | while random_negative in book_idxs:
288 | random_negative = np.random.randint(0, n_books)
289 | yield {
290 | "len_book_idxs": len(book_idxs),
291 | "book_idxs": book_idxs,
292 | "book_idx": random_negative,
293 | "label": 0,
294 | }
295 |
296 | return generator
297 |
298 |
299 | def get_data_tensors(df):
300 | # Use generator to get data points each epoch, along with shuffling and batching.
301 | padded_shapes = {
302 | "len_book_idxs": [],
303 | "book_idxs": [None],
304 | "book_idx": [],
305 | "label": [],
306 | }
307 | dataset = (
308 | tf.data.Dataset.from_generator(
309 | get_data_points_generator(df), {k: tf.int64 for k in padded_shapes}
310 | )
311 | .shuffle(123)
312 | .repeat(None)
313 | .padded_batch(batch_size=256, padded_shapes=padded_shapes)
314 | )
315 | tensor_dict = tf.compat.v1.data.make_one_shot_iterator(dataset).get_next()
316 | return (
317 | (
318 | tensor_dict["len_book_idxs"],
319 | tensor_dict["book_idxs"],
320 | tensor_dict["book_idx"],
321 | ),
322 | tensor_dict["label"],
323 | )
324 |
325 |
326 | # %% [markdown]
327 | # We now train the model on our combined training data (data labeled by LFs plus dev data).
328 | #
329 | # %% {"tags": ["md-exclude-output"]}
330 | from utils import get_n_epochs
331 |
332 | model = get_model()
333 |
334 | X_train, Y_train = get_data_tensors(df_train_filtered)
335 | X_valid, Y_valid = get_data_tensors(df_valid)
336 | model.fit(
337 | X_train,
338 | Y_train,
339 | steps_per_epoch=300,
340 | validation_data=(X_valid, Y_valid),
341 | validation_steps=40,
342 | epochs=get_n_epochs(),
343 | verbose=1,
344 | )
345 | # %% [markdown]
346 | # Finally, we evaluate the model's predicted ratings on our test data.
347 | #
348 | # %%
349 | X_test, Y_test = get_data_tensors(df_test)
350 | _ = model.evaluate(X_test, Y_test, steps=30)
351 |
352 | # %% [markdown]
353 | # Our model has generalized quite well to our test set!
354 | # Note that we should additionally measure ranking metrics, like precision@10, before deploying to production.
355 |
356 | # %% [markdown]
357 | # ## Summary
358 | #
359 | # In this tutorial, we showed one way to use Snorkel for recommendations.
360 | # We used book metadata and review text to create LFs that estimate user ratings.
361 | # We used Snorkel's `LabelModel` to combine the outputs of those LFs.
362 | # Finally, we trained a model to predict whether a user will read and like a given book (and therefore what books should be recommended to the user) based only on what books the user has interacted with in the past.
363 | #
364 | # Here we demonstrated one way to use Snorkel for training a recommender system.
365 | # Note, however, that this approach could easily be adapted to take advantage of additional information as it is available (e.g., user profile data, denser user ratings, and so on.)
366 |
--------------------------------------------------------------------------------
/spam/02_spam_data_augmentation_tutorial.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | # %% [markdown]
3 | # # 📈 Snorkel Intro Tutorial: Data Augmentation
4 |
5 | # %% [markdown]
6 | # In this tutorial, we will walk through the process of using *transformation functions* (TFs) to perform data augmentation.
7 | # Like the labeling tutorial, our goal is to train a classifier to YouTube comments as `SPAM` or `HAM` (not spam).
8 | # In the [previous tutorial](https://github.com/snorkel-team/snorkel-tutorials/blob/master/spam/01_spam_tutorial.ipynb),
9 | # we demonstrated how to label training sets programmatically with Snorkel.
10 | # In this tutorial, we'll assume that step has already been done, and start with labeled training data,
11 | # which we'll aim to augment using transformation functions.
12 | #
13 | # %% [markdown] {"tags": ["md-exclude"]}
14 | # * For more details on the task, check out the [labeling tutorial](https://github.com/snorkel-team/snorkel-tutorials/blob/master/spam/01_spam_tutorial.ipynb)
15 | # * For an overview of Snorkel, visit [snorkel.org](https://snorkel.org)
16 | # * You can also check out the [Snorkel API documentation](https://snorkel.readthedocs.io/)
17 | #
18 | # %% [markdown]
19 | # Data augmentation is a popular technique for increasing the size of labeled training sets by applying class-preserving transformations to create copies of labeled data points.
20 | # In the image domain, it is a crucial factor in almost every state-of-the-art result today and is quickly gaining
21 | # popularity in text-based applications.
22 | # Snorkel models the data augmentation process by applying user-defined *transformation functions* (TFs) in sequence.
23 | # You can learn more about data augmentation in
24 | # [this blog post about our NeurIPS 2017 work on automatically learned data augmentation](https://snorkel.org/blog/tanda/).
25 | #
26 | # The tutorial is divided into four parts:
27 | # 1. **Loading Data**: We load a [YouTube comments dataset](http://www.dt.fee.unicamp.br/~tiago//youtubespamcollection/).
28 | # 2. **Writing Transformation Functions**: We write Transformation Functions (TFs) that can be applied to training data points to generate new training data points.
29 | # 3. **Applying Transformation Functions to Augment Our Dataset**: We apply a sequence of TFs to each training data point, using a random policy, to generate an augmented training set.
30 | # 4. **Training a Model**: We use the augmented training set to train an LSTM model for classifying new comments as `SPAM` or `HAM`.
31 |
32 | # %% [markdown] {"tags": ["md-exclude"]}
33 | # This next cell takes care of some notebook-specific housekeeping.
34 | # You can ignore it.
35 |
36 | # %% {"tags": ["md-exclude"]}
37 | import os
38 | import random
39 |
40 | import numpy as np
41 |
42 | # Make sure we're running from the spam/ directory
43 | if os.path.basename(os.getcwd()) == "snorkel-tutorials":
44 | os.chdir("spam")
45 |
46 | # Turn off TensorFlow logging messages
47 | os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
48 |
49 | # For reproducibility
50 | seed = 0
51 | os.environ["PYTHONHASHSEED"] = str(seed)
52 | np.random.seed(0)
53 | random.seed(0)
54 |
55 | # %% [markdown] {"tags": ["md-exclude"]}
56 | # If you want to display all comment text untruncated, change `DISPLAY_ALL_TEXT` to `True` below.
57 |
58 | # %% {"tags": ["md-exclude"]}
59 | import pandas as pd
60 |
61 |
62 | DISPLAY_ALL_TEXT = False
63 |
64 | pd.set_option("display.max_colwidth", 0 if DISPLAY_ALL_TEXT else 50)
65 |
66 | # %% [markdown] {"tags": ["md-exclude"]}
67 | # This next cell makes sure a spaCy English model is downloaded.
68 | # If this is your first time downloading this model, restart the kernel after executing the next cell.
69 |
70 | # %% {"tags": ["md-exclude"]}
71 | # Download the spaCy english model
72 | # ! python -m spacy download en_core_web_sm
73 |
74 | # %% [markdown]
75 | # ## 1. Loading Data
76 |
77 | # %% [markdown]
78 | # We load the Kaggle dataset and create Pandas DataFrame objects for the `train` and `test` sets.
79 | # The two main columns in the DataFrames are:
80 | # * **`text`**: Raw text content of the comment
81 | # * **`label`**: Whether the comment is `SPAM` (1) or `HAM` (0).
82 | #
83 | # For more details, check out the [labeling tutorial](https://github.com/snorkel-team/snorkel-tutorials/blob/master/spam/01_spam_tutorial.ipynb).
84 |
85 | # %%
86 | from utils import load_spam_dataset
87 |
88 | df_train, df_test = load_spam_dataset(load_train_labels=True)
89 |
90 | # We pull out the label vectors for ease of use later
91 | Y_train = df_train["label"].values
92 | Y_test = df_test["label"].values
93 |
94 |
95 | # %%
96 | df_train.head()
97 |
98 | # %% [markdown]
99 | # ## 2. Writing Transformation Functions (TFs)
100 | #
101 | # Transformation functions are functions that can be applied to a training data point to create another valid training data point of the same class.
102 | # For example, for image classification problems, it is common to rotate or crop images in the training data to create new training inputs.
103 | # Transformation functions should be atomic e.g. a small rotation of an image, or changing a single word in a sentence.
104 | # We then compose multiple transformation functions when applying them to training data points.
105 | #
106 | # Common ways to augment text includes replacing words with their synonyms, or replacing names entities with other entities.
107 | # More info can be found
108 | # [here](https://towardsdatascience.com/data-augmentation-in-nlp-2801a34dfc28) or
109 | # [here](https://towardsdatascience.com/these-are-the-easiest-data-augmentation-techniques-in-natural-language-processing-you-can-think-of-88e393fd610).
110 | # Our basic modeling assumption is that applying these operations to a comment generally shouldn't change whether it is `SPAM` or not.
111 | #
112 | # Transformation functions in Snorkel are created with the
113 | # [`transformation_function` decorator](https://snorkel.readthedocs.io/en/master/packages/_autosummary/augmentation/snorkel.augmentation.transformation_function.html#snorkel.augmentation.transformation_function),
114 | # which wraps a function that takes in a single data point and returns a transformed version of the data point.
115 | # If no transformation is possible, a TF can return `None` or the original data point.
116 | # If all the TFs applied to a data point return `None`, the data point won't be included in
117 | # the augmented dataset when we apply our TFs below.
118 | #
119 | # Just like the `labeling_function` decorator, the `transformation_function` decorator
120 | # accepts `pre` argument for `Preprocessor` objects.
121 | # Here, we'll use a
122 | # [`SpacyPreprocessor`](https://snorkel.readthedocs.io/en/master/packages/_autosummary/preprocess/snorkel.preprocess.nlp.SpacyPreprocessor.html#snorkel.preprocess.nlp.SpacyPreprocessor).
123 |
124 | # %%
125 | from snorkel.preprocess.nlp import SpacyPreprocessor
126 |
127 | spacy = SpacyPreprocessor(text_field="text", doc_field="doc", memoize=True)
128 |
129 | # %%
130 | import names
131 | from snorkel.augmentation import transformation_function
132 |
133 | # Pregenerate some random person names to replace existing ones with
134 | # for the transformation strategies below
135 | replacement_names = [names.get_full_name() for _ in range(50)]
136 |
137 |
138 | # Replace a random named entity with a different entity of the same type.
139 | @transformation_function(pre=[spacy])
140 | def change_person(x):
141 | person_names = [ent.text for ent in x.doc.ents if ent.label_ == "PERSON"]
142 | # If there is at least one person name, replace a random one. Else return None.
143 | if person_names:
144 | name_to_replace = np.random.choice(person_names)
145 | replacement_name = np.random.choice(replacement_names)
146 | x.text = x.text.replace(name_to_replace, replacement_name)
147 | return x
148 |
149 |
150 | # Swap two adjectives at random.
151 | @transformation_function(pre=[spacy])
152 | def swap_adjectives(x):
153 | adjective_idxs = [i for i, token in enumerate(x.doc) if token.pos_ == "ADJ"]
154 | # Check that there are at least two adjectives to swap.
155 | if len(adjective_idxs) >= 2:
156 | idx1, idx2 = sorted(np.random.choice(adjective_idxs, 2, replace=False))
157 | # Swap tokens in positions idx1 and idx2.
158 | x.text = " ".join(
159 | [
160 | x.doc[:idx1].text,
161 | x.doc[idx2].text,
162 | x.doc[1 + idx1 : idx2].text,
163 | x.doc[idx1].text,
164 | x.doc[1 + idx2 :].text,
165 | ]
166 | )
167 | return x
168 |
169 |
170 | # %% [markdown]
171 | # We add some transformation functions that use `wordnet` from [NLTK](https://www.nltk.org/) to replace different parts of speech with their synonyms.
172 |
173 | # %% {"tags": ["md-exclude-output"]}
174 | import nltk
175 | from nltk.corpus import wordnet as wn
176 |
177 | nltk.download("wordnet")
178 |
179 |
180 | def get_synonym(word, pos=None):
181 | """Get synonym for word given its part-of-speech (pos)."""
182 | synsets = wn.synsets(word, pos=pos)
183 | # Return None if wordnet has no synsets (synonym sets) for this word and pos.
184 | if synsets:
185 | words = [lemma.name() for lemma in synsets[0].lemmas()]
186 | if words[0].lower() != word.lower(): # Skip if synonym is same as word.
187 | # Multi word synonyms in wordnet use '_' as a separator e.g. reckon_with. Replace it with space.
188 | return words[0].replace("_", " ")
189 |
190 |
191 | def replace_token(spacy_doc, idx, replacement):
192 | """Replace token in position idx with replacement."""
193 | return " ".join([spacy_doc[:idx].text, replacement, spacy_doc[1 + idx :].text])
194 |
195 |
196 | @transformation_function(pre=[spacy])
197 | def replace_verb_with_synonym(x):
198 | # Get indices of verb tokens in sentence.
199 | verb_idxs = [i for i, token in enumerate(x.doc) if token.pos_ == "VERB"]
200 | if verb_idxs:
201 | # Pick random verb idx to replace.
202 | idx = np.random.choice(verb_idxs)
203 | synonym = get_synonym(x.doc[idx].text, pos="v")
204 | # If there's a valid verb synonym, replace it. Otherwise, return None.
205 | if synonym:
206 | x.text = replace_token(x.doc, idx, synonym)
207 | return x
208 |
209 |
210 | @transformation_function(pre=[spacy])
211 | def replace_noun_with_synonym(x):
212 | # Get indices of noun tokens in sentence.
213 | noun_idxs = [i for i, token in enumerate(x.doc) if token.pos_ == "NOUN"]
214 | if noun_idxs:
215 | # Pick random noun idx to replace.
216 | idx = np.random.choice(noun_idxs)
217 | synonym = get_synonym(x.doc[idx].text, pos="n")
218 | # If there's a valid noun synonym, replace it. Otherwise, return None.
219 | if synonym:
220 | x.text = replace_token(x.doc, idx, synonym)
221 | return x
222 |
223 |
224 | @transformation_function(pre=[spacy])
225 | def replace_adjective_with_synonym(x):
226 | # Get indices of adjective tokens in sentence.
227 | adjective_idxs = [i for i, token in enumerate(x.doc) if token.pos_ == "ADJ"]
228 | if adjective_idxs:
229 | # Pick random adjective idx to replace.
230 | idx = np.random.choice(adjective_idxs)
231 | synonym = get_synonym(x.doc[idx].text, pos="a")
232 | # If there's a valid adjective synonym, replace it. Otherwise, return None.
233 | if synonym:
234 | x.text = replace_token(x.doc, idx, synonym)
235 | return x
236 |
237 |
238 | # %%
239 | tfs = [
240 | change_person,
241 | swap_adjectives,
242 | replace_verb_with_synonym,
243 | replace_noun_with_synonym,
244 | replace_adjective_with_synonym,
245 | ]
246 |
247 | # %% [markdown]
248 | # Let's check out a few examples of transformed data points to see what our TFs are doing.
249 |
250 | # %%
251 | from utils import preview_tfs
252 |
253 | preview_tfs(df_train, tfs)
254 |
255 | # %% [markdown]
256 | # We notice a couple of things about the TFs.
257 | #
258 | # * Sometimes they make trivial changes (`"website"` to `"web site"` for replace_noun_with_synonym).
259 | # This can still be helpful for training our model, because it teaches the model to be invariant to such small changes.
260 | # * Sometimes they introduce incorrect grammar to the sentence (e.g. `swap_adjectives` swapping `"young"` and `"more"` above).
261 | #
262 | # The TFs are expected to be heuristic strategies that indeed preserve the class most of the time, but
263 | # [don't need to be perfect](https://arxiv.org/pdf/1901.11196.pdf).
264 | # This is especially true when using automated
265 | # [data augmentation techniques](https://snorkel.org/blog/tanda/)
266 | # which can learn to avoid particularly corrupted data points.
267 | # As we'll see below, Snorkel is compatible with such learned augmentation policies.
268 |
269 | # %% [markdown]
270 | # ## 3. Applying Transformation Functions
271 |
272 | # %% [markdown]
273 | # We'll first define a `Policy` to determine what sequence of TFs to apply to each data point.
274 | # We'll start with a [`RandomPolicy`](https://snorkel.readthedocs.io/en/master/packages/_autosummary/augmentation/snorkel.augmentation.RandomPolicy.html)
275 | # that samples `sequence_length=2` TFs to apply uniformly at random per data point.
276 | # The `n_per_original` argument determines how many augmented data points to generate per original data point.
277 |
278 | # %%
279 | from snorkel.augmentation import RandomPolicy
280 |
281 | random_policy = RandomPolicy(
282 | len(tfs), sequence_length=2, n_per_original=2, keep_original=True
283 | )
284 |
285 | # %% [markdown]
286 | # In some cases, we can do better than uniform random sampling.
287 | # We might have domain knowledge that some TFs should be applied more frequently than others,
288 | # or have trained an [automated data augmentation model](https://snorkel.org/blog/tanda/)
289 | # that learned a sampling distribution for the TFs.
290 | # Snorkel supports this use case with a
291 | # [`MeanFieldPolicy`](https://snorkel.readthedocs.io/en/master/packages/_autosummary/augmentation/snorkel.augmentation.MeanFieldPolicy.html),
292 | # which allows you to specify a sampling distribution for the TFs.
293 | # We give higher probabilities to the `replace_[X]_with_synonym` TFs, since those provide more information to the model.
294 |
295 | # %%
296 | from snorkel.augmentation import MeanFieldPolicy
297 |
298 | mean_field_policy = MeanFieldPolicy(
299 | len(tfs),
300 | sequence_length=2,
301 | n_per_original=2,
302 | keep_original=True,
303 | p=[0.05, 0.05, 0.3, 0.3, 0.3],
304 | )
305 |
306 | # %% [markdown]
307 | # To apply one or more TFs that we've written to a collection of data points according to our policy, we use a
308 | # [`PandasTFApplier`](https://snorkel.readthedocs.io/en/master/packages/_autosummary/augmentation/snorkel.augmentation.PandasTFApplier.html)
309 | # because our data points are represented with a Pandas DataFrame.
310 |
311 | # %% {"tags": ["md-exclude-output"]}
312 | from snorkel.augmentation import PandasTFApplier
313 |
314 | tf_applier = PandasTFApplier(tfs, mean_field_policy)
315 | df_train_augmented = tf_applier.apply(df_train)
316 | Y_train_augmented = df_train_augmented["label"].values
317 |
318 | # %%
319 | print(f"Original training set size: {len(df_train)}")
320 | print(f"Augmented training set size: {len(df_train_augmented)}")
321 |
322 | # %% [markdown]
323 | # We have almost doubled our dataset using TFs!
324 | # Note that despite `n_per_original` being set to 2, our dataset may not exactly triple in size,
325 | # because sometimes TFs return `None` instead of a new data point
326 | # (e.g. `change_person` when applied to a sentence with no persons).
327 | # If you prefer to have exact proportions for your dataset, you can have TFs that can't perform a
328 | # valid transformation return the original data point rather than `None` (as they do here).
329 |
330 |
331 | # %% [markdown]
332 | # ## 4. Training A Model
333 | #
334 | # Our final step is to use the augmented data to train a model. We train an LSTM (Long Short Term Memory) model, which is a very standard architecture for text processing tasks.
335 |
336 | # %% [markdown] {"tags": ["md-exclude"]}
337 | # The next cell makes Keras results reproducible. You can ignore it.
338 |
339 | # %% {"tags": ["md-exclude"]}
340 | import tensorflow as tf
341 |
342 | session_conf = tf.compat.v1.ConfigProto(
343 | intra_op_parallelism_threads=1, inter_op_parallelism_threads=1
344 | )
345 |
346 | tf.compat.v1.set_random_seed(0)
347 | sess = tf.compat.v1.Session(graph=tf.compat.v1.get_default_graph(), config=session_conf)
348 | tf.compat.v1.keras.backend.set_session(sess)
349 |
350 | # %% [markdown]
351 | # Now we'll train our LSTM on both the original and augmented datasets to compare performance.
352 |
353 | # %% {"tags": ["md-exclude-output"]}
354 | from utils import featurize_df_tokens, get_keras_lstm
355 |
356 | X_train = featurize_df_tokens(df_train)
357 | X_train_augmented = featurize_df_tokens(df_train_augmented)
358 | X_test = featurize_df_tokens(df_test)
359 |
360 |
361 | def train_and_test(X_train, Y_train, X_test=X_test, Y_test=Y_test, num_buckets=30000):
362 | # Define a vanilla LSTM model with Keras
363 | lstm_model = get_keras_lstm(num_buckets)
364 | lstm_model.fit(X_train, Y_train, epochs=5, verbose=0)
365 | preds_test = lstm_model.predict(X_test)[:, 0] > 0.5
366 | return (preds_test == Y_test).mean()
367 |
368 |
369 | acc_augmented = train_and_test(X_train_augmented, Y_train_augmented)
370 | acc_original = train_and_test(X_train, Y_train)
371 |
372 | # %%
373 | print(f"Test Accuracy (original training data): {100 * acc_original:.1f}%")
374 | print(f"Test Accuracy (augmented training data): {100 * acc_augmented:.1f}%")
375 |
376 |
377 | # %% [markdown]
378 | # So using the augmented dataset indeed improved our model!
379 | # There is a lot more you can do with data augmentation, so try a few ideas
380 | # out on your own!
381 |
--------------------------------------------------------------------------------