├── README.md
├── VLE_datasets
└── v1
│ └── VLE_12k_dataset_v1.csv
├── docs
└── figs
│ ├── categories.jpg
│ ├── duration.jpg
│ ├── features.jpg
│ └── tokens.jpg
├── helper_code
├── __init__.py
├── feature_extraction
│ ├── __init__.py
│ ├── _api_utils.py
│ ├── _text_utils.py
│ ├── content_based_features.py
│ └── wikipedia_based_features.py
├── helper_tools
│ ├── __init__.py
│ ├── evaluation_metrics.py
│ └── io_utils.py
└── models
│ ├── __init__.py
│ └── regression
│ ├── __init__.py
│ └── train_rf_regression_full_cv.py
└── setup.py
/README.md:
--------------------------------------------------------------------------------
1 | # Video lectures dataset
2 |
3 | This repository contains the dataset and source code of the experiments conducted at reported using the VLE
4 | Dataset. The VLE dataset provides a set of statistics aimed at studying population-based (context-agnostic)
5 | engagement in video lectures, together with other conventional metrics in subjective assessment such as average star
6 | ratings and number of views. We believe the dataset will serve the community applying AI in Education to further
7 | understand what are the features of educational material that makes it engaging for learners.
8 |
9 | ## Use-Cases
10 | The dataset is particularly suited to solve the cold-start problem found in educational recommender systems, both when
11 | i) ***user cold-start***, new users join the system and we may not have enough information about their context so we may
12 | simply recommend population-based engaging lectures for a specific query topic and ii) ***item cold-start***, new
13 | educational content is released, for which we may not have user engagement data yet and thus an engagement predictive
14 | model would be necessary. To the best of our knowledge, this is the first dataset to tackle such a task in
15 | education/scientific recommendations at this scale.
16 |
17 | The Dataset is a pivotal milestone in uplifting **sustainability** of future knowledge systems having direct impact on
18 | scalable, automatic quality assurance and personalised education. It
19 | improves **transparency** by allowing the interpretation of humanly intuitive features and their influence in
20 | population-based engagement prediction.
21 |
22 | ## Impact
23 |
24 | VLE dataset can be considered as a highly impactful resource contribution to the information retrieval,
25 | multimedia analysis, educational data mining, learning analytics and AI in education research community as it will
26 | enable a whole new line of research that is geared towards next generation information and knowledge management within
27 | educational repositories, Massively Open Online Course platforms and other Video/document platforms. This dataset complements the ongoing effort of understanding learner engagement in video lectures. However, it dramatically improves the research landscape by formally
28 | establishing ***two objectively measurable novel tasks*** related to predicting engagement of educational
29 | videos while making a significantly larger, more-focused dataset and its baselines available to the research community with more relevance to AI education.
30 | AI in Education, Intelligent Tutoring Systems and Educational Data Mining communities are on a rapid growth trajectory
31 | right now and will benefit from this dataset as it directly addresses issues related to the respective knowledge fields. The simultaneously growing need for scalable, personalised learning solutions makes this dataset a central piece within
32 | community that will enable improving scalable quality assurance and personalised educational recommendation in the years
33 | to come. The value of this dataset to the field is expected to last for a long time and will increase with subsequent
34 | versions of the dataset being available in the future with more videos and more features.
35 |
36 |
37 | ## Using the Dataset and Its Tools
38 | The resource is developed in a way that any researcher with very basic technological literacy can start building on top
39 | of this dataset.
40 | - The dataset is provided in `Comma Seperate Values (CSV)` format making it *human-readable* while being accessible
41 | through a wide range of data manipulation and statistical software suites.
42 | - The resource includes
43 | `helper_tools`
44 | that provides a set of functions that any researcher with *basic python knowledge* can use to interact with the dataset
45 | and also evaluate the built models.
46 | - `models.regression`
47 | provides *well-documented example code snippets* that can 1) enable the researcher to reproduce results reported for
48 | baseline models, 2) use an example coding snippets to understand how to build novel models using the VLE dataset.
49 | - `feature_extraction`
50 | module presents the programming logic of how features in the dataset are calculated. The feature extraction logic is
51 | presented in the form of *well-documented (PEP-8 standard, Google Docstrings format)* Python functions that can be used
52 | to 1) understand the logic behind feature extraction or 2) apply the feature extraction logic to your own lecture records
53 | to generate more data
54 |
55 | ## Structure of the Resource
56 |
57 | The structure of the repository divides the resources to two distinct components on top-level.
58 | 1. `VLE_datasets`: This section stores the different versions of VLE datasets (current version: `v1`)
59 | 2. `helper_code`: This module stores all the code related to manipulating and managing the datasets.
60 |
61 | In addition, there are two files:
62 | - `README.md`: The main source of information for understanding and working with the VLE datasets.
63 | - `setup.py`: Python setup file that will install the support tools to your local python environment.
64 |
65 | ### Table of Contents
66 | - [VLE Datasets](#vle-datasets)
67 | - [Anonymity](#anonymity)
68 | - [Versions](#versions)
69 | - [Features](#features)
70 | - [General Features](#general-features)
71 | - [Content-based Features](#content-based-features)
72 | - [Wikipedia-based Features](#wikipedia-based-features)
73 | - [Video-specific Features](#video-specific-features)
74 | - [Labels](#labels)
75 | - [Explicit Rating](#explicit-rating)
76 | - [Popularity](#popularity)
77 | - [Watch Time/Engagement](#watch-timeengagement)
78 | - [VLE Dataset v1](#vle-dataset-v1-12k)
79 | - [Lecture Duration Distribution](#lecture-duration-distribution)
80 | - [Lecture Categories](#lecture-categories)
81 | - [`helper_code` Module](#helper_code-module)
82 | - [`feature_extraction`](#feature_extraction)
83 | - [`helper_tools`](#helper_tools)
84 | - [`models`](#models)
85 | - [References](#references)
86 |
87 |
88 | ## VLE Datasets
89 | This section makes the VLE datasets publicly available. The VLE dataset is constructed using the
90 | aggregated video lectures consumption data coming from a popular OER repository,
91 | [VideoLectures.Net](http://videolectures.net/). These videos are recorded when researchers are presenting their work at
92 | peer-reviewed conferences. Lectures are reviewed and hence material is controlled for correctness of knowledge and
93 | pedagogical robustness. Specififally, the dataset is comparatively
94 | more useful when building e-learning systems for Artificial Intellgence and Computer Science Education as majority of
95 | lectures in the dataset belong to these topics.
96 |
97 | ### Versions
98 | All the relevant datasets are available as Comma Separated Value (CSV) file within a dataset subdirectory
99 | (eg. `v1/VLE_12k_dataset_v1.csv`). At present, a dataset consisting around 12,000 lectures is available publicly.
100 |
101 | | Dataset | Number of Lectures | Number of Users | Number of Star Ratings | Log Recency | URL |
102 | |---------|--------------------|-------------------|--------------------------|----|-----|
103 | | ***v1*** | 11568 | Over 1.1 Million | 2127 | Until February 01, 2021 | /VLE_datasets/v1 |
104 |
105 | The latest dataset of this collection is `v1`. The tools required to load,
106 | and manipulate the datasets are found in `helper_code.utils.io_utils` module.
107 |
108 | ### Anonymity
109 | We restrict the final dataset to lectures that have been viewed by at least 5 unique users to preserve anonymity of
110 | users and have reliable engagement measurements. Additionally, a regime of techniques are used for preserving the
111 | anonymity of the data authors using the remaining features. Rarely occurring values in *Lecture Type* feature were
112 | grouped together to create the `other` category. *Language* feature is grouped into `en` and `non-en` categories.
113 | Similarly, Domain category groups Life Sciences, Physics, Technology, Mathematics, Computer Science, Data Science
114 | and Computers subjects to `stem` category and the other subjects to `misc` category. Rounding is used with
115 | *Published Date*, rounding to the nearest 10 days. *Lecture Duration* is rounded to the nearest 10 seconds.
116 | Gaussian white noise (10%) is added to *Title Word Count* feature and rounded to the nearest integer.
117 |
118 |
119 | ### Features
120 | There 4 main types of features extracted from the video lectures. These features can be categorised into [six quality
121 | verticals](https://www.k4all.org/wp-content/uploads/2019/08/IJCAI_paper_on_quality.pdf).
122 |
123 | All the features that are included in the dataset are summarised in Table 1.
124 |
125 |
126 | Table 1: Features extracted and available in the VLE dataset with their variable type (Continuous vs.
127 | Categorical) and their quality vertical.
128 |
129 |
130 |
131 |
132 | | Variable Type |
133 | Name |
134 | Quality Vertical |
135 | Description |
136 |
137 |
138 |
139 |
140 | | Metadata-based Features |
141 |
142 |
143 | | cat. |
144 | Language |
145 | - |
146 | Language of instruction of the video lecture |
147 |
148 |
149 | | cat. |
150 | Domain |
151 | - |
152 | Subject area (STEM or Miscellaneous) |
153 |
154 |
155 | | Content-based Features |
156 |
157 |
158 | | con. |
159 | Word Count |
160 | Topic Coverage |
161 | Word Count of Transcript |
162 |
163 |
164 | | con. |
165 | Title Word Count |
166 | Topic Coverag |
167 | Word Count of Title |
168 |
169 |
170 | | con. |
171 | Document Entropy |
172 | Topic Coverage |
173 | Document Entropy of Transcript |
174 |
175 |
176 | | con. |
177 | Easiness (FK Easiness) |
178 | Understandability |
179 | FK Easiness based on FK Easiness |
180 |
181 |
182 | | con. |
183 | Stop-word Presence Rate |
184 | Understandability |
185 | Stopword Presence Rate of Transcript text |
186 |
187 |
188 | | con. |
189 | Stop-word Coverage Rate |
190 | Understandability |
191 | Stopword Coverage Rate of Transcript text |
192 |
193 |
194 | | con. |
195 | Preposition Rate |
196 | Presentation |
197 | Preposition Rate of Transcript text |
198 |
199 |
200 | | con. |
201 | Auxiliary Rate |
202 | Presentation |
203 | Auxiliary Rate of Transcript text |
204 |
205 |
206 | | con. |
207 | To Be Rate |
208 | Presentation |
209 | To-Be Verb Rate of Transcript text |
210 |
211 |
212 | | con. |
213 | Conjunction Rate |
214 | Presentation |
215 | Conjunction Rate of Transcript text |
216 |
217 |
218 | | con. |
219 | Normalisation Rate |
220 | Presentation |
221 | Normalisation Rate of Transcript text |
222 |
223 |
224 | | con. |
225 | Pronoun Rate |
226 | Presentation |
227 | Pronoun Rate of Transcript text |
228 |
229 |
230 | | con. |
231 | Published Date |
232 | Freshness |
233 | Duration between 01/01/1970 and the lecture published date (in days) |
234 |
235 |
236 | | Wikipedia-based Features |
237 |
238 |
239 | | cat. |
240 | Top-5 Authoritative Topic URLs |
241 | Authority |
242 | 5 Most Authoritative Topic URLs based on PageRank Score. 5 features in this group |
243 |
244 |
245 | | con. |
246 | Top-5 PageRank Scores |
247 | Authority |
248 | PageRank Scores of the top-5 most authoritative topics |
249 |
250 |
251 | | cat. |
252 | Top-5 Covered Topic URLs |
253 | Topic Coverage |
254 | 5 Most Covered Topic URLs based on Cosine Similarity Score. 5 features in this group |
255 |
256 |
257 | | con. |
258 | Top-5 Cosine Similarities |
259 | Topic Coverage |
260 | Cosine Similarity Scores of the top-5 most covered topics |
261 |
262 |
263 | | Video-based Features |
264 |
265 |
266 | | con. |
267 | Lecture Duration |
268 | Topic Coverage |
269 | Duration of the video (in seconds) |
270 |
271 |
272 | | cat. |
273 | Is Chunked |
274 | Presentation |
275 | If the lecture consists of multiple videos |
276 |
277 |
278 | | cat. |
279 | Lecture Type |
280 | Presentation |
281 | Type of lecture (lecture, tutorial, invited talk etc.) |
282 |
283 |
284 | | con. |
285 | Speaker speed |
286 | Presentation |
287 | Speaker speed (words per minute) |
288 |
289 |
290 | | con. |
291 | Silence Period Rate (SPR) |
292 | Presentation |
293 | Fraction of silence in the lecture video |
294 |
295 |
296 |
297 |
298 | #### General Features
299 | Features that extracted from Lecture metadata that are associated with the language and subject of the materials.
300 |
301 | #### Content-based Features
302 | Features that have been extracted from the contents that are discussed within the lecture. These features are extracted
303 | using the content transcript in English lectures. Features are extracted from the English translation where the lecture
304 | is a non-english lecture. The transcription and translation services are provided by the
305 | [TransLectures](https://www.mllp.upv.es/projects/translectures/) project.
306 |
307 | #### Textual Feature Extraction
308 |
309 | Different groups of word tokens are used when calculating features such as `Preposition rate`, `Auxilliary Rate` etc.
310 | as proposed by [Dalip et al.](https://asistdl.onlinelibrary.wiley.com/doi/abs/10.1002/asi.23650).
311 |
312 | The features are calculated using the formulae listed below:
313 |
314 |
315 |
316 | The tokens used during feature extraction are listed below:
317 |
318 |
319 |
320 | #### Wikipedia-based Features
321 | Two features groups that associate to *content authority* and *topic coverage* are extracted by connecting the lecture
322 | transcript to Wikipedia. [Entity Linking](http://www.wikifier.org/) technology is used to identify Wikipedia concepts
323 | that are asscoated with the lecture contents.
324 |
325 | - Most Authoritative Topics
326 | The Wikipedia topics in the lecture are used to build a Semantic graph of the lecture where the *Semantic Relatedness*
327 | is calculated using Milne and Witten method [(4)](https://www.aaai.org/Papers/Workshops/2008/WS-08-15/WS08-15-005.pdf).
328 | PageRank is run on the semantic graph to identify the most authoritative topics within the lecture. The top-5 most
329 | authoritative topic URLs and their respective PageRank value is included in the dataset.
330 |
331 | - Most Convered Topics
332 | Similarly, the [Cosine Similarity](https://www.sciencedirect.com/topics/computer-science/cosine-similarity) between
333 | the Wikipedia topic page and the lecture transcript is used to rank the Wikipedia topics that are most covered in the
334 | video lecture. The top-5 most covered topic URLs and their respective cosine similarity value is included in the
335 | dataset.
336 |
337 | #### Video-specific Features
338 | Video-specific features are extracted and included in the dataset. Most of the features in this category are motivated
339 | by prior work analyses done on engagement in video lectures [(5)](https://doi.org/10.1145/2556325.2566239).
340 |
341 | ### Labels
342 | There are several target labels available in the VLE dataset. These target labels are created by aggregating
343 | available explicit and implicit feedback measures in the repository. Mainly, the labels can be constructed as three
344 | different types of quantification's of learner subjective assessment of a video lecture. The labels available with the
345 | dataset are outlined in Table 2:
346 |
347 | Table 2: Labels in VLE dataset with their variable type (Continuous vs. Categorical), value interval
348 | and category.
349 |
350 |
351 |
352 |
353 | | Type |
354 | Label |
355 | Range Interval |
356 | Category |
357 |
358 |
359 |
360 |
361 | | cont. |
362 | Mean Star Rating |
363 | [1,5) |
364 | Explicit Rating |
365 |
366 |
367 | | cont. |
368 | View Count |
369 | (5,∞) |
370 | Popularity |
371 |
372 |
373 | | cont. |
374 | SMNET |
375 | [0,1) |
376 | Watch Time |
377 |
378 |
379 | | cont. |
380 | SANET |
381 | [0,1) |
382 | Watch Time |
383 |
384 |
385 | | cont. |
386 | Std. of NET |
387 | (0,1) |
388 | Watch Time |
389 |
390 |
391 | | cont. |
392 | Number of User Sessions |
393 | (5,∞) |
394 | Watch Time |
395 |
396 |
397 |
398 |
399 | #### Explicit Rating
400 | In terms of rating labels, *Mean Star Rating* is provided for the video lecture using a star rating scale from 1 to 5
401 | stars. As expected, explicit ratings are scarce and thus only populated in a subset of resources (1250 lectures).
402 | Lecture records are labelled with `-1` where star rating labels are missing. The data source does not provide access
403 | to ratings from individual users. Instead, only the aggregated average rating is available.
404 |
405 | #### Popularity
406 |
407 | A popularity-based target label is created by extracting the *View Count* of the lectures. The total
408 | number of views for each video lecture as of February 17, 2018 is extracted from the metadata and provided with
409 | the dataset.
410 |
411 | #### Watch Time/Engagement
412 |
413 | The majority of learner engagement labels in the VLE dataset are based on watch time. We aggregate the user
414 | view logs and use the `Normalised Engagement Time (NET)` to compute the *Median of Normalised Engagement (MNET)*, as it
415 | has been proposed as the gold standard for engagement with educational materials in previous work
416 | [(5)](https://doi.org/10.1145/2556325.2566239). We also calculate the *Average of Normalised Engagement (ANET)*.
417 |
418 | ## VLE Dataset v1 (12k)
419 |
420 | *** VLE Dataset*** is the latest addition of video lecture engagement data to this collection.
421 | This dataset contains all the english lectures from our previous release of lectures and contains additional lectures.
422 |
423 | ### Lecture Duration Distribution
424 |
425 | Duration of lectures is evidenced to be one of the most influential features when it comes to engagement with video
426 | lectures. Similar to the observations of our previous work [(2)](https://educationaldatamining.org/files/conferences/EDM2020/papers/paper_62.pdf),
427 | it is observed that the new dataset too has a bimodal distribution for duration. The density plot below presents this.
428 |
429 |
430 |
431 | ### Lecture Categories
432 |
433 | There are lectures belonging to diverse topic categories in the dataset. For preserving anonymity,
434 | we have grouped these lectures into `stem` and `misc` groups. The original data source has around 21
435 | categories on the top level of which the distribution is presented below.
436 |
437 |
438 |
439 | Although majority of the lectures
440 | belong to Computer Science category, there are other categories that are diverse in this dataset. The predictive performance of non-CS lectures
441 | has also been empirically tested.
442 |
443 | ## `helper_code` Module
444 | This section contains the code that enables the research community to work with the VLE dataset. The folder
445 | structure in this section logically separates the code into three modules.
446 | ### `feature_extraction`
447 | This section contains the programming logic of the functions used for feature extraction. The main use of this module
448 | is when one is interested in populating the features for their own lecture corpus using the exact programming logic used
449 | to populate VLE data. Several files with feature extraction related functions are found in this module.
450 | - `_api_utils.py`: Internal functions relevant to making API calls to the [Wikifier](http://www.wikifier.org/).
451 | - `_text_utils.py`: Internal functions relevant to utility functions for handling text.
452 | - `content_based_features`: Functions and logic associated with extracting content-based features.
453 | - `wikipedia_based_features`: Functions and logic associated with extracting Wikipedia-based features.
454 |
455 | ### `helper_tools`
456 | This module includes the helper tools that are useful in working with the dataset. The two main submodules contain
457 | helper functions relating to evaluation and input-output operations.
458 | - `evaluation_metrics`: contains the helper functions to run Root Mean Sqaure Error (RMSE), Spearman's Rank Order
459 | Correlation Coefficient (SROCC) and Pairwise Ranking Accuracy (Pairwise).
460 | - `io_utils`: contains the helper functions that are required for loading and manipulating the dataset.
461 |
462 | ### `models`
463 | This module contains the python scripts that have been used to create the current baseline. Currently, `regression`
464 | models have been proposed as baseline models for the tasks. The `models/regression/train_rf_regression_full_cv.py` can be used to
465 | reproduce the baseline performance for Random Forests (RF) models.
466 |
--------------------------------------------------------------------------------
/docs/figs/categories.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sahanbull/VLE-Dataset/4e6b43b16931fd56a6033ef9cc61c9e19c1c8c9b/docs/figs/categories.jpg
--------------------------------------------------------------------------------
/docs/figs/duration.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sahanbull/VLE-Dataset/4e6b43b16931fd56a6033ef9cc61c9e19c1c8c9b/docs/figs/duration.jpg
--------------------------------------------------------------------------------
/docs/figs/features.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sahanbull/VLE-Dataset/4e6b43b16931fd56a6033ef9cc61c9e19c1c8c9b/docs/figs/features.jpg
--------------------------------------------------------------------------------
/docs/figs/tokens.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sahanbull/VLE-Dataset/4e6b43b16931fd56a6033ef9cc61c9e19c1c8c9b/docs/figs/tokens.jpg
--------------------------------------------------------------------------------
/helper_code/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sahanbull/VLE-Dataset/4e6b43b16931fd56a6033ef9cc61c9e19c1c8c9b/helper_code/__init__.py
--------------------------------------------------------------------------------
/helper_code/feature_extraction/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sahanbull/VLE-Dataset/4e6b43b16931fd56a6033ef9cc61c9e19c1c8c9b/helper_code/feature_extraction/__init__.py
--------------------------------------------------------------------------------
/helper_code/feature_extraction/_api_utils.py:
--------------------------------------------------------------------------------
1 | import time
2 |
3 | import requests
4 | import ujson as json
5 |
6 | ERROR_KEY = u'error'
7 | STATUS_FIELD = u'status'
8 |
9 | URL_FIELD = u'url'
10 | PAGERANK_FIELD = u'pageRank'
11 |
12 | COSINE_FIELD = u'cosine'
13 |
14 | ANNOTATION_DATA_FIELD = u'annotation_data'
15 |
16 | _WIKIFIER_URL = u"http://www.wikifier.org/annotate-article"
17 | _WIKIFIER_MAX_SERVER_LIMIT = 25000
18 | WIKIFIER_MAX_CHAR_CEILING = round(_WIKIFIER_MAX_SERVER_LIMIT * .99) # 99% of max allowed num chars for a post request
19 |
20 |
21 | def _get_wikififier_concepts(resp):
22 | annotations = [{URL_FIELD: ann[URL_FIELD],
23 | COSINE_FIELD: ann[COSINE_FIELD],
24 | PAGERANK_FIELD: ann[PAGERANK_FIELD]} for ann in resp.get("annotations", [])]
25 |
26 | return {
27 | ANNOTATION_DATA_FIELD: annotations,
28 | STATUS_FIELD: resp[STATUS_FIELD]
29 | }
30 |
31 |
32 | def _get_wikifier_response(text, api_key, df_ignore, words_ignore):
33 | params = {"text": text,
34 | "userKey": api_key,
35 | "nTopDfValuesToIgnore": df_ignore,
36 | "nWordsToIgnoreFromList": words_ignore}
37 | r = requests.post(_WIKIFIER_URL, params)
38 | if r.status_code == 200:
39 | resp = json.loads(r.content)
40 | if ERROR_KEY in resp:
41 | raise ValueError("error in response : {}".format(resp[ERROR_KEY]))
42 | return resp
43 | else:
44 | raise ValueError("http status code 200 expected, got status code {} instead".format(r.status_code))
45 |
46 |
47 | def wikify(text, key, df_ignore, words_ignore):
48 | """This function takes in a text representation of a lecture transcript and associates relevant Wikipedia topics to
49 | it using www.wikifier.org entity linking technology.
50 |
51 | Args:
52 | text (str): text that needs to be Wikified (usually lecture transcript string)
53 | key (str): API key for Wikifier obtained from http://www.wikifier.org/register.html
54 | df_ignore (int): Most common words to ignore based on Document frequency
55 | words_ignore (int): Most common words to ignore based on Term frequency
56 |
57 | Returns:
58 | [{key:val}]: a dict with status of the request and the list of Wiki topics linked to the text
59 | """
60 | try:
61 | resp = _get_wikifier_response(text, key, df_ignore, words_ignore)
62 | resp[STATUS_FIELD] = 'success'
63 | except ValueError as e:
64 | try:
65 | STATUS_ = e.message
66 | except:
67 | STATUS_ = e.args[0]
68 | return {
69 | STATUS_FIELD: STATUS_
70 | }
71 | time.sleep(0.5)
72 | return _get_wikififier_concepts(resp)
73 |
--------------------------------------------------------------------------------
/helper_code/feature_extraction/_text_utils.py:
--------------------------------------------------------------------------------
1 | import re
2 |
3 | SENTENCE_AGGREGATOR = " "
4 | LEN_SENTENCE_AGGR = len(SENTENCE_AGGREGATOR)
5 |
6 |
7 | def _make_regex_with_escapes(escapers):
8 | words_regex = r'{}[^_\W]+{}'
9 |
10 | temp_regexes = []
11 | for escaper_pair in escapers:
12 | (start, end) = escaper_pair
13 | temp_regexes.append(words_regex.format(start, end))
14 |
15 | return r"|".join(temp_regexes)
16 |
17 |
18 | def shallow_word_segment(phrase, escape_pairs=None):
19 | """ Takes in a string phrase and segments it into words based on simple regex
20 |
21 | Args:
22 | phrase (str): phrase to be segmented to words
23 | escape_pairs ([(str, str)]): list of tuples where each tuple is a pair of substrings that should not be
24 | used as word seperators. The motivation is to escapte special tags such as [HESITATION], ~SILENCE~
25 | IMPORTANT: Row regex has to be used when definng escapte pairs
26 | ("[", "]") will not work as [] are special chars in regex. Instead ("\[", "\]")
27 |
28 | Returns:
29 | ([str]): list of words extracted from the phrase
30 | """
31 | if escape_pairs is None:
32 | escape_pairs = []
33 |
34 | escape_pairs.append(("", ""))
35 |
36 | _regex = _make_regex_with_escapes(escape_pairs)
37 | return re.findall(_regex, phrase, flags=re.UNICODE)
38 |
39 |
40 | def _segment_sentences(text):
41 | """segments a text into a set of sentences
42 |
43 | Args:
44 | text:
45 |
46 | Returns:
47 |
48 | """
49 | import en_core_web_sm
50 | nlp = en_core_web_sm.load()
51 |
52 | text_sentences = nlp(text)
53 |
54 | for sentence in text_sentences.sents:
55 | yield sentence.text
56 |
57 |
58 | def partition_text(text, max_size):
59 | """takes a text string and creates a list of substrings that are shorter than a given length
60 |
61 | Args:
62 | text (str): text to be partitioned (usually a lecture transcript)
63 | max_size (int): maximum number of characters one partition should contain
64 |
65 | Returns:
66 | chunks([str]): list of sub strings where each substring is shorter than the given length
67 | """
68 | # get sentences
69 | sentences = _segment_sentences(text)
70 |
71 | chunks = []
72 |
73 | temp_sents = []
74 | temp_len = 0
75 | for sentence in sentences:
76 | len_sentence = len(sentence)
77 | expected_len = temp_len + LEN_SENTENCE_AGGR + len_sentence # estimate length cost
78 | if expected_len > max_size: # if it goes above threshold,
79 | if len(temp_sents) > 0:
80 | chunks.append(SENTENCE_AGGREGATOR.join(temp_sents)) # first load the preceding chunk
81 | temp_sents = []
82 | temp_len = 0
83 |
84 | temp_sents.append(sentence) # then aggregate the sentence to the temp chunk
85 | temp_len += len_sentence
86 |
87 | if len(temp_sents) > 0:
88 | chunks.append(SENTENCE_AGGREGATOR.join(temp_sents)) # send the remainder chunk
89 |
90 | return chunks
91 |
--------------------------------------------------------------------------------
/helper_code/feature_extraction/content_based_features.py:
--------------------------------------------------------------------------------
1 | """
2 | This file contains the functions that can be used to extract content-based features from a lecture transcript string and
3 | lecture title
4 | """
5 | from collections import Counter
6 | from helper_code.feature_extraction._text_utils import shallow_word_segment
7 |
8 | STOPWORDS = frozenset(
9 | [u'all', 'show', 'anyway', 'fifty', 'four', 'go', 'mill', 'find', 'seemed', 'one', 'whose', 're', u'herself',
10 | 'whoever', 'behind', u'should', u'to', u'only', u'under', 'herein', u'do', u'his', 'get', u'very', 'de',
11 | 'none', 'cannot', 'every', u'during', u'him', u'did', 'cry', 'beforehand', u'these', u'she', 'thereupon',
12 | u'where', 'ten', 'eleven', 'namely', 'besides', u'are', u'further', 'sincere', 'even', u'what', 'please',
13 | 'yet', 'couldnt', 'enough', u'above', u'between', 'neither', 'ever', 'across', 'thin', u'we', 'full',
14 | 'never', 'however', u'here', 'others', u'hers', 'along', 'fifteen', u'both', 'last', 'many', 'whereafter',
15 | 'wherever', u'against', 'etc', u's', 'became', 'whole', 'otherwise', 'among', 'via', 'co', 'afterwards',
16 | 'seems', 'whatever', 'alone', 'moreover', 'throughout', u'from', 'would', 'two', u'been', 'next', u'few',
17 | 'much', 'call', 'therefore', 'interest', u'themselves', 'thru', u'until', 'empty', u'more', 'fire',
18 | 'latterly', 'hereby', 'else', 'everywhere', 'former', u'those', 'must', u'me', u'myself', u'this', 'bill',
19 | u'will', u'while', 'anywhere', 'nine', u'can', u'of', u'my', 'whenever', 'give', 'almost', u'is', 'thus',
20 | u'it', 'cant', u'itself', 'something', u'in', 'ie', u'if', 'inc', 'perhaps', 'six', 'amount', u'same',
21 | 'wherein', 'beside', u'how', 'several', 'whereas', 'see', 'may', u'after', 'upon', 'hereupon', u'such', u'a',
22 | u'off', 'whereby', 'third', u'i', 'well', 'rather', 'without', u'so', u'the', 'con', u'yours', u'just',
23 | 'less', u'being', 'indeed', u'over', 'move', 'front', 'already', u'through', u'yourselves', 'still', u'its',
24 | u'before', 'thence', 'somewhere', u'had', 'except', u'ours', u'has', 'might', 'thereafter', u'then', u'them',
25 | 'someone', 'around', 'thereby', 'five', u'they', u'not', u'now', u'nor', 'hereafter', 'name', 'always',
26 | 'whither', u't', u'each', 'become', 'side', 'therein', 'twelve', u'because', 'often', u'doing', 'eg',
27 | u'some', 'back', u'our', 'beyond', u'ourselves', u'out', u'for', 'bottom', 'since', 'forty', 'per',
28 | 'everything', u'does', 'three', 'either', u'be', 'amoungst', 'whereupon', 'nowhere', 'although', 'found',
29 | 'sixty', 'anyhow', u'by', u'on', u'about', 'anything', u'theirs', 'could', 'put', 'keep', 'whence', 'due',
30 | 'ltd', 'hence', 'onto', u'or', 'first', u'own', 'seeming', 'formerly', u'into', 'within', u'yourself',
31 | u'down', 'everyone', 'done', 'another', 'thick', u'your', u'her', u'whom', 'twenty', 'top', u'there',
32 | 'system', 'least', 'anyone', u'their', u'too', 'hundred', u'was', u'himself', 'elsewhere', 'mostly', u'that',
33 | 'becoming', 'nobody', u'but', 'somehow', 'part', u'with', u'than', u'he', 'made', 'whether', u'up', 'us',
34 | 'nevertheless', u'below', 'un', u'were', 'toward', u'and', 'describe', u'am', 'mine', u'an', 'meanwhile',
35 | u'as', 'sometime', u'at', u'have', 'seem', u'any', 'fill', u'again', 'hasnt', u'no', 'latter', u'when',
36 | 'detail', 'also', u'other', 'take', u'which', 'becomes', u'you', 'towards', 'though', u'who', u'most',
37 | 'eight', 'amongst', 'nothing', u'why', u'don', 'noone', 'sometimes', 'together', 'serious', u'having',
38 | u'once'])
39 |
40 | CONJ_WORDS = frozenset(["and", "but", "or", "yet", "nor"])
41 |
42 | NORM_SUFFIXES = ["tion", "ment", "ence", "ance"]
43 |
44 | TO_BE_VERBS = frozenset(["be", "being", "was", "were", "been", "are", "is"])
45 |
46 | PREPOSITION_WORDS = frozenset(
47 | ["aboard", "about", "above", "according to", "across from", "after", "against", "alongside", "alongside of",
48 | "along with", "amid", "among", "apart from", "around", "aside from", "at", "away from", "back of", "because of",
49 | "before", "behind", "below", "beneath", "beside", "besides", "between", "beyond", "but", "by means of",
50 | "concerning", "considering", "despite", "down", "down from", "during", "except", "except for", "excepting for",
51 | "from among", "from between", "from under", "in addition to", "in behalf of", "in front of", "in place of",
52 | "in regard to", "inside of", "inside", "in spite of", "instead of", "into", "like", "near to", "off ",
53 | "on account of", "on behalf of", "onto", "on top of", "on", "opposite", "out of", "out", "outside", "outside of",
54 | "over to", "over", "owing to", "past", "prior to", "regarding", "round about", "round", "since", "subsequent to",
55 | "together", "with", "throughout", "through", "till", "toward", "under", "underneath", "until", "unto", "up",
56 | "up to", "upon", "with", "within", "without", "across", "along", "by", "of", "in", "to", "near", "of", "from"])
57 |
58 | AUXILIARY_VERBS = frozenset(
59 | ["will", "shall", "cannot", "may", "need to", "would", "should", "could", "might", "must", "ought", "ought to",
60 | "can’t", "can"])
61 |
62 | PRONOUN_WORDS = frozenset(
63 | ["i", "me", "we", "us", "you", "he", "him", "she", "her", "it", "they", "them", "thou", "thee", "ye", "myself",
64 | "yourself", "himself", "herself", "itself", "ourselves", "yourselves", "themselves", "oneself", "my", "mine",
65 | "his", "hers", "yours", "ours", "theirs", "its", "our", "that", "their", "these", "this", "those", "you"])
66 |
67 |
68 | def get_stopwords(additional_stopwords=set()):
69 | """returns the default stopword set aggregated to a custom stopword set provided.
70 |
71 | Args:
72 | additional_stopwords ({str}): set of additional stopwords to be added to the default stopword set.
73 |
74 | Returns:
75 | {str}: frozenset of stopwords
76 | """
77 | return STOPWORDS | additional_stopwords
78 |
79 |
80 | def word_count(s):
81 | """returns word count of a string
82 |
83 | Args:
84 | s (str): string to be word counted.
85 |
86 | Returns:
87 | (int): number of words in the string
88 |
89 | """
90 | return len(shallow_word_segment(s))
91 |
92 |
93 | def title_word_count(title):
94 | """returns word count of a title
95 |
96 | Args:
97 | title (str): title string to be word counted.
98 |
99 | Returns:
100 | (int): number of words in the title
101 |
102 | """
103 | return word_count(title)
104 |
105 |
106 | def compute_entropy(word_list):
107 | """Computes document entropy of a transcript calculated
108 | according to https://people.cs.umass.edu/~yanlei/publications/wsdm11.pdf
109 |
110 | Args:
111 | word_list ([str]): list of words in the transcript.
112 |
113 | Returns:
114 | (float): document entropy value
115 | """
116 | from scipy.stats import entropy
117 |
118 | word_histogram = Counter(word_list)
119 | total_word_count = float(len(word_list))
120 |
121 | word_probs = []
122 | for _, count in word_histogram.items():
123 | mle_pw = count / float(total_word_count)
124 | word_probs.append(mle_pw)
125 |
126 | return entropy(word_probs, base=2)
127 |
128 |
129 | def get_readability_features(text):
130 | """get FK easiness readability score from a text
131 | calculated according to https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests
132 |
133 | Args:
134 | text (str): text string of the transcript
135 |
136 | Returns:
137 | easiness (float): FK ease readability score for the text.
138 |
139 | """
140 | # remove non words
141 | from textatistic import Textatistic
142 |
143 | try:
144 | text_score_obj = Textatistic(text)
145 | easiness = text_score_obj.flesch_score
146 | except ZeroDivisionError:
147 | easiness = 100.0
148 |
149 | return easiness
150 |
151 |
152 | def compute_stop_word_presence_rate(word_list, stop_word_set=get_stopwords()):
153 | """returns the stopword presence rate
154 | calculated according to fracStops in https://people.cs.umass.edu/~yanlei/publications/wsdm11.pdf
155 |
156 | Args:
157 | word_list ([str]): list of words in the transcript.
158 | stop_word_set ({set}): set of stopwords to be used
159 |
160 | Returns:
161 | (float): stopword presence rate
162 | """
163 | word_count = float(len(word_list))
164 |
165 | # if no words, return 0
166 | if word_count == 0:
167 | return 0.
168 |
169 | stopwords_count = 0.
170 |
171 | for w in word_list:
172 | if w in stop_word_set:
173 | stopwords_count += 1
174 |
175 | return stopwords_count / word_count
176 |
177 |
178 | def compute_stop_word_coverage_rate(word_list, stop_word_set=get_stopwords()):
179 | """returns the stopword coverage rate
180 | calculated according to stopCover in https://people.cs.umass.edu/~yanlei/publications/wsdm11.pdf
181 |
182 | Args:
183 | word_list ([str]): list of words in the transcript.
184 | stop_word_set ({set}): set of stopwords to be used
185 |
186 | Returns:
187 | (float): stopword coverage rate
188 | """
189 | word_count = float(len(word_list))
190 |
191 | # if no words, return 0
192 | if word_count == 0:
193 | return 0.
194 |
195 | stopwords_cardinality = float(len(stop_word_set))
196 | stopwords_present = set()
197 |
198 | for w in word_list:
199 | if w in stop_word_set:
200 | stopwords_present.add(w)
201 |
202 | return len(stopwords_present) / stopwords_cardinality
203 |
204 |
205 | def compute_conjunction_rate(word_list):
206 | """Compute conjugation word rate
207 | calculated according to https://dl.acm.org/doi/pdf/10.1145/2063504.2063507
208 |
209 | Args:
210 | word_list ([str]): list of words in the transcript.
211 |
212 | Returns:
213 | (float): conjugation word rate
214 | """
215 | word_count = float(len(word_list))
216 |
217 | if word_count == 0:
218 | return 0.
219 |
220 | qualified_count = 0
221 |
222 | for word in word_list:
223 | if word.strip() in CONJ_WORDS:
224 | qualified_count += 1
225 |
226 | return qualified_count / word_count
227 |
228 |
229 | def compute_normalization_rate(word_list):
230 | """Compute normalization suffix rate
231 | calculated according to https://dl.acm.org/doi/pdf/10.1145/2063504.2063507
232 |
233 | Args:
234 | word_list ([str]): list of words in the transcript.
235 |
236 | Returns:
237 | (float): normalization suffix rate
238 | """
239 | word_count = float(len(word_list))
240 |
241 | if word_count == 0:
242 | return 0.
243 |
244 | qualified_count = 0
245 |
246 | for word in word_list:
247 | _word_ = word.strip()
248 | for suffix in NORM_SUFFIXES:
249 | if _word_.endswith(suffix):
250 | qualified_count += 1
251 | break
252 |
253 | return qualified_count / word_count
254 |
255 |
256 | def compute_preposition_rate(word_list):
257 | """Compute preposition word rate
258 | calculated according to https://dl.acm.org/doi/pdf/10.1145/2063504.2063507
259 |
260 | Args:
261 | word_list ([str]): list of words in the transcript.
262 |
263 | Returns:
264 | (float): preposition word rate
265 | """
266 | word_count = float(len(word_list))
267 |
268 | if word_count == 0:
269 | return 0.
270 |
271 | qualified_count = 0
272 |
273 | for word in word_list:
274 | if word.strip() in PREPOSITION_WORDS:
275 | qualified_count += 1
276 |
277 | return qualified_count / word_count
278 |
279 |
280 | def compute_tobe_verb_rate(word_list):
281 | """Compute to-be verb word rate
282 | calculated according to https://dl.acm.org/doi/pdf/10.1145/2063504.2063507
283 |
284 | Args:
285 | word_list ([str]): list of words in the transcript.
286 |
287 | Returns:
288 | (float): to-be verb word
289 | """
290 | word_count = float(len(word_list))
291 |
292 | if word_count == 0:
293 | return 0.
294 |
295 | qualified_count = 0
296 |
297 | for word in word_list:
298 | if word.strip() in TO_BE_VERBS:
299 | qualified_count += 1
300 |
301 | return qualified_count / word_count
302 |
303 |
304 | def compute_auxiliary_verb_rate(word_list):
305 | """Compute auxiliary verb word rate
306 | calculated according to https://dl.acm.org/doi/pdf/10.1145/2063504.2063507
307 |
308 | Args:
309 | word_list ([str]): list of words in the transcript.
310 |
311 | Returns:
312 | (float): auxiliary verb word rate
313 | """
314 | word_count = float(len(word_list))
315 |
316 | if word_count == 0:
317 | return 0.
318 |
319 | qualified_count = 0
320 |
321 | for word in word_list:
322 | if word.strip() in AUXILIARY_VERBS:
323 | qualified_count += 1
324 |
325 | return qualified_count / word_count
326 |
327 |
328 | def compute_pronouns_rate(word_list):
329 | """Compute pronoun word rate
330 | calculated according to https://dl.acm.org/doi/pdf/10.1145/2063504.2063507
331 |
332 | Args:
333 | word_list ([str]): list of words in the transcript.
334 |
335 | Returns:
336 | (float): pronoun word rate
337 | """
338 | word_count = float(len(word_list))
339 |
340 | if word_count == 0:
341 | return 0.
342 |
343 | qualified_count = 0
344 |
345 | for word in word_list:
346 | if word.strip() in PRONOUN_WORDS:
347 | qualified_count += 1
348 |
349 | return qualified_count / word_count
350 |
--------------------------------------------------------------------------------
/helper_code/feature_extraction/wikipedia_based_features.py:
--------------------------------------------------------------------------------
1 | from collections import defaultdict
2 |
3 | from helper_code.feature_extraction._api_utils import wikify
4 | from helper_code.feature_extraction._text_utils import partition_text
5 |
6 | # values for Doc Frequency and Words to Ignore, more details about these variables
7 | # found at: http://www.wikifier.org/info.html
8 | DF_IGNORE_VAL = 50
9 | WORDS_IGNORE_VAL = 50
10 |
11 |
12 | def get_wikipedia_topic_features(text, api_key, chunk_size=5000):
13 | """ get Wikification for the transcript using http://www.wikifier.org
14 |
15 | Args:
16 | text (str): text that needs to be Wikified
17 | api_key (str): API key for Wikifier obtained from http://www.wikifier.org/register.html
18 | chunk_size (int): maximum number of characters that need included in each Wikified fragment.
19 |
20 | Returns:
21 | enrichments ([{str: val}]): list of annotated chunks from the transcript
22 |
23 | """
24 | text_partitions = partition_text(text, max_size=chunk_size)
25 |
26 | enrichments = []
27 | i = 1
28 | for text_part in text_partitions:
29 | temp_record = {}
30 | annotations = wikify(text_part, api_key, DF_IGNORE_VAL, WORDS_IGNORE_VAL)
31 | temp_record["part"] = i
32 | temp_record["text"] = text_part
33 | temp_record["annotations"] = annotations
34 | enrichments.append(temp_record)
35 | i += 1
36 |
37 | return enrichments
38 |
39 |
40 | def get_ranked_topics(chunks, option, top_n):
41 | """ ranks the topics using the aggregated score across multiple Wikified chunks of the text.
42 |
43 | Args:
44 | chunks ([{str: val}]): list of Wikified chunks for the transcript
45 | option {str}: pageRank or cosine
46 | top_n (int): n top ranked topics of interest
47 |
48 | Returns:
49 | final_rec ({str:val}): dict with key for top_n_url or top_n_value and the URL or value of the topic
50 |
51 | """
52 | chunks = list(chunks)
53 |
54 | total_length = sum([len(part["text"]) for part in chunks])
55 |
56 | records = defaultdict(list)
57 | for part in chunks:
58 | annotations = part["annotations"]["annotation_data"]
59 | weight = len(part["text"])
60 | norm = weight / total_length
61 | for concept in annotations:
62 | url = concept["url"]
63 | val = concept.get(option, 0.)
64 | records[url].append(val * norm)
65 |
66 | rec = [(title, sum(val)) for title, val in records.items()]
67 |
68 | # sort by normalised weight
69 | rec.sort(key=lambda l: l[1], reverse=True)
70 | n_recs = rec[:top_n]
71 |
72 | final_rec = {}
73 | for idx, item in enumerate(n_recs):
74 | url, val = item
75 | _idx = idx + 1
76 | final_rec["topic_{}_{}_url".format(_idx, option)] = url
77 | final_rec["topic_{}_{}_val".format(_idx, option)] = val
78 |
79 | return final_rec
80 |
81 |
82 | def get_authority_wiki_features(text, api_key, top_n):
83 | """ returns top-n most authoritative Wikipedia topics with PageRank scores.
84 | Calculated using http://www.wikifier.org/
85 |
86 | Args:
87 | text (str): text that needs to be Wikified for authority
88 | api_key (str): API key for Wikifier obtained from http://www.wikifier.org/register.html
89 | top_n (int): n top ranking topics to be returned with PageRank scores
90 |
91 | Returns:
92 | ranked_topic_records ({str:val}): dict with key for top_n_url or top_n_value and the URL or value of the topic
93 |
94 | """
95 | enriched_chunks = get_wikipedia_topic_features(text, api_key)
96 | ranked_topic_records = get_ranked_topics(enriched_chunks, "pageRank", top_n)
97 |
98 | return ranked_topic_records
99 |
100 |
101 | def get_coverage_wiki_features(text, api_key, top_n):
102 | """ returns top-n most covered Wikipedia topics with cosine similarity scores.
103 | Calculated using http://www.wikifier.org/
104 |
105 | Args:
106 | text (str): text that needs to be Wikified for coverage
107 | api_key (str): API key for Wikifier obtained from http://www.wikifier.org/register.html
108 | top_n (int): n top ranking topics to be returned with cosine scores
109 |
110 | Returns:
111 | ranked_topic_records ({str:val}): dict with key for top_n_url or top_n_value and the URL or value of the topic
112 |
113 | """
114 | enriched_chunks = get_wikipedia_topic_features(text, api_key)
115 | ranked_topic_records = get_ranked_topics(enriched_chunks, "cosine", top_n)
116 |
117 | return ranked_topic_records
118 |
--------------------------------------------------------------------------------
/helper_code/helper_tools/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sahanbull/VLE-Dataset/4e6b43b16931fd56a6033ef9cc61c9e19c1c8c9b/helper_code/helper_tools/__init__.py
--------------------------------------------------------------------------------
/helper_code/helper_tools/evaluation_metrics.py:
--------------------------------------------------------------------------------
1 | from sklearn import metrics as skm
2 | from scipy.stats import spearmanr
3 | import numpy as np
4 |
5 | from helper_code.helper_tools.io_utils import get_pairwise_version, ID_COL, DOMAIN_COL
6 |
7 |
8 | def get_rmse(Y_train, Y_test, train_pred, test_pred):
9 | """ calculates Root Mean Squared Error (RMSE) for train and test data
10 |
11 | Args:
12 | Y_train ([float]): actual label in training data
13 | Y_test ([float]): actual label in testing data
14 | train_pred ([float]): predicted label for training data
15 | test_pred ([float]): predicted label for testing data
16 |
17 | Returns:
18 | train_rmse (float): metric for training data
19 | test_rmse (float): metric for testing data
20 |
21 | """
22 | train_rmse = np.sqrt(skm.mean_squared_error(Y_train, train_pred))
23 | test_rmse = np.sqrt(skm.mean_squared_error(Y_test, test_pred))
24 |
25 | return train_rmse, test_rmse
26 |
27 |
28 | def get_spearman_r(Y_train, Y_test, train_pred, test_pred):
29 | """ calculates Spearman's Rank Order Correlation (SROCC) for train and test data
30 |
31 | Args:
32 | Y_train ([float]): actual label in training data
33 | Y_test ([float]): actual label in testing data
34 | train_pred ([float]): predicted label for training data
35 | test_pred ([float]): predicted label for testing data
36 |
37 | Returns:
38 | train_spearman (float, float): r value and pvalue metric for training data
39 | test_spearman (float, float): r value and pvalue metric metric for testing data
40 |
41 | """
42 | train_spearman = spearmanr(Y_train, train_pred)
43 | test_spearman = spearmanr(Y_test, test_pred)
44 |
45 | return train_spearman, test_spearman
46 |
47 |
48 | def get_pairwise_accuracy(spark, label, fold_train_df, fold_test_df, train_pred, test_pred):
49 | """calculates Pairwise Accuracy (Pairwise) for train and test data
50 |
51 | Args:
52 | spark (SparkSession): Apache Spark session object
53 | label (str): label (median, mean etc.)
54 | fold_train_df (pd.DataFrame): training fold pandas DataFrame
55 | fold_test_df (pd.DataFrame): testing fold pandas DataFrame
56 | train_pred ([float]): predicted labels on training data
57 | test_pred ([float]): predicted labels on testing data
58 | test_pred ([float]): predicted labels on testing data
59 |
60 | Returns:
61 | train_accuracy (float): pairwise accuracy metric for training data
62 | test_accuracy (float): pairwise accuracy metric metric for testing data
63 |
64 | """
65 | train_Y_p = spark.createDataFrame(fold_train_df)
66 | train_Y_p = get_pairwise_version(train_Y_p, is_gap=True, label_only=True).toPandas()[
67 | [ID_COL, "_" + ID_COL, "gap_" + label, DOMAIN_COL]]
68 |
69 | predict_train_Y_p = fold_train_df
70 | predict_train_Y_p[label] = train_pred
71 | predict_train_Y_p = spark.createDataFrame(predict_train_Y_p)
72 | predict_train_Y_p = \
73 | get_pairwise_version(predict_train_Y_p, is_gap=True, label_only=True).toPandas()[
74 | [ID_COL, "_" + ID_COL, "gap_" + label, DOMAIN_COL]]
75 |
76 | test_Y_p = spark.createDataFrame(fold_test_df)
77 | test_Y_p = get_pairwise_version(test_Y_p, is_gap=True, label_only=True).toPandas()[
78 | [ID_COL, "_" + ID_COL, "gap_" + label, DOMAIN_COL]]
79 |
80 | predict_test_Y_p = fold_test_df
81 | predict_test_Y_p[label] = test_pred
82 | predict_test_Y_p = spark.createDataFrame(predict_test_Y_p)
83 | predict_test_Y_p = \
84 | get_pairwise_version(predict_test_Y_p, is_gap=True, label_only=True).toPandas()[
85 | [ID_COL, "_" + ID_COL, "gap_" + label, DOMAIN_COL]]
86 |
87 | train_accuracy = skm.accuracy_score(train_Y_p["gap_" + label] > 0.,
88 | predict_train_Y_p["gap_" + label] > 0., normalize=True)
89 | test_accuracy = skm.accuracy_score(test_Y_p["gap_" + label] > 0.,
90 | predict_test_Y_p["gap_" + label] > 0., normalize=True)
91 |
92 | return train_accuracy, test_accuracy
93 |
--------------------------------------------------------------------------------
/helper_code/helper_tools/io_utils.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 |
3 | MIN_NUM_SESSIONS = 5
4 |
5 | GEN_FEATURES = ['id', 'fold']
6 |
7 | CONT_COLS = ['categories', 'freshness',
8 | 'auxiliary_rate', 'conjugate_rate', 'normalization_rate', 'tobe_verb_rate', 'preposition_rate',
9 | 'pronoun_rate', 'document_entropy', 'easiness', 'fraction_stopword_coverage',
10 | 'fraction_stopword_presence', 'title_word_count', 'word_count']
11 |
12 | WIKI_COLS = ['auth_topic_rank_1_url', 'auth_topic_rank_1_score',
13 | 'auth_topic_rank_2_url', 'auth_topic_rank_2_score',
14 | 'auth_topic_rank_3_url', 'auth_topic_rank_3_score',
15 | 'auth_topic_rank_4_url', 'auth_topic_rank_4_score',
16 | 'auth_topic_rank_5_url', 'auth_topic_rank_5_score',
17 | 'coverage_topic_rank_1_url', 'coverage_topic_rank_1_score',
18 | 'coverage_topic_rank_2_url', 'coverage_topic_rank_2_score',
19 | 'coverage_topic_rank_3_url', 'coverage_topic_rank_3_score',
20 | 'coverage_topic_rank_4_url', 'coverage_topic_rank_4_score',
21 | 'coverage_topic_rank_5_url', 'coverage_topic_rank_5_score']
22 |
23 | VID_COLS = ['duration', 'type', 'has_parts', "speaker_speed", 'silent_period_rate']
24 |
25 | MEAN_ENGAGEMENT_RATE = 'mean_engagement'
26 | MED_ENGAGEMENT_RATE = 'med_engagement'
27 | MEAN_STAR_RATING = 'mean_star_rating'
28 | VIEW_COUNT = "view_count"
29 |
30 | LABEL_COLS = [MEAN_ENGAGEMENT_RATE, MED_ENGAGEMENT_RATE]
31 |
32 | DOMAIN_COL = "categories"
33 | ID_COL = "id"
34 |
35 | COL_VER_1 = CONT_COLS
36 | COL_VER_2 = CONT_COLS + WIKI_COLS
37 | COL_VER_3 = CONT_COLS + WIKI_COLS + VID_COLS
38 |
39 |
40 | def vectorise_video_features(lectures, columns):
41 | """converts the video specific categorical variables to one-hot encoding
42 |
43 | Args:
44 | lectures (pd.DataFrame): pandas DataFrame that has lectures dataset
45 | columns ([str]): set of feature columns
46 |
47 | Returns:
48 | lectures (pd.DataFrame): pandas DataFrame with updated columns that are one hot encoded
49 | columns ([str]): updated set of feature columns
50 |
51 | """
52 | dummies = pd.get_dummies(lectures['type']).rename(columns=lambda x: 'type_' + str(x))
53 | for col in dummies.columns:
54 | lectures[col] = dummies[col]
55 | columns += list(dummies.columns)
56 | columns.remove("type")
57 |
58 | return lectures, columns
59 |
60 |
61 | def _wikititle_from_url(url):
62 | title = url.split("/")[-1]
63 | return title.strip()
64 |
65 |
66 | def vectorise_wiki_features(lectures, columns):
67 | """converts the wikipedia specific categorical variables (topic 1 page rank and topic 1 cosine) to one-hot encoding
68 |
69 | Args:
70 | lectures (pd.DataFrame): pandas DataFrame that has lectures dataset
71 | columns ([str]): set of feature columns
72 |
73 | Returns:
74 | lectures (pd.DataFrame): pandas DataFrame with updated columns that are one hot encoded
75 | columns ([str]): updated set of feature columns
76 |
77 | """
78 | # get pageRank URL
79 | col_name = "auth_topic_rank_1_url"
80 | lectures[col_name] = lectures[col_name].apply(_wikititle_from_url)
81 |
82 | dummies = pd.get_dummies(lectures[col_name]).rename(columns=lambda x: 'authority_' + str(x))
83 | for col in dummies.columns:
84 | lectures[col] = dummies[col]
85 | columns += list(dummies.columns)
86 |
87 | # get cosine URL
88 | col_name = "coverage_topic_rank_1_url"
89 | lectures[col_name] = lectures[col_name].apply(_wikititle_from_url)
90 |
91 | dummies = pd.get_dummies(lectures[col_name]).rename(columns=lambda x: 'coverage_' + str(x))
92 | for col in dummies.columns:
93 | lectures[col] = dummies[col]
94 | columns += list(dummies.columns)
95 |
96 | for col in WIKI_COLS:
97 | columns.remove(col)
98 |
99 | lectures.drop(WIKI_COLS, axis='columns', inplace=True)
100 |
101 | return lectures, columns
102 |
103 |
104 | def _numerise_categorical(lectures):
105 | # lectures["language"] = lectures["language"].apply(lambda l: 1 if l == "en" else 0)
106 | lectures["categories"] = lectures["categories"].apply(lambda l: 1 if l == "stem" else 0)
107 |
108 | return lectures
109 |
110 |
111 | def transform_features(lectures):
112 | """converts the string represented metadata related features to numeric features.
113 | Args:
114 | lectures (pd.DataFrame): pandas DataFrame that has lectures dataset
115 |
116 | Returns:
117 | lectures (pd.DataFrame): pandas DataFrame with updated columns that are one hot encoded
118 |
119 | """
120 | lectures = _numerise_categorical(lectures)
121 |
122 | return lectures
123 |
124 |
125 | def load_lecture_dataset(input_filepath, col_version=1):
126 | """ takes in a distributed path to lecture data and pulls the data
127 |
128 | Args:
129 | input_filepath (str): input filepath where the dataset CSV file is.
130 | col_version (int): column version that defines the set of features being considered (could be 1, 2 or 3)
131 |
132 | Returns:
133 | lectures (pd.DataFrame): pandas DataFrame containing all the relevant fields from lectures data
134 | """
135 | if col_version == 1:
136 | columns = GEN_FEATURES + COL_VER_1
137 | elif col_version == 2:
138 | columns = GEN_FEATURES + COL_VER_2
139 | else:
140 | columns = GEN_FEATURES + COL_VER_3
141 |
142 | lectures = pd.read_csv(input_filepath)
143 |
144 | columns += LABEL_COLS
145 |
146 | return lectures[columns]
147 |
148 |
149 | def get_label_from_dataset(label_param):
150 | """gets actual label column name based on parameter
151 |
152 | Args:
153 | label_param (str): label parameter defined in the traning scripts.
154 |
155 | Returns:
156 | (str): column name of the relevant label as per the dataset.
157 |
158 | """
159 | if label_param == "mean":
160 | return MEAN_ENGAGEMENT_RATE
161 | elif label_param == "median":
162 | return MED_ENGAGEMENT_RATE
163 | elif label_param == "rating":
164 | return MEAN_STAR_RATING
165 | else:
166 | return VIEW_COUNT
167 |
168 |
169 | def get_features_from_dataset(col_cat, lectures):
170 | """returns the correct set of feature column names and the relevant columns from the dataset.
171 |
172 | Args:
173 | col_cat (int): column category parameter that defines the final feature set.
174 | lectures (pd.DataFrame): pandas DataFrame with the full dataset including all features
175 |
176 | Returns:
177 | lectures (pd.DataFrame): pandas DataFrame with the full dataset including relevant features
178 | columns ([str]): list of column names relevant to the column category
179 |
180 | """
181 | if col_cat == 1:
182 | columns = COL_VER_1
183 | lectures = transform_features(lectures)
184 |
185 | if col_cat == 2:
186 | columns = COL_VER_2
187 | lectures = transform_features(lectures)
188 | # add wiki features
189 | lectures, columns = vectorise_wiki_features(lectures, columns)
190 |
191 | if col_cat == 3:
192 | columns = COL_VER_3
193 | lectures = transform_features(lectures)
194 | # add wiki features
195 | lectures, columns = vectorise_wiki_features(lectures, columns)
196 | # add video features
197 | lectures, columns = vectorise_video_features(lectures, columns)
198 |
199 | return columns, lectures
200 |
201 |
202 | def get_fold_from_dataset(lectures, fold):
203 | fold_train_df = lectures[lectures["fold"] != fold].reset_index(drop=True)
204 | fold_test_df = lectures[lectures["fold"] == fold].reset_index(drop=True)
205 |
206 | return fold_train_df, fold_test_df
207 |
208 |
209 | def get_pairwise_version(df, is_gap, label_only=False):
210 | """Get the pairwise representation of the lecture dataset
211 |
212 | Args:
213 | df (spark.DataFrame): spark DataFrame that need to transformed to pairwise format
214 | is_gap (bool): is gap between the features are calculated
215 | label_only (bool): are only the labels being considered
216 |
217 | Returns:
218 | cross_df (spark.DataFrame): park DataFrame that is transformed to pairwise format
219 |
220 | """
221 | from pyspark.sql import functions as func
222 |
223 | if label_only:
224 | tmp_columns = [ID_COL, DOMAIN_COL] + LABEL_COLS
225 | df = df.select(tmp_columns)
226 |
227 | _df = df
228 | cols = set(_df.columns)
229 |
230 | # rename columns
231 | for col in cols:
232 | _df = _df.withColumnRenamed(col, "_" + col)
233 |
234 | # do category wise pairing on different observations from the category
235 | cross_df = (df.crossJoin(_df).
236 | where(func.col(ID_COL) != func.col("_" + ID_COL)))
237 |
238 | if is_gap:
239 | gap_cols = cols
240 | for col in gap_cols:
241 | cross_df = cross_df.withColumn("gap_" + col, func.col(col) - func.col("_" + col))
242 |
243 | if label_only:
244 | cross_df = cross_df.select(
245 | [ID_COL, "_" + ID_COL, DOMAIN_COL] + ["gap_" + c for c in LABEL_COLS])
246 |
247 | return cross_df
248 |
--------------------------------------------------------------------------------
/helper_code/models/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sahanbull/VLE-Dataset/4e6b43b16931fd56a6033ef9cc61c9e19c1c8c9b/helper_code/models/__init__.py
--------------------------------------------------------------------------------
/helper_code/models/regression/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sahanbull/VLE-Dataset/4e6b43b16931fd56a6033ef9cc61c9e19c1c8c9b/helper_code/models/regression/__init__.py
--------------------------------------------------------------------------------
/helper_code/models/regression/train_rf_regression_full_cv.py:
--------------------------------------------------------------------------------
1 | from pyspark.sql import SparkSession
2 |
3 | import pandas as pd
4 | import numpy as np
5 |
6 | from os.path import join
7 | from sklearn.ensemble import RandomForestRegressor
8 | import joblib
9 |
10 | from sklearn.model_selection import GridSearchCV
11 |
12 | from helper_code.helper_tools.evaluation_metrics import get_rmse, get_spearman_r, get_pairwise_accuracy
13 | from helper_code.helper_tools.io_utils import load_lecture_dataset, get_fold_from_dataset, \
14 | get_label_from_dataset, get_features_from_dataset
15 |
16 |
17 | def main(args):
18 | spark = (SparkSession.
19 | builder.
20 | config("spark.driver.memory", "20g").
21 | config("spark.executor.memory", "20g").
22 | config("spark.driver.maxResultSize", "20g").
23 | config("spark.rpc.lookupTimeout", "300s").
24 | config("spark.rpc.lookupTimeout", "300s").
25 | config("spark.master", "local[{}]".format(args["k_folds"]))).getOrCreate()
26 |
27 | spark.sparkContext.setLogLevel("ERROR")
28 |
29 | performance_values = []
30 |
31 | folds = args["k_folds"]
32 | jobs = args["n_jobs"]
33 | col_cat = args['feature_cat']
34 |
35 | lectures = load_lecture_dataset(args["training_data_filepath"], col_version=col_cat)
36 |
37 | label = get_label_from_dataset(args["label"])
38 |
39 | columns, lectures = get_features_from_dataset(col_cat, lectures)
40 | print(columns)
41 | cnt = 1
42 | # make pairwise observations
43 | for i in range(folds):
44 | fold_train_df, fold_test_df = get_fold_from_dataset(lectures, cnt)
45 |
46 | X_train, Y_train = fold_train_df[columns], np.array(fold_train_df[label])
47 | X_test, Y_test = fold_test_df[columns], np.array(fold_test_df[label])
48 |
49 | if args["is_log"]:
50 | # log transformation of the data
51 | Y_train = np.log(Y_train)
52 | Y_test = np.log(Y_test)
53 |
54 | params = {'n_estimators': [100, 500, 750, 1000, 2000, 5000],
55 | 'max_depth': [3, 5, 10, 25]}
56 |
57 | print("\n\n\n ========== dataset {} created !!! ===========\n\n".format(cnt))
58 | print("no. of features: {}".format(X_train.shape[1]))
59 | print("training data size: {}".format(len(X_train)))
60 | print("testing data size: {}\n\n".format(len(X_test)))
61 |
62 | grid_model = GridSearchCV(RandomForestRegressor(), params, cv=folds, n_jobs=jobs, refit=True)
63 | grid_model.fit(X_train, Y_train)
64 |
65 | train_pred = grid_model.predict(X_train)
66 |
67 | print("Model Trained...")
68 |
69 | test_pred = grid_model.predict(X_test)
70 |
71 | joblib.dump(grid_model, join(args["output_dir"], "model_{}.pkl".format(cnt)), compress=True)
72 |
73 | if args["is_log"]:
74 | Y_train = np.exp(Y_train)
75 | train_pred = np.exp(train_pred)
76 | Y_test = np.exp(Y_test)
77 | test_pred = np.exp(test_pred)
78 |
79 | train_rmse, test_rmse = get_rmse(Y_train, Y_test, train_pred, test_pred)
80 |
81 | train_spearman, test_spearman = get_spearman_r(Y_train, Y_test, train_pred, test_pred)
82 |
83 | best_model = {}
84 | best_model["params"] = "{}_{}".format(grid_model.best_estimator_.n_estimators,
85 | grid_model.best_estimator_.max_depth)
86 | best_model["n_estimators"] = grid_model.best_estimator_.n_estimators
87 | best_model["max_depth"] \
88 | = grid_model.best_estimator_.max_depth
89 |
90 | best_model["train_rmse"] = train_rmse
91 | best_model["test_rmse"] = test_rmse
92 | best_model["train_spearman_r"] = train_spearman.correlation
93 | best_model["test_spearman_r"] = test_spearman.correlation
94 | best_model["train_spearman_p"] = train_spearman.pvalue
95 | best_model["test_spearman_p"] = test_spearman.pvalue
96 | best_model["fold_id"] = cnt
97 |
98 | print("Model: {}".format(best_model["params"]))
99 |
100 | performance_values.append(best_model)
101 | pd.DataFrame(performance_values).to_csv(join(args["output_dir"], "results.csv"), index=False)
102 |
103 | cnt += 1
104 |
105 |
106 | if __name__ == '__main__':
107 | """this script takes in the relevant parameters to train a context-agnostic engagement prediction model using a
108 | Random Forests Regressor (RF). The script outputs a "results.csv" with the evaluation metrics and k model
109 | files in joblib pickle format to the output directory.
110 |
111 | eg: command to run this script:
112 |
113 | python context_agnostic_engagement/models/regression/train_rf_regression_full_cv.py
114 | --training-data-filepath /path/to/v1/VLE_12k_dataset_v1.csv --output-dir path/to/output/directory --n-jobs 8
115 | --is-log --feature-cat 1 --k-folds 5 --label median
116 |
117 | """
118 | import argparse
119 |
120 | parser = argparse.ArgumentParser()
121 |
122 | parser.add_argument('--training-data-filepath', type=str, required=True,
123 | help="filepath where the training data is. Should be a CSV in the right format")
124 | parser.add_argument('--output-dir', type=str, required=True,
125 | help="output file dir where the models and the results are saved")
126 | parser.add_argument('--n-jobs', type=int, default=8,
127 | help="number of parallel jobs to run")
128 | parser.add_argument('--k-folds', type=int, default=5,
129 | help="Number of folds to be used in k-fold cross validation")
130 | parser.add_argument('--label', default='median', const='all', nargs='?', choices=['median', 'mean'],
131 | help="Defines what label should be used for training")
132 | parser.add_argument('--feature-cat', type=int, default=1,
133 | help="defines what label set should be used for training")
134 | parser.add_argument('--is-log', action='store_true', help="Defines if the label should be log transformed.")
135 |
136 | args = vars(parser.parse_args())
137 |
138 | main(args)
139 |
--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
1 | from setuptools import setup
2 |
3 | setup(
4 | name='',
5 | version='1.0',
6 | packages=['helper_code', 'helper_code.helper_tools',
7 | 'helper_code.feature_extraction', 'helper_code.models'],
8 | url='',
9 | license='',
10 | author='',
11 | author_email='',
12 | description='This python package includes the datasets and the helpfer functions that allow building models for predicting context-agnostic (population-based) of video lectures. ',
13 | install_requires=[
14 | 'numpy>=1.14.1',
15 | 'pandas>=0.22.0',
16 | 'scipy>=1.0.1',
17 | 'nltk>=3.2.5',
18 | 'ujson>=1.35',
19 | 'scikit-learn>=0.19.1',
20 | 'pyspark>=2.4.5',
21 | 'textatistic>=0.0.1']
22 | )
23 |
--------------------------------------------------------------------------------