├── README.md ├── VLE_datasets └── v1 │ └── VLE_12k_dataset_v1.csv ├── docs └── figs │ ├── categories.jpg │ ├── duration.jpg │ ├── features.jpg │ └── tokens.jpg ├── helper_code ├── __init__.py ├── feature_extraction │ ├── __init__.py │ ├── _api_utils.py │ ├── _text_utils.py │ ├── content_based_features.py │ └── wikipedia_based_features.py ├── helper_tools │ ├── __init__.py │ ├── evaluation_metrics.py │ └── io_utils.py └── models │ ├── __init__.py │ └── regression │ ├── __init__.py │ └── train_rf_regression_full_cv.py └── setup.py /README.md: -------------------------------------------------------------------------------- 1 | # Video lectures dataset 2 | 3 | This repository contains the dataset and source code of the experiments conducted at reported using the VLE 4 | Dataset. The VLE dataset provides a set of statistics aimed at studying population-based (context-agnostic) 5 | engagement in video lectures, together with other conventional metrics in subjective assessment such as average star 6 | ratings and number of views. We believe the dataset will serve the community applying AI in Education to further 7 | understand what are the features of educational material that makes it engaging for learners. 8 | 9 | ## Use-Cases 10 | The dataset is particularly suited to solve the cold-start problem found in educational recommender systems, both when 11 | i) ***user cold-start***, new users join the system and we may not have enough information about their context so we may 12 | simply recommend population-based engaging lectures for a specific query topic and ii) ***item cold-start***, new 13 | educational content is released, for which we may not have user engagement data yet and thus an engagement predictive 14 | model would be necessary. To the best of our knowledge, this is the first dataset to tackle such a task in 15 | education/scientific recommendations at this scale. 16 | 17 | The Dataset is a pivotal milestone in uplifting **sustainability** of future knowledge systems having direct impact on 18 | scalable, automatic quality assurance and personalised education. It 19 | improves **transparency** by allowing the interpretation of humanly intuitive features and their influence in 20 | population-based engagement prediction. 21 | 22 | ## Impact 23 | 24 | VLE dataset can be considered as a highly impactful resource contribution to the information retrieval, 25 | multimedia analysis, educational data mining, learning analytics and AI in education research community as it will 26 | enable a whole new line of research that is geared towards next generation information and knowledge management within 27 | educational repositories, Massively Open Online Course platforms and other Video/document platforms. This dataset complements the ongoing effort of understanding learner engagement in video lectures. However, it dramatically improves the research landscape by formally 28 | establishing ***two objectively measurable novel tasks*** related to predicting engagement of educational 29 | videos while making a significantly larger, more-focused dataset and its baselines available to the research community with more relevance to AI education. 30 | AI in Education, Intelligent Tutoring Systems and Educational Data Mining communities are on a rapid growth trajectory 31 | right now and will benefit from this dataset as it directly addresses issues related to the respective knowledge fields. The simultaneously growing need for scalable, personalised learning solutions makes this dataset a central piece within 32 | community that will enable improving scalable quality assurance and personalised educational recommendation in the years 33 | to come. The value of this dataset to the field is expected to last for a long time and will increase with subsequent 34 | versions of the dataset being available in the future with more videos and more features. 35 | 36 | 37 | ## Using the Dataset and Its Tools 38 | The resource is developed in a way that any researcher with very basic technological literacy can start building on top 39 | of this dataset. 40 | - The dataset is provided in `Comma Seperate Values (CSV)` format making it *human-readable* while being accessible 41 | through a wide range of data manipulation and statistical software suites. 42 | - The resource includes 43 | `helper_tools` 44 | that provides a set of functions that any researcher with *basic python knowledge* can use to interact with the dataset 45 | and also evaluate the built models. 46 | - `models.regression` 47 | provides *well-documented example code snippets* that can 1) enable the researcher to reproduce results reported for 48 | baseline models, 2) use an example coding snippets to understand how to build novel models using the VLE dataset. 49 | - `feature_extraction` 50 | module presents the programming logic of how features in the dataset are calculated. The feature extraction logic is 51 | presented in the form of *well-documented (PEP-8 standard, Google Docstrings format)* Python functions that can be used 52 | to 1) understand the logic behind feature extraction or 2) apply the feature extraction logic to your own lecture records 53 | to generate more data 54 | 55 | ## Structure of the Resource 56 | 57 | The structure of the repository divides the resources to two distinct components on top-level. 58 | 1. `VLE_datasets`: This section stores the different versions of VLE datasets (current version: `v1`) 59 | 2. `helper_code`: This module stores all the code related to manipulating and managing the datasets. 60 | 61 | In addition, there are two files: 62 | - `README.md`: The main source of information for understanding and working with the VLE datasets. 63 | - `setup.py`: Python setup file that will install the support tools to your local python environment. 64 | 65 | ### Table of Contents 66 | - [VLE Datasets](#vle-datasets) 67 | - [Anonymity](#anonymity) 68 | - [Versions](#versions) 69 | - [Features](#features) 70 | - [General Features](#general-features) 71 | - [Content-based Features](#content-based-features) 72 | - [Wikipedia-based Features](#wikipedia-based-features) 73 | - [Video-specific Features](#video-specific-features) 74 | - [Labels](#labels) 75 | - [Explicit Rating](#explicit-rating) 76 | - [Popularity](#popularity) 77 | - [Watch Time/Engagement](#watch-timeengagement) 78 | - [VLE Dataset v1](#vle-dataset-v1-12k) 79 | - [Lecture Duration Distribution](#lecture-duration-distribution) 80 | - [Lecture Categories](#lecture-categories) 81 | - [`helper_code` Module](#helper_code-module) 82 | - [`feature_extraction`](#feature_extraction) 83 | - [`helper_tools`](#helper_tools) 84 | - [`models`](#models) 85 | - [References](#references) 86 | 87 | 88 | ## VLE Datasets 89 | This section makes the VLE datasets publicly available. The VLE dataset is constructed using the 90 | aggregated video lectures consumption data coming from a popular OER repository, 91 | [VideoLectures.Net](http://videolectures.net/). These videos are recorded when researchers are presenting their work at 92 | peer-reviewed conferences. Lectures are reviewed and hence material is controlled for correctness of knowledge and 93 | pedagogical robustness. Specififally, the dataset is comparatively 94 | more useful when building e-learning systems for Artificial Intellgence and Computer Science Education as majority of 95 | lectures in the dataset belong to these topics. 96 | 97 | ### Versions 98 | All the relevant datasets are available as Comma Separated Value (CSV) file within a dataset subdirectory 99 | (eg. `v1/VLE_12k_dataset_v1.csv`). At present, a dataset consisting around 12,000 lectures is available publicly. 100 | 101 | | Dataset | Number of Lectures | Number of Users | Number of Star Ratings | Log Recency | URL | 102 | |---------|--------------------|-------------------|--------------------------|----|-----| 103 | | ***v1*** | 11568 | Over 1.1 Million | 2127 | Until February 01, 2021 | /VLE_datasets/v1 | 104 | 105 | The latest dataset of this collection is `v1`. The tools required to load, 106 | and manipulate the datasets are found in `helper_code.utils.io_utils` module. 107 | 108 | ### Anonymity 109 | We restrict the final dataset to lectures that have been viewed by at least 5 unique users to preserve anonymity of 110 | users and have reliable engagement measurements. Additionally, a regime of techniques are used for preserving the 111 | anonymity of the data authors using the remaining features. Rarely occurring values in *Lecture Type* feature were 112 | grouped together to create the `other` category. *Language* feature is grouped into `en` and `non-en` categories. 113 | Similarly, Domain category groups Life Sciences, Physics, Technology, Mathematics, Computer Science, Data Science 114 | and Computers subjects to `stem` category and the other subjects to `misc` category. Rounding is used with 115 | *Published Date*, rounding to the nearest 10 days. *Lecture Duration* is rounded to the nearest 10 seconds. 116 | Gaussian white noise (10%) is added to *Title Word Count* feature and rounded to the nearest integer. 117 | 118 | 119 | ### Features 120 | There 4 main types of features extracted from the video lectures. These features can be categorised into [six quality 121 | verticals](https://www.k4all.org/wp-content/uploads/2019/08/IJCAI_paper_on_quality.pdf). 122 | 123 | All the features that are included in the dataset are summarised in Table 1. 124 | 125 | 126 | Table 1: Features extracted and available in the VLE dataset with their variable type (Continuous vs. 127 | Categorical) and their quality vertical. 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 | 208 | 209 | 210 | 211 | 212 | 213 | 214 | 215 | 216 | 217 | 218 | 219 | 220 | 221 | 222 | 223 | 224 | 225 | 226 | 227 | 228 | 229 | 230 | 231 | 232 | 233 | 234 | 235 | 236 | 237 | 238 | 239 | 240 | 241 | 242 | 243 | 244 | 245 | 246 | 247 | 248 | 249 | 250 | 251 | 252 | 253 | 254 | 255 | 256 | 257 | 258 | 259 | 260 | 261 | 262 | 263 | 264 | 265 | 266 | 267 | 268 | 269 | 270 | 271 | 272 | 273 | 274 | 275 | 276 | 277 | 278 | 279 | 280 | 281 | 282 | 283 | 284 | 285 | 286 | 287 | 288 | 289 | 290 | 291 | 292 | 293 | 294 | 295 | 296 |
Variable TypeNameQuality Vertical Description
Metadata-based   Features
cat.Language-Language of instruction of the video lecture
cat.Domain-Subject area (STEM or Miscellaneous)
Content-based   Features
con.Word CountTopic Coverage Word Count of Transcript
con.Title Word CountTopic Coverag Word Count of Title
con.Document EntropyTopic Coverage Document Entropy of Transcript
con.Easiness (FK Easiness)Understandability FK Easiness based on FK Easiness
con.Stop-word Presence RateUnderstandabilityStopword Presence Rate of Transcript text
con.Stop-word Coverage RateUnderstandability Stopword Coverage Rate of Transcript text
con.Preposition RatePresentation Preposition Rate of Transcript text
con.Auxiliary RatePresentation Auxiliary Rate of Transcript text
con.To Be RatePresentation To-Be Verb Rate of Transcript text
con.Conjunction RatePresentation Conjunction Rate of Transcript text
con.Normalisation RatePresentation Normalisation Rate of Transcript text
con.Pronoun RatePresentation Pronoun Rate of Transcript text
con.Published DateFreshness Duration between 01/01/1970 and the lecture published date (in days)
Wikipedia-based   Features
cat.Top-5 Authoritative Topic URLsAuthority 5 Most Authoritative Topic URLs based on PageRank Score. 5 features in   this group
con.Top-5 PageRank Scores Authority PageRank Scores of the top-5 most authoritative topics
cat.Top-5 Covered Topic URLsTopic Coverage 5 Most Covered Topic URLs based on Cosine Similarity Score. 5 features in   this group
con.Top-5 Cosine SimilaritiesTopic Coverage Cosine Similarity Scores of the top-5 most covered topics
Video-based Features
con.Lecture DurationTopic Coverage Duration of the video (in seconds)
cat.Is ChunkedPresentation If the lecture consists of multiple videos
cat.Lecture TypePresentation Type of lecture (lecture, tutorial, invited talk etc.)
con.Speaker speedPresentation Speaker speed (words per minute)
con.Silence Period Rate (SPR)PresentationFraction of silence in the lecture video
297 | 298 | #### General Features 299 | Features that extracted from Lecture metadata that are associated with the language and subject of the materials. 300 | 301 | #### Content-based Features 302 | Features that have been extracted from the contents that are discussed within the lecture. These features are extracted 303 | using the content transcript in English lectures. Features are extracted from the English translation where the lecture 304 | is a non-english lecture. The transcription and translation services are provided by the 305 | [TransLectures](https://www.mllp.upv.es/projects/translectures/) project. 306 | 307 | #### Textual Feature Extraction 308 | 309 | Different groups of word tokens are used when calculating features such as `Preposition rate`, `Auxilliary Rate` etc. 310 | as proposed by [Dalip et al.](https://asistdl.onlinelibrary.wiley.com/doi/abs/10.1002/asi.23650). 311 | 312 | The features are calculated using the formulae listed below: 313 | 314 | 315 | 316 | The tokens used during feature extraction are listed below: 317 | 318 | 319 | 320 | #### Wikipedia-based Features 321 | Two features groups that associate to *content authority* and *topic coverage* are extracted by connecting the lecture 322 | transcript to Wikipedia. [Entity Linking](http://www.wikifier.org/) technology is used to identify Wikipedia concepts 323 | that are asscoated with the lecture contents. 324 | 325 | - Most Authoritative Topics 326 | The Wikipedia topics in the lecture are used to build a Semantic graph of the lecture where the *Semantic Relatedness* 327 | is calculated using Milne and Witten method [(4)](https://www.aaai.org/Papers/Workshops/2008/WS-08-15/WS08-15-005.pdf). 328 | PageRank is run on the semantic graph to identify the most authoritative topics within the lecture. The top-5 most 329 | authoritative topic URLs and their respective PageRank value is included in the dataset. 330 | 331 | - Most Convered Topics 332 | Similarly, the [Cosine Similarity](https://www.sciencedirect.com/topics/computer-science/cosine-similarity) between 333 | the Wikipedia topic page and the lecture transcript is used to rank the Wikipedia topics that are most covered in the 334 | video lecture. The top-5 most covered topic URLs and their respective cosine similarity value is included in the 335 | dataset. 336 | 337 | #### Video-specific Features 338 | Video-specific features are extracted and included in the dataset. Most of the features in this category are motivated 339 | by prior work analyses done on engagement in video lectures [(5)](https://doi.org/10.1145/2556325.2566239). 340 | 341 | ### Labels 342 | There are several target labels available in the VLE dataset. These target labels are created by aggregating 343 | available explicit and implicit feedback measures in the repository. Mainly, the labels can be constructed as three 344 | different types of quantification's of learner subjective assessment of a video lecture. The labels available with the 345 | dataset are outlined in Table 2: 346 | 347 | Table 2: Labels in VLE dataset with their variable type (Continuous vs. Categorical), value interval 348 | and category. 349 | 350 | 351 | 352 | 353 | 354 | 355 | 356 | 357 | 358 | 359 | 360 | 361 | 362 | 363 | 364 | 365 | 366 | 367 | 368 | 369 | 370 | 371 | 372 | 373 | 374 | 375 | 376 | 377 | 378 | 379 | 380 | 381 | 382 | 383 | 384 | 385 | 386 | 387 | 388 | 389 | 390 | 391 | 392 | 393 | 394 | 395 | 396 | 397 |
TypeLabelRange IntervalCategory
cont.Mean Star Rating[1,5)Explicit Rating
cont.View Count(5,∞)Popularity
cont.SMNET[0,1)Watch Time
cont.SANET[0,1)Watch Time
cont.Std. of NET(0,1)Watch Time
cont.Number of User Sessions(5,∞)Watch Time
398 | 399 | #### Explicit Rating 400 | In terms of rating labels, *Mean Star Rating* is provided for the video lecture using a star rating scale from 1 to 5 401 | stars. As expected, explicit ratings are scarce and thus only populated in a subset of resources (1250 lectures). 402 | Lecture records are labelled with `-1` where star rating labels are missing. The data source does not provide access 403 | to ratings from individual users. Instead, only the aggregated average rating is available. 404 | 405 | #### Popularity 406 | 407 | A popularity-based target label is created by extracting the *View Count* of the lectures. The total 408 | number of views for each video lecture as of February 17, 2018 is extracted from the metadata and provided with 409 | the dataset. 410 | 411 | #### Watch Time/Engagement 412 | 413 | The majority of learner engagement labels in the VLE dataset are based on watch time. We aggregate the user 414 | view logs and use the `Normalised Engagement Time (NET)` to compute the *Median of Normalised Engagement (MNET)*, as it 415 | has been proposed as the gold standard for engagement with educational materials in previous work 416 | [(5)](https://doi.org/10.1145/2556325.2566239). We also calculate the *Average of Normalised Engagement (ANET)*. 417 | 418 | ## VLE Dataset v1 (12k) 419 | 420 | *** VLE Dataset*** is the latest addition of video lecture engagement data to this collection. 421 | This dataset contains all the english lectures from our previous release of lectures and contains additional lectures. 422 | 423 | ### Lecture Duration Distribution 424 | 425 | Duration of lectures is evidenced to be one of the most influential features when it comes to engagement with video 426 | lectures. Similar to the observations of our previous work [(2)](https://educationaldatamining.org/files/conferences/EDM2020/papers/paper_62.pdf), 427 | it is observed that the new dataset too has a bimodal distribution for duration. The density plot below presents this. 428 | 429 | 430 | 431 | ### Lecture Categories 432 | 433 | There are lectures belonging to diverse topic categories in the dataset. For preserving anonymity, 434 | we have grouped these lectures into `stem` and `misc` groups. The original data source has around 21 435 | categories on the top level of which the distribution is presented below. 436 | 437 | 438 | 439 | Although majority of the lectures 440 | belong to Computer Science category, there are other categories that are diverse in this dataset. The predictive performance of non-CS lectures 441 | has also been empirically tested. 442 | 443 | ## `helper_code` Module 444 | This section contains the code that enables the research community to work with the VLE dataset. The folder 445 | structure in this section logically separates the code into three modules. 446 | ### `feature_extraction` 447 | This section contains the programming logic of the functions used for feature extraction. The main use of this module 448 | is when one is interested in populating the features for their own lecture corpus using the exact programming logic used 449 | to populate VLE data. Several files with feature extraction related functions are found in this module. 450 | - `_api_utils.py`: Internal functions relevant to making API calls to the [Wikifier](http://www.wikifier.org/). 451 | - `_text_utils.py`: Internal functions relevant to utility functions for handling text. 452 | - `content_based_features`: Functions and logic associated with extracting content-based features. 453 | - `wikipedia_based_features`: Functions and logic associated with extracting Wikipedia-based features. 454 | 455 | ### `helper_tools` 456 | This module includes the helper tools that are useful in working with the dataset. The two main submodules contain 457 | helper functions relating to evaluation and input-output operations. 458 | - `evaluation_metrics`: contains the helper functions to run Root Mean Sqaure Error (RMSE), Spearman's Rank Order 459 | Correlation Coefficient (SROCC) and Pairwise Ranking Accuracy (Pairwise). 460 | - `io_utils`: contains the helper functions that are required for loading and manipulating the dataset. 461 | 462 | ### `models` 463 | This module contains the python scripts that have been used to create the current baseline. Currently, `regression` 464 | models have been proposed as baseline models for the tasks. The `models/regression/train_rf_regression_full_cv.py` can be used to 465 | reproduce the baseline performance for Random Forests (RF) models. 466 | -------------------------------------------------------------------------------- /docs/figs/categories.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sahanbull/VLE-Dataset/4e6b43b16931fd56a6033ef9cc61c9e19c1c8c9b/docs/figs/categories.jpg -------------------------------------------------------------------------------- /docs/figs/duration.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sahanbull/VLE-Dataset/4e6b43b16931fd56a6033ef9cc61c9e19c1c8c9b/docs/figs/duration.jpg -------------------------------------------------------------------------------- /docs/figs/features.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sahanbull/VLE-Dataset/4e6b43b16931fd56a6033ef9cc61c9e19c1c8c9b/docs/figs/features.jpg -------------------------------------------------------------------------------- /docs/figs/tokens.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sahanbull/VLE-Dataset/4e6b43b16931fd56a6033ef9cc61c9e19c1c8c9b/docs/figs/tokens.jpg -------------------------------------------------------------------------------- /helper_code/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sahanbull/VLE-Dataset/4e6b43b16931fd56a6033ef9cc61c9e19c1c8c9b/helper_code/__init__.py -------------------------------------------------------------------------------- /helper_code/feature_extraction/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sahanbull/VLE-Dataset/4e6b43b16931fd56a6033ef9cc61c9e19c1c8c9b/helper_code/feature_extraction/__init__.py -------------------------------------------------------------------------------- /helper_code/feature_extraction/_api_utils.py: -------------------------------------------------------------------------------- 1 | import time 2 | 3 | import requests 4 | import ujson as json 5 | 6 | ERROR_KEY = u'error' 7 | STATUS_FIELD = u'status' 8 | 9 | URL_FIELD = u'url' 10 | PAGERANK_FIELD = u'pageRank' 11 | 12 | COSINE_FIELD = u'cosine' 13 | 14 | ANNOTATION_DATA_FIELD = u'annotation_data' 15 | 16 | _WIKIFIER_URL = u"http://www.wikifier.org/annotate-article" 17 | _WIKIFIER_MAX_SERVER_LIMIT = 25000 18 | WIKIFIER_MAX_CHAR_CEILING = round(_WIKIFIER_MAX_SERVER_LIMIT * .99) # 99% of max allowed num chars for a post request 19 | 20 | 21 | def _get_wikififier_concepts(resp): 22 | annotations = [{URL_FIELD: ann[URL_FIELD], 23 | COSINE_FIELD: ann[COSINE_FIELD], 24 | PAGERANK_FIELD: ann[PAGERANK_FIELD]} for ann in resp.get("annotations", [])] 25 | 26 | return { 27 | ANNOTATION_DATA_FIELD: annotations, 28 | STATUS_FIELD: resp[STATUS_FIELD] 29 | } 30 | 31 | 32 | def _get_wikifier_response(text, api_key, df_ignore, words_ignore): 33 | params = {"text": text, 34 | "userKey": api_key, 35 | "nTopDfValuesToIgnore": df_ignore, 36 | "nWordsToIgnoreFromList": words_ignore} 37 | r = requests.post(_WIKIFIER_URL, params) 38 | if r.status_code == 200: 39 | resp = json.loads(r.content) 40 | if ERROR_KEY in resp: 41 | raise ValueError("error in response : {}".format(resp[ERROR_KEY])) 42 | return resp 43 | else: 44 | raise ValueError("http status code 200 expected, got status code {} instead".format(r.status_code)) 45 | 46 | 47 | def wikify(text, key, df_ignore, words_ignore): 48 | """This function takes in a text representation of a lecture transcript and associates relevant Wikipedia topics to 49 | it using www.wikifier.org entity linking technology. 50 | 51 | Args: 52 | text (str): text that needs to be Wikified (usually lecture transcript string) 53 | key (str): API key for Wikifier obtained from http://www.wikifier.org/register.html 54 | df_ignore (int): Most common words to ignore based on Document frequency 55 | words_ignore (int): Most common words to ignore based on Term frequency 56 | 57 | Returns: 58 | [{key:val}]: a dict with status of the request and the list of Wiki topics linked to the text 59 | """ 60 | try: 61 | resp = _get_wikifier_response(text, key, df_ignore, words_ignore) 62 | resp[STATUS_FIELD] = 'success' 63 | except ValueError as e: 64 | try: 65 | STATUS_ = e.message 66 | except: 67 | STATUS_ = e.args[0] 68 | return { 69 | STATUS_FIELD: STATUS_ 70 | } 71 | time.sleep(0.5) 72 | return _get_wikififier_concepts(resp) 73 | -------------------------------------------------------------------------------- /helper_code/feature_extraction/_text_utils.py: -------------------------------------------------------------------------------- 1 | import re 2 | 3 | SENTENCE_AGGREGATOR = " " 4 | LEN_SENTENCE_AGGR = len(SENTENCE_AGGREGATOR) 5 | 6 | 7 | def _make_regex_with_escapes(escapers): 8 | words_regex = r'{}[^_\W]+{}' 9 | 10 | temp_regexes = [] 11 | for escaper_pair in escapers: 12 | (start, end) = escaper_pair 13 | temp_regexes.append(words_regex.format(start, end)) 14 | 15 | return r"|".join(temp_regexes) 16 | 17 | 18 | def shallow_word_segment(phrase, escape_pairs=None): 19 | """ Takes in a string phrase and segments it into words based on simple regex 20 | 21 | Args: 22 | phrase (str): phrase to be segmented to words 23 | escape_pairs ([(str, str)]): list of tuples where each tuple is a pair of substrings that should not be 24 | used as word seperators. The motivation is to escapte special tags such as [HESITATION], ~SILENCE~ 25 | IMPORTANT: Row regex has to be used when definng escapte pairs 26 | ("[", "]") will not work as [] are special chars in regex. Instead ("\[", "\]") 27 | 28 | Returns: 29 | ([str]): list of words extracted from the phrase 30 | """ 31 | if escape_pairs is None: 32 | escape_pairs = [] 33 | 34 | escape_pairs.append(("", "")) 35 | 36 | _regex = _make_regex_with_escapes(escape_pairs) 37 | return re.findall(_regex, phrase, flags=re.UNICODE) 38 | 39 | 40 | def _segment_sentences(text): 41 | """segments a text into a set of sentences 42 | 43 | Args: 44 | text: 45 | 46 | Returns: 47 | 48 | """ 49 | import en_core_web_sm 50 | nlp = en_core_web_sm.load() 51 | 52 | text_sentences = nlp(text) 53 | 54 | for sentence in text_sentences.sents: 55 | yield sentence.text 56 | 57 | 58 | def partition_text(text, max_size): 59 | """takes a text string and creates a list of substrings that are shorter than a given length 60 | 61 | Args: 62 | text (str): text to be partitioned (usually a lecture transcript) 63 | max_size (int): maximum number of characters one partition should contain 64 | 65 | Returns: 66 | chunks([str]): list of sub strings where each substring is shorter than the given length 67 | """ 68 | # get sentences 69 | sentences = _segment_sentences(text) 70 | 71 | chunks = [] 72 | 73 | temp_sents = [] 74 | temp_len = 0 75 | for sentence in sentences: 76 | len_sentence = len(sentence) 77 | expected_len = temp_len + LEN_SENTENCE_AGGR + len_sentence # estimate length cost 78 | if expected_len > max_size: # if it goes above threshold, 79 | if len(temp_sents) > 0: 80 | chunks.append(SENTENCE_AGGREGATOR.join(temp_sents)) # first load the preceding chunk 81 | temp_sents = [] 82 | temp_len = 0 83 | 84 | temp_sents.append(sentence) # then aggregate the sentence to the temp chunk 85 | temp_len += len_sentence 86 | 87 | if len(temp_sents) > 0: 88 | chunks.append(SENTENCE_AGGREGATOR.join(temp_sents)) # send the remainder chunk 89 | 90 | return chunks 91 | -------------------------------------------------------------------------------- /helper_code/feature_extraction/content_based_features.py: -------------------------------------------------------------------------------- 1 | """ 2 | This file contains the functions that can be used to extract content-based features from a lecture transcript string and 3 | lecture title 4 | """ 5 | from collections import Counter 6 | from helper_code.feature_extraction._text_utils import shallow_word_segment 7 | 8 | STOPWORDS = frozenset( 9 | [u'all', 'show', 'anyway', 'fifty', 'four', 'go', 'mill', 'find', 'seemed', 'one', 'whose', 're', u'herself', 10 | 'whoever', 'behind', u'should', u'to', u'only', u'under', 'herein', u'do', u'his', 'get', u'very', 'de', 11 | 'none', 'cannot', 'every', u'during', u'him', u'did', 'cry', 'beforehand', u'these', u'she', 'thereupon', 12 | u'where', 'ten', 'eleven', 'namely', 'besides', u'are', u'further', 'sincere', 'even', u'what', 'please', 13 | 'yet', 'couldnt', 'enough', u'above', u'between', 'neither', 'ever', 'across', 'thin', u'we', 'full', 14 | 'never', 'however', u'here', 'others', u'hers', 'along', 'fifteen', u'both', 'last', 'many', 'whereafter', 15 | 'wherever', u'against', 'etc', u's', 'became', 'whole', 'otherwise', 'among', 'via', 'co', 'afterwards', 16 | 'seems', 'whatever', 'alone', 'moreover', 'throughout', u'from', 'would', 'two', u'been', 'next', u'few', 17 | 'much', 'call', 'therefore', 'interest', u'themselves', 'thru', u'until', 'empty', u'more', 'fire', 18 | 'latterly', 'hereby', 'else', 'everywhere', 'former', u'those', 'must', u'me', u'myself', u'this', 'bill', 19 | u'will', u'while', 'anywhere', 'nine', u'can', u'of', u'my', 'whenever', 'give', 'almost', u'is', 'thus', 20 | u'it', 'cant', u'itself', 'something', u'in', 'ie', u'if', 'inc', 'perhaps', 'six', 'amount', u'same', 21 | 'wherein', 'beside', u'how', 'several', 'whereas', 'see', 'may', u'after', 'upon', 'hereupon', u'such', u'a', 22 | u'off', 'whereby', 'third', u'i', 'well', 'rather', 'without', u'so', u'the', 'con', u'yours', u'just', 23 | 'less', u'being', 'indeed', u'over', 'move', 'front', 'already', u'through', u'yourselves', 'still', u'its', 24 | u'before', 'thence', 'somewhere', u'had', 'except', u'ours', u'has', 'might', 'thereafter', u'then', u'them', 25 | 'someone', 'around', 'thereby', 'five', u'they', u'not', u'now', u'nor', 'hereafter', 'name', 'always', 26 | 'whither', u't', u'each', 'become', 'side', 'therein', 'twelve', u'because', 'often', u'doing', 'eg', 27 | u'some', 'back', u'our', 'beyond', u'ourselves', u'out', u'for', 'bottom', 'since', 'forty', 'per', 28 | 'everything', u'does', 'three', 'either', u'be', 'amoungst', 'whereupon', 'nowhere', 'although', 'found', 29 | 'sixty', 'anyhow', u'by', u'on', u'about', 'anything', u'theirs', 'could', 'put', 'keep', 'whence', 'due', 30 | 'ltd', 'hence', 'onto', u'or', 'first', u'own', 'seeming', 'formerly', u'into', 'within', u'yourself', 31 | u'down', 'everyone', 'done', 'another', 'thick', u'your', u'her', u'whom', 'twenty', 'top', u'there', 32 | 'system', 'least', 'anyone', u'their', u'too', 'hundred', u'was', u'himself', 'elsewhere', 'mostly', u'that', 33 | 'becoming', 'nobody', u'but', 'somehow', 'part', u'with', u'than', u'he', 'made', 'whether', u'up', 'us', 34 | 'nevertheless', u'below', 'un', u'were', 'toward', u'and', 'describe', u'am', 'mine', u'an', 'meanwhile', 35 | u'as', 'sometime', u'at', u'have', 'seem', u'any', 'fill', u'again', 'hasnt', u'no', 'latter', u'when', 36 | 'detail', 'also', u'other', 'take', u'which', 'becomes', u'you', 'towards', 'though', u'who', u'most', 37 | 'eight', 'amongst', 'nothing', u'why', u'don', 'noone', 'sometimes', 'together', 'serious', u'having', 38 | u'once']) 39 | 40 | CONJ_WORDS = frozenset(["and", "but", "or", "yet", "nor"]) 41 | 42 | NORM_SUFFIXES = ["tion", "ment", "ence", "ance"] 43 | 44 | TO_BE_VERBS = frozenset(["be", "being", "was", "were", "been", "are", "is"]) 45 | 46 | PREPOSITION_WORDS = frozenset( 47 | ["aboard", "about", "above", "according to", "across from", "after", "against", "alongside", "alongside of", 48 | "along with", "amid", "among", "apart from", "around", "aside from", "at", "away from", "back of", "because of", 49 | "before", "behind", "below", "beneath", "beside", "besides", "between", "beyond", "but", "by means of", 50 | "concerning", "considering", "despite", "down", "down from", "during", "except", "except for", "excepting for", 51 | "from among", "from between", "from under", "in addition to", "in behalf of", "in front of", "in place of", 52 | "in regard to", "inside of", "inside", "in spite of", "instead of", "into", "like", "near to", "off ", 53 | "on account of", "on behalf of", "onto", "on top of", "on", "opposite", "out of", "out", "outside", "outside of", 54 | "over to", "over", "owing to", "past", "prior to", "regarding", "round about", "round", "since", "subsequent to", 55 | "together", "with", "throughout", "through", "till", "toward", "under", "underneath", "until", "unto", "up", 56 | "up to", "upon", "with", "within", "without", "across", "along", "by", "of", "in", "to", "near", "of", "from"]) 57 | 58 | AUXILIARY_VERBS = frozenset( 59 | ["will", "shall", "cannot", "may", "need to", "would", "should", "could", "might", "must", "ought", "ought to", 60 | "can’t", "can"]) 61 | 62 | PRONOUN_WORDS = frozenset( 63 | ["i", "me", "we", "us", "you", "he", "him", "she", "her", "it", "they", "them", "thou", "thee", "ye", "myself", 64 | "yourself", "himself", "herself", "itself", "ourselves", "yourselves", "themselves", "oneself", "my", "mine", 65 | "his", "hers", "yours", "ours", "theirs", "its", "our", "that", "their", "these", "this", "those", "you"]) 66 | 67 | 68 | def get_stopwords(additional_stopwords=set()): 69 | """returns the default stopword set aggregated to a custom stopword set provided. 70 | 71 | Args: 72 | additional_stopwords ({str}): set of additional stopwords to be added to the default stopword set. 73 | 74 | Returns: 75 | {str}: frozenset of stopwords 76 | """ 77 | return STOPWORDS | additional_stopwords 78 | 79 | 80 | def word_count(s): 81 | """returns word count of a string 82 | 83 | Args: 84 | s (str): string to be word counted. 85 | 86 | Returns: 87 | (int): number of words in the string 88 | 89 | """ 90 | return len(shallow_word_segment(s)) 91 | 92 | 93 | def title_word_count(title): 94 | """returns word count of a title 95 | 96 | Args: 97 | title (str): title string to be word counted. 98 | 99 | Returns: 100 | (int): number of words in the title 101 | 102 | """ 103 | return word_count(title) 104 | 105 | 106 | def compute_entropy(word_list): 107 | """Computes document entropy of a transcript calculated 108 | according to https://people.cs.umass.edu/~yanlei/publications/wsdm11.pdf 109 | 110 | Args: 111 | word_list ([str]): list of words in the transcript. 112 | 113 | Returns: 114 | (float): document entropy value 115 | """ 116 | from scipy.stats import entropy 117 | 118 | word_histogram = Counter(word_list) 119 | total_word_count = float(len(word_list)) 120 | 121 | word_probs = [] 122 | for _, count in word_histogram.items(): 123 | mle_pw = count / float(total_word_count) 124 | word_probs.append(mle_pw) 125 | 126 | return entropy(word_probs, base=2) 127 | 128 | 129 | def get_readability_features(text): 130 | """get FK easiness readability score from a text 131 | calculated according to https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests 132 | 133 | Args: 134 | text (str): text string of the transcript 135 | 136 | Returns: 137 | easiness (float): FK ease readability score for the text. 138 | 139 | """ 140 | # remove non words 141 | from textatistic import Textatistic 142 | 143 | try: 144 | text_score_obj = Textatistic(text) 145 | easiness = text_score_obj.flesch_score 146 | except ZeroDivisionError: 147 | easiness = 100.0 148 | 149 | return easiness 150 | 151 | 152 | def compute_stop_word_presence_rate(word_list, stop_word_set=get_stopwords()): 153 | """returns the stopword presence rate 154 | calculated according to fracStops in https://people.cs.umass.edu/~yanlei/publications/wsdm11.pdf 155 | 156 | Args: 157 | word_list ([str]): list of words in the transcript. 158 | stop_word_set ({set}): set of stopwords to be used 159 | 160 | Returns: 161 | (float): stopword presence rate 162 | """ 163 | word_count = float(len(word_list)) 164 | 165 | # if no words, return 0 166 | if word_count == 0: 167 | return 0. 168 | 169 | stopwords_count = 0. 170 | 171 | for w in word_list: 172 | if w in stop_word_set: 173 | stopwords_count += 1 174 | 175 | return stopwords_count / word_count 176 | 177 | 178 | def compute_stop_word_coverage_rate(word_list, stop_word_set=get_stopwords()): 179 | """returns the stopword coverage rate 180 | calculated according to stopCover in https://people.cs.umass.edu/~yanlei/publications/wsdm11.pdf 181 | 182 | Args: 183 | word_list ([str]): list of words in the transcript. 184 | stop_word_set ({set}): set of stopwords to be used 185 | 186 | Returns: 187 | (float): stopword coverage rate 188 | """ 189 | word_count = float(len(word_list)) 190 | 191 | # if no words, return 0 192 | if word_count == 0: 193 | return 0. 194 | 195 | stopwords_cardinality = float(len(stop_word_set)) 196 | stopwords_present = set() 197 | 198 | for w in word_list: 199 | if w in stop_word_set: 200 | stopwords_present.add(w) 201 | 202 | return len(stopwords_present) / stopwords_cardinality 203 | 204 | 205 | def compute_conjunction_rate(word_list): 206 | """Compute conjugation word rate 207 | calculated according to https://dl.acm.org/doi/pdf/10.1145/2063504.2063507 208 | 209 | Args: 210 | word_list ([str]): list of words in the transcript. 211 | 212 | Returns: 213 | (float): conjugation word rate 214 | """ 215 | word_count = float(len(word_list)) 216 | 217 | if word_count == 0: 218 | return 0. 219 | 220 | qualified_count = 0 221 | 222 | for word in word_list: 223 | if word.strip() in CONJ_WORDS: 224 | qualified_count += 1 225 | 226 | return qualified_count / word_count 227 | 228 | 229 | def compute_normalization_rate(word_list): 230 | """Compute normalization suffix rate 231 | calculated according to https://dl.acm.org/doi/pdf/10.1145/2063504.2063507 232 | 233 | Args: 234 | word_list ([str]): list of words in the transcript. 235 | 236 | Returns: 237 | (float): normalization suffix rate 238 | """ 239 | word_count = float(len(word_list)) 240 | 241 | if word_count == 0: 242 | return 0. 243 | 244 | qualified_count = 0 245 | 246 | for word in word_list: 247 | _word_ = word.strip() 248 | for suffix in NORM_SUFFIXES: 249 | if _word_.endswith(suffix): 250 | qualified_count += 1 251 | break 252 | 253 | return qualified_count / word_count 254 | 255 | 256 | def compute_preposition_rate(word_list): 257 | """Compute preposition word rate 258 | calculated according to https://dl.acm.org/doi/pdf/10.1145/2063504.2063507 259 | 260 | Args: 261 | word_list ([str]): list of words in the transcript. 262 | 263 | Returns: 264 | (float): preposition word rate 265 | """ 266 | word_count = float(len(word_list)) 267 | 268 | if word_count == 0: 269 | return 0. 270 | 271 | qualified_count = 0 272 | 273 | for word in word_list: 274 | if word.strip() in PREPOSITION_WORDS: 275 | qualified_count += 1 276 | 277 | return qualified_count / word_count 278 | 279 | 280 | def compute_tobe_verb_rate(word_list): 281 | """Compute to-be verb word rate 282 | calculated according to https://dl.acm.org/doi/pdf/10.1145/2063504.2063507 283 | 284 | Args: 285 | word_list ([str]): list of words in the transcript. 286 | 287 | Returns: 288 | (float): to-be verb word 289 | """ 290 | word_count = float(len(word_list)) 291 | 292 | if word_count == 0: 293 | return 0. 294 | 295 | qualified_count = 0 296 | 297 | for word in word_list: 298 | if word.strip() in TO_BE_VERBS: 299 | qualified_count += 1 300 | 301 | return qualified_count / word_count 302 | 303 | 304 | def compute_auxiliary_verb_rate(word_list): 305 | """Compute auxiliary verb word rate 306 | calculated according to https://dl.acm.org/doi/pdf/10.1145/2063504.2063507 307 | 308 | Args: 309 | word_list ([str]): list of words in the transcript. 310 | 311 | Returns: 312 | (float): auxiliary verb word rate 313 | """ 314 | word_count = float(len(word_list)) 315 | 316 | if word_count == 0: 317 | return 0. 318 | 319 | qualified_count = 0 320 | 321 | for word in word_list: 322 | if word.strip() in AUXILIARY_VERBS: 323 | qualified_count += 1 324 | 325 | return qualified_count / word_count 326 | 327 | 328 | def compute_pronouns_rate(word_list): 329 | """Compute pronoun word rate 330 | calculated according to https://dl.acm.org/doi/pdf/10.1145/2063504.2063507 331 | 332 | Args: 333 | word_list ([str]): list of words in the transcript. 334 | 335 | Returns: 336 | (float): pronoun word rate 337 | """ 338 | word_count = float(len(word_list)) 339 | 340 | if word_count == 0: 341 | return 0. 342 | 343 | qualified_count = 0 344 | 345 | for word in word_list: 346 | if word.strip() in PRONOUN_WORDS: 347 | qualified_count += 1 348 | 349 | return qualified_count / word_count 350 | -------------------------------------------------------------------------------- /helper_code/feature_extraction/wikipedia_based_features.py: -------------------------------------------------------------------------------- 1 | from collections import defaultdict 2 | 3 | from helper_code.feature_extraction._api_utils import wikify 4 | from helper_code.feature_extraction._text_utils import partition_text 5 | 6 | # values for Doc Frequency and Words to Ignore, more details about these variables 7 | # found at: http://www.wikifier.org/info.html 8 | DF_IGNORE_VAL = 50 9 | WORDS_IGNORE_VAL = 50 10 | 11 | 12 | def get_wikipedia_topic_features(text, api_key, chunk_size=5000): 13 | """ get Wikification for the transcript using http://www.wikifier.org 14 | 15 | Args: 16 | text (str): text that needs to be Wikified 17 | api_key (str): API key for Wikifier obtained from http://www.wikifier.org/register.html 18 | chunk_size (int): maximum number of characters that need included in each Wikified fragment. 19 | 20 | Returns: 21 | enrichments ([{str: val}]): list of annotated chunks from the transcript 22 | 23 | """ 24 | text_partitions = partition_text(text, max_size=chunk_size) 25 | 26 | enrichments = [] 27 | i = 1 28 | for text_part in text_partitions: 29 | temp_record = {} 30 | annotations = wikify(text_part, api_key, DF_IGNORE_VAL, WORDS_IGNORE_VAL) 31 | temp_record["part"] = i 32 | temp_record["text"] = text_part 33 | temp_record["annotations"] = annotations 34 | enrichments.append(temp_record) 35 | i += 1 36 | 37 | return enrichments 38 | 39 | 40 | def get_ranked_topics(chunks, option, top_n): 41 | """ ranks the topics using the aggregated score across multiple Wikified chunks of the text. 42 | 43 | Args: 44 | chunks ([{str: val}]): list of Wikified chunks for the transcript 45 | option {str}: pageRank or cosine 46 | top_n (int): n top ranked topics of interest 47 | 48 | Returns: 49 | final_rec ({str:val}): dict with key for top_n_url or top_n_value and the URL or value of the topic 50 | 51 | """ 52 | chunks = list(chunks) 53 | 54 | total_length = sum([len(part["text"]) for part in chunks]) 55 | 56 | records = defaultdict(list) 57 | for part in chunks: 58 | annotations = part["annotations"]["annotation_data"] 59 | weight = len(part["text"]) 60 | norm = weight / total_length 61 | for concept in annotations: 62 | url = concept["url"] 63 | val = concept.get(option, 0.) 64 | records[url].append(val * norm) 65 | 66 | rec = [(title, sum(val)) for title, val in records.items()] 67 | 68 | # sort by normalised weight 69 | rec.sort(key=lambda l: l[1], reverse=True) 70 | n_recs = rec[:top_n] 71 | 72 | final_rec = {} 73 | for idx, item in enumerate(n_recs): 74 | url, val = item 75 | _idx = idx + 1 76 | final_rec["topic_{}_{}_url".format(_idx, option)] = url 77 | final_rec["topic_{}_{}_val".format(_idx, option)] = val 78 | 79 | return final_rec 80 | 81 | 82 | def get_authority_wiki_features(text, api_key, top_n): 83 | """ returns top-n most authoritative Wikipedia topics with PageRank scores. 84 | Calculated using http://www.wikifier.org/ 85 | 86 | Args: 87 | text (str): text that needs to be Wikified for authority 88 | api_key (str): API key for Wikifier obtained from http://www.wikifier.org/register.html 89 | top_n (int): n top ranking topics to be returned with PageRank scores 90 | 91 | Returns: 92 | ranked_topic_records ({str:val}): dict with key for top_n_url or top_n_value and the URL or value of the topic 93 | 94 | """ 95 | enriched_chunks = get_wikipedia_topic_features(text, api_key) 96 | ranked_topic_records = get_ranked_topics(enriched_chunks, "pageRank", top_n) 97 | 98 | return ranked_topic_records 99 | 100 | 101 | def get_coverage_wiki_features(text, api_key, top_n): 102 | """ returns top-n most covered Wikipedia topics with cosine similarity scores. 103 | Calculated using http://www.wikifier.org/ 104 | 105 | Args: 106 | text (str): text that needs to be Wikified for coverage 107 | api_key (str): API key for Wikifier obtained from http://www.wikifier.org/register.html 108 | top_n (int): n top ranking topics to be returned with cosine scores 109 | 110 | Returns: 111 | ranked_topic_records ({str:val}): dict with key for top_n_url or top_n_value and the URL or value of the topic 112 | 113 | """ 114 | enriched_chunks = get_wikipedia_topic_features(text, api_key) 115 | ranked_topic_records = get_ranked_topics(enriched_chunks, "cosine", top_n) 116 | 117 | return ranked_topic_records 118 | -------------------------------------------------------------------------------- /helper_code/helper_tools/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sahanbull/VLE-Dataset/4e6b43b16931fd56a6033ef9cc61c9e19c1c8c9b/helper_code/helper_tools/__init__.py -------------------------------------------------------------------------------- /helper_code/helper_tools/evaluation_metrics.py: -------------------------------------------------------------------------------- 1 | from sklearn import metrics as skm 2 | from scipy.stats import spearmanr 3 | import numpy as np 4 | 5 | from helper_code.helper_tools.io_utils import get_pairwise_version, ID_COL, DOMAIN_COL 6 | 7 | 8 | def get_rmse(Y_train, Y_test, train_pred, test_pred): 9 | """ calculates Root Mean Squared Error (RMSE) for train and test data 10 | 11 | Args: 12 | Y_train ([float]): actual label in training data 13 | Y_test ([float]): actual label in testing data 14 | train_pred ([float]): predicted label for training data 15 | test_pred ([float]): predicted label for testing data 16 | 17 | Returns: 18 | train_rmse (float): metric for training data 19 | test_rmse (float): metric for testing data 20 | 21 | """ 22 | train_rmse = np.sqrt(skm.mean_squared_error(Y_train, train_pred)) 23 | test_rmse = np.sqrt(skm.mean_squared_error(Y_test, test_pred)) 24 | 25 | return train_rmse, test_rmse 26 | 27 | 28 | def get_spearman_r(Y_train, Y_test, train_pred, test_pred): 29 | """ calculates Spearman's Rank Order Correlation (SROCC) for train and test data 30 | 31 | Args: 32 | Y_train ([float]): actual label in training data 33 | Y_test ([float]): actual label in testing data 34 | train_pred ([float]): predicted label for training data 35 | test_pred ([float]): predicted label for testing data 36 | 37 | Returns: 38 | train_spearman (float, float): r value and pvalue metric for training data 39 | test_spearman (float, float): r value and pvalue metric metric for testing data 40 | 41 | """ 42 | train_spearman = spearmanr(Y_train, train_pred) 43 | test_spearman = spearmanr(Y_test, test_pred) 44 | 45 | return train_spearman, test_spearman 46 | 47 | 48 | def get_pairwise_accuracy(spark, label, fold_train_df, fold_test_df, train_pred, test_pred): 49 | """calculates Pairwise Accuracy (Pairwise) for train and test data 50 | 51 | Args: 52 | spark (SparkSession): Apache Spark session object 53 | label (str): label (median, mean etc.) 54 | fold_train_df (pd.DataFrame): training fold pandas DataFrame 55 | fold_test_df (pd.DataFrame): testing fold pandas DataFrame 56 | train_pred ([float]): predicted labels on training data 57 | test_pred ([float]): predicted labels on testing data 58 | test_pred ([float]): predicted labels on testing data 59 | 60 | Returns: 61 | train_accuracy (float): pairwise accuracy metric for training data 62 | test_accuracy (float): pairwise accuracy metric metric for testing data 63 | 64 | """ 65 | train_Y_p = spark.createDataFrame(fold_train_df) 66 | train_Y_p = get_pairwise_version(train_Y_p, is_gap=True, label_only=True).toPandas()[ 67 | [ID_COL, "_" + ID_COL, "gap_" + label, DOMAIN_COL]] 68 | 69 | predict_train_Y_p = fold_train_df 70 | predict_train_Y_p[label] = train_pred 71 | predict_train_Y_p = spark.createDataFrame(predict_train_Y_p) 72 | predict_train_Y_p = \ 73 | get_pairwise_version(predict_train_Y_p, is_gap=True, label_only=True).toPandas()[ 74 | [ID_COL, "_" + ID_COL, "gap_" + label, DOMAIN_COL]] 75 | 76 | test_Y_p = spark.createDataFrame(fold_test_df) 77 | test_Y_p = get_pairwise_version(test_Y_p, is_gap=True, label_only=True).toPandas()[ 78 | [ID_COL, "_" + ID_COL, "gap_" + label, DOMAIN_COL]] 79 | 80 | predict_test_Y_p = fold_test_df 81 | predict_test_Y_p[label] = test_pred 82 | predict_test_Y_p = spark.createDataFrame(predict_test_Y_p) 83 | predict_test_Y_p = \ 84 | get_pairwise_version(predict_test_Y_p, is_gap=True, label_only=True).toPandas()[ 85 | [ID_COL, "_" + ID_COL, "gap_" + label, DOMAIN_COL]] 86 | 87 | train_accuracy = skm.accuracy_score(train_Y_p["gap_" + label] > 0., 88 | predict_train_Y_p["gap_" + label] > 0., normalize=True) 89 | test_accuracy = skm.accuracy_score(test_Y_p["gap_" + label] > 0., 90 | predict_test_Y_p["gap_" + label] > 0., normalize=True) 91 | 92 | return train_accuracy, test_accuracy 93 | -------------------------------------------------------------------------------- /helper_code/helper_tools/io_utils.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | 3 | MIN_NUM_SESSIONS = 5 4 | 5 | GEN_FEATURES = ['id', 'fold'] 6 | 7 | CONT_COLS = ['categories', 'freshness', 8 | 'auxiliary_rate', 'conjugate_rate', 'normalization_rate', 'tobe_verb_rate', 'preposition_rate', 9 | 'pronoun_rate', 'document_entropy', 'easiness', 'fraction_stopword_coverage', 10 | 'fraction_stopword_presence', 'title_word_count', 'word_count'] 11 | 12 | WIKI_COLS = ['auth_topic_rank_1_url', 'auth_topic_rank_1_score', 13 | 'auth_topic_rank_2_url', 'auth_topic_rank_2_score', 14 | 'auth_topic_rank_3_url', 'auth_topic_rank_3_score', 15 | 'auth_topic_rank_4_url', 'auth_topic_rank_4_score', 16 | 'auth_topic_rank_5_url', 'auth_topic_rank_5_score', 17 | 'coverage_topic_rank_1_url', 'coverage_topic_rank_1_score', 18 | 'coverage_topic_rank_2_url', 'coverage_topic_rank_2_score', 19 | 'coverage_topic_rank_3_url', 'coverage_topic_rank_3_score', 20 | 'coverage_topic_rank_4_url', 'coverage_topic_rank_4_score', 21 | 'coverage_topic_rank_5_url', 'coverage_topic_rank_5_score'] 22 | 23 | VID_COLS = ['duration', 'type', 'has_parts', "speaker_speed", 'silent_period_rate'] 24 | 25 | MEAN_ENGAGEMENT_RATE = 'mean_engagement' 26 | MED_ENGAGEMENT_RATE = 'med_engagement' 27 | MEAN_STAR_RATING = 'mean_star_rating' 28 | VIEW_COUNT = "view_count" 29 | 30 | LABEL_COLS = [MEAN_ENGAGEMENT_RATE, MED_ENGAGEMENT_RATE] 31 | 32 | DOMAIN_COL = "categories" 33 | ID_COL = "id" 34 | 35 | COL_VER_1 = CONT_COLS 36 | COL_VER_2 = CONT_COLS + WIKI_COLS 37 | COL_VER_3 = CONT_COLS + WIKI_COLS + VID_COLS 38 | 39 | 40 | def vectorise_video_features(lectures, columns): 41 | """converts the video specific categorical variables to one-hot encoding 42 | 43 | Args: 44 | lectures (pd.DataFrame): pandas DataFrame that has lectures dataset 45 | columns ([str]): set of feature columns 46 | 47 | Returns: 48 | lectures (pd.DataFrame): pandas DataFrame with updated columns that are one hot encoded 49 | columns ([str]): updated set of feature columns 50 | 51 | """ 52 | dummies = pd.get_dummies(lectures['type']).rename(columns=lambda x: 'type_' + str(x)) 53 | for col in dummies.columns: 54 | lectures[col] = dummies[col] 55 | columns += list(dummies.columns) 56 | columns.remove("type") 57 | 58 | return lectures, columns 59 | 60 | 61 | def _wikititle_from_url(url): 62 | title = url.split("/")[-1] 63 | return title.strip() 64 | 65 | 66 | def vectorise_wiki_features(lectures, columns): 67 | """converts the wikipedia specific categorical variables (topic 1 page rank and topic 1 cosine) to one-hot encoding 68 | 69 | Args: 70 | lectures (pd.DataFrame): pandas DataFrame that has lectures dataset 71 | columns ([str]): set of feature columns 72 | 73 | Returns: 74 | lectures (pd.DataFrame): pandas DataFrame with updated columns that are one hot encoded 75 | columns ([str]): updated set of feature columns 76 | 77 | """ 78 | # get pageRank URL 79 | col_name = "auth_topic_rank_1_url" 80 | lectures[col_name] = lectures[col_name].apply(_wikititle_from_url) 81 | 82 | dummies = pd.get_dummies(lectures[col_name]).rename(columns=lambda x: 'authority_' + str(x)) 83 | for col in dummies.columns: 84 | lectures[col] = dummies[col] 85 | columns += list(dummies.columns) 86 | 87 | # get cosine URL 88 | col_name = "coverage_topic_rank_1_url" 89 | lectures[col_name] = lectures[col_name].apply(_wikititle_from_url) 90 | 91 | dummies = pd.get_dummies(lectures[col_name]).rename(columns=lambda x: 'coverage_' + str(x)) 92 | for col in dummies.columns: 93 | lectures[col] = dummies[col] 94 | columns += list(dummies.columns) 95 | 96 | for col in WIKI_COLS: 97 | columns.remove(col) 98 | 99 | lectures.drop(WIKI_COLS, axis='columns', inplace=True) 100 | 101 | return lectures, columns 102 | 103 | 104 | def _numerise_categorical(lectures): 105 | # lectures["language"] = lectures["language"].apply(lambda l: 1 if l == "en" else 0) 106 | lectures["categories"] = lectures["categories"].apply(lambda l: 1 if l == "stem" else 0) 107 | 108 | return lectures 109 | 110 | 111 | def transform_features(lectures): 112 | """converts the string represented metadata related features to numeric features. 113 | Args: 114 | lectures (pd.DataFrame): pandas DataFrame that has lectures dataset 115 | 116 | Returns: 117 | lectures (pd.DataFrame): pandas DataFrame with updated columns that are one hot encoded 118 | 119 | """ 120 | lectures = _numerise_categorical(lectures) 121 | 122 | return lectures 123 | 124 | 125 | def load_lecture_dataset(input_filepath, col_version=1): 126 | """ takes in a distributed path to lecture data and pulls the data 127 | 128 | Args: 129 | input_filepath (str): input filepath where the dataset CSV file is. 130 | col_version (int): column version that defines the set of features being considered (could be 1, 2 or 3) 131 | 132 | Returns: 133 | lectures (pd.DataFrame): pandas DataFrame containing all the relevant fields from lectures data 134 | """ 135 | if col_version == 1: 136 | columns = GEN_FEATURES + COL_VER_1 137 | elif col_version == 2: 138 | columns = GEN_FEATURES + COL_VER_2 139 | else: 140 | columns = GEN_FEATURES + COL_VER_3 141 | 142 | lectures = pd.read_csv(input_filepath) 143 | 144 | columns += LABEL_COLS 145 | 146 | return lectures[columns] 147 | 148 | 149 | def get_label_from_dataset(label_param): 150 | """gets actual label column name based on parameter 151 | 152 | Args: 153 | label_param (str): label parameter defined in the traning scripts. 154 | 155 | Returns: 156 | (str): column name of the relevant label as per the dataset. 157 | 158 | """ 159 | if label_param == "mean": 160 | return MEAN_ENGAGEMENT_RATE 161 | elif label_param == "median": 162 | return MED_ENGAGEMENT_RATE 163 | elif label_param == "rating": 164 | return MEAN_STAR_RATING 165 | else: 166 | return VIEW_COUNT 167 | 168 | 169 | def get_features_from_dataset(col_cat, lectures): 170 | """returns the correct set of feature column names and the relevant columns from the dataset. 171 | 172 | Args: 173 | col_cat (int): column category parameter that defines the final feature set. 174 | lectures (pd.DataFrame): pandas DataFrame with the full dataset including all features 175 | 176 | Returns: 177 | lectures (pd.DataFrame): pandas DataFrame with the full dataset including relevant features 178 | columns ([str]): list of column names relevant to the column category 179 | 180 | """ 181 | if col_cat == 1: 182 | columns = COL_VER_1 183 | lectures = transform_features(lectures) 184 | 185 | if col_cat == 2: 186 | columns = COL_VER_2 187 | lectures = transform_features(lectures) 188 | # add wiki features 189 | lectures, columns = vectorise_wiki_features(lectures, columns) 190 | 191 | if col_cat == 3: 192 | columns = COL_VER_3 193 | lectures = transform_features(lectures) 194 | # add wiki features 195 | lectures, columns = vectorise_wiki_features(lectures, columns) 196 | # add video features 197 | lectures, columns = vectorise_video_features(lectures, columns) 198 | 199 | return columns, lectures 200 | 201 | 202 | def get_fold_from_dataset(lectures, fold): 203 | fold_train_df = lectures[lectures["fold"] != fold].reset_index(drop=True) 204 | fold_test_df = lectures[lectures["fold"] == fold].reset_index(drop=True) 205 | 206 | return fold_train_df, fold_test_df 207 | 208 | 209 | def get_pairwise_version(df, is_gap, label_only=False): 210 | """Get the pairwise representation of the lecture dataset 211 | 212 | Args: 213 | df (spark.DataFrame): spark DataFrame that need to transformed to pairwise format 214 | is_gap (bool): is gap between the features are calculated 215 | label_only (bool): are only the labels being considered 216 | 217 | Returns: 218 | cross_df (spark.DataFrame): park DataFrame that is transformed to pairwise format 219 | 220 | """ 221 | from pyspark.sql import functions as func 222 | 223 | if label_only: 224 | tmp_columns = [ID_COL, DOMAIN_COL] + LABEL_COLS 225 | df = df.select(tmp_columns) 226 | 227 | _df = df 228 | cols = set(_df.columns) 229 | 230 | # rename columns 231 | for col in cols: 232 | _df = _df.withColumnRenamed(col, "_" + col) 233 | 234 | # do category wise pairing on different observations from the category 235 | cross_df = (df.crossJoin(_df). 236 | where(func.col(ID_COL) != func.col("_" + ID_COL))) 237 | 238 | if is_gap: 239 | gap_cols = cols 240 | for col in gap_cols: 241 | cross_df = cross_df.withColumn("gap_" + col, func.col(col) - func.col("_" + col)) 242 | 243 | if label_only: 244 | cross_df = cross_df.select( 245 | [ID_COL, "_" + ID_COL, DOMAIN_COL] + ["gap_" + c for c in LABEL_COLS]) 246 | 247 | return cross_df 248 | -------------------------------------------------------------------------------- /helper_code/models/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sahanbull/VLE-Dataset/4e6b43b16931fd56a6033ef9cc61c9e19c1c8c9b/helper_code/models/__init__.py -------------------------------------------------------------------------------- /helper_code/models/regression/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sahanbull/VLE-Dataset/4e6b43b16931fd56a6033ef9cc61c9e19c1c8c9b/helper_code/models/regression/__init__.py -------------------------------------------------------------------------------- /helper_code/models/regression/train_rf_regression_full_cv.py: -------------------------------------------------------------------------------- 1 | from pyspark.sql import SparkSession 2 | 3 | import pandas as pd 4 | import numpy as np 5 | 6 | from os.path import join 7 | from sklearn.ensemble import RandomForestRegressor 8 | import joblib 9 | 10 | from sklearn.model_selection import GridSearchCV 11 | 12 | from helper_code.helper_tools.evaluation_metrics import get_rmse, get_spearman_r, get_pairwise_accuracy 13 | from helper_code.helper_tools.io_utils import load_lecture_dataset, get_fold_from_dataset, \ 14 | get_label_from_dataset, get_features_from_dataset 15 | 16 | 17 | def main(args): 18 | spark = (SparkSession. 19 | builder. 20 | config("spark.driver.memory", "20g"). 21 | config("spark.executor.memory", "20g"). 22 | config("spark.driver.maxResultSize", "20g"). 23 | config("spark.rpc.lookupTimeout", "300s"). 24 | config("spark.rpc.lookupTimeout", "300s"). 25 | config("spark.master", "local[{}]".format(args["k_folds"]))).getOrCreate() 26 | 27 | spark.sparkContext.setLogLevel("ERROR") 28 | 29 | performance_values = [] 30 | 31 | folds = args["k_folds"] 32 | jobs = args["n_jobs"] 33 | col_cat = args['feature_cat'] 34 | 35 | lectures = load_lecture_dataset(args["training_data_filepath"], col_version=col_cat) 36 | 37 | label = get_label_from_dataset(args["label"]) 38 | 39 | columns, lectures = get_features_from_dataset(col_cat, lectures) 40 | print(columns) 41 | cnt = 1 42 | # make pairwise observations 43 | for i in range(folds): 44 | fold_train_df, fold_test_df = get_fold_from_dataset(lectures, cnt) 45 | 46 | X_train, Y_train = fold_train_df[columns], np.array(fold_train_df[label]) 47 | X_test, Y_test = fold_test_df[columns], np.array(fold_test_df[label]) 48 | 49 | if args["is_log"]: 50 | # log transformation of the data 51 | Y_train = np.log(Y_train) 52 | Y_test = np.log(Y_test) 53 | 54 | params = {'n_estimators': [100, 500, 750, 1000, 2000, 5000], 55 | 'max_depth': [3, 5, 10, 25]} 56 | 57 | print("\n\n\n ========== dataset {} created !!! ===========\n\n".format(cnt)) 58 | print("no. of features: {}".format(X_train.shape[1])) 59 | print("training data size: {}".format(len(X_train))) 60 | print("testing data size: {}\n\n".format(len(X_test))) 61 | 62 | grid_model = GridSearchCV(RandomForestRegressor(), params, cv=folds, n_jobs=jobs, refit=True) 63 | grid_model.fit(X_train, Y_train) 64 | 65 | train_pred = grid_model.predict(X_train) 66 | 67 | print("Model Trained...") 68 | 69 | test_pred = grid_model.predict(X_test) 70 | 71 | joblib.dump(grid_model, join(args["output_dir"], "model_{}.pkl".format(cnt)), compress=True) 72 | 73 | if args["is_log"]: 74 | Y_train = np.exp(Y_train) 75 | train_pred = np.exp(train_pred) 76 | Y_test = np.exp(Y_test) 77 | test_pred = np.exp(test_pred) 78 | 79 | train_rmse, test_rmse = get_rmse(Y_train, Y_test, train_pred, test_pred) 80 | 81 | train_spearman, test_spearman = get_spearman_r(Y_train, Y_test, train_pred, test_pred) 82 | 83 | best_model = {} 84 | best_model["params"] = "{}_{}".format(grid_model.best_estimator_.n_estimators, 85 | grid_model.best_estimator_.max_depth) 86 | best_model["n_estimators"] = grid_model.best_estimator_.n_estimators 87 | best_model["max_depth"] \ 88 | = grid_model.best_estimator_.max_depth 89 | 90 | best_model["train_rmse"] = train_rmse 91 | best_model["test_rmse"] = test_rmse 92 | best_model["train_spearman_r"] = train_spearman.correlation 93 | best_model["test_spearman_r"] = test_spearman.correlation 94 | best_model["train_spearman_p"] = train_spearman.pvalue 95 | best_model["test_spearman_p"] = test_spearman.pvalue 96 | best_model["fold_id"] = cnt 97 | 98 | print("Model: {}".format(best_model["params"])) 99 | 100 | performance_values.append(best_model) 101 | pd.DataFrame(performance_values).to_csv(join(args["output_dir"], "results.csv"), index=False) 102 | 103 | cnt += 1 104 | 105 | 106 | if __name__ == '__main__': 107 | """this script takes in the relevant parameters to train a context-agnostic engagement prediction model using a 108 | Random Forests Regressor (RF). The script outputs a "results.csv" with the evaluation metrics and k model 109 | files in joblib pickle format to the output directory. 110 | 111 | eg: command to run this script: 112 | 113 | python context_agnostic_engagement/models/regression/train_rf_regression_full_cv.py 114 | --training-data-filepath /path/to/v1/VLE_12k_dataset_v1.csv --output-dir path/to/output/directory --n-jobs 8 115 | --is-log --feature-cat 1 --k-folds 5 --label median 116 | 117 | """ 118 | import argparse 119 | 120 | parser = argparse.ArgumentParser() 121 | 122 | parser.add_argument('--training-data-filepath', type=str, required=True, 123 | help="filepath where the training data is. Should be a CSV in the right format") 124 | parser.add_argument('--output-dir', type=str, required=True, 125 | help="output file dir where the models and the results are saved") 126 | parser.add_argument('--n-jobs', type=int, default=8, 127 | help="number of parallel jobs to run") 128 | parser.add_argument('--k-folds', type=int, default=5, 129 | help="Number of folds to be used in k-fold cross validation") 130 | parser.add_argument('--label', default='median', const='all', nargs='?', choices=['median', 'mean'], 131 | help="Defines what label should be used for training") 132 | parser.add_argument('--feature-cat', type=int, default=1, 133 | help="defines what label set should be used for training") 134 | parser.add_argument('--is-log', action='store_true', help="Defines if the label should be log transformed.") 135 | 136 | args = vars(parser.parse_args()) 137 | 138 | main(args) 139 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup 2 | 3 | setup( 4 | name='', 5 | version='1.0', 6 | packages=['helper_code', 'helper_code.helper_tools', 7 | 'helper_code.feature_extraction', 'helper_code.models'], 8 | url='', 9 | license='', 10 | author='', 11 | author_email='', 12 | description='This python package includes the datasets and the helpfer functions that allow building models for predicting context-agnostic (population-based) of video lectures. ', 13 | install_requires=[ 14 | 'numpy>=1.14.1', 15 | 'pandas>=0.22.0', 16 | 'scipy>=1.0.1', 17 | 'nltk>=3.2.5', 18 | 'ujson>=1.35', 19 | 'scikit-learn>=0.19.1', 20 | 'pyspark>=2.4.5', 21 | 'textatistic>=0.0.1'] 22 | ) 23 | --------------------------------------------------------------------------------