├── README.md
├── data_wrangling.ipynb
├── env_ideal_profiles.yaml
├── helper.py
├── ideal_profiles.ipynb
├── ideal_profiles_2.ipynb
├── process_text.py
├── scrape_data.py
└── stopwords.csv


/README.md:
--------------------------------------------------------------------------------
 1 | # Ideal Profiles
 2 | What does an ideal Data Scientist's profile look like? This project aims to provide a quantitative answer based on job postings. In this project, I scraped job posting data from Indeed and analyzed frequencies for various Data Science skills. The analysis then can be used not only as objective keyword reference for resume optimization, but can also serve as Data Science learning road map!!
 3 | 
 4 | The related Medium posts are:
 5 | - [What Does an Ideal Data Scientist’s Profile Look Like?](https://towardsdatascience.com/what-does-an-ideal-data-scientists-profile-look-like-7d7bd78ff7ab) 
 6 | - [Navigating the Data Science Careers Landscape](https://hackernoon.com/navigating-the-data-science-career-landscape-db746a61ac62)
 7 | - [Scraping Job Posting Data from Indeed using Selenium and BeautifulSoup](https://towardsdatascience.com/scraping-job-posting-data-from-indeed-using-selenium-and-beautifulsoup-dfc86230baac)
 8 | - [Building an End-To-End Data Science Project](https://towardsdatascience.com/building-an-end-to-end-data-science-project-28e853c0cae3)
 9 | 
10 | 
11 | ## How to Use
12 | If you want to run the code locally, please download the repo and build your Anaconda environment using the `env_ideal_profiles.yaml` file, and download geckodriver (see Requirements below). Then you can start with data scraping by running `python scrape_date.py` in Anaconda Prompt. Once you have the raw data, you can then clean the data using the `data_wrangling.ipynb` Jupyter Notebook. Finally, the `ideal_profiles_2.ipynb` Notebook can be used to make various plots. Refer to list below for the roles of different files.
13 | 
14 | 
15 | ## Requirements
16 | - Windows 10 OS
17 | - Firefox Web Browser 63.0.3
18 | - Ananconda 3
19 | - geckodriver v0.22.0 (geckodriver-v0.22.0-win64.zip, available [here](https://github.com/mozilla/geckodriver/releases))
20 | - pandas (see the yaml file for version number, same below)
21 | - numpy
22 | - matplotlib
23 | - json
24 | - re
25 | - csv
26 | - wordcloud
27 | - nltk
28 | - bs4 (BeautifulSoup)
29 | - selenium
30 | 
31 | 
32 | ## Files
33 | - `scrape_data.py`: scrapes the data from Indeed.ca
34 | - `process_text.py`: performs various text related operations such as remove digits, tokenize, and check term frequency
35 | - `helper.py`: contains data loading and various plotting functions
36 | - `data_wrangling.ipynb`: gathers the raw text data, counts term frequency and stores the result in a pandas dataframe
37 | - `ideal_profiles.ipynb`: creates spider plots to visualize various Data Science roles' skill requirements based on intuition
38 | - `ideal_profiles_2.ipynb`: creates skill distribution and word cloud plots to represent ideal profiles quantitatively
39 | - `stopwords.csv`: contains the stop words for word cloud plotting
40 | - `env_ideal_profiles.yaml`: the Anaconda environment file for setting up the project environment
41 | 
42 | 
43 | ## Contribute
44 | Any contribution is welcome!
45 | 
46 | 
47 | ## To-do's
48 | - Allow to query Indeed USA instead of the Canadian site and increase the number of postings to scrape
49 | - Allow to show context for specific words in word clouds
50 | - Update all docstrings and comments
51 | - OOP
52 | - Code refactoring - single responsibility principle for functions
53 | - Add Data Analyst and AI Engineer roles
54 | - Allow to show Percentage of Mentions for a certain skill, i.e., out of 1000 job postings, what proportion mentions the given skill?
55 | 
56 | 
57 | ## License
58 | [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
59 | 


--------------------------------------------------------------------------------
/data_wrangling.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "code",
   5 |    "execution_count": 1,
   6 |    "metadata": {},
   7 |    "outputs": [],
   8 |    "source": [
   9 |     "import pandas as pd\n",
  10 |     "#import nltk\n",
  11 |     "from matplotlib import pyplot as plt\n",
  12 |     "from scrape_data import *\n",
  13 |     "from process_text import *\n",
  14 |     "from helper import *"
  15 |    ]
  16 |   },
  17 |   {
  18 |    "cell_type": "code",
  19 |    "execution_count": 2,
  20 |    "metadata": {},
  21 |    "outputs": [],
  22 |    "source": [
  23 |     "# Initialize the dict to store all text lists for below titles\n",
  24 |     "text_lists = {}\n",
  25 |     "titles = ['Data Scientist', 'Machine Learning Engineer', 'Data Engineer']\n",
  26 |     "# Grab the tokens list and store them in the dict\n",
  27 |     "for title in titles:\n",
  28 |     "    text_lists[title] = plot_profile(title=title, first_n_postings=120, return_text_list=True)"
  29 |    ]
  30 |   },
  31 |   {
  32 |    "cell_type": "code",
  33 |    "execution_count": 3,
  34 |    "metadata": {},
  35 |    "outputs": [],
  36 |    "source": [
  37 |     "# Make the dict of skills to investigate\n",
  38 |     "\n",
  39 |     "languages = ['Python', 'R', 'SQL', 'Java', 'C', 'C++', 'C#', 'Scala', 'Perl', 'Julia', \n",
  40 |     "             'Javascript', 'HTML', 'CSS', 'PHP', 'Ruby', 'Lua', 'MATLAB', 'SAS'] \n",
  41 |     "\n",
  42 |     "big_data = ['Hadoop', 'MapReduce', 'Hive', 'Pig', 'Cascading', 'Scalding', 'Cascalog', 'HBase', 'Sqoop', \n",
  43 |     "            'Mahout', 'Oozie', 'Flume', 'ZooKeeper', 'Spark', 'Storm', 'Shark', 'Impala', 'Elasticsearch', \n",
  44 |     "            'Kafka', 'Flink', 'Kinesis', 'Presto', 'Hume', 'Airflow', 'Azkabhan', 'Luigi', 'Cassandra']\n",
  45 |     "\n",
  46 |     "dl = ['TensorFlow', 'Keras', 'PyTorch', 'Theano', 'Deeplearning4J', 'Caffe', 'TFLearn', 'Torch', \n",
  47 |     "      'OpenCV', 'MXNet', 'Microsoft Cognitive Toolkit', 'Lasagne']\n",
  48 |     "\n",
  49 |     "cloud = ['AWS', 'GCP', 'Azure']\n",
  50 |     "\n",
  51 |     "ml = ['Natural Language Processing', 'Computer Vision', 'Speech Recognition', 'Fraud Detection',\n",
  52 |     "      'Recommender System', 'Image Recognition', 'Object Dectection', 'Chatbot',  'Sentiment Analysis']\n",
  53 |     "\n",
  54 |     "visualization = ['Dimple', 'D3.js', 'Ggplot', 'Shiny', 'Plotly', 'Matplotlib', 'Seaborn', \n",
  55 |     "                'Bokeh', 'Tableau']\n",
  56 |     "\n",
  57 |     "other = ['Pandas', 'Numpy', 'Scipy', 'Sklearn', 'Scikit-Learn', 'Docker', 'Git', 'Jira', 'Kaggle']\n",
  58 |     "\n",
  59 |     "dict_to_check = {'Programming Languages': languages,\n",
  60 |     "                 'Big Data Technologies': big_data,\n",
  61 |     "                 'Deep Learning Frameworks': dl,\n",
  62 |     "                 'Cloud Computing Platforms': cloud,\n",
  63 |     "                 'Machine Learning Application': ml,\n",
  64 |     "                 'Visualization Tools': visualization,\n",
  65 |     "                 'Other': other}"
  66 |    ]
  67 |   },
  68 |   {
  69 |    "cell_type": "code",
  70 |    "execution_count": 4,
  71 |    "metadata": {},
  72 |    "outputs": [],
  73 |    "source": [
  74 |     "# Check the frequency and store in dict\n",
  75 |     "freq_dict = {}\n",
  76 |     "for title in text_lists.keys():\n",
  77 |     "    freq_dict[title] = check_freq(dict_to_check=dict_to_check, text_list=text_lists[title])"
  78 |    ]
  79 |   },
  80 |   {
  81 |    "cell_type": "code",
  82 |    "execution_count": 5,
  83 |    "metadata": {},
  84 |    "outputs": [
  85 |     {
  86 |      "data": {
  87 |       "text/html": [
  88 |        "<div>\n",
  89 |        "<style scoped>\n",
  90 |        "    .dataframe tbody tr th:only-of-type {\n",
  91 |        "        vertical-align: middle;\n",
  92 |        "    }\n",
  93 |        "\n",
  94 |        "    .dataframe tbody tr th {\n",
  95 |        "        vertical-align: top;\n",
  96 |        "    }\n",
  97 |        "\n",
  98 |        "    .dataframe thead th {\n",
  99 |        "        text-align: right;\n",
 100 |        "    }\n",
 101 |        "</style>\n",
 102 |        "<table border=\"1\" class=\"dataframe\">\n",
 103 |        "  <thead>\n",
 104 |        "    <tr style=\"text-align: right;\">\n",
 105 |        "      <th></th>\n",
 106 |        "      <th></th>\n",
 107 |        "      <th>Python</th>\n",
 108 |        "      <th>R</th>\n",
 109 |        "      <th>SQL</th>\n",
 110 |        "      <th>Java</th>\n",
 111 |        "      <th>C</th>\n",
 112 |        "      <th>C++</th>\n",
 113 |        "      <th>C#</th>\n",
 114 |        "      <th>Scala</th>\n",
 115 |        "      <th>Perl</th>\n",
 116 |        "      <th>Julia</th>\n",
 117 |        "      <th>...</th>\n",
 118 |        "      <th>Tableau</th>\n",
 119 |        "      <th>Pandas</th>\n",
 120 |        "      <th>Numpy</th>\n",
 121 |        "      <th>Scipy</th>\n",
 122 |        "      <th>Sklearn</th>\n",
 123 |        "      <th>Scikit-Learn</th>\n",
 124 |        "      <th>Docker</th>\n",
 125 |        "      <th>Git</th>\n",
 126 |        "      <th>Jira</th>\n",
 127 |        "      <th>Kaggle</th>\n",
 128 |        "    </tr>\n",
 129 |        "  </thead>\n",
 130 |        "  <tbody>\n",
 131 |        "    <tr>\n",
 132 |        "      <th rowspan=\"5\" valign=\"top\">Data Engineer</th>\n",
 133 |        "      <th>Big Data Technologies</th>\n",
 134 |        "      <td>NaN</td>\n",
 135 |        "      <td>NaN</td>\n",
 136 |        "      <td>NaN</td>\n",
 137 |        "      <td>NaN</td>\n",
 138 |        "      <td>NaN</td>\n",
 139 |        "      <td>NaN</td>\n",
 140 |        "      <td>NaN</td>\n",
 141 |        "      <td>NaN</td>\n",
 142 |        "      <td>NaN</td>\n",
 143 |        "      <td>NaN</td>\n",
 144 |        "      <td>...</td>\n",
 145 |        "      <td>NaN</td>\n",
 146 |        "      <td>NaN</td>\n",
 147 |        "      <td>NaN</td>\n",
 148 |        "      <td>NaN</td>\n",
 149 |        "      <td>NaN</td>\n",
 150 |        "      <td>NaN</td>\n",
 151 |        "      <td>NaN</td>\n",
 152 |        "      <td>NaN</td>\n",
 153 |        "      <td>NaN</td>\n",
 154 |        "      <td>NaN</td>\n",
 155 |        "    </tr>\n",
 156 |        "    <tr>\n",
 157 |        "      <th>Cloud Computing Platforms</th>\n",
 158 |        "      <td>NaN</td>\n",
 159 |        "      <td>NaN</td>\n",
 160 |        "      <td>NaN</td>\n",
 161 |        "      <td>NaN</td>\n",
 162 |        "      <td>NaN</td>\n",
 163 |        "      <td>NaN</td>\n",
 164 |        "      <td>NaN</td>\n",
 165 |        "      <td>NaN</td>\n",
 166 |        "      <td>NaN</td>\n",
 167 |        "      <td>NaN</td>\n",
 168 |        "      <td>...</td>\n",
 169 |        "      <td>NaN</td>\n",
 170 |        "      <td>NaN</td>\n",
 171 |        "      <td>NaN</td>\n",
 172 |        "      <td>NaN</td>\n",
 173 |        "      <td>NaN</td>\n",
 174 |        "      <td>NaN</td>\n",
 175 |        "      <td>NaN</td>\n",
 176 |        "      <td>NaN</td>\n",
 177 |        "      <td>NaN</td>\n",
 178 |        "      <td>NaN</td>\n",
 179 |        "    </tr>\n",
 180 |        "    <tr>\n",
 181 |        "      <th>Deep Learning Frameworks</th>\n",
 182 |        "      <td>NaN</td>\n",
 183 |        "      <td>NaN</td>\n",
 184 |        "      <td>NaN</td>\n",
 185 |        "      <td>NaN</td>\n",
 186 |        "      <td>NaN</td>\n",
 187 |        "      <td>NaN</td>\n",
 188 |        "      <td>NaN</td>\n",
 189 |        "      <td>NaN</td>\n",
 190 |        "      <td>NaN</td>\n",
 191 |        "      <td>NaN</td>\n",
 192 |        "      <td>...</td>\n",
 193 |        "      <td>NaN</td>\n",
 194 |        "      <td>NaN</td>\n",
 195 |        "      <td>NaN</td>\n",
 196 |        "      <td>NaN</td>\n",
 197 |        "      <td>NaN</td>\n",
 198 |        "      <td>NaN</td>\n",
 199 |        "      <td>NaN</td>\n",
 200 |        "      <td>NaN</td>\n",
 201 |        "      <td>NaN</td>\n",
 202 |        "      <td>NaN</td>\n",
 203 |        "    </tr>\n",
 204 |        "    <tr>\n",
 205 |        "      <th>Machine Learning Application</th>\n",
 206 |        "      <td>NaN</td>\n",
 207 |        "      <td>NaN</td>\n",
 208 |        "      <td>NaN</td>\n",
 209 |        "      <td>NaN</td>\n",
 210 |        "      <td>NaN</td>\n",
 211 |        "      <td>NaN</td>\n",
 212 |        "      <td>NaN</td>\n",
 213 |        "      <td>NaN</td>\n",
 214 |        "      <td>NaN</td>\n",
 215 |        "      <td>NaN</td>\n",
 216 |        "      <td>...</td>\n",
 217 |        "      <td>NaN</td>\n",
 218 |        "      <td>NaN</td>\n",
 219 |        "      <td>NaN</td>\n",
 220 |        "      <td>NaN</td>\n",
 221 |        "      <td>NaN</td>\n",
 222 |        "      <td>NaN</td>\n",
 223 |        "      <td>NaN</td>\n",
 224 |        "      <td>NaN</td>\n",
 225 |        "      <td>NaN</td>\n",
 226 |        "      <td>NaN</td>\n",
 227 |        "    </tr>\n",
 228 |        "    <tr>\n",
 229 |        "      <th>Other</th>\n",
 230 |        "      <td>NaN</td>\n",
 231 |        "      <td>NaN</td>\n",
 232 |        "      <td>NaN</td>\n",
 233 |        "      <td>NaN</td>\n",
 234 |        "      <td>NaN</td>\n",
 235 |        "      <td>NaN</td>\n",
 236 |        "      <td>NaN</td>\n",
 237 |        "      <td>NaN</td>\n",
 238 |        "      <td>NaN</td>\n",
 239 |        "      <td>NaN</td>\n",
 240 |        "      <td>...</td>\n",
 241 |        "      <td>NaN</td>\n",
 242 |        "      <td>3.0</td>\n",
 243 |        "      <td>0.0</td>\n",
 244 |        "      <td>2.0</td>\n",
 245 |        "      <td>0.0</td>\n",
 246 |        "      <td>0.0</td>\n",
 247 |        "      <td>8.0</td>\n",
 248 |        "      <td>29.0</td>\n",
 249 |        "      <td>1.0</td>\n",
 250 |        "      <td>0.0</td>\n",
 251 |        "    </tr>\n",
 252 |        "  </tbody>\n",
 253 |        "</table>\n",
 254 |        "<p>5 rows × 87 columns</p>\n",
 255 |        "</div>"
 256 |       ],
 257 |       "text/plain": [
 258 |        "                                            Python   R  SQL  Java   C  C++  \\\n",
 259 |        "Data Engineer Big Data Technologies            NaN NaN  NaN   NaN NaN  NaN   \n",
 260 |        "              Cloud Computing Platforms        NaN NaN  NaN   NaN NaN  NaN   \n",
 261 |        "              Deep Learning Frameworks         NaN NaN  NaN   NaN NaN  NaN   \n",
 262 |        "              Machine Learning Application     NaN NaN  NaN   NaN NaN  NaN   \n",
 263 |        "              Other                            NaN NaN  NaN   NaN NaN  NaN   \n",
 264 |        "\n",
 265 |        "                                            C#  Scala  Perl  Julia   ...    \\\n",
 266 |        "Data Engineer Big Data Technologies        NaN    NaN   NaN    NaN   ...     \n",
 267 |        "              Cloud Computing Platforms    NaN    NaN   NaN    NaN   ...     \n",
 268 |        "              Deep Learning Frameworks     NaN    NaN   NaN    NaN   ...     \n",
 269 |        "              Machine Learning Application NaN    NaN   NaN    NaN   ...     \n",
 270 |        "              Other                        NaN    NaN   NaN    NaN   ...     \n",
 271 |        "\n",
 272 |        "                                            Tableau  Pandas  Numpy  Scipy  \\\n",
 273 |        "Data Engineer Big Data Technologies             NaN     NaN    NaN    NaN   \n",
 274 |        "              Cloud Computing Platforms         NaN     NaN    NaN    NaN   \n",
 275 |        "              Deep Learning Frameworks          NaN     NaN    NaN    NaN   \n",
 276 |        "              Machine Learning Application      NaN     NaN    NaN    NaN   \n",
 277 |        "              Other                             NaN     3.0    0.0    2.0   \n",
 278 |        "\n",
 279 |        "                                            Sklearn  Scikit-Learn  Docker  \\\n",
 280 |        "Data Engineer Big Data Technologies             NaN           NaN     NaN   \n",
 281 |        "              Cloud Computing Platforms         NaN           NaN     NaN   \n",
 282 |        "              Deep Learning Frameworks          NaN           NaN     NaN   \n",
 283 |        "              Machine Learning Application      NaN           NaN     NaN   \n",
 284 |        "              Other                             0.0           0.0     8.0   \n",
 285 |        "\n",
 286 |        "                                             Git  Jira  Kaggle  \n",
 287 |        "Data Engineer Big Data Technologies          NaN   NaN     NaN  \n",
 288 |        "              Cloud Computing Platforms      NaN   NaN     NaN  \n",
 289 |        "              Deep Learning Frameworks       NaN   NaN     NaN  \n",
 290 |        "              Machine Learning Application   NaN   NaN     NaN  \n",
 291 |        "              Other                         29.0   1.0     0.0  \n",
 292 |        "\n",
 293 |        "[5 rows x 87 columns]"
 294 |       ]
 295 |      },
 296 |      "execution_count": 5,
 297 |      "metadata": {},
 298 |      "output_type": "execute_result"
 299 |     }
 300 |    ],
 301 |    "source": [
 302 |     "# Convert the dict to a pandas df\n",
 303 |     "df = pd.DataFrame.from_dict({(i,j): freq_dict[i][j] \n",
 304 |     "                             for i in freq_dict.keys()\n",
 305 |     "                             for j in freq_dict[i].keys()},\n",
 306 |     "                            orient='index')\n",
 307 |     "df.head()"
 308 |    ]
 309 |   },
 310 |   {
 311 |    "cell_type": "code",
 312 |    "execution_count": 6,
 313 |    "metadata": {},
 314 |    "outputs": [
 315 |     {
 316 |      "data": {
 317 |       "text/html": [
 318 |        "<div>\n",
 319 |        "<style scoped>\n",
 320 |        "    .dataframe tbody tr th:only-of-type {\n",
 321 |        "        vertical-align: middle;\n",
 322 |        "    }\n",
 323 |        "\n",
 324 |        "    .dataframe tbody tr th {\n",
 325 |        "        vertical-align: top;\n",
 326 |        "    }\n",
 327 |        "\n",
 328 |        "    .dataframe thead th {\n",
 329 |        "        text-align: right;\n",
 330 |        "    }\n",
 331 |        "</style>\n",
 332 |        "<table border=\"1\" class=\"dataframe\">\n",
 333 |        "  <thead>\n",
 334 |        "    <tr style=\"text-align: right;\">\n",
 335 |        "      <th></th>\n",
 336 |        "      <th>level_0</th>\n",
 337 |        "      <th>level_1</th>\n",
 338 |        "      <th>Python</th>\n",
 339 |        "      <th>R</th>\n",
 340 |        "      <th>SQL</th>\n",
 341 |        "      <th>Java</th>\n",
 342 |        "      <th>C</th>\n",
 343 |        "      <th>C++</th>\n",
 344 |        "      <th>C#</th>\n",
 345 |        "      <th>Scala</th>\n",
 346 |        "      <th>...</th>\n",
 347 |        "      <th>Tableau</th>\n",
 348 |        "      <th>Pandas</th>\n",
 349 |        "      <th>Numpy</th>\n",
 350 |        "      <th>Scipy</th>\n",
 351 |        "      <th>Sklearn</th>\n",
 352 |        "      <th>Scikit-Learn</th>\n",
 353 |        "      <th>Docker</th>\n",
 354 |        "      <th>Git</th>\n",
 355 |        "      <th>Jira</th>\n",
 356 |        "      <th>Kaggle</th>\n",
 357 |        "    </tr>\n",
 358 |        "  </thead>\n",
 359 |        "  <tbody>\n",
 360 |        "    <tr>\n",
 361 |        "      <th>0</th>\n",
 362 |        "      <td>Data Engineer</td>\n",
 363 |        "      <td>Big Data Technologies</td>\n",
 364 |        "      <td>NaN</td>\n",
 365 |        "      <td>NaN</td>\n",
 366 |        "      <td>NaN</td>\n",
 367 |        "      <td>NaN</td>\n",
 368 |        "      <td>NaN</td>\n",
 369 |        "      <td>NaN</td>\n",
 370 |        "      <td>NaN</td>\n",
 371 |        "      <td>NaN</td>\n",
 372 |        "      <td>...</td>\n",
 373 |        "      <td>NaN</td>\n",
 374 |        "      <td>NaN</td>\n",
 375 |        "      <td>NaN</td>\n",
 376 |        "      <td>NaN</td>\n",
 377 |        "      <td>NaN</td>\n",
 378 |        "      <td>NaN</td>\n",
 379 |        "      <td>NaN</td>\n",
 380 |        "      <td>NaN</td>\n",
 381 |        "      <td>NaN</td>\n",
 382 |        "      <td>NaN</td>\n",
 383 |        "    </tr>\n",
 384 |        "    <tr>\n",
 385 |        "      <th>1</th>\n",
 386 |        "      <td>Data Engineer</td>\n",
 387 |        "      <td>Cloud Computing Platforms</td>\n",
 388 |        "      <td>NaN</td>\n",
 389 |        "      <td>NaN</td>\n",
 390 |        "      <td>NaN</td>\n",
 391 |        "      <td>NaN</td>\n",
 392 |        "      <td>NaN</td>\n",
 393 |        "      <td>NaN</td>\n",
 394 |        "      <td>NaN</td>\n",
 395 |        "      <td>NaN</td>\n",
 396 |        "      <td>...</td>\n",
 397 |        "      <td>NaN</td>\n",
 398 |        "      <td>NaN</td>\n",
 399 |        "      <td>NaN</td>\n",
 400 |        "      <td>NaN</td>\n",
 401 |        "      <td>NaN</td>\n",
 402 |        "      <td>NaN</td>\n",
 403 |        "      <td>NaN</td>\n",
 404 |        "      <td>NaN</td>\n",
 405 |        "      <td>NaN</td>\n",
 406 |        "      <td>NaN</td>\n",
 407 |        "    </tr>\n",
 408 |        "    <tr>\n",
 409 |        "      <th>2</th>\n",
 410 |        "      <td>Data Engineer</td>\n",
 411 |        "      <td>Deep Learning Frameworks</td>\n",
 412 |        "      <td>NaN</td>\n",
 413 |        "      <td>NaN</td>\n",
 414 |        "      <td>NaN</td>\n",
 415 |        "      <td>NaN</td>\n",
 416 |        "      <td>NaN</td>\n",
 417 |        "      <td>NaN</td>\n",
 418 |        "      <td>NaN</td>\n",
 419 |        "      <td>NaN</td>\n",
 420 |        "      <td>...</td>\n",
 421 |        "      <td>NaN</td>\n",
 422 |        "      <td>NaN</td>\n",
 423 |        "      <td>NaN</td>\n",
 424 |        "      <td>NaN</td>\n",
 425 |        "      <td>NaN</td>\n",
 426 |        "      <td>NaN</td>\n",
 427 |        "      <td>NaN</td>\n",
 428 |        "      <td>NaN</td>\n",
 429 |        "      <td>NaN</td>\n",
 430 |        "      <td>NaN</td>\n",
 431 |        "    </tr>\n",
 432 |        "    <tr>\n",
 433 |        "      <th>3</th>\n",
 434 |        "      <td>Data Engineer</td>\n",
 435 |        "      <td>Machine Learning Application</td>\n",
 436 |        "      <td>NaN</td>\n",
 437 |        "      <td>NaN</td>\n",
 438 |        "      <td>NaN</td>\n",
 439 |        "      <td>NaN</td>\n",
 440 |        "      <td>NaN</td>\n",
 441 |        "      <td>NaN</td>\n",
 442 |        "      <td>NaN</td>\n",
 443 |        "      <td>NaN</td>\n",
 444 |        "      <td>...</td>\n",
 445 |        "      <td>NaN</td>\n",
 446 |        "      <td>NaN</td>\n",
 447 |        "      <td>NaN</td>\n",
 448 |        "      <td>NaN</td>\n",
 449 |        "      <td>NaN</td>\n",
 450 |        "      <td>NaN</td>\n",
 451 |        "      <td>NaN</td>\n",
 452 |        "      <td>NaN</td>\n",
 453 |        "      <td>NaN</td>\n",
 454 |        "      <td>NaN</td>\n",
 455 |        "    </tr>\n",
 456 |        "    <tr>\n",
 457 |        "      <th>4</th>\n",
 458 |        "      <td>Data Engineer</td>\n",
 459 |        "      <td>Other</td>\n",
 460 |        "      <td>NaN</td>\n",
 461 |        "      <td>NaN</td>\n",
 462 |        "      <td>NaN</td>\n",
 463 |        "      <td>NaN</td>\n",
 464 |        "      <td>NaN</td>\n",
 465 |        "      <td>NaN</td>\n",
 466 |        "      <td>NaN</td>\n",
 467 |        "      <td>NaN</td>\n",
 468 |        "      <td>...</td>\n",
 469 |        "      <td>NaN</td>\n",
 470 |        "      <td>3.0</td>\n",
 471 |        "      <td>0.0</td>\n",
 472 |        "      <td>2.0</td>\n",
 473 |        "      <td>0.0</td>\n",
 474 |        "      <td>0.0</td>\n",
 475 |        "      <td>8.0</td>\n",
 476 |        "      <td>29.0</td>\n",
 477 |        "      <td>1.0</td>\n",
 478 |        "      <td>0.0</td>\n",
 479 |        "    </tr>\n",
 480 |        "  </tbody>\n",
 481 |        "</table>\n",
 482 |        "<p>5 rows × 89 columns</p>\n",
 483 |        "</div>"
 484 |       ],
 485 |       "text/plain": [
 486 |        "         level_0                       level_1  Python   R  SQL  Java   C  \\\n",
 487 |        "0  Data Engineer         Big Data Technologies     NaN NaN  NaN   NaN NaN   \n",
 488 |        "1  Data Engineer     Cloud Computing Platforms     NaN NaN  NaN   NaN NaN   \n",
 489 |        "2  Data Engineer      Deep Learning Frameworks     NaN NaN  NaN   NaN NaN   \n",
 490 |        "3  Data Engineer  Machine Learning Application     NaN NaN  NaN   NaN NaN   \n",
 491 |        "4  Data Engineer                         Other     NaN NaN  NaN   NaN NaN   \n",
 492 |        "\n",
 493 |        "   C++  C#  Scala   ...    Tableau  Pandas  Numpy  Scipy  Sklearn  \\\n",
 494 |        "0  NaN NaN    NaN   ...        NaN     NaN    NaN    NaN      NaN   \n",
 495 |        "1  NaN NaN    NaN   ...        NaN     NaN    NaN    NaN      NaN   \n",
 496 |        "2  NaN NaN    NaN   ...        NaN     NaN    NaN    NaN      NaN   \n",
 497 |        "3  NaN NaN    NaN   ...        NaN     NaN    NaN    NaN      NaN   \n",
 498 |        "4  NaN NaN    NaN   ...        NaN     3.0    0.0    2.0      0.0   \n",
 499 |        "\n",
 500 |        "   Scikit-Learn  Docker   Git  Jira  Kaggle  \n",
 501 |        "0           NaN     NaN   NaN   NaN     NaN  \n",
 502 |        "1           NaN     NaN   NaN   NaN     NaN  \n",
 503 |        "2           NaN     NaN   NaN   NaN     NaN  \n",
 504 |        "3           NaN     NaN   NaN   NaN     NaN  \n",
 505 |        "4           0.0     8.0  29.0   1.0     0.0  \n",
 506 |        "\n",
 507 |        "[5 rows x 89 columns]"
 508 |       ]
 509 |      },
 510 |      "execution_count": 6,
 511 |      "metadata": {},
 512 |      "output_type": "execute_result"
 513 |     }
 514 |    ],
 515 |    "source": [
 516 |     "# Reset the index to include both title and category as columns\n",
 517 |     "df = df.reset_index()\n",
 518 |     "df.head()"
 519 |    ]
 520 |   },
 521 |   {
 522 |    "cell_type": "code",
 523 |    "execution_count": 7,
 524 |    "metadata": {},
 525 |    "outputs": [
 526 |     {
 527 |      "data": {
 528 |       "text/html": [
 529 |        "<div>\n",
 530 |        "<style scoped>\n",
 531 |        "    .dataframe tbody tr th:only-of-type {\n",
 532 |        "        vertical-align: middle;\n",
 533 |        "    }\n",
 534 |        "\n",
 535 |        "    .dataframe tbody tr th {\n",
 536 |        "        vertical-align: top;\n",
 537 |        "    }\n",
 538 |        "\n",
 539 |        "    .dataframe thead th {\n",
 540 |        "        text-align: right;\n",
 541 |        "    }\n",
 542 |        "</style>\n",
 543 |        "<table border=\"1\" class=\"dataframe\">\n",
 544 |        "  <thead>\n",
 545 |        "    <tr style=\"text-align: right;\">\n",
 546 |        "      <th></th>\n",
 547 |        "      <th>title</th>\n",
 548 |        "      <th>category</th>\n",
 549 |        "      <th>Python</th>\n",
 550 |        "      <th>R</th>\n",
 551 |        "      <th>SQL</th>\n",
 552 |        "      <th>Java</th>\n",
 553 |        "      <th>C</th>\n",
 554 |        "      <th>C++</th>\n",
 555 |        "      <th>C#</th>\n",
 556 |        "      <th>Scala</th>\n",
 557 |        "      <th>...</th>\n",
 558 |        "      <th>Tableau</th>\n",
 559 |        "      <th>Pandas</th>\n",
 560 |        "      <th>Numpy</th>\n",
 561 |        "      <th>Scipy</th>\n",
 562 |        "      <th>Sklearn</th>\n",
 563 |        "      <th>Scikit-Learn</th>\n",
 564 |        "      <th>Docker</th>\n",
 565 |        "      <th>Git</th>\n",
 566 |        "      <th>Jira</th>\n",
 567 |        "      <th>Kaggle</th>\n",
 568 |        "    </tr>\n",
 569 |        "  </thead>\n",
 570 |        "  <tbody>\n",
 571 |        "    <tr>\n",
 572 |        "      <th>0</th>\n",
 573 |        "      <td>Data Engineer</td>\n",
 574 |        "      <td>Big Data Technologies</td>\n",
 575 |        "      <td>NaN</td>\n",
 576 |        "      <td>NaN</td>\n",
 577 |        "      <td>NaN</td>\n",
 578 |        "      <td>NaN</td>\n",
 579 |        "      <td>NaN</td>\n",
 580 |        "      <td>NaN</td>\n",
 581 |        "      <td>NaN</td>\n",
 582 |        "      <td>NaN</td>\n",
 583 |        "      <td>...</td>\n",
 584 |        "      <td>NaN</td>\n",
 585 |        "      <td>NaN</td>\n",
 586 |        "      <td>NaN</td>\n",
 587 |        "      <td>NaN</td>\n",
 588 |        "      <td>NaN</td>\n",
 589 |        "      <td>NaN</td>\n",
 590 |        "      <td>NaN</td>\n",
 591 |        "      <td>NaN</td>\n",
 592 |        "      <td>NaN</td>\n",
 593 |        "      <td>NaN</td>\n",
 594 |        "    </tr>\n",
 595 |        "    <tr>\n",
 596 |        "      <th>1</th>\n",
 597 |        "      <td>Data Engineer</td>\n",
 598 |        "      <td>Cloud Computing Platforms</td>\n",
 599 |        "      <td>NaN</td>\n",
 600 |        "      <td>NaN</td>\n",
 601 |        "      <td>NaN</td>\n",
 602 |        "      <td>NaN</td>\n",
 603 |        "      <td>NaN</td>\n",
 604 |        "      <td>NaN</td>\n",
 605 |        "      <td>NaN</td>\n",
 606 |        "      <td>NaN</td>\n",
 607 |        "      <td>...</td>\n",
 608 |        "      <td>NaN</td>\n",
 609 |        "      <td>NaN</td>\n",
 610 |        "      <td>NaN</td>\n",
 611 |        "      <td>NaN</td>\n",
 612 |        "      <td>NaN</td>\n",
 613 |        "      <td>NaN</td>\n",
 614 |        "      <td>NaN</td>\n",
 615 |        "      <td>NaN</td>\n",
 616 |        "      <td>NaN</td>\n",
 617 |        "      <td>NaN</td>\n",
 618 |        "    </tr>\n",
 619 |        "    <tr>\n",
 620 |        "      <th>2</th>\n",
 621 |        "      <td>Data Engineer</td>\n",
 622 |        "      <td>Deep Learning Frameworks</td>\n",
 623 |        "      <td>NaN</td>\n",
 624 |        "      <td>NaN</td>\n",
 625 |        "      <td>NaN</td>\n",
 626 |        "      <td>NaN</td>\n",
 627 |        "      <td>NaN</td>\n",
 628 |        "      <td>NaN</td>\n",
 629 |        "      <td>NaN</td>\n",
 630 |        "      <td>NaN</td>\n",
 631 |        "      <td>...</td>\n",
 632 |        "      <td>NaN</td>\n",
 633 |        "      <td>NaN</td>\n",
 634 |        "      <td>NaN</td>\n",
 635 |        "      <td>NaN</td>\n",
 636 |        "      <td>NaN</td>\n",
 637 |        "      <td>NaN</td>\n",
 638 |        "      <td>NaN</td>\n",
 639 |        "      <td>NaN</td>\n",
 640 |        "      <td>NaN</td>\n",
 641 |        "      <td>NaN</td>\n",
 642 |        "    </tr>\n",
 643 |        "    <tr>\n",
 644 |        "      <th>3</th>\n",
 645 |        "      <td>Data Engineer</td>\n",
 646 |        "      <td>Machine Learning Application</td>\n",
 647 |        "      <td>NaN</td>\n",
 648 |        "      <td>NaN</td>\n",
 649 |        "      <td>NaN</td>\n",
 650 |        "      <td>NaN</td>\n",
 651 |        "      <td>NaN</td>\n",
 652 |        "      <td>NaN</td>\n",
 653 |        "      <td>NaN</td>\n",
 654 |        "      <td>NaN</td>\n",
 655 |        "      <td>...</td>\n",
 656 |        "      <td>NaN</td>\n",
 657 |        "      <td>NaN</td>\n",
 658 |        "      <td>NaN</td>\n",
 659 |        "      <td>NaN</td>\n",
 660 |        "      <td>NaN</td>\n",
 661 |        "      <td>NaN</td>\n",
 662 |        "      <td>NaN</td>\n",
 663 |        "      <td>NaN</td>\n",
 664 |        "      <td>NaN</td>\n",
 665 |        "      <td>NaN</td>\n",
 666 |        "    </tr>\n",
 667 |        "    <tr>\n",
 668 |        "      <th>4</th>\n",
 669 |        "      <td>Data Engineer</td>\n",
 670 |        "      <td>Other</td>\n",
 671 |        "      <td>NaN</td>\n",
 672 |        "      <td>NaN</td>\n",
 673 |        "      <td>NaN</td>\n",
 674 |        "      <td>NaN</td>\n",
 675 |        "      <td>NaN</td>\n",
 676 |        "      <td>NaN</td>\n",
 677 |        "      <td>NaN</td>\n",
 678 |        "      <td>NaN</td>\n",
 679 |        "      <td>...</td>\n",
 680 |        "      <td>NaN</td>\n",
 681 |        "      <td>3.0</td>\n",
 682 |        "      <td>0.0</td>\n",
 683 |        "      <td>2.0</td>\n",
 684 |        "      <td>0.0</td>\n",
 685 |        "      <td>0.0</td>\n",
 686 |        "      <td>8.0</td>\n",
 687 |        "      <td>29.0</td>\n",
 688 |        "      <td>1.0</td>\n",
 689 |        "      <td>0.0</td>\n",
 690 |        "    </tr>\n",
 691 |        "  </tbody>\n",
 692 |        "</table>\n",
 693 |        "<p>5 rows × 89 columns</p>\n",
 694 |        "</div>"
 695 |       ],
 696 |       "text/plain": [
 697 |        "           title                      category  Python   R  SQL  Java   C  \\\n",
 698 |        "0  Data Engineer         Big Data Technologies     NaN NaN  NaN   NaN NaN   \n",
 699 |        "1  Data Engineer     Cloud Computing Platforms     NaN NaN  NaN   NaN NaN   \n",
 700 |        "2  Data Engineer      Deep Learning Frameworks     NaN NaN  NaN   NaN NaN   \n",
 701 |        "3  Data Engineer  Machine Learning Application     NaN NaN  NaN   NaN NaN   \n",
 702 |        "4  Data Engineer                         Other     NaN NaN  NaN   NaN NaN   \n",
 703 |        "\n",
 704 |        "   C++  C#  Scala   ...    Tableau  Pandas  Numpy  Scipy  Sklearn  \\\n",
 705 |        "0  NaN NaN    NaN   ...        NaN     NaN    NaN    NaN      NaN   \n",
 706 |        "1  NaN NaN    NaN   ...        NaN     NaN    NaN    NaN      NaN   \n",
 707 |        "2  NaN NaN    NaN   ...        NaN     NaN    NaN    NaN      NaN   \n",
 708 |        "3  NaN NaN    NaN   ...        NaN     NaN    NaN    NaN      NaN   \n",
 709 |        "4  NaN NaN    NaN   ...        NaN     3.0    0.0    2.0      0.0   \n",
 710 |        "\n",
 711 |        "   Scikit-Learn  Docker   Git  Jira  Kaggle  \n",
 712 |        "0           NaN     NaN   NaN   NaN     NaN  \n",
 713 |        "1           NaN     NaN   NaN   NaN     NaN  \n",
 714 |        "2           NaN     NaN   NaN   NaN     NaN  \n",
 715 |        "3           NaN     NaN   NaN   NaN     NaN  \n",
 716 |        "4           0.0     8.0  29.0   1.0     0.0  \n",
 717 |        "\n",
 718 |        "[5 rows x 89 columns]"
 719 |       ]
 720 |      },
 721 |      "execution_count": 7,
 722 |      "metadata": {},
 723 |      "output_type": "execute_result"
 724 |     }
 725 |    ],
 726 |    "source": [
 727 |     "# Rename the first two columns\n",
 728 |     "df.rename({'level_0': 'title', 'level_1': 'category'}, axis='columns', inplace=True)\n",
 729 |     "df.head()"
 730 |    ]
 731 |   },
 732 |   {
 733 |    "cell_type": "code",
 734 |    "execution_count": 8,
 735 |    "metadata": {},
 736 |    "outputs": [
 737 |     {
 738 |      "data": {
 739 |       "text/html": [
 740 |        "<div>\n",
 741 |        "<style scoped>\n",
 742 |        "    .dataframe tbody tr th:only-of-type {\n",
 743 |        "        vertical-align: middle;\n",
 744 |        "    }\n",
 745 |        "\n",
 746 |        "    .dataframe tbody tr th {\n",
 747 |        "        vertical-align: top;\n",
 748 |        "    }\n",
 749 |        "\n",
 750 |        "    .dataframe thead th {\n",
 751 |        "        text-align: right;\n",
 752 |        "    }\n",
 753 |        "</style>\n",
 754 |        "<table border=\"1\" class=\"dataframe\">\n",
 755 |        "  <thead>\n",
 756 |        "    <tr style=\"text-align: right;\">\n",
 757 |        "      <th></th>\n",
 758 |        "      <th>title</th>\n",
 759 |        "      <th>category</th>\n",
 760 |        "      <th>variable</th>\n",
 761 |        "      <th>value</th>\n",
 762 |        "    </tr>\n",
 763 |        "  </thead>\n",
 764 |        "  <tbody>\n",
 765 |        "    <tr>\n",
 766 |        "      <th>0</th>\n",
 767 |        "      <td>Data Engineer</td>\n",
 768 |        "      <td>Big Data Technologies</td>\n",
 769 |        "      <td>Python</td>\n",
 770 |        "      <td>NaN</td>\n",
 771 |        "    </tr>\n",
 772 |        "    <tr>\n",
 773 |        "      <th>1</th>\n",
 774 |        "      <td>Data Engineer</td>\n",
 775 |        "      <td>Cloud Computing Platforms</td>\n",
 776 |        "      <td>Python</td>\n",
 777 |        "      <td>NaN</td>\n",
 778 |        "    </tr>\n",
 779 |        "    <tr>\n",
 780 |        "      <th>2</th>\n",
 781 |        "      <td>Data Engineer</td>\n",
 782 |        "      <td>Deep Learning Frameworks</td>\n",
 783 |        "      <td>Python</td>\n",
 784 |        "      <td>NaN</td>\n",
 785 |        "    </tr>\n",
 786 |        "    <tr>\n",
 787 |        "      <th>3</th>\n",
 788 |        "      <td>Data Engineer</td>\n",
 789 |        "      <td>Machine Learning Application</td>\n",
 790 |        "      <td>Python</td>\n",
 791 |        "      <td>NaN</td>\n",
 792 |        "    </tr>\n",
 793 |        "    <tr>\n",
 794 |        "      <th>4</th>\n",
 795 |        "      <td>Data Engineer</td>\n",
 796 |        "      <td>Other</td>\n",
 797 |        "      <td>Python</td>\n",
 798 |        "      <td>NaN</td>\n",
 799 |        "    </tr>\n",
 800 |        "  </tbody>\n",
 801 |        "</table>\n",
 802 |        "</div>"
 803 |       ],
 804 |       "text/plain": [
 805 |        "           title                      category variable  value\n",
 806 |        "0  Data Engineer         Big Data Technologies   Python    NaN\n",
 807 |        "1  Data Engineer     Cloud Computing Platforms   Python    NaN\n",
 808 |        "2  Data Engineer      Deep Learning Frameworks   Python    NaN\n",
 809 |        "3  Data Engineer  Machine Learning Application   Python    NaN\n",
 810 |        "4  Data Engineer                         Other   Python    NaN"
 811 |       ]
 812 |      },
 813 |      "execution_count": 8,
 814 |      "metadata": {},
 815 |      "output_type": "execute_result"
 816 |     }
 817 |    ],
 818 |    "source": [
 819 |     "value_vars = df.columns.tolist()[2:] # the list of column names except the first two\n",
 820 |     "# Transform from wide to long for plotting\n",
 821 |     "df = pd.melt(df, id_vars=['title', 'category'], value_vars=value_vars)\n",
 822 |     "df.head()"
 823 |    ]
 824 |   },
 825 |   {
 826 |    "cell_type": "code",
 827 |    "execution_count": 9,
 828 |    "metadata": {},
 829 |    "outputs": [
 830 |     {
 831 |      "data": {
 832 |       "text/html": [
 833 |        "<div>\n",
 834 |        "<style scoped>\n",
 835 |        "    .dataframe tbody tr th:only-of-type {\n",
 836 |        "        vertical-align: middle;\n",
 837 |        "    }\n",
 838 |        "\n",
 839 |        "    .dataframe tbody tr th {\n",
 840 |        "        vertical-align: top;\n",
 841 |        "    }\n",
 842 |        "\n",
 843 |        "    .dataframe thead th {\n",
 844 |        "        text-align: right;\n",
 845 |        "    }\n",
 846 |        "</style>\n",
 847 |        "<table border=\"1\" class=\"dataframe\">\n",
 848 |        "  <thead>\n",
 849 |        "    <tr style=\"text-align: right;\">\n",
 850 |        "      <th></th>\n",
 851 |        "      <th>title</th>\n",
 852 |        "      <th>category</th>\n",
 853 |        "      <th>skill</th>\n",
 854 |        "      <th>frequency</th>\n",
 855 |        "    </tr>\n",
 856 |        "  </thead>\n",
 857 |        "  <tbody>\n",
 858 |        "    <tr>\n",
 859 |        "      <th>0</th>\n",
 860 |        "      <td>Data Engineer</td>\n",
 861 |        "      <td>Big Data Technologies</td>\n",
 862 |        "      <td>Python</td>\n",
 863 |        "      <td>NaN</td>\n",
 864 |        "    </tr>\n",
 865 |        "    <tr>\n",
 866 |        "      <th>1</th>\n",
 867 |        "      <td>Data Engineer</td>\n",
 868 |        "      <td>Cloud Computing Platforms</td>\n",
 869 |        "      <td>Python</td>\n",
 870 |        "      <td>NaN</td>\n",
 871 |        "    </tr>\n",
 872 |        "    <tr>\n",
 873 |        "      <th>2</th>\n",
 874 |        "      <td>Data Engineer</td>\n",
 875 |        "      <td>Deep Learning Frameworks</td>\n",
 876 |        "      <td>Python</td>\n",
 877 |        "      <td>NaN</td>\n",
 878 |        "    </tr>\n",
 879 |        "    <tr>\n",
 880 |        "      <th>3</th>\n",
 881 |        "      <td>Data Engineer</td>\n",
 882 |        "      <td>Machine Learning Application</td>\n",
 883 |        "      <td>Python</td>\n",
 884 |        "      <td>NaN</td>\n",
 885 |        "    </tr>\n",
 886 |        "    <tr>\n",
 887 |        "      <th>4</th>\n",
 888 |        "      <td>Data Engineer</td>\n",
 889 |        "      <td>Other</td>\n",
 890 |        "      <td>Python</td>\n",
 891 |        "      <td>NaN</td>\n",
 892 |        "    </tr>\n",
 893 |        "  </tbody>\n",
 894 |        "</table>\n",
 895 |        "</div>"
 896 |       ],
 897 |       "text/plain": [
 898 |        "           title                      category   skill  frequency\n",
 899 |        "0  Data Engineer         Big Data Technologies  Python        NaN\n",
 900 |        "1  Data Engineer     Cloud Computing Platforms  Python        NaN\n",
 901 |        "2  Data Engineer      Deep Learning Frameworks  Python        NaN\n",
 902 |        "3  Data Engineer  Machine Learning Application  Python        NaN\n",
 903 |        "4  Data Engineer                         Other  Python        NaN"
 904 |       ]
 905 |      },
 906 |      "execution_count": 9,
 907 |      "metadata": {},
 908 |      "output_type": "execute_result"
 909 |     }
 910 |    ],
 911 |    "source": [
 912 |     "# Rename the last two columns\n",
 913 |     "df.rename({'variable': 'skill', 'value': 'frequency'}, axis='columns', inplace=True)\n",
 914 |     "df.head()"
 915 |    ]
 916 |   },
 917 |   {
 918 |    "cell_type": "code",
 919 |    "execution_count": 10,
 920 |    "metadata": {},
 921 |    "outputs": [
 922 |     {
 923 |      "data": {
 924 |       "text/html": [
 925 |        "<div>\n",
 926 |        "<style scoped>\n",
 927 |        "    .dataframe tbody tr th:only-of-type {\n",
 928 |        "        vertical-align: middle;\n",
 929 |        "    }\n",
 930 |        "\n",
 931 |        "    .dataframe tbody tr th {\n",
 932 |        "        vertical-align: top;\n",
 933 |        "    }\n",
 934 |        "\n",
 935 |        "    .dataframe thead th {\n",
 936 |        "        text-align: right;\n",
 937 |        "    }\n",
 938 |        "</style>\n",
 939 |        "<table border=\"1\" class=\"dataframe\">\n",
 940 |        "  <thead>\n",
 941 |        "    <tr style=\"text-align: right;\">\n",
 942 |        "      <th></th>\n",
 943 |        "      <th>title</th>\n",
 944 |        "      <th>category</th>\n",
 945 |        "      <th>skill</th>\n",
 946 |        "      <th>frequency</th>\n",
 947 |        "    </tr>\n",
 948 |        "  </thead>\n",
 949 |        "  <tbody>\n",
 950 |        "    <tr>\n",
 951 |        "      <th>5</th>\n",
 952 |        "      <td>Data Engineer</td>\n",
 953 |        "      <td>Programming Languages</td>\n",
 954 |        "      <td>Python</td>\n",
 955 |        "      <td>52.0</td>\n",
 956 |        "    </tr>\n",
 957 |        "    <tr>\n",
 958 |        "      <th>12</th>\n",
 959 |        "      <td>Data Scientist</td>\n",
 960 |        "      <td>Programming Languages</td>\n",
 961 |        "      <td>Python</td>\n",
 962 |        "      <td>103.0</td>\n",
 963 |        "    </tr>\n",
 964 |        "    <tr>\n",
 965 |        "      <th>19</th>\n",
 966 |        "      <td>Machine Learning Engineer</td>\n",
 967 |        "      <td>Programming Languages</td>\n",
 968 |        "      <td>Python</td>\n",
 969 |        "      <td>71.0</td>\n",
 970 |        "    </tr>\n",
 971 |        "    <tr>\n",
 972 |        "      <th>26</th>\n",
 973 |        "      <td>Data Engineer</td>\n",
 974 |        "      <td>Programming Languages</td>\n",
 975 |        "      <td>R</td>\n",
 976 |        "      <td>5.0</td>\n",
 977 |        "    </tr>\n",
 978 |        "    <tr>\n",
 979 |        "      <th>33</th>\n",
 980 |        "      <td>Data Scientist</td>\n",
 981 |        "      <td>Programming Languages</td>\n",
 982 |        "      <td>R</td>\n",
 983 |        "      <td>19.0</td>\n",
 984 |        "    </tr>\n",
 985 |        "  </tbody>\n",
 986 |        "</table>\n",
 987 |        "</div>"
 988 |       ],
 989 |       "text/plain": [
 990 |        "                        title               category   skill  frequency\n",
 991 |        "5               Data Engineer  Programming Languages  Python       52.0\n",
 992 |        "12             Data Scientist  Programming Languages  Python      103.0\n",
 993 |        "19  Machine Learning Engineer  Programming Languages  Python       71.0\n",
 994 |        "26              Data Engineer  Programming Languages       R        5.0\n",
 995 |        "33             Data Scientist  Programming Languages       R       19.0"
 996 |       ]
 997 |      },
 998 |      "execution_count": 10,
 999 |      "metadata": {},
1000 |      "output_type": "execute_result"
1001 |     }
1002 |    ],
1003 |    "source": [
1004 |     "# Subset to non null values in the freq column\n",
1005 |     "df = df[df['frequency'].notnull()]\n",
1006 |     "df.head()"
1007 |    ]
1008 |   },
1009 |   {
1010 |    "cell_type": "code",
1011 |    "execution_count": 11,
1012 |    "metadata": {},
1013 |    "outputs": [
1014 |     {
1015 |      "data": {
1016 |       "text/html": [
1017 |        "<div>\n",
1018 |        "<style scoped>\n",
1019 |        "    .dataframe tbody tr th:only-of-type {\n",
1020 |        "        vertical-align: middle;\n",
1021 |        "    }\n",
1022 |        "\n",
1023 |        "    .dataframe tbody tr th {\n",
1024 |        "        vertical-align: top;\n",
1025 |        "    }\n",
1026 |        "\n",
1027 |        "    .dataframe thead th {\n",
1028 |        "        text-align: right;\n",
1029 |        "    }\n",
1030 |        "</style>\n",
1031 |        "<table border=\"1\" class=\"dataframe\">\n",
1032 |        "  <thead>\n",
1033 |        "    <tr style=\"text-align: right;\">\n",
1034 |        "      <th></th>\n",
1035 |        "      <th>title</th>\n",
1036 |        "      <th>category</th>\n",
1037 |        "      <th>skill</th>\n",
1038 |        "      <th>frequency</th>\n",
1039 |        "    </tr>\n",
1040 |        "  </thead>\n",
1041 |        "  <tbody>\n",
1042 |        "    <tr>\n",
1043 |        "      <th>0</th>\n",
1044 |        "      <td>Data Engineer</td>\n",
1045 |        "      <td>Programming Languages</td>\n",
1046 |        "      <td>Python</td>\n",
1047 |        "      <td>52.0</td>\n",
1048 |        "    </tr>\n",
1049 |        "    <tr>\n",
1050 |        "      <th>1</th>\n",
1051 |        "      <td>Data Scientist</td>\n",
1052 |        "      <td>Programming Languages</td>\n",
1053 |        "      <td>Python</td>\n",
1054 |        "      <td>103.0</td>\n",
1055 |        "    </tr>\n",
1056 |        "    <tr>\n",
1057 |        "      <th>2</th>\n",
1058 |        "      <td>Machine Learning Engineer</td>\n",
1059 |        "      <td>Programming Languages</td>\n",
1060 |        "      <td>Python</td>\n",
1061 |        "      <td>71.0</td>\n",
1062 |        "    </tr>\n",
1063 |        "    <tr>\n",
1064 |        "      <th>3</th>\n",
1065 |        "      <td>Data Engineer</td>\n",
1066 |        "      <td>Programming Languages</td>\n",
1067 |        "      <td>R</td>\n",
1068 |        "      <td>5.0</td>\n",
1069 |        "    </tr>\n",
1070 |        "    <tr>\n",
1071 |        "      <th>4</th>\n",
1072 |        "      <td>Data Scientist</td>\n",
1073 |        "      <td>Programming Languages</td>\n",
1074 |        "      <td>R</td>\n",
1075 |        "      <td>19.0</td>\n",
1076 |        "    </tr>\n",
1077 |        "  </tbody>\n",
1078 |        "</table>\n",
1079 |        "</div>"
1080 |       ],
1081 |       "text/plain": [
1082 |        "                       title               category   skill  frequency\n",
1083 |        "0              Data Engineer  Programming Languages  Python       52.0\n",
1084 |        "1             Data Scientist  Programming Languages  Python      103.0\n",
1085 |        "2  Machine Learning Engineer  Programming Languages  Python       71.0\n",
1086 |        "3              Data Engineer  Programming Languages       R        5.0\n",
1087 |        "4             Data Scientist  Programming Languages       R       19.0"
1088 |       ]
1089 |      },
1090 |      "execution_count": 11,
1091 |      "metadata": {},
1092 |      "output_type": "execute_result"
1093 |     }
1094 |    ],
1095 |    "source": [
1096 |     "# Reset the index\n",
1097 |     "df.reset_index(drop=True, inplace=True)\n",
1098 |     "df.head()"
1099 |    ]
1100 |   },
1101 |   {
1102 |    "cell_type": "code",
1103 |    "execution_count": 12,
1104 |    "metadata": {},
1105 |    "outputs": [
1106 |     {
1107 |      "data": {
1108 |       "text/plain": [
1109 |        "title        object\n",
1110 |        "category     object\n",
1111 |        "skill        object\n",
1112 |        "frequency     int32\n",
1113 |        "dtype: object"
1114 |       ]
1115 |      },
1116 |      "execution_count": 12,
1117 |      "metadata": {},
1118 |      "output_type": "execute_result"
1119 |     }
1120 |    ],
1121 |    "source": [
1122 |     "df = df.astype({'frequency': int})\n",
1123 |     "df.dtypes"
1124 |    ]
1125 |   },
1126 |   {
1127 |    "cell_type": "code",
1128 |    "execution_count": 13,
1129 |    "metadata": {},
1130 |    "outputs": [],
1131 |    "source": [
1132 |     "df.to_csv('skill_frequencies.csv')"
1133 |    ]
1134 |   }
1135 |  ],
1136 |  "metadata": {
1137 |   "kernelspec": {
1138 |    "display_name": "Python 3",
1139 |    "language": "python",
1140 |    "name": "python3"
1141 |   },
1142 |   "language_info": {
1143 |    "codemirror_mode": {
1144 |     "name": "ipython",
1145 |     "version": 3
1146 |    },
1147 |    "file_extension": ".py",
1148 |    "mimetype": "text/x-python",
1149 |    "name": "python",
1150 |    "nbconvert_exporter": "python",
1151 |    "pygments_lexer": "ipython3",
1152 |    "version": "3.6.5"
1153 |   }
1154 |  },
1155 |  "nbformat": 4,
1156 |  "nbformat_minor": 2
1157 | }
1158 | 


--------------------------------------------------------------------------------
/env_ideal_profiles.yaml:
--------------------------------------------------------------------------------
  1 | name: base
  2 | channels:
  3 |   - conda-forge
  4 |   - anaconda-fusion
  5 |   - defaults
  6 | dependencies:
  7 |   - conda=4.5.11=py36_1000
  8 |   - selenium=3.14.1=py36hfa6e2cd_1000
  9 |   - wordcloud=1.4.1=py36_0
 10 |   - _ipyw_jlab_nb_ext_conf=0.1.0=py36he6757f0_0
 11 |   - alabaster=0.7.10=py36hcd07829_0
 12 |   - anaconda=5.2.0=py36_3
 13 |   - anaconda-client=1.6.14=py36_0
 14 |   - anaconda-navigator=1.8.7=py36_0
 15 |   - anaconda-project=0.8.2=py36hfad2e28_0
 16 |   - asn1crypto=0.24.0=py36_0
 17 |   - astroid=1.6.3=py36_0
 18 |   - astropy=3.0.2=py36h452e1ab_1
 19 |   - attrs=18.1.0=py36_0
 20 |   - babel=2.5.3=py36_0
 21 |   - backcall=0.1.0=py36_0
 22 |   - backports=1.0=py36h81696a8_1
 23 |   - backports.shutil_get_terminal_size=1.0.0=py36h79ab834_2
 24 |   - beautifulsoup4=4.6.0=py36hd4cc5e8_1
 25 |   - bitarray=0.8.1=py36hfa6e2cd_1
 26 |   - bkcharts=0.2=py36h7e685f7_0
 27 |   - blas=1.0=mkl
 28 |   - blaze=0.11.3=py36h8a29ca5_0
 29 |   - bleach=2.1.3=py36_0
 30 |   - blosc=1.14.3=he51fdeb_0
 31 |   - bokeh=0.12.16=py36_0
 32 |   - boto=2.48.0=py36h1a776d2_1
 33 |   - bottleneck=1.2.1=py36hd119dfa_0
 34 |   - bzip2=1.0.6=hfa6e2cd_5
 35 |   - ca-certificates=2018.03.07=0
 36 |   - certifi=2018.4.16=py36_0
 37 |   - cffi=1.11.5=py36h945400d_0
 38 |   - chardet=3.0.4=py36h420ce6e_1
 39 |   - click=6.7=py36hec8c647_0
 40 |   - cloudpickle=0.5.3=py36_0
 41 |   - clyent=1.2.2=py36hb10d595_1
 42 |   - colorama=0.3.9=py36h029ae33_0
 43 |   - comtypes=1.1.4=py36_0
 44 |   - conda-build=3.10.5=py36_0
 45 |   - conda-env=2.6.0=h36134e3_1
 46 |   - conda-verify=2.0.0=py36h065de53_0
 47 |   - console_shortcut=0.1.1=h6bb2dd7_3
 48 |   - contextlib2=0.5.5=py36he5d52c0_0
 49 |   - cryptography=2.2.2=py36hfa6e2cd_0
 50 |   - curl=7.60.0=h7602738_0
 51 |   - cycler=0.10.0=py36h009560c_0
 52 |   - cython=0.28.2=py36hfa6e2cd_0
 53 |   - cytoolz=0.9.0.1=py36hfa6e2cd_0
 54 |   - dask=0.17.5=py36_0
 55 |   - dask-core=0.17.5=py36_0
 56 |   - datashape=0.5.4=py36h5770b85_0
 57 |   - decorator=4.3.0=py36_0
 58 |   - distributed=1.21.8=py36_0
 59 |   - docutils=0.14=py36h6012d8f_0
 60 |   - entrypoints=0.2.3=py36hfd66bb0_2
 61 |   - et_xmlfile=1.0.1=py36h3d2d736_0
 62 |   - fastcache=1.0.2=py36hfa6e2cd_2
 63 |   - filelock=3.0.4=py36_0
 64 |   - flask=1.0.2=py36_1
 65 |   - flask-cors=3.0.4=py36_0
 66 |   - freetype=2.8=h51f8f2c_1
 67 |   - get_terminal_size=1.0.0=h38e98db_0
 68 |   - gevent=1.3.0=py36hfa6e2cd_0
 69 |   - glob2=0.6=py36hdf76b57_0
 70 |   - greenlet=0.4.13=py36hfa6e2cd_0
 71 |   - h5py=2.7.1=py36h3bdd7fb_2
 72 |   - hdf5=1.10.2=hac2f561_1
 73 |   - heapdict=1.0.0=py36_2
 74 |   - html5lib=1.0.1=py36h047fa9f_0
 75 |   - icc_rt=2017.0.4=h97af966_0
 76 |   - icu=58.2=ha66f8fd_1
 77 |   - idna=2.6=py36h148d497_1
 78 |   - imageio=2.3.0=py36_0
 79 |   - imagesize=1.0.0=py36_0
 80 |   - intel-openmp=2018.0.0=8
 81 |   - ipykernel=4.8.2=py36_0
 82 |   - ipython=6.4.0=py36_0
 83 |   - ipython_genutils=0.2.0=py36h3c5d0ee_0
 84 |   - ipywidgets=7.2.1=py36_0
 85 |   - isort=4.3.4=py36_0
 86 |   - itsdangerous=0.24=py36hb6c5a24_1
 87 |   - jdcal=1.4=py36_0
 88 |   - jedi=0.12.0=py36_1
 89 |   - jinja2=2.10=py36h292fed1_0
 90 |   - jpeg=9b=hb83a4c4_2
 91 |   - jsonschema=2.6.0=py36h7636477_0
 92 |   - jupyter=1.0.0=py36_4
 93 |   - jupyter_client=5.2.3=py36_0
 94 |   - jupyter_console=5.2.0=py36h6d89b47_1
 95 |   - jupyter_core=4.4.0=py36h56e9d50_0
 96 |   - jupyterlab=0.32.1=py36_0
 97 |   - jupyterlab_launcher=0.10.5=py36_0
 98 |   - kiwisolver=1.0.1=py36h12c3424_0
 99 |   - lazy-object-proxy=1.3.1=py36hd1c21d2_0
100 |   - libcurl=7.60.0=hc4dcbb0_0
101 |   - libiconv=1.15=h1df5818_7
102 |   - libpng=1.6.34=h79bbb47_0
103 |   - libsodium=1.0.16=h9d3ae62_0
104 |   - libssh2=1.8.0=hd619d38_4
105 |   - libtiff=4.0.9=hb8ad9f9_1
106 |   - libxml2=2.9.8=hadb2253_1
107 |   - libxslt=1.1.32=hf6f1972_0
108 |   - llvmlite=0.23.1=py36hcacf6c6_0
109 |   - locket=0.2.0=py36hfed976d_1
110 |   - lxml=4.2.1=py36heafd4d3_0
111 |   - lzo=2.10=h6df0209_2
112 |   - m2w64-gcc-libgfortran=5.3.0=6
113 |   - m2w64-gcc-libs=5.3.0=7
114 |   - m2w64-gcc-libs-core=5.3.0=7
115 |   - m2w64-gmp=6.1.0=2
116 |   - m2w64-libwinpthread-git=5.0.0.4634.697f757=2
117 |   - markupsafe=1.0=py36h0e26971_1
118 |   - matplotlib=2.2.2=py36h153e9ff_1
119 |   - mccabe=0.6.1=py36hb41005a_1
120 |   - menuinst=1.4.14=py36hfa6e2cd_0
121 |   - mistune=0.8.3=py36hfa6e2cd_1
122 |   - mkl=2018.0.2=1
123 |   - mkl-service=1.1.2=py36h57e144c_4
124 |   - mkl_fft=1.0.1=py36h452e1ab_0
125 |   - mkl_random=1.0.1=py36h9258bd6_0
126 |   - more-itertools=4.1.0=py36_0
127 |   - mpmath=1.0.0=py36hacc8adf_2
128 |   - msgpack-python=0.5.6=py36he980bc4_0
129 |   - msys2-conda-epoch=20160418=1
130 |   - multipledispatch=0.5.0=py36_0
131 |   - navigator-updater=0.2.1=py36_0
132 |   - nbconvert=5.3.1=py36h8dc0fde_0
133 |   - nbformat=4.4.0=py36h3a5bc1b_0
134 |   - networkx=2.1=py36_0
135 |   - nltk=3.3.0=py36_0
136 |   - nose=1.3.7=py36h1c3779e_2
137 |   - notebook=5.5.0=py36_0
138 |   - numba=0.38.0=py36h830ac7b_0
139 |   - numexpr=2.6.5=py36hcd2f87e_0
140 |   - numpy=1.14.3=py36h9fa60d3_1
141 |   - numpy-base=1.14.3=py36h555522e_1
142 |   - numpydoc=0.8.0=py36_0
143 |   - odo=0.5.1=py36h7560279_0
144 |   - olefile=0.45.1=py36_0
145 |   - openpyxl=2.5.3=py36_0
146 |   - openssl=1.0.2o=h8ea7d77_0
147 |   - packaging=17.1=py36_0
148 |   - pandas=0.23.0=py36h830ac7b_0
149 |   - pandoc=1.19.2.1=hb2460c7_1
150 |   - pandocfilters=1.4.2=py36h3ef6317_1
151 |   - parso=0.2.0=py36_0
152 |   - partd=0.3.8=py36hc8e763b_0
153 |   - path.py=11.0.1=py36_0
154 |   - pathlib2=2.3.2=py36_0
155 |   - patsy=0.5.0=py36_0
156 |   - pep8=1.7.1=py36_0
157 |   - pickleshare=0.7.4=py36h9de030f_0
158 |   - pillow=5.1.0=py36h0738816_0
159 |   - pip=10.0.1=py36_0
160 |   - pkginfo=1.4.2=py36_1
161 |   - plotly=3.4.1=py36h28b3542_0
162 |   - pluggy=0.6.0=py36hc7daf1e_0
163 |   - ply=3.11=py36_0
164 |   - prompt_toolkit=1.0.15=py36h60b8f86_0
165 |   - psutil=5.4.5=py36hfa6e2cd_0
166 |   - py=1.5.3=py36_0
167 |   - pycodestyle=2.4.0=py36_0
168 |   - pycosat=0.6.3=py36h413d8a4_0
169 |   - pycparser=2.18=py36hd053e01_1
170 |   - pycrypto=2.6.1=py36hfa6e2cd_8
171 |   - pycurl=7.43.0.1=py36h74b6da3_0
172 |   - pyflakes=1.6.0=py36h0b975d6_0
173 |   - pygments=2.2.0=py36hb010967_0
174 |   - pylint=1.8.4=py36_0
175 |   - pyodbc=4.0.23=py36h6538335_0
176 |   - pyopenssl=18.0.0=py36_0
177 |   - pyparsing=2.2.0=py36h785a196_1
178 |   - pyqt=5.9.2=py36h1aa27d4_0
179 |   - pysocks=1.6.8=py36_0
180 |   - pytables=3.4.3=py36he6f6034_1
181 |   - pytest=3.5.1=py36_0
182 |   - pytest-arraydiff=0.2=py36_0
183 |   - pytest-astropy=0.3.0=py36_0
184 |   - pytest-doctestplus=0.1.3=py36_0
185 |   - pytest-openfiles=0.3.0=py36_0
186 |   - pytest-remotedata=0.2.1=py36_0
187 |   - python=3.6.5=h0c2934d_0
188 |   - python-dateutil=2.7.3=py36_0
189 |   - pytz=2018.4=py36_0
190 |   - pywavelets=0.5.2=py36hc649158_0
191 |   - pywin32=223=py36hfa6e2cd_1
192 |   - pywinpty=0.5.1=py36_0
193 |   - pyyaml=3.12=py36h1d1928f_1
194 |   - pyzmq=17.0.0=py36hfa6e2cd_1
195 |   - qt=5.9.5=vc14he4a7d60_0
196 |   - qtawesome=0.4.4=py36h5aa48f6_0
197 |   - qtconsole=4.3.1=py36h99a29a9_0
198 |   - qtpy=1.4.1=py36_0
199 |   - requests=2.18.4=py36h4371aae_1
200 |   - retrying=1.3.3=py36_2
201 |   - rope=0.10.7=py36had63a69_0
202 |   - ruamel_yaml=0.15.35=py36hfa6e2cd_1
203 |   - scikit-image=0.13.1=py36hfa6e2cd_1
204 |   - scikit-learn=0.19.1=py36h53aea1b_0
205 |   - scipy=1.1.0=py36h672f292_0
206 |   - seaborn=0.8.1=py36h9b69545_0
207 |   - send2trash=1.5.0=py36_0
208 |   - setuptools=39.1.0=py36_0
209 |   - simplegeneric=0.8.1=py36_2
210 |   - singledispatch=3.4.0.3=py36h17d0c80_0
211 |   - sip=4.19.8=py36h6538335_0
212 |   - six=1.11.0=py36h4db2310_1
213 |   - snappy=1.1.7=h777316e_3
214 |   - snowballstemmer=1.2.1=py36h763602f_0
215 |   - sortedcollections=0.6.1=py36_0
216 |   - sortedcontainers=1.5.10=py36_0
217 |   - sphinx=1.7.4=py36_0
218 |   - sphinxcontrib=1.0=py36hbbac3d2_1
219 |   - sphinxcontrib-websupport=1.0.1=py36hb5e5916_1
220 |   - spyder=3.2.8=py36_0
221 |   - sqlalchemy=1.2.7=py36ha85dd04_0
222 |   - sqlite=3.23.1=h35aae40_0
223 |   - statsmodels=0.9.0=py36h452e1ab_0
224 |   - sympy=1.1.1=py36h96708e0_0
225 |   - tblib=1.3.2=py36h30f5020_0
226 |   - terminado=0.8.1=py36_1
227 |   - testpath=0.3.1=py36h2698cfe_0
228 |   - tk=8.6.7=hcb92d03_3
229 |   - toolz=0.9.0=py36_0
230 |   - tornado=5.0.2=py36_0
231 |   - traitlets=4.3.2=py36h096827d_0
232 |   - typing=3.6.4=py36_0
233 |   - unicodecsv=0.14.1=py36h6450c06_0
234 |   - urllib3=1.22=py36h276f60a_0
235 |   - vc=14=h0510ff6_3
236 |   - vs2015_runtime=14.0.25123=3
237 |   - wcwidth=0.1.7=py36h3d5aa90_0
238 |   - webencodings=0.5.1=py36h67c50ae_1
239 |   - werkzeug=0.14.1=py36_0
240 |   - wheel=0.31.1=py36_0
241 |   - widgetsnbextension=3.2.1=py36_0
242 |   - win_inet_pton=1.0.1=py36he67d7fd_1
243 |   - win_unicode_console=0.5=py36hcdbd4b5_0
244 |   - wincertstore=0.2=py36h7fe50ca_0
245 |   - winpty=0.4.3=4
246 |   - wrapt=1.10.11=py36he5f5981_0
247 |   - xlrd=1.1.0=py36h1cb58dc_1
248 |   - xlsxwriter=1.0.4=py36_0
249 |   - xlwings=0.11.8=py36_0
250 |   - xlwt=1.3.0=py36h1a4751e_0
251 |   - yaml=0.1.7=hc54c509_2
252 |   - zeromq=4.2.5=hc6251cf_0
253 |   - zict=0.1.3=py36h2d8e73e_0
254 |   - zlib=1.2.11=h8395fce_2
255 |   - pip:
256 |     - tables==3.4.3
257 | prefix: D:\Anaconda3
258 | 
259 | 


--------------------------------------------------------------------------------
/helper.py:
--------------------------------------------------------------------------------
  1 | import json
  2 | import re, csv
  3 | from wordcloud import WordCloud, STOPWORDS
  4 | from matplotlib import pyplot as plt
  5 | from process_text import *
  6 | import pandas as pd
  7 | import numpy as np
  8 | 
  9 | 
 10 | 
 11 | def load_data(file_name):
 12 |     """
 13 |     Open the saved json data file and load the data into a dict.
 14 |     
 15 |     Parameters:
 16 |         file_name: the saved file name, e.g. "machine_learning_engineer.json"
 17 |     
 18 |     Returns:
 19 |         postings_dict: data in dict format   
 20 |     
 21 |     """
 22 | 
 23 |     with open(file_name, 'r') as f:
 24 |         postings_dict = json.load(f)
 25 |         return postings_dict
 26 | 
 27 | 
 28 | 
 29 | def plot_wc(text, max_words=200, stopwords_list=[], to_file_name=None):
 30 |     """
 31 |     Make a word cloud plot using the given text.
 32 |     
 33 |     Parameters:
 34 |         text -- the text as a string
 35 |     
 36 |     Returns:
 37 |         None    
 38 |     """
 39 |     wordcloud = WordCloud().generate(text)
 40 |     stopwords = set(STOPWORDS)
 41 |     stopwords.update(stopwords_list)
 42 | 
 43 |     wordcloud = WordCloud(background_color='white',
 44 |                          stopwords=stopwords,
 45 |                          #prefer_horizontal=1,
 46 |                          max_words=max_words, 
 47 |                          min_font_size=6,
 48 |                          scale=1,
 49 |                          width = 800, height = 800, 
 50 |                          random_state=8).generate(text)
 51 |     
 52 |     plt.figure(figsize=[16,12])
 53 |     plt.imshow(wordcloud, interpolation="bilinear")
 54 |     plt.axis("off")
 55 |     plt.show()
 56 |     
 57 |     if to_file_name:
 58 |         to_file_name = to_file_name + ".png"
 59 |         wordcloud.to_file(to_file_name)
 60 | 
 61 | 
 62 | 
 63 | def plot_profile(title, 
 64 |     first_n_postings, 
 65 |     max_words=200, 
 66 |     return_posting=False, 
 67 |     return_tokens=False,
 68 |     return_text_list=False):
 69 |     """
 70 |     Loads the corresponding json file, extracts the first_n job postings and plot the wordcloud profile.
 71 |     
 72 |     Parameters:
 73 |         title: the job title such as "data scientist"
 74 |         first_n_postings: int, the first n job postings to use for the plot.
 75 |     
 76 |     Returns:
 77 |         nth_posting: the nth job posting as a string. This helps to verify the first_n_postings param used.
 78 |     
 79 |     """
 80 |     # Convert title to full file name then load the data
 81 |     file_name = '_'.join(title.split()) + '.json'
 82 |     data = load_data(file_name)
 83 |     
 84 |     # Only of the two can be True
 85 |     if (return_posting + return_tokens + return_text_list) >= 2:
 86 |         print('You can only return one of these: a posting, tokens, text list! \nPlease try again.')
 87 |         return None
 88 | 
 89 |     if return_posting:
 90 |         n_posting = data[str(first_n_postings)]
 91 |         return n_posting
 92 |     
 93 |     text_list = make_text_list(data, first_n_postings)
 94 | 
 95 |     if return_text_list:
 96 |         return text_list
 97 |     elif return_tokens:
 98 |         tokens = tokenize_list(text_list, return_string=False)
 99 |         return tokens
100 |     else:
101 |         # Get the tokens joined as a string
102 |         text = tokenize_list(text_list, return_string=True)
103 |         # Get the stop words to use
104 |         with open('stopwords.csv', 'r', newline='') as f:
105 |             reader = csv.reader(f)
106 |             stop_list = list(reader)[0]
107 |         to_file_name = '_'.join(title.split())
108 |         plot_wc(text, max_words, stopwords_list=stop_list, to_file_name=to_file_name)
109 | 
110 | 
111 | 
112 | def plot_title(df, title, save_figure=False):
113 |     """
114 |     Plots the skill frequencies of all skill categories for a given title.
115 |     
116 |     Params:
117 |         df: (pandas df) the frequency df
118 |         title: (str) one of the three job titles:
119 |             'data scientist', 'machine learning engineer', 'data engineer'
120 |     
121 |     Returns:
122 |         None
123 |     
124 |     """
125 |     categories = df.category.unique()
126 |     titles = list(df.title.unique())
127 | 
128 |     # Ensure input is valid
129 |     if title.title() not in titles:
130 |         print('Title invalid. Please try again!')
131 |         return None
132 |     title = title.title()
133 |     # Subset df to the given title
134 |     df_title = df.query('title==@title')
135 |     # Set up the parameters for the plotting grid
136 |     nrows=4
137 |     ncols=2
138 |     figsize = (15, 20)
139 |     # Add a dummy category name to match the grid
140 |     categories = np.append(categories, 'Empty').reshape(4, 2)
141 |     
142 |     # Generate the plotting objects
143 |     fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=figsize)
144 |     
145 |     # Loop thru the axes of the figure
146 |     for row in range(nrows):
147 |         for col in range(ncols):
148 |             cat = categories[row, col]
149 |             # Subset to one category for each subplot
150 |             df_cat = df_title.query('category==@cat')
151 |             df_cat = df_cat.sort_values(by='frequency', ascending=False)
152 |             # Find the correspoinding axis in axes
153 |             ax = axes[row, col]
154 |             # Handle errors for the empty last subplot
155 |             try:
156 |                 df_cat.plot(x='skill', y='frequency', kind='bar', ax=ax)
157 |                 ax.set(title=cat, xlabel='', ylabel='Frequency')
158 |                 ax.get_legend().remove() # remove legend
159 |                 for tick in ax.get_xticklabels():
160 |                     tick.set_rotation(60)
161 |             except:
162 |                 fig.delaxes(ax)
163 | 
164 |     # Add the figure title
165 |     fig_title = title + ' Skills Distribution'
166 |     fig.suptitle(fig_title, y=0.92, verticalalignment='bottom', fontsize=30)
167 |     plt.subplots_adjust(hspace=0.9) # make sure the figure title doesn't overlap with subplot titles
168 |     plt.show()
169 | 
170 |     if save_figure:
171 |         figure_name = fig_title + '.png'
172 |         fig.savefig(figure_name)
173 | 
174 | 
175 | 
176 | def plot_skill(df, cat, save_figure=False):
177 |     """
178 |     Plots the skill frequencies of all job titles for a given skill category.
179 |     
180 |     Params:
181 |         df: (pandas df) the frequency df
182 |         cat: (str) one of the seven skill categories:
183 |             'Programming Languages', 'Big Data Technologies'...
184 |     
185 |     Returns:
186 |         None
187 |     
188 |     """
189 |     categories = list(df.category.unique())
190 |     titles = list(df.title.unique())
191 |      
192 |     if cat.title() not in categories:
193 |         print('Category invalid. Please try again!')
194 |         return None
195 |     cat = cat.title()
196 |     
197 |     # Subset df to the given category
198 |     df_cat = df.query('category==@cat')
199 | 
200 |     # Set up the parameters for the plotting grid
201 |     nrows = len(titles)
202 |     ncols = 1
203 |     figsize = (10, 12)
204 |     
205 |     # Generate the plotting objects
206 |     fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=figsize)
207 |     
208 |     # Loop thru the axes of the figure
209 |     for row in range(nrows):
210 |         title = titles[row]
211 |         # Subset to one title for each subplot
212 |         df_title = df_cat.query('title==@title')
213 |         df_title = df_title.sort_values(by='frequency', ascending=False)
214 |         # Find the correspoinding axis in axes
215 |         ax = axes[row]
216 |         df_title.plot(x='skill', y='frequency', kind='bar', ax=ax)
217 |         ax.set(title=title, xlabel='', ylabel='Frequency')
218 |         ax.get_legend().remove() # remove legend
219 |         for tick in ax.get_xticklabels():
220 |             tick.set_rotation(30)
221 | 
222 |     # Add the figure title
223 |     fig_title = cat + ' Distribution'
224 |     fig.suptitle(fig_title, y=0.95, verticalalignment='baseline', fontsize=30)
225 |     plt.subplots_adjust(hspace=0.36) # make sure the figure title doesn't overlap with subplot titles
226 |     plt.show()
227 | 
228 |     if save_figure:
229 |         figure_name = fig_title + '.png'
230 |         fig.savefig(figure_name)


--------------------------------------------------------------------------------
/process_text.py:
--------------------------------------------------------------------------------
  1 | #from string import digits
  2 | #from nltk import word_tokenize
  3 | import re
  4 | from nltk.corpus import stopwords 
  5 | from nltk.stem.snowball import SnowballStemmer
  6 | 
  7 | 
  8 | 
  9 | def make_text_list(postings_dict, first_n_postings=100):
 10 |     """
 11 |     Extract the texts from postings_dict into a list of strings
 12 |     
 13 |     Parameters:
 14 |         postings_dict:  
 15 |         first_n_postings:  
 16 |     
 17 |     Returns:
 18 |         text_list: list of job posting texts
 19 |     
 20 |     """
 21 |     
 22 |     text_list = []
 23 |     for i in range(0, first_n_postings+1):
 24 |         # Since some number could be missing due to errors in scraping, 
 25 |         # handle exception here to ensure error free
 26 |         try:
 27 |             text_list.append(postings_dict[str(i)]['posting'])
 28 |         except:
 29 |             continue        
 30 |     
 31 |     return text_list
 32 | 
 33 | 
 34 | 
 35 | def remove_digits(token):
 36 |     """
 37 |     Remove digits from a token
 38 | 
 39 |     Params:
 40 |         token: (str) a string token
 41 | 
 42 |     Returns:
 43 |         cleaned_token: (str) the cleaned token
 44 | 
 45 |     """
 46 |     # Remove digits from the token
 47 |     remove_digits = str.maketrans('', '', digits)
 48 |     token = token.translate(remove_digits)
 49 |     return token
 50 |     
 51 | 
 52 | 
 53 | def tokenize_text(text, stem=False):
 54 |     """
 55 |     Tokenize, stem and remove stop words for the given text
 56 |     
 57 |     Parameters:
 58 |         text: a text string
 59 |     
 60 |     Returns:
 61 |         tokens: the processed text as a list of tokens
 62 |     """
 63 |     stop_words = set(stopwords.words('english')) 
 64 |     #tokens = word_tokenize(text.lower())
 65 | 
 66 |     # Change "C++" to "Cpp" to avoid being removed below
 67 |     #tokens = ['cpp' if token=='c++' else token for token in tokens]
 68 |     # Same with C#
 69 |     #tokens = ['csharp' if token=='c#' else token for token in tokens]
 70 |     # Remove digits
 71 |     #tokens = [remove_digits(token) for token in tokens]
 72 |     # Remove non-alphabetic tokens and stopwords
 73 |     #tokens = [token for token in tokens if token.isalpha() and token not in stop_words]
 74 |   
 75 |     # Use Regex to tokenize
 76 |     # Replace any non word characters except .+# with space
 77 |     text = re.sub("[^\w.+#]", " ", text)
 78 |     # Twe cases to replace with space
 79 |     # Case 1: \d+\.?\d+\s -- any number of digits followed by a space with or without
 80 |     # a dot in between
 81 |     # Case 2: \d+\+ -- any number of digits followed by a plus sign
 82 |     text = re.sub("\d+\.?\d+\s|\d+\+", " ", text) 
 83 |     tokens = text.lower().split()
 84 |     tokens = [token for token in tokens if token not in stop_words]
 85 | 
 86 |     # Stem tokens
 87 |     if stem:
 88 |         stemmer = SnowballStemmer("english")
 89 |         tokens = [stemmer.stem(i) for i in tokens]
 90 |                     
 91 |     return tokens 
 92 | 
 93 | 
 94 | 
 95 | def tokenize_list(text_list, stem=False, return_string=False):
 96 |     """
 97 |     Tokenize the given list of text and then combine list of tokens into text for plotting
 98 |     
 99 |     Parameters:
100 |         text_list -- list of job posting strings
101 |         
102 |     Returns:
103 |         text -- a text string for word cloud plot
104 |     """
105 |     # Split the text based on slash, space and newline, then take set     
106 |     #text = [set(re.split('/| |\n|', i)) for i in text]
107 |     #text = [set(re.split('\W', i)) for i in text_list]
108 |     
109 |     text_list_tokenized = [tokenize_text(text=i, stem=stem) for i in text_list]
110 |     
111 |     tokens = []
112 |     # Combine all token lists into one big list of tokens
113 |     for i in text_list_tokenized:
114 |         tokens += i
115 |         
116 |     if return_string:
117 |         text = ' '.join(tokens)
118 |         return text
119 |     
120 |     # Return the list of all tokens
121 |     return tokens  
122 | 
123 | 
124 | 
125 | def check_freq(dict_to_check, text_list):
126 |     """
127 |     Checks each given word's freqency in a list of posting strings.
128 | 
129 |     Params:
130 |         words: (dict) a dict of word strings to check frequency for, format:
131 |             {'languages': ['Python', 'R'..],
132 |             'big data': ['AWS', 'Azure'...],
133 |              ..}
134 |         text_list: (list) a list of posting strings to search in
135 | 
136 |     Returns:
137 |         freq: (dict) frequency counts
138 | 
139 |     """
140 |     freq = {}
141 | 
142 |     # Join the text together and convert words to lowercase 
143 |     text = ' '.join(text_list).lower()
144 | 
145 |     for category, skill_list in dict_to_check.items():
146 |         # Initialize each category as a dictionary
147 |         freq[category] = {}
148 |         for skill in skill_list:
149 |             if len(skill) == 1: # pad single letter skills such as "R" with spaces
150 |                 skill_name = ' ' + skill.lower() + ' '
151 |             else:
152 |                 skill_name = skill.lower()    
153 |             freq[category][skill] = text.count(skill_name)
154 | 
155 |     return freq
156 | 


--------------------------------------------------------------------------------
/scrape_data.py:
--------------------------------------------------------------------------------
  1 | import re
  2 | import json
  3 | from bs4 import BeautifulSoup
  4 | from selenium import webdriver
  5 | 
  6 | 
  7 | 
  8 | def get_soup(url):
  9 |     """
 10 |     Given the url of a page, this function returns the soup object.
 11 |     
 12 |     Parameters:
 13 |         url: the link to get soup object for
 14 |     
 15 |     Returns:
 16 |         soup: soup object
 17 |     """
 18 |     driver = webdriver.Firefox()
 19 |     driver.get(url)
 20 |     html = driver.page_source
 21 |     soup = BeautifulSoup(html, 'html.parser')
 22 |     driver.close()
 23 |     
 24 |     return soup
 25 | 
 26 | 
 27 | 
 28 | def grab_job_links(soup):
 29 |     """
 30 |     Grab all non-sponsored job posting links from a Indeed search result page using the given soup object
 31 |     
 32 |     Parameters:
 33 |         soup: the soup object corresponding to a search result page
 34 |                 e.g. https://ca.indeed.com/jobs?q=data+scientist&l=Toronto&start=20
 35 |     
 36 |     Returns:
 37 |         urls: a python list of job posting urls
 38 |     
 39 |     """
 40 |     urls = []
 41 |     
 42 |     # Loop thru all the posting links
 43 |     for link in soup.find_all('h2', {'class': 'jobtitle'}):
 44 |         # Since sponsored job postings are represented by "a target" instead of "a href", no need to worry here
 45 |         partial_url = link.a.get('href')
 46 |         # This is a partial url, we need to attach the prefix
 47 |         url = 'https://ca.indeed.com' + partial_url
 48 |         # Make sure this is not a sponsored posting
 49 |         urls.append(url)
 50 |     
 51 |     return urls
 52 | 
 53 | 
 54 | 
 55 | def get_urls(query, num_pages, location):
 56 |     """
 57 |     Get all the job posting URLs resulted from a specific search.
 58 |     
 59 |     Parameters:
 60 |         query: job title to query
 61 |         num_pages: number of pages needed
 62 |         location: city to search in
 63 |     
 64 |     Returns:
 65 |         urls: a list of job posting URL's (when num_pages valid)
 66 |         max_pages: maximum number of pages allowed ((when num_pages invalid))
 67 |     """
 68 |     # We always need the first page
 69 |     base_url = 'https://ca.indeed.com/jobs?q={}&l={}'.format(query, location)
 70 |     soup = get_soup(base_url)
 71 |     urls = grab_job_links(soup)
 72 |     
 73 |     # Get the total number of postings found 
 74 |     posting_count_string = soup.find(name='div', attrs={'id':"searchCount"}).get_text()
 75 |     posting_count_string = posting_count_string[posting_count_string.find('of')+2:].strip()
 76 |     #print('posting_count_string: {}'.format(posting_count_string))
 77 |     #print('type is: {}'.format(type(posting_count_string)))
 78 |     
 79 |     try:
 80 |         posting_count = int(posting_count_string)
 81 |     except ValueError: # deal with special case when parsed string is "360 jobs"
 82 |         posting_count = int(re.search('\d+', posting_count_string).group(0))
 83 |         #print('posting_count: {}'.format(posting_count))
 84 |         #print('\ntype: {}'.format(type(posting_count)))
 85 |     finally:
 86 |         posting_count = 330 # setting to 330 when unable to get the total
 87 |         pass
 88 |     
 89 |     # Limit nunmber of pages to get
 90 |     max_pages = round(posting_count / 10) - 3
 91 |     if num_pages > max_pages:
 92 |         print('returning max_pages!!')
 93 |         return max_pages
 94 |     
 95 |         # Additional work is needed when more than 1 page is requested
 96 |     if num_pages >= 2:
 97 |         # Start loop from page 2 since page 1 has been dealt with above
 98 |         for i in range(2, num_pages+1):
 99 |             num = (i-1) * 10
100 |             base_url = 'https://ca.indeed.com/jobs?q={}&l={}&start={}'.format(query, location, num)
101 |             try:
102 |                 soup = get_soup(base_url)
103 |                 # We always combine the results back to the list
104 |                 urls += grab_job_links(soup)
105 |             except:
106 |                 continue
107 | 
108 |     # Check to ensure the number of urls gotten is correct
109 |     #assert len(urls) == num_pages * 10, "There are missing job links, check code!"
110 | 
111 |     return urls     
112 | 
113 | 
114 | 
115 | def get_posting(url):
116 |     """
117 |     Get the text portion including both title and job description of the job posting from a given url
118 |     
119 |     Parameters:
120 |         url: The job posting link
121 |         
122 |     Returns:
123 |         title: the job title (if "data scientist" is in the title)
124 |         posting: the job posting content    
125 |     """
126 |     # Get the url content as BS object
127 |     soup = get_soup(url)
128 |     
129 |     # The job title is held in the h3 tag
130 |     title = soup.find(name='h3').getText().lower()
131 |     posting = soup.find(name='div', attrs={'class': "jobsearch-JobComponent"}).get_text()
132 | 
133 |     return title, posting.lower()
134 | 
135 |         
136 |     #if 'data scientist' in title:  # We'll proceed to grab the job posting text if the title is correct
137 |         # All the text info is contained in the div element with the below class, extract the text.
138 |         #posting = soup.find(name='div', attrs={'class': "jobsearch-JobComponent"}).get_text()
139 |         #return title, posting.lower()
140 |     #else:
141 |         #return False
142 |     
143 |         # Get rid of numbers and symbols other than given
144 |         #text = re.sub("[^a-zA-Z'+#&]", " ", text)
145 |         # Convert to lower case and split to list and then set
146 |         #text = text.lower().strip()
147 |     
148 |         #return text
149 | 
150 | 
151 | 
152 | def get_data(query, num_pages, location='Toronto'):
153 |     """
154 |     Get all the job posting data and save in a json file using below structure:
155 |     
156 |     {<count>: {'title': ..., 'posting':..., 'url':...}...}
157 |     
158 |     The json file name has this format: ""<query>.json"
159 |     
160 |     Parameters:
161 |         query: Indeed query keyword such as 'Data Scientist'
162 |         num_pages: Number of search results needed
163 |         location: location to search for
164 |     
165 |     Returns:
166 |         postings_dict: Python dict including all posting data
167 |     
168 |     """
169 |     # Convert the queried title to Indeed format
170 |     query = '+'.join(query.lower().split())
171 |     
172 |     postings_dict = {}
173 |     urls = get_urls(query, num_pages, location)
174 |     
175 |     #  Continue only if the requested number of pages is valid (when invalid, a number is returned instead of list)
176 |     if isinstance(urls, list):
177 |         num_urls = len(urls)
178 |         for i, url in enumerate(urls):
179 |             try:
180 |                 title, posting = get_posting(url)
181 |                 postings_dict[i] = {}
182 |                 postings_dict[i]['title'], postings_dict[i]['posting'], postings_dict[i]['url'] = \
183 |                 title, posting, url
184 |             except: 
185 |                 continue
186 |             
187 |             percent = (i+1) / num_urls
188 |             # Print the progress the "end" arg keeps the message in the same line 
189 |             print("Progress: {:2.0f}%".format(100*percent), end='\r')
190 | 
191 |         # Save the dict as json file
192 |         file_name = query.replace('+', '_') + '.json'
193 |         with open(file_name, 'w') as f:
194 |             json.dump(postings_dict, f)
195 |         
196 |         print('All {} postings have been scraped and saved!'.format(num_urls))    
197 |         #return postings_dict
198 |     else:
199 |         print("Due to similar results, maximum number of pages is only {}. Please try again!".format(urls))
200 | 
201 | 
202 | 
203 | # If script is run directly, we'll take input from the user
204 | if __name__ == "__main__":
205 |     queries = ["data scientist", "machine learning engineer", "data engineer"]
206 |     
207 |     while True: 
208 |         query = input("Please enter the title to scrape data for: \n").lower()
209 |         if query in queries:
210 |             break
211 |         else:
212 |             print("Invalid title! Please try again.")
213 | 
214 |     while True:
215 |         num_pages = input("Please enter the number of pages needed (integer only): \n")
216 |         try:
217 |             num_pages = int(num_pages)
218 |             break
219 |         except:
220 |             print("Invalid number of pages! Please try again.")
221 | 
222 |     get_data(query, num_pages, location='Toronto')
223 | 
224 | 


--------------------------------------------------------------------------------
/stopwords.csv:
--------------------------------------------------------------------------------
1 | "experience","job","work","working","skills","new","company","years","technology","ago","save","jobapply","nowapply","using","strong","ability","days","knowledge","opportunity","tools","related","including","original","understanding","us","role","degree","one","requirements","canada","required","toronto","world","provide","industry","help","saying","reviewsread","looking","preferred","sitesave","applicants","applications","part","field","etc","apply","across","position","life","application","employment","best","key","use","well","following","please","like","opportunities","within","nowsave","drive","qualifications","responsibilities","employees","global","must","equal","able","various","join","candidate","high","needs","education","time","meet","need",,"status","accommodation","diverse","successful","may","background","candidates","language","good","excellent","career","also","level","employer","flexible","companies","canadian","want","culture","grow","closely","available","relevant","diversity","approaches","group","used","demonstrated","full","languages","top","professional","multiple","type","description","based","sources","disability","location","day","current","take","national","highly","events","gender","individuals","variety","better","order","similar","concepts","effectively","way","offer","record","great","sets","different","next","human","include","ensure","plus","ontario","minimum","every","disabilities","data","team","benefit","understand","onapply","applying","benefits","around","office","require","future","asset","real","contribute","review","hand","responsible"
2 | 


--------------------------------------------------------------------------------