├── README.md ├── data_wrangling.ipynb ├── env_ideal_profiles.yaml ├── helper.py ├── ideal_profiles.ipynb ├── ideal_profiles_2.ipynb ├── process_text.py ├── scrape_data.py └── stopwords.csv /README.md: -------------------------------------------------------------------------------- 1 | # Ideal Profiles 2 | What does an ideal Data Scientist's profile look like? This project aims to provide a quantitative answer based on job postings. In this project, I scraped job posting data from Indeed and analyzed frequencies for various Data Science skills. The analysis then can be used not only as objective keyword reference for resume optimization, but can also serve as Data Science learning road map!! 3 | 4 | The related Medium posts are: 5 | - [What Does an Ideal Data Scientist’s Profile Look Like?](https://towardsdatascience.com/what-does-an-ideal-data-scientists-profile-look-like-7d7bd78ff7ab) 6 | - [Navigating the Data Science Careers Landscape](https://hackernoon.com/navigating-the-data-science-career-landscape-db746a61ac62) 7 | - [Scraping Job Posting Data from Indeed using Selenium and BeautifulSoup](https://towardsdatascience.com/scraping-job-posting-data-from-indeed-using-selenium-and-beautifulsoup-dfc86230baac) 8 | - [Building an End-To-End Data Science Project](https://towardsdatascience.com/building-an-end-to-end-data-science-project-28e853c0cae3) 9 | 10 | 11 | ## How to Use 12 | If you want to run the code locally, please download the repo and build your Anaconda environment using the `env_ideal_profiles.yaml` file, and download geckodriver (see Requirements below). Then you can start with data scraping by running `python scrape_date.py` in Anaconda Prompt. Once you have the raw data, you can then clean the data using the `data_wrangling.ipynb` Jupyter Notebook. Finally, the `ideal_profiles_2.ipynb` Notebook can be used to make various plots. Refer to list below for the roles of different files. 13 | 14 | 15 | ## Requirements 16 | - Windows 10 OS 17 | - Firefox Web Browser 63.0.3 18 | - Ananconda 3 19 | - geckodriver v0.22.0 (geckodriver-v0.22.0-win64.zip, available [here](https://github.com/mozilla/geckodriver/releases)) 20 | - pandas (see the yaml file for version number, same below) 21 | - numpy 22 | - matplotlib 23 | - json 24 | - re 25 | - csv 26 | - wordcloud 27 | - nltk 28 | - bs4 (BeautifulSoup) 29 | - selenium 30 | 31 | 32 | ## Files 33 | - `scrape_data.py`: scrapes the data from Indeed.ca 34 | - `process_text.py`: performs various text related operations such as remove digits, tokenize, and check term frequency 35 | - `helper.py`: contains data loading and various plotting functions 36 | - `data_wrangling.ipynb`: gathers the raw text data, counts term frequency and stores the result in a pandas dataframe 37 | - `ideal_profiles.ipynb`: creates spider plots to visualize various Data Science roles' skill requirements based on intuition 38 | - `ideal_profiles_2.ipynb`: creates skill distribution and word cloud plots to represent ideal profiles quantitatively 39 | - `stopwords.csv`: contains the stop words for word cloud plotting 40 | - `env_ideal_profiles.yaml`: the Anaconda environment file for setting up the project environment 41 | 42 | 43 | ## Contribute 44 | Any contribution is welcome! 45 | 46 | 47 | ## To-do's 48 | - Allow to query Indeed USA instead of the Canadian site and increase the number of postings to scrape 49 | - Allow to show context for specific words in word clouds 50 | - Update all docstrings and comments 51 | - OOP 52 | - Code refactoring - single responsibility principle for functions 53 | - Add Data Analyst and AI Engineer roles 54 | - Allow to show Percentage of Mentions for a certain skill, i.e., out of 1000 job postings, what proportion mentions the given skill? 55 | 56 | 57 | ## License 58 | [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) 59 | -------------------------------------------------------------------------------- /data_wrangling.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import pandas as pd\n", 10 | "#import nltk\n", 11 | "from matplotlib import pyplot as plt\n", 12 | "from scrape_data import *\n", 13 | "from process_text import *\n", 14 | "from helper import *" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 2, 20 | "metadata": {}, 21 | "outputs": [], 22 | "source": [ 23 | "# Initialize the dict to store all text lists for below titles\n", 24 | "text_lists = {}\n", 25 | "titles = ['Data Scientist', 'Machine Learning Engineer', 'Data Engineer']\n", 26 | "# Grab the tokens list and store them in the dict\n", 27 | "for title in titles:\n", 28 | " text_lists[title] = plot_profile(title=title, first_n_postings=120, return_text_list=True)" 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": 3, 34 | "metadata": {}, 35 | "outputs": [], 36 | "source": [ 37 | "# Make the dict of skills to investigate\n", 38 | "\n", 39 | "languages = ['Python', 'R', 'SQL', 'Java', 'C', 'C++', 'C#', 'Scala', 'Perl', 'Julia', \n", 40 | " 'Javascript', 'HTML', 'CSS', 'PHP', 'Ruby', 'Lua', 'MATLAB', 'SAS'] \n", 41 | "\n", 42 | "big_data = ['Hadoop', 'MapReduce', 'Hive', 'Pig', 'Cascading', 'Scalding', 'Cascalog', 'HBase', 'Sqoop', \n", 43 | " 'Mahout', 'Oozie', 'Flume', 'ZooKeeper', 'Spark', 'Storm', 'Shark', 'Impala', 'Elasticsearch', \n", 44 | " 'Kafka', 'Flink', 'Kinesis', 'Presto', 'Hume', 'Airflow', 'Azkabhan', 'Luigi', 'Cassandra']\n", 45 | "\n", 46 | "dl = ['TensorFlow', 'Keras', 'PyTorch', 'Theano', 'Deeplearning4J', 'Caffe', 'TFLearn', 'Torch', \n", 47 | " 'OpenCV', 'MXNet', 'Microsoft Cognitive Toolkit', 'Lasagne']\n", 48 | "\n", 49 | "cloud = ['AWS', 'GCP', 'Azure']\n", 50 | "\n", 51 | "ml = ['Natural Language Processing', 'Computer Vision', 'Speech Recognition', 'Fraud Detection',\n", 52 | " 'Recommender System', 'Image Recognition', 'Object Dectection', 'Chatbot', 'Sentiment Analysis']\n", 53 | "\n", 54 | "visualization = ['Dimple', 'D3.js', 'Ggplot', 'Shiny', 'Plotly', 'Matplotlib', 'Seaborn', \n", 55 | " 'Bokeh', 'Tableau']\n", 56 | "\n", 57 | "other = ['Pandas', 'Numpy', 'Scipy', 'Sklearn', 'Scikit-Learn', 'Docker', 'Git', 'Jira', 'Kaggle']\n", 58 | "\n", 59 | "dict_to_check = {'Programming Languages': languages,\n", 60 | " 'Big Data Technologies': big_data,\n", 61 | " 'Deep Learning Frameworks': dl,\n", 62 | " 'Cloud Computing Platforms': cloud,\n", 63 | " 'Machine Learning Application': ml,\n", 64 | " 'Visualization Tools': visualization,\n", 65 | " 'Other': other}" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": 4, 71 | "metadata": {}, 72 | "outputs": [], 73 | "source": [ 74 | "# Check the frequency and store in dict\n", 75 | "freq_dict = {}\n", 76 | "for title in text_lists.keys():\n", 77 | " freq_dict[title] = check_freq(dict_to_check=dict_to_check, text_list=text_lists[title])" 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": 5, 83 | "metadata": {}, 84 | "outputs": [ 85 | { 86 | "data": { 87 | "text/html": [ 88 | "
\n", 89 | "\n", 102 | "\n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | "
PythonRSQLJavaCC++C#ScalaPerlJulia...TableauPandasNumpyScipySklearnScikit-LearnDockerGitJiraKaggle
Data EngineerBig Data TechnologiesNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
Cloud Computing PlatformsNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
Deep Learning FrameworksNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
Machine Learning ApplicationNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
OtherNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN...NaN3.00.02.00.00.08.029.01.00.0
\n", 254 | "

5 rows × 87 columns

\n", 255 | "
" 256 | ], 257 | "text/plain": [ 258 | " Python R SQL Java C C++ \\\n", 259 | "Data Engineer Big Data Technologies NaN NaN NaN NaN NaN NaN \n", 260 | " Cloud Computing Platforms NaN NaN NaN NaN NaN NaN \n", 261 | " Deep Learning Frameworks NaN NaN NaN NaN NaN NaN \n", 262 | " Machine Learning Application NaN NaN NaN NaN NaN NaN \n", 263 | " Other NaN NaN NaN NaN NaN NaN \n", 264 | "\n", 265 | " C# Scala Perl Julia ... \\\n", 266 | "Data Engineer Big Data Technologies NaN NaN NaN NaN ... \n", 267 | " Cloud Computing Platforms NaN NaN NaN NaN ... \n", 268 | " Deep Learning Frameworks NaN NaN NaN NaN ... \n", 269 | " Machine Learning Application NaN NaN NaN NaN ... \n", 270 | " Other NaN NaN NaN NaN ... \n", 271 | "\n", 272 | " Tableau Pandas Numpy Scipy \\\n", 273 | "Data Engineer Big Data Technologies NaN NaN NaN NaN \n", 274 | " Cloud Computing Platforms NaN NaN NaN NaN \n", 275 | " Deep Learning Frameworks NaN NaN NaN NaN \n", 276 | " Machine Learning Application NaN NaN NaN NaN \n", 277 | " Other NaN 3.0 0.0 2.0 \n", 278 | "\n", 279 | " Sklearn Scikit-Learn Docker \\\n", 280 | "Data Engineer Big Data Technologies NaN NaN NaN \n", 281 | " Cloud Computing Platforms NaN NaN NaN \n", 282 | " Deep Learning Frameworks NaN NaN NaN \n", 283 | " Machine Learning Application NaN NaN NaN \n", 284 | " Other 0.0 0.0 8.0 \n", 285 | "\n", 286 | " Git Jira Kaggle \n", 287 | "Data Engineer Big Data Technologies NaN NaN NaN \n", 288 | " Cloud Computing Platforms NaN NaN NaN \n", 289 | " Deep Learning Frameworks NaN NaN NaN \n", 290 | " Machine Learning Application NaN NaN NaN \n", 291 | " Other 29.0 1.0 0.0 \n", 292 | "\n", 293 | "[5 rows x 87 columns]" 294 | ] 295 | }, 296 | "execution_count": 5, 297 | "metadata": {}, 298 | "output_type": "execute_result" 299 | } 300 | ], 301 | "source": [ 302 | "# Convert the dict to a pandas df\n", 303 | "df = pd.DataFrame.from_dict({(i,j): freq_dict[i][j] \n", 304 | " for i in freq_dict.keys()\n", 305 | " for j in freq_dict[i].keys()},\n", 306 | " orient='index')\n", 307 | "df.head()" 308 | ] 309 | }, 310 | { 311 | "cell_type": "code", 312 | "execution_count": 6, 313 | "metadata": {}, 314 | "outputs": [ 315 | { 316 | "data": { 317 | "text/html": [ 318 | "
\n", 319 | "\n", 332 | "\n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | "
level_0level_1PythonRSQLJavaCC++C#Scala...TableauPandasNumpyScipySklearnScikit-LearnDockerGitJiraKaggle
0Data EngineerBig Data TechnologiesNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
1Data EngineerCloud Computing PlatformsNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2Data EngineerDeep Learning FrameworksNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
3Data EngineerMachine Learning ApplicationNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
4Data EngineerOtherNaNNaNNaNNaNNaNNaNNaNNaN...NaN3.00.02.00.00.08.029.01.00.0
\n", 482 | "

5 rows × 89 columns

\n", 483 | "
" 484 | ], 485 | "text/plain": [ 486 | " level_0 level_1 Python R SQL Java C \\\n", 487 | "0 Data Engineer Big Data Technologies NaN NaN NaN NaN NaN \n", 488 | "1 Data Engineer Cloud Computing Platforms NaN NaN NaN NaN NaN \n", 489 | "2 Data Engineer Deep Learning Frameworks NaN NaN NaN NaN NaN \n", 490 | "3 Data Engineer Machine Learning Application NaN NaN NaN NaN NaN \n", 491 | "4 Data Engineer Other NaN NaN NaN NaN NaN \n", 492 | "\n", 493 | " C++ C# Scala ... Tableau Pandas Numpy Scipy Sklearn \\\n", 494 | "0 NaN NaN NaN ... NaN NaN NaN NaN NaN \n", 495 | "1 NaN NaN NaN ... NaN NaN NaN NaN NaN \n", 496 | "2 NaN NaN NaN ... NaN NaN NaN NaN NaN \n", 497 | "3 NaN NaN NaN ... NaN NaN NaN NaN NaN \n", 498 | "4 NaN NaN NaN ... NaN 3.0 0.0 2.0 0.0 \n", 499 | "\n", 500 | " Scikit-Learn Docker Git Jira Kaggle \n", 501 | "0 NaN NaN NaN NaN NaN \n", 502 | "1 NaN NaN NaN NaN NaN \n", 503 | "2 NaN NaN NaN NaN NaN \n", 504 | "3 NaN NaN NaN NaN NaN \n", 505 | "4 0.0 8.0 29.0 1.0 0.0 \n", 506 | "\n", 507 | "[5 rows x 89 columns]" 508 | ] 509 | }, 510 | "execution_count": 6, 511 | "metadata": {}, 512 | "output_type": "execute_result" 513 | } 514 | ], 515 | "source": [ 516 | "# Reset the index to include both title and category as columns\n", 517 | "df = df.reset_index()\n", 518 | "df.head()" 519 | ] 520 | }, 521 | { 522 | "cell_type": "code", 523 | "execution_count": 7, 524 | "metadata": {}, 525 | "outputs": [ 526 | { 527 | "data": { 528 | "text/html": [ 529 | "
\n", 530 | "\n", 543 | "\n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | " \n", 581 | " \n", 582 | " \n", 583 | " \n", 584 | " \n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | " \n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | " \n", 619 | " \n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | "
titlecategoryPythonRSQLJavaCC++C#Scala...TableauPandasNumpyScipySklearnScikit-LearnDockerGitJiraKaggle
0Data EngineerBig Data TechnologiesNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
1Data EngineerCloud Computing PlatformsNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2Data EngineerDeep Learning FrameworksNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
3Data EngineerMachine Learning ApplicationNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
4Data EngineerOtherNaNNaNNaNNaNNaNNaNNaNNaN...NaN3.00.02.00.00.08.029.01.00.0
\n", 693 | "

5 rows × 89 columns

\n", 694 | "
" 695 | ], 696 | "text/plain": [ 697 | " title category Python R SQL Java C \\\n", 698 | "0 Data Engineer Big Data Technologies NaN NaN NaN NaN NaN \n", 699 | "1 Data Engineer Cloud Computing Platforms NaN NaN NaN NaN NaN \n", 700 | "2 Data Engineer Deep Learning Frameworks NaN NaN NaN NaN NaN \n", 701 | "3 Data Engineer Machine Learning Application NaN NaN NaN NaN NaN \n", 702 | "4 Data Engineer Other NaN NaN NaN NaN NaN \n", 703 | "\n", 704 | " C++ C# Scala ... Tableau Pandas Numpy Scipy Sklearn \\\n", 705 | "0 NaN NaN NaN ... NaN NaN NaN NaN NaN \n", 706 | "1 NaN NaN NaN ... NaN NaN NaN NaN NaN \n", 707 | "2 NaN NaN NaN ... NaN NaN NaN NaN NaN \n", 708 | "3 NaN NaN NaN ... NaN NaN NaN NaN NaN \n", 709 | "4 NaN NaN NaN ... NaN 3.0 0.0 2.0 0.0 \n", 710 | "\n", 711 | " Scikit-Learn Docker Git Jira Kaggle \n", 712 | "0 NaN NaN NaN NaN NaN \n", 713 | "1 NaN NaN NaN NaN NaN \n", 714 | "2 NaN NaN NaN NaN NaN \n", 715 | "3 NaN NaN NaN NaN NaN \n", 716 | "4 0.0 8.0 29.0 1.0 0.0 \n", 717 | "\n", 718 | "[5 rows x 89 columns]" 719 | ] 720 | }, 721 | "execution_count": 7, 722 | "metadata": {}, 723 | "output_type": "execute_result" 724 | } 725 | ], 726 | "source": [ 727 | "# Rename the first two columns\n", 728 | "df.rename({'level_0': 'title', 'level_1': 'category'}, axis='columns', inplace=True)\n", 729 | "df.head()" 730 | ] 731 | }, 732 | { 733 | "cell_type": "code", 734 | "execution_count": 8, 735 | "metadata": {}, 736 | "outputs": [ 737 | { 738 | "data": { 739 | "text/html": [ 740 | "
\n", 741 | "\n", 754 | "\n", 755 | " \n", 756 | " \n", 757 | " \n", 758 | " \n", 759 | " \n", 760 | " \n", 761 | " \n", 762 | " \n", 763 | " \n", 764 | " \n", 765 | " \n", 766 | " \n", 767 | " \n", 768 | " \n", 769 | " \n", 770 | " \n", 771 | " \n", 772 | " \n", 773 | " \n", 774 | " \n", 775 | " \n", 776 | " \n", 777 | " \n", 778 | " \n", 779 | " \n", 780 | " \n", 781 | " \n", 782 | " \n", 783 | " \n", 784 | " \n", 785 | " \n", 786 | " \n", 787 | " \n", 788 | " \n", 789 | " \n", 790 | " \n", 791 | " \n", 792 | " \n", 793 | " \n", 794 | " \n", 795 | " \n", 796 | " \n", 797 | " \n", 798 | " \n", 799 | " \n", 800 | " \n", 801 | "
titlecategoryvariablevalue
0Data EngineerBig Data TechnologiesPythonNaN
1Data EngineerCloud Computing PlatformsPythonNaN
2Data EngineerDeep Learning FrameworksPythonNaN
3Data EngineerMachine Learning ApplicationPythonNaN
4Data EngineerOtherPythonNaN
\n", 802 | "
" 803 | ], 804 | "text/plain": [ 805 | " title category variable value\n", 806 | "0 Data Engineer Big Data Technologies Python NaN\n", 807 | "1 Data Engineer Cloud Computing Platforms Python NaN\n", 808 | "2 Data Engineer Deep Learning Frameworks Python NaN\n", 809 | "3 Data Engineer Machine Learning Application Python NaN\n", 810 | "4 Data Engineer Other Python NaN" 811 | ] 812 | }, 813 | "execution_count": 8, 814 | "metadata": {}, 815 | "output_type": "execute_result" 816 | } 817 | ], 818 | "source": [ 819 | "value_vars = df.columns.tolist()[2:] # the list of column names except the first two\n", 820 | "# Transform from wide to long for plotting\n", 821 | "df = pd.melt(df, id_vars=['title', 'category'], value_vars=value_vars)\n", 822 | "df.head()" 823 | ] 824 | }, 825 | { 826 | "cell_type": "code", 827 | "execution_count": 9, 828 | "metadata": {}, 829 | "outputs": [ 830 | { 831 | "data": { 832 | "text/html": [ 833 | "
\n", 834 | "\n", 847 | "\n", 848 | " \n", 849 | " \n", 850 | " \n", 851 | " \n", 852 | " \n", 853 | " \n", 854 | " \n", 855 | " \n", 856 | " \n", 857 | " \n", 858 | " \n", 859 | " \n", 860 | " \n", 861 | " \n", 862 | " \n", 863 | " \n", 864 | " \n", 865 | " \n", 866 | " \n", 867 | " \n", 868 | " \n", 869 | " \n", 870 | " \n", 871 | " \n", 872 | " \n", 873 | " \n", 874 | " \n", 875 | " \n", 876 | " \n", 877 | " \n", 878 | " \n", 879 | " \n", 880 | " \n", 881 | " \n", 882 | " \n", 883 | " \n", 884 | " \n", 885 | " \n", 886 | " \n", 887 | " \n", 888 | " \n", 889 | " \n", 890 | " \n", 891 | " \n", 892 | " \n", 893 | " \n", 894 | "
titlecategoryskillfrequency
0Data EngineerBig Data TechnologiesPythonNaN
1Data EngineerCloud Computing PlatformsPythonNaN
2Data EngineerDeep Learning FrameworksPythonNaN
3Data EngineerMachine Learning ApplicationPythonNaN
4Data EngineerOtherPythonNaN
\n", 895 | "
" 896 | ], 897 | "text/plain": [ 898 | " title category skill frequency\n", 899 | "0 Data Engineer Big Data Technologies Python NaN\n", 900 | "1 Data Engineer Cloud Computing Platforms Python NaN\n", 901 | "2 Data Engineer Deep Learning Frameworks Python NaN\n", 902 | "3 Data Engineer Machine Learning Application Python NaN\n", 903 | "4 Data Engineer Other Python NaN" 904 | ] 905 | }, 906 | "execution_count": 9, 907 | "metadata": {}, 908 | "output_type": "execute_result" 909 | } 910 | ], 911 | "source": [ 912 | "# Rename the last two columns\n", 913 | "df.rename({'variable': 'skill', 'value': 'frequency'}, axis='columns', inplace=True)\n", 914 | "df.head()" 915 | ] 916 | }, 917 | { 918 | "cell_type": "code", 919 | "execution_count": 10, 920 | "metadata": {}, 921 | "outputs": [ 922 | { 923 | "data": { 924 | "text/html": [ 925 | "
\n", 926 | "\n", 939 | "\n", 940 | " \n", 941 | " \n", 942 | " \n", 943 | " \n", 944 | " \n", 945 | " \n", 946 | " \n", 947 | " \n", 948 | " \n", 949 | " \n", 950 | " \n", 951 | " \n", 952 | " \n", 953 | " \n", 954 | " \n", 955 | " \n", 956 | " \n", 957 | " \n", 958 | " \n", 959 | " \n", 960 | " \n", 961 | " \n", 962 | " \n", 963 | " \n", 964 | " \n", 965 | " \n", 966 | " \n", 967 | " \n", 968 | " \n", 969 | " \n", 970 | " \n", 971 | " \n", 972 | " \n", 973 | " \n", 974 | " \n", 975 | " \n", 976 | " \n", 977 | " \n", 978 | " \n", 979 | " \n", 980 | " \n", 981 | " \n", 982 | " \n", 983 | " \n", 984 | " \n", 985 | " \n", 986 | "
titlecategoryskillfrequency
5Data EngineerProgramming LanguagesPython52.0
12Data ScientistProgramming LanguagesPython103.0
19Machine Learning EngineerProgramming LanguagesPython71.0
26Data EngineerProgramming LanguagesR5.0
33Data ScientistProgramming LanguagesR19.0
\n", 987 | "
" 988 | ], 989 | "text/plain": [ 990 | " title category skill frequency\n", 991 | "5 Data Engineer Programming Languages Python 52.0\n", 992 | "12 Data Scientist Programming Languages Python 103.0\n", 993 | "19 Machine Learning Engineer Programming Languages Python 71.0\n", 994 | "26 Data Engineer Programming Languages R 5.0\n", 995 | "33 Data Scientist Programming Languages R 19.0" 996 | ] 997 | }, 998 | "execution_count": 10, 999 | "metadata": {}, 1000 | "output_type": "execute_result" 1001 | } 1002 | ], 1003 | "source": [ 1004 | "# Subset to non null values in the freq column\n", 1005 | "df = df[df['frequency'].notnull()]\n", 1006 | "df.head()" 1007 | ] 1008 | }, 1009 | { 1010 | "cell_type": "code", 1011 | "execution_count": 11, 1012 | "metadata": {}, 1013 | "outputs": [ 1014 | { 1015 | "data": { 1016 | "text/html": [ 1017 | "
\n", 1018 | "\n", 1031 | "\n", 1032 | " \n", 1033 | " \n", 1034 | " \n", 1035 | " \n", 1036 | " \n", 1037 | " \n", 1038 | " \n", 1039 | " \n", 1040 | " \n", 1041 | " \n", 1042 | " \n", 1043 | " \n", 1044 | " \n", 1045 | " \n", 1046 | " \n", 1047 | " \n", 1048 | " \n", 1049 | " \n", 1050 | " \n", 1051 | " \n", 1052 | " \n", 1053 | " \n", 1054 | " \n", 1055 | " \n", 1056 | " \n", 1057 | " \n", 1058 | " \n", 1059 | " \n", 1060 | " \n", 1061 | " \n", 1062 | " \n", 1063 | " \n", 1064 | " \n", 1065 | " \n", 1066 | " \n", 1067 | " \n", 1068 | " \n", 1069 | " \n", 1070 | " \n", 1071 | " \n", 1072 | " \n", 1073 | " \n", 1074 | " \n", 1075 | " \n", 1076 | " \n", 1077 | " \n", 1078 | "
titlecategoryskillfrequency
0Data EngineerProgramming LanguagesPython52.0
1Data ScientistProgramming LanguagesPython103.0
2Machine Learning EngineerProgramming LanguagesPython71.0
3Data EngineerProgramming LanguagesR5.0
4Data ScientistProgramming LanguagesR19.0
\n", 1079 | "
" 1080 | ], 1081 | "text/plain": [ 1082 | " title category skill frequency\n", 1083 | "0 Data Engineer Programming Languages Python 52.0\n", 1084 | "1 Data Scientist Programming Languages Python 103.0\n", 1085 | "2 Machine Learning Engineer Programming Languages Python 71.0\n", 1086 | "3 Data Engineer Programming Languages R 5.0\n", 1087 | "4 Data Scientist Programming Languages R 19.0" 1088 | ] 1089 | }, 1090 | "execution_count": 11, 1091 | "metadata": {}, 1092 | "output_type": "execute_result" 1093 | } 1094 | ], 1095 | "source": [ 1096 | "# Reset the index\n", 1097 | "df.reset_index(drop=True, inplace=True)\n", 1098 | "df.head()" 1099 | ] 1100 | }, 1101 | { 1102 | "cell_type": "code", 1103 | "execution_count": 12, 1104 | "metadata": {}, 1105 | "outputs": [ 1106 | { 1107 | "data": { 1108 | "text/plain": [ 1109 | "title object\n", 1110 | "category object\n", 1111 | "skill object\n", 1112 | "frequency int32\n", 1113 | "dtype: object" 1114 | ] 1115 | }, 1116 | "execution_count": 12, 1117 | "metadata": {}, 1118 | "output_type": "execute_result" 1119 | } 1120 | ], 1121 | "source": [ 1122 | "df = df.astype({'frequency': int})\n", 1123 | "df.dtypes" 1124 | ] 1125 | }, 1126 | { 1127 | "cell_type": "code", 1128 | "execution_count": 13, 1129 | "metadata": {}, 1130 | "outputs": [], 1131 | "source": [ 1132 | "df.to_csv('skill_frequencies.csv')" 1133 | ] 1134 | } 1135 | ], 1136 | "metadata": { 1137 | "kernelspec": { 1138 | "display_name": "Python 3", 1139 | "language": "python", 1140 | "name": "python3" 1141 | }, 1142 | "language_info": { 1143 | "codemirror_mode": { 1144 | "name": "ipython", 1145 | "version": 3 1146 | }, 1147 | "file_extension": ".py", 1148 | "mimetype": "text/x-python", 1149 | "name": "python", 1150 | "nbconvert_exporter": "python", 1151 | "pygments_lexer": "ipython3", 1152 | "version": "3.6.5" 1153 | } 1154 | }, 1155 | "nbformat": 4, 1156 | "nbformat_minor": 2 1157 | } 1158 | -------------------------------------------------------------------------------- /env_ideal_profiles.yaml: -------------------------------------------------------------------------------- 1 | name: base 2 | channels: 3 | - conda-forge 4 | - anaconda-fusion 5 | - defaults 6 | dependencies: 7 | - conda=4.5.11=py36_1000 8 | - selenium=3.14.1=py36hfa6e2cd_1000 9 | - wordcloud=1.4.1=py36_0 10 | - _ipyw_jlab_nb_ext_conf=0.1.0=py36he6757f0_0 11 | - alabaster=0.7.10=py36hcd07829_0 12 | - anaconda=5.2.0=py36_3 13 | - anaconda-client=1.6.14=py36_0 14 | - anaconda-navigator=1.8.7=py36_0 15 | - anaconda-project=0.8.2=py36hfad2e28_0 16 | - asn1crypto=0.24.0=py36_0 17 | - astroid=1.6.3=py36_0 18 | - astropy=3.0.2=py36h452e1ab_1 19 | - attrs=18.1.0=py36_0 20 | - babel=2.5.3=py36_0 21 | - backcall=0.1.0=py36_0 22 | - backports=1.0=py36h81696a8_1 23 | - backports.shutil_get_terminal_size=1.0.0=py36h79ab834_2 24 | - beautifulsoup4=4.6.0=py36hd4cc5e8_1 25 | - bitarray=0.8.1=py36hfa6e2cd_1 26 | - bkcharts=0.2=py36h7e685f7_0 27 | - blas=1.0=mkl 28 | - blaze=0.11.3=py36h8a29ca5_0 29 | - bleach=2.1.3=py36_0 30 | - blosc=1.14.3=he51fdeb_0 31 | - bokeh=0.12.16=py36_0 32 | - boto=2.48.0=py36h1a776d2_1 33 | - bottleneck=1.2.1=py36hd119dfa_0 34 | - bzip2=1.0.6=hfa6e2cd_5 35 | - ca-certificates=2018.03.07=0 36 | - certifi=2018.4.16=py36_0 37 | - cffi=1.11.5=py36h945400d_0 38 | - chardet=3.0.4=py36h420ce6e_1 39 | - click=6.7=py36hec8c647_0 40 | - cloudpickle=0.5.3=py36_0 41 | - clyent=1.2.2=py36hb10d595_1 42 | - colorama=0.3.9=py36h029ae33_0 43 | - comtypes=1.1.4=py36_0 44 | - conda-build=3.10.5=py36_0 45 | - conda-env=2.6.0=h36134e3_1 46 | - conda-verify=2.0.0=py36h065de53_0 47 | - console_shortcut=0.1.1=h6bb2dd7_3 48 | - contextlib2=0.5.5=py36he5d52c0_0 49 | - cryptography=2.2.2=py36hfa6e2cd_0 50 | - curl=7.60.0=h7602738_0 51 | - cycler=0.10.0=py36h009560c_0 52 | - cython=0.28.2=py36hfa6e2cd_0 53 | - cytoolz=0.9.0.1=py36hfa6e2cd_0 54 | - dask=0.17.5=py36_0 55 | - dask-core=0.17.5=py36_0 56 | - datashape=0.5.4=py36h5770b85_0 57 | - decorator=4.3.0=py36_0 58 | - distributed=1.21.8=py36_0 59 | - docutils=0.14=py36h6012d8f_0 60 | - entrypoints=0.2.3=py36hfd66bb0_2 61 | - et_xmlfile=1.0.1=py36h3d2d736_0 62 | - fastcache=1.0.2=py36hfa6e2cd_2 63 | - filelock=3.0.4=py36_0 64 | - flask=1.0.2=py36_1 65 | - flask-cors=3.0.4=py36_0 66 | - freetype=2.8=h51f8f2c_1 67 | - get_terminal_size=1.0.0=h38e98db_0 68 | - gevent=1.3.0=py36hfa6e2cd_0 69 | - glob2=0.6=py36hdf76b57_0 70 | - greenlet=0.4.13=py36hfa6e2cd_0 71 | - h5py=2.7.1=py36h3bdd7fb_2 72 | - hdf5=1.10.2=hac2f561_1 73 | - heapdict=1.0.0=py36_2 74 | - html5lib=1.0.1=py36h047fa9f_0 75 | - icc_rt=2017.0.4=h97af966_0 76 | - icu=58.2=ha66f8fd_1 77 | - idna=2.6=py36h148d497_1 78 | - imageio=2.3.0=py36_0 79 | - imagesize=1.0.0=py36_0 80 | - intel-openmp=2018.0.0=8 81 | - ipykernel=4.8.2=py36_0 82 | - ipython=6.4.0=py36_0 83 | - ipython_genutils=0.2.0=py36h3c5d0ee_0 84 | - ipywidgets=7.2.1=py36_0 85 | - isort=4.3.4=py36_0 86 | - itsdangerous=0.24=py36hb6c5a24_1 87 | - jdcal=1.4=py36_0 88 | - jedi=0.12.0=py36_1 89 | - jinja2=2.10=py36h292fed1_0 90 | - jpeg=9b=hb83a4c4_2 91 | - jsonschema=2.6.0=py36h7636477_0 92 | - jupyter=1.0.0=py36_4 93 | - jupyter_client=5.2.3=py36_0 94 | - jupyter_console=5.2.0=py36h6d89b47_1 95 | - jupyter_core=4.4.0=py36h56e9d50_0 96 | - jupyterlab=0.32.1=py36_0 97 | - jupyterlab_launcher=0.10.5=py36_0 98 | - kiwisolver=1.0.1=py36h12c3424_0 99 | - lazy-object-proxy=1.3.1=py36hd1c21d2_0 100 | - libcurl=7.60.0=hc4dcbb0_0 101 | - libiconv=1.15=h1df5818_7 102 | - libpng=1.6.34=h79bbb47_0 103 | - libsodium=1.0.16=h9d3ae62_0 104 | - libssh2=1.8.0=hd619d38_4 105 | - libtiff=4.0.9=hb8ad9f9_1 106 | - libxml2=2.9.8=hadb2253_1 107 | - libxslt=1.1.32=hf6f1972_0 108 | - llvmlite=0.23.1=py36hcacf6c6_0 109 | - locket=0.2.0=py36hfed976d_1 110 | - lxml=4.2.1=py36heafd4d3_0 111 | - lzo=2.10=h6df0209_2 112 | - m2w64-gcc-libgfortran=5.3.0=6 113 | - m2w64-gcc-libs=5.3.0=7 114 | - m2w64-gcc-libs-core=5.3.0=7 115 | - m2w64-gmp=6.1.0=2 116 | - m2w64-libwinpthread-git=5.0.0.4634.697f757=2 117 | - markupsafe=1.0=py36h0e26971_1 118 | - matplotlib=2.2.2=py36h153e9ff_1 119 | - mccabe=0.6.1=py36hb41005a_1 120 | - menuinst=1.4.14=py36hfa6e2cd_0 121 | - mistune=0.8.3=py36hfa6e2cd_1 122 | - mkl=2018.0.2=1 123 | - mkl-service=1.1.2=py36h57e144c_4 124 | - mkl_fft=1.0.1=py36h452e1ab_0 125 | - mkl_random=1.0.1=py36h9258bd6_0 126 | - more-itertools=4.1.0=py36_0 127 | - mpmath=1.0.0=py36hacc8adf_2 128 | - msgpack-python=0.5.6=py36he980bc4_0 129 | - msys2-conda-epoch=20160418=1 130 | - multipledispatch=0.5.0=py36_0 131 | - navigator-updater=0.2.1=py36_0 132 | - nbconvert=5.3.1=py36h8dc0fde_0 133 | - nbformat=4.4.0=py36h3a5bc1b_0 134 | - networkx=2.1=py36_0 135 | - nltk=3.3.0=py36_0 136 | - nose=1.3.7=py36h1c3779e_2 137 | - notebook=5.5.0=py36_0 138 | - numba=0.38.0=py36h830ac7b_0 139 | - numexpr=2.6.5=py36hcd2f87e_0 140 | - numpy=1.14.3=py36h9fa60d3_1 141 | - numpy-base=1.14.3=py36h555522e_1 142 | - numpydoc=0.8.0=py36_0 143 | - odo=0.5.1=py36h7560279_0 144 | - olefile=0.45.1=py36_0 145 | - openpyxl=2.5.3=py36_0 146 | - openssl=1.0.2o=h8ea7d77_0 147 | - packaging=17.1=py36_0 148 | - pandas=0.23.0=py36h830ac7b_0 149 | - pandoc=1.19.2.1=hb2460c7_1 150 | - pandocfilters=1.4.2=py36h3ef6317_1 151 | - parso=0.2.0=py36_0 152 | - partd=0.3.8=py36hc8e763b_0 153 | - path.py=11.0.1=py36_0 154 | - pathlib2=2.3.2=py36_0 155 | - patsy=0.5.0=py36_0 156 | - pep8=1.7.1=py36_0 157 | - pickleshare=0.7.4=py36h9de030f_0 158 | - pillow=5.1.0=py36h0738816_0 159 | - pip=10.0.1=py36_0 160 | - pkginfo=1.4.2=py36_1 161 | - plotly=3.4.1=py36h28b3542_0 162 | - pluggy=0.6.0=py36hc7daf1e_0 163 | - ply=3.11=py36_0 164 | - prompt_toolkit=1.0.15=py36h60b8f86_0 165 | - psutil=5.4.5=py36hfa6e2cd_0 166 | - py=1.5.3=py36_0 167 | - pycodestyle=2.4.0=py36_0 168 | - pycosat=0.6.3=py36h413d8a4_0 169 | - pycparser=2.18=py36hd053e01_1 170 | - pycrypto=2.6.1=py36hfa6e2cd_8 171 | - pycurl=7.43.0.1=py36h74b6da3_0 172 | - pyflakes=1.6.0=py36h0b975d6_0 173 | - pygments=2.2.0=py36hb010967_0 174 | - pylint=1.8.4=py36_0 175 | - pyodbc=4.0.23=py36h6538335_0 176 | - pyopenssl=18.0.0=py36_0 177 | - pyparsing=2.2.0=py36h785a196_1 178 | - pyqt=5.9.2=py36h1aa27d4_0 179 | - pysocks=1.6.8=py36_0 180 | - pytables=3.4.3=py36he6f6034_1 181 | - pytest=3.5.1=py36_0 182 | - pytest-arraydiff=0.2=py36_0 183 | - pytest-astropy=0.3.0=py36_0 184 | - pytest-doctestplus=0.1.3=py36_0 185 | - pytest-openfiles=0.3.0=py36_0 186 | - pytest-remotedata=0.2.1=py36_0 187 | - python=3.6.5=h0c2934d_0 188 | - python-dateutil=2.7.3=py36_0 189 | - pytz=2018.4=py36_0 190 | - pywavelets=0.5.2=py36hc649158_0 191 | - pywin32=223=py36hfa6e2cd_1 192 | - pywinpty=0.5.1=py36_0 193 | - pyyaml=3.12=py36h1d1928f_1 194 | - pyzmq=17.0.0=py36hfa6e2cd_1 195 | - qt=5.9.5=vc14he4a7d60_0 196 | - qtawesome=0.4.4=py36h5aa48f6_0 197 | - qtconsole=4.3.1=py36h99a29a9_0 198 | - qtpy=1.4.1=py36_0 199 | - requests=2.18.4=py36h4371aae_1 200 | - retrying=1.3.3=py36_2 201 | - rope=0.10.7=py36had63a69_0 202 | - ruamel_yaml=0.15.35=py36hfa6e2cd_1 203 | - scikit-image=0.13.1=py36hfa6e2cd_1 204 | - scikit-learn=0.19.1=py36h53aea1b_0 205 | - scipy=1.1.0=py36h672f292_0 206 | - seaborn=0.8.1=py36h9b69545_0 207 | - send2trash=1.5.0=py36_0 208 | - setuptools=39.1.0=py36_0 209 | - simplegeneric=0.8.1=py36_2 210 | - singledispatch=3.4.0.3=py36h17d0c80_0 211 | - sip=4.19.8=py36h6538335_0 212 | - six=1.11.0=py36h4db2310_1 213 | - snappy=1.1.7=h777316e_3 214 | - snowballstemmer=1.2.1=py36h763602f_0 215 | - sortedcollections=0.6.1=py36_0 216 | - sortedcontainers=1.5.10=py36_0 217 | - sphinx=1.7.4=py36_0 218 | - sphinxcontrib=1.0=py36hbbac3d2_1 219 | - sphinxcontrib-websupport=1.0.1=py36hb5e5916_1 220 | - spyder=3.2.8=py36_0 221 | - sqlalchemy=1.2.7=py36ha85dd04_0 222 | - sqlite=3.23.1=h35aae40_0 223 | - statsmodels=0.9.0=py36h452e1ab_0 224 | - sympy=1.1.1=py36h96708e0_0 225 | - tblib=1.3.2=py36h30f5020_0 226 | - terminado=0.8.1=py36_1 227 | - testpath=0.3.1=py36h2698cfe_0 228 | - tk=8.6.7=hcb92d03_3 229 | - toolz=0.9.0=py36_0 230 | - tornado=5.0.2=py36_0 231 | - traitlets=4.3.2=py36h096827d_0 232 | - typing=3.6.4=py36_0 233 | - unicodecsv=0.14.1=py36h6450c06_0 234 | - urllib3=1.22=py36h276f60a_0 235 | - vc=14=h0510ff6_3 236 | - vs2015_runtime=14.0.25123=3 237 | - wcwidth=0.1.7=py36h3d5aa90_0 238 | - webencodings=0.5.1=py36h67c50ae_1 239 | - werkzeug=0.14.1=py36_0 240 | - wheel=0.31.1=py36_0 241 | - widgetsnbextension=3.2.1=py36_0 242 | - win_inet_pton=1.0.1=py36he67d7fd_1 243 | - win_unicode_console=0.5=py36hcdbd4b5_0 244 | - wincertstore=0.2=py36h7fe50ca_0 245 | - winpty=0.4.3=4 246 | - wrapt=1.10.11=py36he5f5981_0 247 | - xlrd=1.1.0=py36h1cb58dc_1 248 | - xlsxwriter=1.0.4=py36_0 249 | - xlwings=0.11.8=py36_0 250 | - xlwt=1.3.0=py36h1a4751e_0 251 | - yaml=0.1.7=hc54c509_2 252 | - zeromq=4.2.5=hc6251cf_0 253 | - zict=0.1.3=py36h2d8e73e_0 254 | - zlib=1.2.11=h8395fce_2 255 | - pip: 256 | - tables==3.4.3 257 | prefix: D:\Anaconda3 258 | 259 | -------------------------------------------------------------------------------- /helper.py: -------------------------------------------------------------------------------- 1 | import json 2 | import re, csv 3 | from wordcloud import WordCloud, STOPWORDS 4 | from matplotlib import pyplot as plt 5 | from process_text import * 6 | import pandas as pd 7 | import numpy as np 8 | 9 | 10 | 11 | def load_data(file_name): 12 | """ 13 | Open the saved json data file and load the data into a dict. 14 | 15 | Parameters: 16 | file_name: the saved file name, e.g. "machine_learning_engineer.json" 17 | 18 | Returns: 19 | postings_dict: data in dict format 20 | 21 | """ 22 | 23 | with open(file_name, 'r') as f: 24 | postings_dict = json.load(f) 25 | return postings_dict 26 | 27 | 28 | 29 | def plot_wc(text, max_words=200, stopwords_list=[], to_file_name=None): 30 | """ 31 | Make a word cloud plot using the given text. 32 | 33 | Parameters: 34 | text -- the text as a string 35 | 36 | Returns: 37 | None 38 | """ 39 | wordcloud = WordCloud().generate(text) 40 | stopwords = set(STOPWORDS) 41 | stopwords.update(stopwords_list) 42 | 43 | wordcloud = WordCloud(background_color='white', 44 | stopwords=stopwords, 45 | #prefer_horizontal=1, 46 | max_words=max_words, 47 | min_font_size=6, 48 | scale=1, 49 | width = 800, height = 800, 50 | random_state=8).generate(text) 51 | 52 | plt.figure(figsize=[16,12]) 53 | plt.imshow(wordcloud, interpolation="bilinear") 54 | plt.axis("off") 55 | plt.show() 56 | 57 | if to_file_name: 58 | to_file_name = to_file_name + ".png" 59 | wordcloud.to_file(to_file_name) 60 | 61 | 62 | 63 | def plot_profile(title, 64 | first_n_postings, 65 | max_words=200, 66 | return_posting=False, 67 | return_tokens=False, 68 | return_text_list=False): 69 | """ 70 | Loads the corresponding json file, extracts the first_n job postings and plot the wordcloud profile. 71 | 72 | Parameters: 73 | title: the job title such as "data scientist" 74 | first_n_postings: int, the first n job postings to use for the plot. 75 | 76 | Returns: 77 | nth_posting: the nth job posting as a string. This helps to verify the first_n_postings param used. 78 | 79 | """ 80 | # Convert title to full file name then load the data 81 | file_name = '_'.join(title.split()) + '.json' 82 | data = load_data(file_name) 83 | 84 | # Only of the two can be True 85 | if (return_posting + return_tokens + return_text_list) >= 2: 86 | print('You can only return one of these: a posting, tokens, text list! \nPlease try again.') 87 | return None 88 | 89 | if return_posting: 90 | n_posting = data[str(first_n_postings)] 91 | return n_posting 92 | 93 | text_list = make_text_list(data, first_n_postings) 94 | 95 | if return_text_list: 96 | return text_list 97 | elif return_tokens: 98 | tokens = tokenize_list(text_list, return_string=False) 99 | return tokens 100 | else: 101 | # Get the tokens joined as a string 102 | text = tokenize_list(text_list, return_string=True) 103 | # Get the stop words to use 104 | with open('stopwords.csv', 'r', newline='') as f: 105 | reader = csv.reader(f) 106 | stop_list = list(reader)[0] 107 | to_file_name = '_'.join(title.split()) 108 | plot_wc(text, max_words, stopwords_list=stop_list, to_file_name=to_file_name) 109 | 110 | 111 | 112 | def plot_title(df, title, save_figure=False): 113 | """ 114 | Plots the skill frequencies of all skill categories for a given title. 115 | 116 | Params: 117 | df: (pandas df) the frequency df 118 | title: (str) one of the three job titles: 119 | 'data scientist', 'machine learning engineer', 'data engineer' 120 | 121 | Returns: 122 | None 123 | 124 | """ 125 | categories = df.category.unique() 126 | titles = list(df.title.unique()) 127 | 128 | # Ensure input is valid 129 | if title.title() not in titles: 130 | print('Title invalid. Please try again!') 131 | return None 132 | title = title.title() 133 | # Subset df to the given title 134 | df_title = df.query('title==@title') 135 | # Set up the parameters for the plotting grid 136 | nrows=4 137 | ncols=2 138 | figsize = (15, 20) 139 | # Add a dummy category name to match the grid 140 | categories = np.append(categories, 'Empty').reshape(4, 2) 141 | 142 | # Generate the plotting objects 143 | fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=figsize) 144 | 145 | # Loop thru the axes of the figure 146 | for row in range(nrows): 147 | for col in range(ncols): 148 | cat = categories[row, col] 149 | # Subset to one category for each subplot 150 | df_cat = df_title.query('category==@cat') 151 | df_cat = df_cat.sort_values(by='frequency', ascending=False) 152 | # Find the correspoinding axis in axes 153 | ax = axes[row, col] 154 | # Handle errors for the empty last subplot 155 | try: 156 | df_cat.plot(x='skill', y='frequency', kind='bar', ax=ax) 157 | ax.set(title=cat, xlabel='', ylabel='Frequency') 158 | ax.get_legend().remove() # remove legend 159 | for tick in ax.get_xticklabels(): 160 | tick.set_rotation(60) 161 | except: 162 | fig.delaxes(ax) 163 | 164 | # Add the figure title 165 | fig_title = title + ' Skills Distribution' 166 | fig.suptitle(fig_title, y=0.92, verticalalignment='bottom', fontsize=30) 167 | plt.subplots_adjust(hspace=0.9) # make sure the figure title doesn't overlap with subplot titles 168 | plt.show() 169 | 170 | if save_figure: 171 | figure_name = fig_title + '.png' 172 | fig.savefig(figure_name) 173 | 174 | 175 | 176 | def plot_skill(df, cat, save_figure=False): 177 | """ 178 | Plots the skill frequencies of all job titles for a given skill category. 179 | 180 | Params: 181 | df: (pandas df) the frequency df 182 | cat: (str) one of the seven skill categories: 183 | 'Programming Languages', 'Big Data Technologies'... 184 | 185 | Returns: 186 | None 187 | 188 | """ 189 | categories = list(df.category.unique()) 190 | titles = list(df.title.unique()) 191 | 192 | if cat.title() not in categories: 193 | print('Category invalid. Please try again!') 194 | return None 195 | cat = cat.title() 196 | 197 | # Subset df to the given category 198 | df_cat = df.query('category==@cat') 199 | 200 | # Set up the parameters for the plotting grid 201 | nrows = len(titles) 202 | ncols = 1 203 | figsize = (10, 12) 204 | 205 | # Generate the plotting objects 206 | fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=figsize) 207 | 208 | # Loop thru the axes of the figure 209 | for row in range(nrows): 210 | title = titles[row] 211 | # Subset to one title for each subplot 212 | df_title = df_cat.query('title==@title') 213 | df_title = df_title.sort_values(by='frequency', ascending=False) 214 | # Find the correspoinding axis in axes 215 | ax = axes[row] 216 | df_title.plot(x='skill', y='frequency', kind='bar', ax=ax) 217 | ax.set(title=title, xlabel='', ylabel='Frequency') 218 | ax.get_legend().remove() # remove legend 219 | for tick in ax.get_xticklabels(): 220 | tick.set_rotation(30) 221 | 222 | # Add the figure title 223 | fig_title = cat + ' Distribution' 224 | fig.suptitle(fig_title, y=0.95, verticalalignment='baseline', fontsize=30) 225 | plt.subplots_adjust(hspace=0.36) # make sure the figure title doesn't overlap with subplot titles 226 | plt.show() 227 | 228 | if save_figure: 229 | figure_name = fig_title + '.png' 230 | fig.savefig(figure_name) -------------------------------------------------------------------------------- /process_text.py: -------------------------------------------------------------------------------- 1 | #from string import digits 2 | #from nltk import word_tokenize 3 | import re 4 | from nltk.corpus import stopwords 5 | from nltk.stem.snowball import SnowballStemmer 6 | 7 | 8 | 9 | def make_text_list(postings_dict, first_n_postings=100): 10 | """ 11 | Extract the texts from postings_dict into a list of strings 12 | 13 | Parameters: 14 | postings_dict: 15 | first_n_postings: 16 | 17 | Returns: 18 | text_list: list of job posting texts 19 | 20 | """ 21 | 22 | text_list = [] 23 | for i in range(0, first_n_postings+1): 24 | # Since some number could be missing due to errors in scraping, 25 | # handle exception here to ensure error free 26 | try: 27 | text_list.append(postings_dict[str(i)]['posting']) 28 | except: 29 | continue 30 | 31 | return text_list 32 | 33 | 34 | 35 | def remove_digits(token): 36 | """ 37 | Remove digits from a token 38 | 39 | Params: 40 | token: (str) a string token 41 | 42 | Returns: 43 | cleaned_token: (str) the cleaned token 44 | 45 | """ 46 | # Remove digits from the token 47 | remove_digits = str.maketrans('', '', digits) 48 | token = token.translate(remove_digits) 49 | return token 50 | 51 | 52 | 53 | def tokenize_text(text, stem=False): 54 | """ 55 | Tokenize, stem and remove stop words for the given text 56 | 57 | Parameters: 58 | text: a text string 59 | 60 | Returns: 61 | tokens: the processed text as a list of tokens 62 | """ 63 | stop_words = set(stopwords.words('english')) 64 | #tokens = word_tokenize(text.lower()) 65 | 66 | # Change "C++" to "Cpp" to avoid being removed below 67 | #tokens = ['cpp' if token=='c++' else token for token in tokens] 68 | # Same with C# 69 | #tokens = ['csharp' if token=='c#' else token for token in tokens] 70 | # Remove digits 71 | #tokens = [remove_digits(token) for token in tokens] 72 | # Remove non-alphabetic tokens and stopwords 73 | #tokens = [token for token in tokens if token.isalpha() and token not in stop_words] 74 | 75 | # Use Regex to tokenize 76 | # Replace any non word characters except .+# with space 77 | text = re.sub("[^\w.+#]", " ", text) 78 | # Twe cases to replace with space 79 | # Case 1: \d+\.?\d+\s -- any number of digits followed by a space with or without 80 | # a dot in between 81 | # Case 2: \d+\+ -- any number of digits followed by a plus sign 82 | text = re.sub("\d+\.?\d+\s|\d+\+", " ", text) 83 | tokens = text.lower().split() 84 | tokens = [token for token in tokens if token not in stop_words] 85 | 86 | # Stem tokens 87 | if stem: 88 | stemmer = SnowballStemmer("english") 89 | tokens = [stemmer.stem(i) for i in tokens] 90 | 91 | return tokens 92 | 93 | 94 | 95 | def tokenize_list(text_list, stem=False, return_string=False): 96 | """ 97 | Tokenize the given list of text and then combine list of tokens into text for plotting 98 | 99 | Parameters: 100 | text_list -- list of job posting strings 101 | 102 | Returns: 103 | text -- a text string for word cloud plot 104 | """ 105 | # Split the text based on slash, space and newline, then take set 106 | #text = [set(re.split('/| |\n|', i)) for i in text] 107 | #text = [set(re.split('\W', i)) for i in text_list] 108 | 109 | text_list_tokenized = [tokenize_text(text=i, stem=stem) for i in text_list] 110 | 111 | tokens = [] 112 | # Combine all token lists into one big list of tokens 113 | for i in text_list_tokenized: 114 | tokens += i 115 | 116 | if return_string: 117 | text = ' '.join(tokens) 118 | return text 119 | 120 | # Return the list of all tokens 121 | return tokens 122 | 123 | 124 | 125 | def check_freq(dict_to_check, text_list): 126 | """ 127 | Checks each given word's freqency in a list of posting strings. 128 | 129 | Params: 130 | words: (dict) a dict of word strings to check frequency for, format: 131 | {'languages': ['Python', 'R'..], 132 | 'big data': ['AWS', 'Azure'...], 133 | ..} 134 | text_list: (list) a list of posting strings to search in 135 | 136 | Returns: 137 | freq: (dict) frequency counts 138 | 139 | """ 140 | freq = {} 141 | 142 | # Join the text together and convert words to lowercase 143 | text = ' '.join(text_list).lower() 144 | 145 | for category, skill_list in dict_to_check.items(): 146 | # Initialize each category as a dictionary 147 | freq[category] = {} 148 | for skill in skill_list: 149 | if len(skill) == 1: # pad single letter skills such as "R" with spaces 150 | skill_name = ' ' + skill.lower() + ' ' 151 | else: 152 | skill_name = skill.lower() 153 | freq[category][skill] = text.count(skill_name) 154 | 155 | return freq 156 | -------------------------------------------------------------------------------- /scrape_data.py: -------------------------------------------------------------------------------- 1 | import re 2 | import json 3 | from bs4 import BeautifulSoup 4 | from selenium import webdriver 5 | 6 | 7 | 8 | def get_soup(url): 9 | """ 10 | Given the url of a page, this function returns the soup object. 11 | 12 | Parameters: 13 | url: the link to get soup object for 14 | 15 | Returns: 16 | soup: soup object 17 | """ 18 | driver = webdriver.Firefox() 19 | driver.get(url) 20 | html = driver.page_source 21 | soup = BeautifulSoup(html, 'html.parser') 22 | driver.close() 23 | 24 | return soup 25 | 26 | 27 | 28 | def grab_job_links(soup): 29 | """ 30 | Grab all non-sponsored job posting links from a Indeed search result page using the given soup object 31 | 32 | Parameters: 33 | soup: the soup object corresponding to a search result page 34 | e.g. https://ca.indeed.com/jobs?q=data+scientist&l=Toronto&start=20 35 | 36 | Returns: 37 | urls: a python list of job posting urls 38 | 39 | """ 40 | urls = [] 41 | 42 | # Loop thru all the posting links 43 | for link in soup.find_all('h2', {'class': 'jobtitle'}): 44 | # Since sponsored job postings are represented by "a target" instead of "a href", no need to worry here 45 | partial_url = link.a.get('href') 46 | # This is a partial url, we need to attach the prefix 47 | url = 'https://ca.indeed.com' + partial_url 48 | # Make sure this is not a sponsored posting 49 | urls.append(url) 50 | 51 | return urls 52 | 53 | 54 | 55 | def get_urls(query, num_pages, location): 56 | """ 57 | Get all the job posting URLs resulted from a specific search. 58 | 59 | Parameters: 60 | query: job title to query 61 | num_pages: number of pages needed 62 | location: city to search in 63 | 64 | Returns: 65 | urls: a list of job posting URL's (when num_pages valid) 66 | max_pages: maximum number of pages allowed ((when num_pages invalid)) 67 | """ 68 | # We always need the first page 69 | base_url = 'https://ca.indeed.com/jobs?q={}&l={}'.format(query, location) 70 | soup = get_soup(base_url) 71 | urls = grab_job_links(soup) 72 | 73 | # Get the total number of postings found 74 | posting_count_string = soup.find(name='div', attrs={'id':"searchCount"}).get_text() 75 | posting_count_string = posting_count_string[posting_count_string.find('of')+2:].strip() 76 | #print('posting_count_string: {}'.format(posting_count_string)) 77 | #print('type is: {}'.format(type(posting_count_string))) 78 | 79 | try: 80 | posting_count = int(posting_count_string) 81 | except ValueError: # deal with special case when parsed string is "360 jobs" 82 | posting_count = int(re.search('\d+', posting_count_string).group(0)) 83 | #print('posting_count: {}'.format(posting_count)) 84 | #print('\ntype: {}'.format(type(posting_count))) 85 | finally: 86 | posting_count = 330 # setting to 330 when unable to get the total 87 | pass 88 | 89 | # Limit nunmber of pages to get 90 | max_pages = round(posting_count / 10) - 3 91 | if num_pages > max_pages: 92 | print('returning max_pages!!') 93 | return max_pages 94 | 95 | # Additional work is needed when more than 1 page is requested 96 | if num_pages >= 2: 97 | # Start loop from page 2 since page 1 has been dealt with above 98 | for i in range(2, num_pages+1): 99 | num = (i-1) * 10 100 | base_url = 'https://ca.indeed.com/jobs?q={}&l={}&start={}'.format(query, location, num) 101 | try: 102 | soup = get_soup(base_url) 103 | # We always combine the results back to the list 104 | urls += grab_job_links(soup) 105 | except: 106 | continue 107 | 108 | # Check to ensure the number of urls gotten is correct 109 | #assert len(urls) == num_pages * 10, "There are missing job links, check code!" 110 | 111 | return urls 112 | 113 | 114 | 115 | def get_posting(url): 116 | """ 117 | Get the text portion including both title and job description of the job posting from a given url 118 | 119 | Parameters: 120 | url: The job posting link 121 | 122 | Returns: 123 | title: the job title (if "data scientist" is in the title) 124 | posting: the job posting content 125 | """ 126 | # Get the url content as BS object 127 | soup = get_soup(url) 128 | 129 | # The job title is held in the h3 tag 130 | title = soup.find(name='h3').getText().lower() 131 | posting = soup.find(name='div', attrs={'class': "jobsearch-JobComponent"}).get_text() 132 | 133 | return title, posting.lower() 134 | 135 | 136 | #if 'data scientist' in title: # We'll proceed to grab the job posting text if the title is correct 137 | # All the text info is contained in the div element with the below class, extract the text. 138 | #posting = soup.find(name='div', attrs={'class': "jobsearch-JobComponent"}).get_text() 139 | #return title, posting.lower() 140 | #else: 141 | #return False 142 | 143 | # Get rid of numbers and symbols other than given 144 | #text = re.sub("[^a-zA-Z'+#&]", " ", text) 145 | # Convert to lower case and split to list and then set 146 | #text = text.lower().strip() 147 | 148 | #return text 149 | 150 | 151 | 152 | def get_data(query, num_pages, location='Toronto'): 153 | """ 154 | Get all the job posting data and save in a json file using below structure: 155 | 156 | {: {'title': ..., 'posting':..., 'url':...}...} 157 | 158 | The json file name has this format: "".json" 159 | 160 | Parameters: 161 | query: Indeed query keyword such as 'Data Scientist' 162 | num_pages: Number of search results needed 163 | location: location to search for 164 | 165 | Returns: 166 | postings_dict: Python dict including all posting data 167 | 168 | """ 169 | # Convert the queried title to Indeed format 170 | query = '+'.join(query.lower().split()) 171 | 172 | postings_dict = {} 173 | urls = get_urls(query, num_pages, location) 174 | 175 | # Continue only if the requested number of pages is valid (when invalid, a number is returned instead of list) 176 | if isinstance(urls, list): 177 | num_urls = len(urls) 178 | for i, url in enumerate(urls): 179 | try: 180 | title, posting = get_posting(url) 181 | postings_dict[i] = {} 182 | postings_dict[i]['title'], postings_dict[i]['posting'], postings_dict[i]['url'] = \ 183 | title, posting, url 184 | except: 185 | continue 186 | 187 | percent = (i+1) / num_urls 188 | # Print the progress the "end" arg keeps the message in the same line 189 | print("Progress: {:2.0f}%".format(100*percent), end='\r') 190 | 191 | # Save the dict as json file 192 | file_name = query.replace('+', '_') + '.json' 193 | with open(file_name, 'w') as f: 194 | json.dump(postings_dict, f) 195 | 196 | print('All {} postings have been scraped and saved!'.format(num_urls)) 197 | #return postings_dict 198 | else: 199 | print("Due to similar results, maximum number of pages is only {}. Please try again!".format(urls)) 200 | 201 | 202 | 203 | # If script is run directly, we'll take input from the user 204 | if __name__ == "__main__": 205 | queries = ["data scientist", "machine learning engineer", "data engineer"] 206 | 207 | while True: 208 | query = input("Please enter the title to scrape data for: \n").lower() 209 | if query in queries: 210 | break 211 | else: 212 | print("Invalid title! Please try again.") 213 | 214 | while True: 215 | num_pages = input("Please enter the number of pages needed (integer only): \n") 216 | try: 217 | num_pages = int(num_pages) 218 | break 219 | except: 220 | print("Invalid number of pages! Please try again.") 221 | 222 | get_data(query, num_pages, location='Toronto') 223 | 224 | -------------------------------------------------------------------------------- /stopwords.csv: -------------------------------------------------------------------------------- 1 | "experience","job","work","working","skills","new","company","years","technology","ago","save","jobapply","nowapply","using","strong","ability","days","knowledge","opportunity","tools","related","including","original","understanding","us","role","degree","one","requirements","canada","required","toronto","world","provide","industry","help","saying","reviewsread","looking","preferred","sitesave","applicants","applications","part","field","etc","apply","across","position","life","application","employment","best","key","use","well","following","please","like","opportunities","within","nowsave","drive","qualifications","responsibilities","employees","global","must","equal","able","various","join","candidate","high","needs","education","time","meet","need",,"status","accommodation","diverse","successful","may","background","candidates","language","good","excellent","career","also","level","employer","flexible","companies","canadian","want","culture","grow","closely","available","relevant","diversity","approaches","group","used","demonstrated","full","languages","top","professional","multiple","type","description","based","sources","disability","location","day","current","take","national","highly","events","gender","individuals","variety","better","order","similar","concepts","effectively","way","offer","record","great","sets","different","next","human","include","ensure","plus","ontario","minimum","every","disabilities","data","team","benefit","understand","onapply","applying","benefits","around","office","require","future","asset","real","contribute","review","hand","responsible" 2 | --------------------------------------------------------------------------------