├── README.md
├── data_wrangling.ipynb
├── env_ideal_profiles.yaml
├── helper.py
├── ideal_profiles.ipynb
├── ideal_profiles_2.ipynb
├── process_text.py
├── scrape_data.py
└── stopwords.csv
/README.md:
--------------------------------------------------------------------------------
1 | # Ideal Profiles
2 | What does an ideal Data Scientist's profile look like? This project aims to provide a quantitative answer based on job postings. In this project, I scraped job posting data from Indeed and analyzed frequencies for various Data Science skills. The analysis then can be used not only as objective keyword reference for resume optimization, but can also serve as Data Science learning road map!!
3 |
4 | The related Medium posts are:
5 | - [What Does an Ideal Data Scientist’s Profile Look Like?](https://towardsdatascience.com/what-does-an-ideal-data-scientists-profile-look-like-7d7bd78ff7ab)
6 | - [Navigating the Data Science Careers Landscape](https://hackernoon.com/navigating-the-data-science-career-landscape-db746a61ac62)
7 | - [Scraping Job Posting Data from Indeed using Selenium and BeautifulSoup](https://towardsdatascience.com/scraping-job-posting-data-from-indeed-using-selenium-and-beautifulsoup-dfc86230baac)
8 | - [Building an End-To-End Data Science Project](https://towardsdatascience.com/building-an-end-to-end-data-science-project-28e853c0cae3)
9 |
10 |
11 | ## How to Use
12 | If you want to run the code locally, please download the repo and build your Anaconda environment using the `env_ideal_profiles.yaml` file, and download geckodriver (see Requirements below). Then you can start with data scraping by running `python scrape_date.py` in Anaconda Prompt. Once you have the raw data, you can then clean the data using the `data_wrangling.ipynb` Jupyter Notebook. Finally, the `ideal_profiles_2.ipynb` Notebook can be used to make various plots. Refer to list below for the roles of different files.
13 |
14 |
15 | ## Requirements
16 | - Windows 10 OS
17 | - Firefox Web Browser 63.0.3
18 | - Ananconda 3
19 | - geckodriver v0.22.0 (geckodriver-v0.22.0-win64.zip, available [here](https://github.com/mozilla/geckodriver/releases))
20 | - pandas (see the yaml file for version number, same below)
21 | - numpy
22 | - matplotlib
23 | - json
24 | - re
25 | - csv
26 | - wordcloud
27 | - nltk
28 | - bs4 (BeautifulSoup)
29 | - selenium
30 |
31 |
32 | ## Files
33 | - `scrape_data.py`: scrapes the data from Indeed.ca
34 | - `process_text.py`: performs various text related operations such as remove digits, tokenize, and check term frequency
35 | - `helper.py`: contains data loading and various plotting functions
36 | - `data_wrangling.ipynb`: gathers the raw text data, counts term frequency and stores the result in a pandas dataframe
37 | - `ideal_profiles.ipynb`: creates spider plots to visualize various Data Science roles' skill requirements based on intuition
38 | - `ideal_profiles_2.ipynb`: creates skill distribution and word cloud plots to represent ideal profiles quantitatively
39 | - `stopwords.csv`: contains the stop words for word cloud plotting
40 | - `env_ideal_profiles.yaml`: the Anaconda environment file for setting up the project environment
41 |
42 |
43 | ## Contribute
44 | Any contribution is welcome!
45 |
46 |
47 | ## To-do's
48 | - Allow to query Indeed USA instead of the Canadian site and increase the number of postings to scrape
49 | - Allow to show context for specific words in word clouds
50 | - Update all docstrings and comments
51 | - OOP
52 | - Code refactoring - single responsibility principle for functions
53 | - Add Data Analyst and AI Engineer roles
54 | - Allow to show Percentage of Mentions for a certain skill, i.e., out of 1000 job postings, what proportion mentions the given skill?
55 |
56 |
57 | ## License
58 | [](https://opensource.org/licenses/MIT)
59 |
--------------------------------------------------------------------------------
/data_wrangling.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {},
7 | "outputs": [],
8 | "source": [
9 | "import pandas as pd\n",
10 | "#import nltk\n",
11 | "from matplotlib import pyplot as plt\n",
12 | "from scrape_data import *\n",
13 | "from process_text import *\n",
14 | "from helper import *"
15 | ]
16 | },
17 | {
18 | "cell_type": "code",
19 | "execution_count": 2,
20 | "metadata": {},
21 | "outputs": [],
22 | "source": [
23 | "# Initialize the dict to store all text lists for below titles\n",
24 | "text_lists = {}\n",
25 | "titles = ['Data Scientist', 'Machine Learning Engineer', 'Data Engineer']\n",
26 | "# Grab the tokens list and store them in the dict\n",
27 | "for title in titles:\n",
28 | " text_lists[title] = plot_profile(title=title, first_n_postings=120, return_text_list=True)"
29 | ]
30 | },
31 | {
32 | "cell_type": "code",
33 | "execution_count": 3,
34 | "metadata": {},
35 | "outputs": [],
36 | "source": [
37 | "# Make the dict of skills to investigate\n",
38 | "\n",
39 | "languages = ['Python', 'R', 'SQL', 'Java', 'C', 'C++', 'C#', 'Scala', 'Perl', 'Julia', \n",
40 | " 'Javascript', 'HTML', 'CSS', 'PHP', 'Ruby', 'Lua', 'MATLAB', 'SAS'] \n",
41 | "\n",
42 | "big_data = ['Hadoop', 'MapReduce', 'Hive', 'Pig', 'Cascading', 'Scalding', 'Cascalog', 'HBase', 'Sqoop', \n",
43 | " 'Mahout', 'Oozie', 'Flume', 'ZooKeeper', 'Spark', 'Storm', 'Shark', 'Impala', 'Elasticsearch', \n",
44 | " 'Kafka', 'Flink', 'Kinesis', 'Presto', 'Hume', 'Airflow', 'Azkabhan', 'Luigi', 'Cassandra']\n",
45 | "\n",
46 | "dl = ['TensorFlow', 'Keras', 'PyTorch', 'Theano', 'Deeplearning4J', 'Caffe', 'TFLearn', 'Torch', \n",
47 | " 'OpenCV', 'MXNet', 'Microsoft Cognitive Toolkit', 'Lasagne']\n",
48 | "\n",
49 | "cloud = ['AWS', 'GCP', 'Azure']\n",
50 | "\n",
51 | "ml = ['Natural Language Processing', 'Computer Vision', 'Speech Recognition', 'Fraud Detection',\n",
52 | " 'Recommender System', 'Image Recognition', 'Object Dectection', 'Chatbot', 'Sentiment Analysis']\n",
53 | "\n",
54 | "visualization = ['Dimple', 'D3.js', 'Ggplot', 'Shiny', 'Plotly', 'Matplotlib', 'Seaborn', \n",
55 | " 'Bokeh', 'Tableau']\n",
56 | "\n",
57 | "other = ['Pandas', 'Numpy', 'Scipy', 'Sklearn', 'Scikit-Learn', 'Docker', 'Git', 'Jira', 'Kaggle']\n",
58 | "\n",
59 | "dict_to_check = {'Programming Languages': languages,\n",
60 | " 'Big Data Technologies': big_data,\n",
61 | " 'Deep Learning Frameworks': dl,\n",
62 | " 'Cloud Computing Platforms': cloud,\n",
63 | " 'Machine Learning Application': ml,\n",
64 | " 'Visualization Tools': visualization,\n",
65 | " 'Other': other}"
66 | ]
67 | },
68 | {
69 | "cell_type": "code",
70 | "execution_count": 4,
71 | "metadata": {},
72 | "outputs": [],
73 | "source": [
74 | "# Check the frequency and store in dict\n",
75 | "freq_dict = {}\n",
76 | "for title in text_lists.keys():\n",
77 | " freq_dict[title] = check_freq(dict_to_check=dict_to_check, text_list=text_lists[title])"
78 | ]
79 | },
80 | {
81 | "cell_type": "code",
82 | "execution_count": 5,
83 | "metadata": {},
84 | "outputs": [
85 | {
86 | "data": {
87 | "text/html": [
88 | "
\n",
89 | "\n",
102 | "
\n",
103 | " \n",
104 | " \n",
105 | " | \n",
106 | " | \n",
107 | " Python | \n",
108 | " R | \n",
109 | " SQL | \n",
110 | " Java | \n",
111 | " C | \n",
112 | " C++ | \n",
113 | " C# | \n",
114 | " Scala | \n",
115 | " Perl | \n",
116 | " Julia | \n",
117 | " ... | \n",
118 | " Tableau | \n",
119 | " Pandas | \n",
120 | " Numpy | \n",
121 | " Scipy | \n",
122 | " Sklearn | \n",
123 | " Scikit-Learn | \n",
124 | " Docker | \n",
125 | " Git | \n",
126 | " Jira | \n",
127 | " Kaggle | \n",
128 | "
\n",
129 | " \n",
130 | " \n",
131 | " \n",
132 | " Data Engineer | \n",
133 | " Big Data Technologies | \n",
134 | " NaN | \n",
135 | " NaN | \n",
136 | " NaN | \n",
137 | " NaN | \n",
138 | " NaN | \n",
139 | " NaN | \n",
140 | " NaN | \n",
141 | " NaN | \n",
142 | " NaN | \n",
143 | " NaN | \n",
144 | " ... | \n",
145 | " NaN | \n",
146 | " NaN | \n",
147 | " NaN | \n",
148 | " NaN | \n",
149 | " NaN | \n",
150 | " NaN | \n",
151 | " NaN | \n",
152 | " NaN | \n",
153 | " NaN | \n",
154 | " NaN | \n",
155 | "
\n",
156 | " \n",
157 | " Cloud Computing Platforms | \n",
158 | " NaN | \n",
159 | " NaN | \n",
160 | " NaN | \n",
161 | " NaN | \n",
162 | " NaN | \n",
163 | " NaN | \n",
164 | " NaN | \n",
165 | " NaN | \n",
166 | " NaN | \n",
167 | " NaN | \n",
168 | " ... | \n",
169 | " NaN | \n",
170 | " NaN | \n",
171 | " NaN | \n",
172 | " NaN | \n",
173 | " NaN | \n",
174 | " NaN | \n",
175 | " NaN | \n",
176 | " NaN | \n",
177 | " NaN | \n",
178 | " NaN | \n",
179 | "
\n",
180 | " \n",
181 | " Deep Learning Frameworks | \n",
182 | " NaN | \n",
183 | " NaN | \n",
184 | " NaN | \n",
185 | " NaN | \n",
186 | " NaN | \n",
187 | " NaN | \n",
188 | " NaN | \n",
189 | " NaN | \n",
190 | " NaN | \n",
191 | " NaN | \n",
192 | " ... | \n",
193 | " NaN | \n",
194 | " NaN | \n",
195 | " NaN | \n",
196 | " NaN | \n",
197 | " NaN | \n",
198 | " NaN | \n",
199 | " NaN | \n",
200 | " NaN | \n",
201 | " NaN | \n",
202 | " NaN | \n",
203 | "
\n",
204 | " \n",
205 | " Machine Learning Application | \n",
206 | " NaN | \n",
207 | " NaN | \n",
208 | " NaN | \n",
209 | " NaN | \n",
210 | " NaN | \n",
211 | " NaN | \n",
212 | " NaN | \n",
213 | " NaN | \n",
214 | " NaN | \n",
215 | " NaN | \n",
216 | " ... | \n",
217 | " NaN | \n",
218 | " NaN | \n",
219 | " NaN | \n",
220 | " NaN | \n",
221 | " NaN | \n",
222 | " NaN | \n",
223 | " NaN | \n",
224 | " NaN | \n",
225 | " NaN | \n",
226 | " NaN | \n",
227 | "
\n",
228 | " \n",
229 | " Other | \n",
230 | " NaN | \n",
231 | " NaN | \n",
232 | " NaN | \n",
233 | " NaN | \n",
234 | " NaN | \n",
235 | " NaN | \n",
236 | " NaN | \n",
237 | " NaN | \n",
238 | " NaN | \n",
239 | " NaN | \n",
240 | " ... | \n",
241 | " NaN | \n",
242 | " 3.0 | \n",
243 | " 0.0 | \n",
244 | " 2.0 | \n",
245 | " 0.0 | \n",
246 | " 0.0 | \n",
247 | " 8.0 | \n",
248 | " 29.0 | \n",
249 | " 1.0 | \n",
250 | " 0.0 | \n",
251 | "
\n",
252 | " \n",
253 | "
\n",
254 | "
5 rows × 87 columns
\n",
255 | "
"
256 | ],
257 | "text/plain": [
258 | " Python R SQL Java C C++ \\\n",
259 | "Data Engineer Big Data Technologies NaN NaN NaN NaN NaN NaN \n",
260 | " Cloud Computing Platforms NaN NaN NaN NaN NaN NaN \n",
261 | " Deep Learning Frameworks NaN NaN NaN NaN NaN NaN \n",
262 | " Machine Learning Application NaN NaN NaN NaN NaN NaN \n",
263 | " Other NaN NaN NaN NaN NaN NaN \n",
264 | "\n",
265 | " C# Scala Perl Julia ... \\\n",
266 | "Data Engineer Big Data Technologies NaN NaN NaN NaN ... \n",
267 | " Cloud Computing Platforms NaN NaN NaN NaN ... \n",
268 | " Deep Learning Frameworks NaN NaN NaN NaN ... \n",
269 | " Machine Learning Application NaN NaN NaN NaN ... \n",
270 | " Other NaN NaN NaN NaN ... \n",
271 | "\n",
272 | " Tableau Pandas Numpy Scipy \\\n",
273 | "Data Engineer Big Data Technologies NaN NaN NaN NaN \n",
274 | " Cloud Computing Platforms NaN NaN NaN NaN \n",
275 | " Deep Learning Frameworks NaN NaN NaN NaN \n",
276 | " Machine Learning Application NaN NaN NaN NaN \n",
277 | " Other NaN 3.0 0.0 2.0 \n",
278 | "\n",
279 | " Sklearn Scikit-Learn Docker \\\n",
280 | "Data Engineer Big Data Technologies NaN NaN NaN \n",
281 | " Cloud Computing Platforms NaN NaN NaN \n",
282 | " Deep Learning Frameworks NaN NaN NaN \n",
283 | " Machine Learning Application NaN NaN NaN \n",
284 | " Other 0.0 0.0 8.0 \n",
285 | "\n",
286 | " Git Jira Kaggle \n",
287 | "Data Engineer Big Data Technologies NaN NaN NaN \n",
288 | " Cloud Computing Platforms NaN NaN NaN \n",
289 | " Deep Learning Frameworks NaN NaN NaN \n",
290 | " Machine Learning Application NaN NaN NaN \n",
291 | " Other 29.0 1.0 0.0 \n",
292 | "\n",
293 | "[5 rows x 87 columns]"
294 | ]
295 | },
296 | "execution_count": 5,
297 | "metadata": {},
298 | "output_type": "execute_result"
299 | }
300 | ],
301 | "source": [
302 | "# Convert the dict to a pandas df\n",
303 | "df = pd.DataFrame.from_dict({(i,j): freq_dict[i][j] \n",
304 | " for i in freq_dict.keys()\n",
305 | " for j in freq_dict[i].keys()},\n",
306 | " orient='index')\n",
307 | "df.head()"
308 | ]
309 | },
310 | {
311 | "cell_type": "code",
312 | "execution_count": 6,
313 | "metadata": {},
314 | "outputs": [
315 | {
316 | "data": {
317 | "text/html": [
318 | "\n",
319 | "\n",
332 | "
\n",
333 | " \n",
334 | " \n",
335 | " | \n",
336 | " level_0 | \n",
337 | " level_1 | \n",
338 | " Python | \n",
339 | " R | \n",
340 | " SQL | \n",
341 | " Java | \n",
342 | " C | \n",
343 | " C++ | \n",
344 | " C# | \n",
345 | " Scala | \n",
346 | " ... | \n",
347 | " Tableau | \n",
348 | " Pandas | \n",
349 | " Numpy | \n",
350 | " Scipy | \n",
351 | " Sklearn | \n",
352 | " Scikit-Learn | \n",
353 | " Docker | \n",
354 | " Git | \n",
355 | " Jira | \n",
356 | " Kaggle | \n",
357 | "
\n",
358 | " \n",
359 | " \n",
360 | " \n",
361 | " 0 | \n",
362 | " Data Engineer | \n",
363 | " Big Data Technologies | \n",
364 | " NaN | \n",
365 | " NaN | \n",
366 | " NaN | \n",
367 | " NaN | \n",
368 | " NaN | \n",
369 | " NaN | \n",
370 | " NaN | \n",
371 | " NaN | \n",
372 | " ... | \n",
373 | " NaN | \n",
374 | " NaN | \n",
375 | " NaN | \n",
376 | " NaN | \n",
377 | " NaN | \n",
378 | " NaN | \n",
379 | " NaN | \n",
380 | " NaN | \n",
381 | " NaN | \n",
382 | " NaN | \n",
383 | "
\n",
384 | " \n",
385 | " 1 | \n",
386 | " Data Engineer | \n",
387 | " Cloud Computing Platforms | \n",
388 | " NaN | \n",
389 | " NaN | \n",
390 | " NaN | \n",
391 | " NaN | \n",
392 | " NaN | \n",
393 | " NaN | \n",
394 | " NaN | \n",
395 | " NaN | \n",
396 | " ... | \n",
397 | " NaN | \n",
398 | " NaN | \n",
399 | " NaN | \n",
400 | " NaN | \n",
401 | " NaN | \n",
402 | " NaN | \n",
403 | " NaN | \n",
404 | " NaN | \n",
405 | " NaN | \n",
406 | " NaN | \n",
407 | "
\n",
408 | " \n",
409 | " 2 | \n",
410 | " Data Engineer | \n",
411 | " Deep Learning Frameworks | \n",
412 | " NaN | \n",
413 | " NaN | \n",
414 | " NaN | \n",
415 | " NaN | \n",
416 | " NaN | \n",
417 | " NaN | \n",
418 | " NaN | \n",
419 | " NaN | \n",
420 | " ... | \n",
421 | " NaN | \n",
422 | " NaN | \n",
423 | " NaN | \n",
424 | " NaN | \n",
425 | " NaN | \n",
426 | " NaN | \n",
427 | " NaN | \n",
428 | " NaN | \n",
429 | " NaN | \n",
430 | " NaN | \n",
431 | "
\n",
432 | " \n",
433 | " 3 | \n",
434 | " Data Engineer | \n",
435 | " Machine Learning Application | \n",
436 | " NaN | \n",
437 | " NaN | \n",
438 | " NaN | \n",
439 | " NaN | \n",
440 | " NaN | \n",
441 | " NaN | \n",
442 | " NaN | \n",
443 | " NaN | \n",
444 | " ... | \n",
445 | " NaN | \n",
446 | " NaN | \n",
447 | " NaN | \n",
448 | " NaN | \n",
449 | " NaN | \n",
450 | " NaN | \n",
451 | " NaN | \n",
452 | " NaN | \n",
453 | " NaN | \n",
454 | " NaN | \n",
455 | "
\n",
456 | " \n",
457 | " 4 | \n",
458 | " Data Engineer | \n",
459 | " Other | \n",
460 | " NaN | \n",
461 | " NaN | \n",
462 | " NaN | \n",
463 | " NaN | \n",
464 | " NaN | \n",
465 | " NaN | \n",
466 | " NaN | \n",
467 | " NaN | \n",
468 | " ... | \n",
469 | " NaN | \n",
470 | " 3.0 | \n",
471 | " 0.0 | \n",
472 | " 2.0 | \n",
473 | " 0.0 | \n",
474 | " 0.0 | \n",
475 | " 8.0 | \n",
476 | " 29.0 | \n",
477 | " 1.0 | \n",
478 | " 0.0 | \n",
479 | "
\n",
480 | " \n",
481 | "
\n",
482 | "
5 rows × 89 columns
\n",
483 | "
"
484 | ],
485 | "text/plain": [
486 | " level_0 level_1 Python R SQL Java C \\\n",
487 | "0 Data Engineer Big Data Technologies NaN NaN NaN NaN NaN \n",
488 | "1 Data Engineer Cloud Computing Platforms NaN NaN NaN NaN NaN \n",
489 | "2 Data Engineer Deep Learning Frameworks NaN NaN NaN NaN NaN \n",
490 | "3 Data Engineer Machine Learning Application NaN NaN NaN NaN NaN \n",
491 | "4 Data Engineer Other NaN NaN NaN NaN NaN \n",
492 | "\n",
493 | " C++ C# Scala ... Tableau Pandas Numpy Scipy Sklearn \\\n",
494 | "0 NaN NaN NaN ... NaN NaN NaN NaN NaN \n",
495 | "1 NaN NaN NaN ... NaN NaN NaN NaN NaN \n",
496 | "2 NaN NaN NaN ... NaN NaN NaN NaN NaN \n",
497 | "3 NaN NaN NaN ... NaN NaN NaN NaN NaN \n",
498 | "4 NaN NaN NaN ... NaN 3.0 0.0 2.0 0.0 \n",
499 | "\n",
500 | " Scikit-Learn Docker Git Jira Kaggle \n",
501 | "0 NaN NaN NaN NaN NaN \n",
502 | "1 NaN NaN NaN NaN NaN \n",
503 | "2 NaN NaN NaN NaN NaN \n",
504 | "3 NaN NaN NaN NaN NaN \n",
505 | "4 0.0 8.0 29.0 1.0 0.0 \n",
506 | "\n",
507 | "[5 rows x 89 columns]"
508 | ]
509 | },
510 | "execution_count": 6,
511 | "metadata": {},
512 | "output_type": "execute_result"
513 | }
514 | ],
515 | "source": [
516 | "# Reset the index to include both title and category as columns\n",
517 | "df = df.reset_index()\n",
518 | "df.head()"
519 | ]
520 | },
521 | {
522 | "cell_type": "code",
523 | "execution_count": 7,
524 | "metadata": {},
525 | "outputs": [
526 | {
527 | "data": {
528 | "text/html": [
529 | "\n",
530 | "\n",
543 | "
\n",
544 | " \n",
545 | " \n",
546 | " | \n",
547 | " title | \n",
548 | " category | \n",
549 | " Python | \n",
550 | " R | \n",
551 | " SQL | \n",
552 | " Java | \n",
553 | " C | \n",
554 | " C++ | \n",
555 | " C# | \n",
556 | " Scala | \n",
557 | " ... | \n",
558 | " Tableau | \n",
559 | " Pandas | \n",
560 | " Numpy | \n",
561 | " Scipy | \n",
562 | " Sklearn | \n",
563 | " Scikit-Learn | \n",
564 | " Docker | \n",
565 | " Git | \n",
566 | " Jira | \n",
567 | " Kaggle | \n",
568 | "
\n",
569 | " \n",
570 | " \n",
571 | " \n",
572 | " 0 | \n",
573 | " Data Engineer | \n",
574 | " Big Data Technologies | \n",
575 | " NaN | \n",
576 | " NaN | \n",
577 | " NaN | \n",
578 | " NaN | \n",
579 | " NaN | \n",
580 | " NaN | \n",
581 | " NaN | \n",
582 | " NaN | \n",
583 | " ... | \n",
584 | " NaN | \n",
585 | " NaN | \n",
586 | " NaN | \n",
587 | " NaN | \n",
588 | " NaN | \n",
589 | " NaN | \n",
590 | " NaN | \n",
591 | " NaN | \n",
592 | " NaN | \n",
593 | " NaN | \n",
594 | "
\n",
595 | " \n",
596 | " 1 | \n",
597 | " Data Engineer | \n",
598 | " Cloud Computing Platforms | \n",
599 | " NaN | \n",
600 | " NaN | \n",
601 | " NaN | \n",
602 | " NaN | \n",
603 | " NaN | \n",
604 | " NaN | \n",
605 | " NaN | \n",
606 | " NaN | \n",
607 | " ... | \n",
608 | " NaN | \n",
609 | " NaN | \n",
610 | " NaN | \n",
611 | " NaN | \n",
612 | " NaN | \n",
613 | " NaN | \n",
614 | " NaN | \n",
615 | " NaN | \n",
616 | " NaN | \n",
617 | " NaN | \n",
618 | "
\n",
619 | " \n",
620 | " 2 | \n",
621 | " Data Engineer | \n",
622 | " Deep Learning Frameworks | \n",
623 | " NaN | \n",
624 | " NaN | \n",
625 | " NaN | \n",
626 | " NaN | \n",
627 | " NaN | \n",
628 | " NaN | \n",
629 | " NaN | \n",
630 | " NaN | \n",
631 | " ... | \n",
632 | " NaN | \n",
633 | " NaN | \n",
634 | " NaN | \n",
635 | " NaN | \n",
636 | " NaN | \n",
637 | " NaN | \n",
638 | " NaN | \n",
639 | " NaN | \n",
640 | " NaN | \n",
641 | " NaN | \n",
642 | "
\n",
643 | " \n",
644 | " 3 | \n",
645 | " Data Engineer | \n",
646 | " Machine Learning Application | \n",
647 | " NaN | \n",
648 | " NaN | \n",
649 | " NaN | \n",
650 | " NaN | \n",
651 | " NaN | \n",
652 | " NaN | \n",
653 | " NaN | \n",
654 | " NaN | \n",
655 | " ... | \n",
656 | " NaN | \n",
657 | " NaN | \n",
658 | " NaN | \n",
659 | " NaN | \n",
660 | " NaN | \n",
661 | " NaN | \n",
662 | " NaN | \n",
663 | " NaN | \n",
664 | " NaN | \n",
665 | " NaN | \n",
666 | "
\n",
667 | " \n",
668 | " 4 | \n",
669 | " Data Engineer | \n",
670 | " Other | \n",
671 | " NaN | \n",
672 | " NaN | \n",
673 | " NaN | \n",
674 | " NaN | \n",
675 | " NaN | \n",
676 | " NaN | \n",
677 | " NaN | \n",
678 | " NaN | \n",
679 | " ... | \n",
680 | " NaN | \n",
681 | " 3.0 | \n",
682 | " 0.0 | \n",
683 | " 2.0 | \n",
684 | " 0.0 | \n",
685 | " 0.0 | \n",
686 | " 8.0 | \n",
687 | " 29.0 | \n",
688 | " 1.0 | \n",
689 | " 0.0 | \n",
690 | "
\n",
691 | " \n",
692 | "
\n",
693 | "
5 rows × 89 columns
\n",
694 | "
"
695 | ],
696 | "text/plain": [
697 | " title category Python R SQL Java C \\\n",
698 | "0 Data Engineer Big Data Technologies NaN NaN NaN NaN NaN \n",
699 | "1 Data Engineer Cloud Computing Platforms NaN NaN NaN NaN NaN \n",
700 | "2 Data Engineer Deep Learning Frameworks NaN NaN NaN NaN NaN \n",
701 | "3 Data Engineer Machine Learning Application NaN NaN NaN NaN NaN \n",
702 | "4 Data Engineer Other NaN NaN NaN NaN NaN \n",
703 | "\n",
704 | " C++ C# Scala ... Tableau Pandas Numpy Scipy Sklearn \\\n",
705 | "0 NaN NaN NaN ... NaN NaN NaN NaN NaN \n",
706 | "1 NaN NaN NaN ... NaN NaN NaN NaN NaN \n",
707 | "2 NaN NaN NaN ... NaN NaN NaN NaN NaN \n",
708 | "3 NaN NaN NaN ... NaN NaN NaN NaN NaN \n",
709 | "4 NaN NaN NaN ... NaN 3.0 0.0 2.0 0.0 \n",
710 | "\n",
711 | " Scikit-Learn Docker Git Jira Kaggle \n",
712 | "0 NaN NaN NaN NaN NaN \n",
713 | "1 NaN NaN NaN NaN NaN \n",
714 | "2 NaN NaN NaN NaN NaN \n",
715 | "3 NaN NaN NaN NaN NaN \n",
716 | "4 0.0 8.0 29.0 1.0 0.0 \n",
717 | "\n",
718 | "[5 rows x 89 columns]"
719 | ]
720 | },
721 | "execution_count": 7,
722 | "metadata": {},
723 | "output_type": "execute_result"
724 | }
725 | ],
726 | "source": [
727 | "# Rename the first two columns\n",
728 | "df.rename({'level_0': 'title', 'level_1': 'category'}, axis='columns', inplace=True)\n",
729 | "df.head()"
730 | ]
731 | },
732 | {
733 | "cell_type": "code",
734 | "execution_count": 8,
735 | "metadata": {},
736 | "outputs": [
737 | {
738 | "data": {
739 | "text/html": [
740 | "\n",
741 | "\n",
754 | "
\n",
755 | " \n",
756 | " \n",
757 | " | \n",
758 | " title | \n",
759 | " category | \n",
760 | " variable | \n",
761 | " value | \n",
762 | "
\n",
763 | " \n",
764 | " \n",
765 | " \n",
766 | " 0 | \n",
767 | " Data Engineer | \n",
768 | " Big Data Technologies | \n",
769 | " Python | \n",
770 | " NaN | \n",
771 | "
\n",
772 | " \n",
773 | " 1 | \n",
774 | " Data Engineer | \n",
775 | " Cloud Computing Platforms | \n",
776 | " Python | \n",
777 | " NaN | \n",
778 | "
\n",
779 | " \n",
780 | " 2 | \n",
781 | " Data Engineer | \n",
782 | " Deep Learning Frameworks | \n",
783 | " Python | \n",
784 | " NaN | \n",
785 | "
\n",
786 | " \n",
787 | " 3 | \n",
788 | " Data Engineer | \n",
789 | " Machine Learning Application | \n",
790 | " Python | \n",
791 | " NaN | \n",
792 | "
\n",
793 | " \n",
794 | " 4 | \n",
795 | " Data Engineer | \n",
796 | " Other | \n",
797 | " Python | \n",
798 | " NaN | \n",
799 | "
\n",
800 | " \n",
801 | "
\n",
802 | "
"
803 | ],
804 | "text/plain": [
805 | " title category variable value\n",
806 | "0 Data Engineer Big Data Technologies Python NaN\n",
807 | "1 Data Engineer Cloud Computing Platforms Python NaN\n",
808 | "2 Data Engineer Deep Learning Frameworks Python NaN\n",
809 | "3 Data Engineer Machine Learning Application Python NaN\n",
810 | "4 Data Engineer Other Python NaN"
811 | ]
812 | },
813 | "execution_count": 8,
814 | "metadata": {},
815 | "output_type": "execute_result"
816 | }
817 | ],
818 | "source": [
819 | "value_vars = df.columns.tolist()[2:] # the list of column names except the first two\n",
820 | "# Transform from wide to long for plotting\n",
821 | "df = pd.melt(df, id_vars=['title', 'category'], value_vars=value_vars)\n",
822 | "df.head()"
823 | ]
824 | },
825 | {
826 | "cell_type": "code",
827 | "execution_count": 9,
828 | "metadata": {},
829 | "outputs": [
830 | {
831 | "data": {
832 | "text/html": [
833 | "\n",
834 | "\n",
847 | "
\n",
848 | " \n",
849 | " \n",
850 | " | \n",
851 | " title | \n",
852 | " category | \n",
853 | " skill | \n",
854 | " frequency | \n",
855 | "
\n",
856 | " \n",
857 | " \n",
858 | " \n",
859 | " 0 | \n",
860 | " Data Engineer | \n",
861 | " Big Data Technologies | \n",
862 | " Python | \n",
863 | " NaN | \n",
864 | "
\n",
865 | " \n",
866 | " 1 | \n",
867 | " Data Engineer | \n",
868 | " Cloud Computing Platforms | \n",
869 | " Python | \n",
870 | " NaN | \n",
871 | "
\n",
872 | " \n",
873 | " 2 | \n",
874 | " Data Engineer | \n",
875 | " Deep Learning Frameworks | \n",
876 | " Python | \n",
877 | " NaN | \n",
878 | "
\n",
879 | " \n",
880 | " 3 | \n",
881 | " Data Engineer | \n",
882 | " Machine Learning Application | \n",
883 | " Python | \n",
884 | " NaN | \n",
885 | "
\n",
886 | " \n",
887 | " 4 | \n",
888 | " Data Engineer | \n",
889 | " Other | \n",
890 | " Python | \n",
891 | " NaN | \n",
892 | "
\n",
893 | " \n",
894 | "
\n",
895 | "
"
896 | ],
897 | "text/plain": [
898 | " title category skill frequency\n",
899 | "0 Data Engineer Big Data Technologies Python NaN\n",
900 | "1 Data Engineer Cloud Computing Platforms Python NaN\n",
901 | "2 Data Engineer Deep Learning Frameworks Python NaN\n",
902 | "3 Data Engineer Machine Learning Application Python NaN\n",
903 | "4 Data Engineer Other Python NaN"
904 | ]
905 | },
906 | "execution_count": 9,
907 | "metadata": {},
908 | "output_type": "execute_result"
909 | }
910 | ],
911 | "source": [
912 | "# Rename the last two columns\n",
913 | "df.rename({'variable': 'skill', 'value': 'frequency'}, axis='columns', inplace=True)\n",
914 | "df.head()"
915 | ]
916 | },
917 | {
918 | "cell_type": "code",
919 | "execution_count": 10,
920 | "metadata": {},
921 | "outputs": [
922 | {
923 | "data": {
924 | "text/html": [
925 | "\n",
926 | "\n",
939 | "
\n",
940 | " \n",
941 | " \n",
942 | " | \n",
943 | " title | \n",
944 | " category | \n",
945 | " skill | \n",
946 | " frequency | \n",
947 | "
\n",
948 | " \n",
949 | " \n",
950 | " \n",
951 | " 5 | \n",
952 | " Data Engineer | \n",
953 | " Programming Languages | \n",
954 | " Python | \n",
955 | " 52.0 | \n",
956 | "
\n",
957 | " \n",
958 | " 12 | \n",
959 | " Data Scientist | \n",
960 | " Programming Languages | \n",
961 | " Python | \n",
962 | " 103.0 | \n",
963 | "
\n",
964 | " \n",
965 | " 19 | \n",
966 | " Machine Learning Engineer | \n",
967 | " Programming Languages | \n",
968 | " Python | \n",
969 | " 71.0 | \n",
970 | "
\n",
971 | " \n",
972 | " 26 | \n",
973 | " Data Engineer | \n",
974 | " Programming Languages | \n",
975 | " R | \n",
976 | " 5.0 | \n",
977 | "
\n",
978 | " \n",
979 | " 33 | \n",
980 | " Data Scientist | \n",
981 | " Programming Languages | \n",
982 | " R | \n",
983 | " 19.0 | \n",
984 | "
\n",
985 | " \n",
986 | "
\n",
987 | "
"
988 | ],
989 | "text/plain": [
990 | " title category skill frequency\n",
991 | "5 Data Engineer Programming Languages Python 52.0\n",
992 | "12 Data Scientist Programming Languages Python 103.0\n",
993 | "19 Machine Learning Engineer Programming Languages Python 71.0\n",
994 | "26 Data Engineer Programming Languages R 5.0\n",
995 | "33 Data Scientist Programming Languages R 19.0"
996 | ]
997 | },
998 | "execution_count": 10,
999 | "metadata": {},
1000 | "output_type": "execute_result"
1001 | }
1002 | ],
1003 | "source": [
1004 | "# Subset to non null values in the freq column\n",
1005 | "df = df[df['frequency'].notnull()]\n",
1006 | "df.head()"
1007 | ]
1008 | },
1009 | {
1010 | "cell_type": "code",
1011 | "execution_count": 11,
1012 | "metadata": {},
1013 | "outputs": [
1014 | {
1015 | "data": {
1016 | "text/html": [
1017 | "\n",
1018 | "\n",
1031 | "
\n",
1032 | " \n",
1033 | " \n",
1034 | " | \n",
1035 | " title | \n",
1036 | " category | \n",
1037 | " skill | \n",
1038 | " frequency | \n",
1039 | "
\n",
1040 | " \n",
1041 | " \n",
1042 | " \n",
1043 | " 0 | \n",
1044 | " Data Engineer | \n",
1045 | " Programming Languages | \n",
1046 | " Python | \n",
1047 | " 52.0 | \n",
1048 | "
\n",
1049 | " \n",
1050 | " 1 | \n",
1051 | " Data Scientist | \n",
1052 | " Programming Languages | \n",
1053 | " Python | \n",
1054 | " 103.0 | \n",
1055 | "
\n",
1056 | " \n",
1057 | " 2 | \n",
1058 | " Machine Learning Engineer | \n",
1059 | " Programming Languages | \n",
1060 | " Python | \n",
1061 | " 71.0 | \n",
1062 | "
\n",
1063 | " \n",
1064 | " 3 | \n",
1065 | " Data Engineer | \n",
1066 | " Programming Languages | \n",
1067 | " R | \n",
1068 | " 5.0 | \n",
1069 | "
\n",
1070 | " \n",
1071 | " 4 | \n",
1072 | " Data Scientist | \n",
1073 | " Programming Languages | \n",
1074 | " R | \n",
1075 | " 19.0 | \n",
1076 | "
\n",
1077 | " \n",
1078 | "
\n",
1079 | "
"
1080 | ],
1081 | "text/plain": [
1082 | " title category skill frequency\n",
1083 | "0 Data Engineer Programming Languages Python 52.0\n",
1084 | "1 Data Scientist Programming Languages Python 103.0\n",
1085 | "2 Machine Learning Engineer Programming Languages Python 71.0\n",
1086 | "3 Data Engineer Programming Languages R 5.0\n",
1087 | "4 Data Scientist Programming Languages R 19.0"
1088 | ]
1089 | },
1090 | "execution_count": 11,
1091 | "metadata": {},
1092 | "output_type": "execute_result"
1093 | }
1094 | ],
1095 | "source": [
1096 | "# Reset the index\n",
1097 | "df.reset_index(drop=True, inplace=True)\n",
1098 | "df.head()"
1099 | ]
1100 | },
1101 | {
1102 | "cell_type": "code",
1103 | "execution_count": 12,
1104 | "metadata": {},
1105 | "outputs": [
1106 | {
1107 | "data": {
1108 | "text/plain": [
1109 | "title object\n",
1110 | "category object\n",
1111 | "skill object\n",
1112 | "frequency int32\n",
1113 | "dtype: object"
1114 | ]
1115 | },
1116 | "execution_count": 12,
1117 | "metadata": {},
1118 | "output_type": "execute_result"
1119 | }
1120 | ],
1121 | "source": [
1122 | "df = df.astype({'frequency': int})\n",
1123 | "df.dtypes"
1124 | ]
1125 | },
1126 | {
1127 | "cell_type": "code",
1128 | "execution_count": 13,
1129 | "metadata": {},
1130 | "outputs": [],
1131 | "source": [
1132 | "df.to_csv('skill_frequencies.csv')"
1133 | ]
1134 | }
1135 | ],
1136 | "metadata": {
1137 | "kernelspec": {
1138 | "display_name": "Python 3",
1139 | "language": "python",
1140 | "name": "python3"
1141 | },
1142 | "language_info": {
1143 | "codemirror_mode": {
1144 | "name": "ipython",
1145 | "version": 3
1146 | },
1147 | "file_extension": ".py",
1148 | "mimetype": "text/x-python",
1149 | "name": "python",
1150 | "nbconvert_exporter": "python",
1151 | "pygments_lexer": "ipython3",
1152 | "version": "3.6.5"
1153 | }
1154 | },
1155 | "nbformat": 4,
1156 | "nbformat_minor": 2
1157 | }
1158 |
--------------------------------------------------------------------------------
/env_ideal_profiles.yaml:
--------------------------------------------------------------------------------
1 | name: base
2 | channels:
3 | - conda-forge
4 | - anaconda-fusion
5 | - defaults
6 | dependencies:
7 | - conda=4.5.11=py36_1000
8 | - selenium=3.14.1=py36hfa6e2cd_1000
9 | - wordcloud=1.4.1=py36_0
10 | - _ipyw_jlab_nb_ext_conf=0.1.0=py36he6757f0_0
11 | - alabaster=0.7.10=py36hcd07829_0
12 | - anaconda=5.2.0=py36_3
13 | - anaconda-client=1.6.14=py36_0
14 | - anaconda-navigator=1.8.7=py36_0
15 | - anaconda-project=0.8.2=py36hfad2e28_0
16 | - asn1crypto=0.24.0=py36_0
17 | - astroid=1.6.3=py36_0
18 | - astropy=3.0.2=py36h452e1ab_1
19 | - attrs=18.1.0=py36_0
20 | - babel=2.5.3=py36_0
21 | - backcall=0.1.0=py36_0
22 | - backports=1.0=py36h81696a8_1
23 | - backports.shutil_get_terminal_size=1.0.0=py36h79ab834_2
24 | - beautifulsoup4=4.6.0=py36hd4cc5e8_1
25 | - bitarray=0.8.1=py36hfa6e2cd_1
26 | - bkcharts=0.2=py36h7e685f7_0
27 | - blas=1.0=mkl
28 | - blaze=0.11.3=py36h8a29ca5_0
29 | - bleach=2.1.3=py36_0
30 | - blosc=1.14.3=he51fdeb_0
31 | - bokeh=0.12.16=py36_0
32 | - boto=2.48.0=py36h1a776d2_1
33 | - bottleneck=1.2.1=py36hd119dfa_0
34 | - bzip2=1.0.6=hfa6e2cd_5
35 | - ca-certificates=2018.03.07=0
36 | - certifi=2018.4.16=py36_0
37 | - cffi=1.11.5=py36h945400d_0
38 | - chardet=3.0.4=py36h420ce6e_1
39 | - click=6.7=py36hec8c647_0
40 | - cloudpickle=0.5.3=py36_0
41 | - clyent=1.2.2=py36hb10d595_1
42 | - colorama=0.3.9=py36h029ae33_0
43 | - comtypes=1.1.4=py36_0
44 | - conda-build=3.10.5=py36_0
45 | - conda-env=2.6.0=h36134e3_1
46 | - conda-verify=2.0.0=py36h065de53_0
47 | - console_shortcut=0.1.1=h6bb2dd7_3
48 | - contextlib2=0.5.5=py36he5d52c0_0
49 | - cryptography=2.2.2=py36hfa6e2cd_0
50 | - curl=7.60.0=h7602738_0
51 | - cycler=0.10.0=py36h009560c_0
52 | - cython=0.28.2=py36hfa6e2cd_0
53 | - cytoolz=0.9.0.1=py36hfa6e2cd_0
54 | - dask=0.17.5=py36_0
55 | - dask-core=0.17.5=py36_0
56 | - datashape=0.5.4=py36h5770b85_0
57 | - decorator=4.3.0=py36_0
58 | - distributed=1.21.8=py36_0
59 | - docutils=0.14=py36h6012d8f_0
60 | - entrypoints=0.2.3=py36hfd66bb0_2
61 | - et_xmlfile=1.0.1=py36h3d2d736_0
62 | - fastcache=1.0.2=py36hfa6e2cd_2
63 | - filelock=3.0.4=py36_0
64 | - flask=1.0.2=py36_1
65 | - flask-cors=3.0.4=py36_0
66 | - freetype=2.8=h51f8f2c_1
67 | - get_terminal_size=1.0.0=h38e98db_0
68 | - gevent=1.3.0=py36hfa6e2cd_0
69 | - glob2=0.6=py36hdf76b57_0
70 | - greenlet=0.4.13=py36hfa6e2cd_0
71 | - h5py=2.7.1=py36h3bdd7fb_2
72 | - hdf5=1.10.2=hac2f561_1
73 | - heapdict=1.0.0=py36_2
74 | - html5lib=1.0.1=py36h047fa9f_0
75 | - icc_rt=2017.0.4=h97af966_0
76 | - icu=58.2=ha66f8fd_1
77 | - idna=2.6=py36h148d497_1
78 | - imageio=2.3.0=py36_0
79 | - imagesize=1.0.0=py36_0
80 | - intel-openmp=2018.0.0=8
81 | - ipykernel=4.8.2=py36_0
82 | - ipython=6.4.0=py36_0
83 | - ipython_genutils=0.2.0=py36h3c5d0ee_0
84 | - ipywidgets=7.2.1=py36_0
85 | - isort=4.3.4=py36_0
86 | - itsdangerous=0.24=py36hb6c5a24_1
87 | - jdcal=1.4=py36_0
88 | - jedi=0.12.0=py36_1
89 | - jinja2=2.10=py36h292fed1_0
90 | - jpeg=9b=hb83a4c4_2
91 | - jsonschema=2.6.0=py36h7636477_0
92 | - jupyter=1.0.0=py36_4
93 | - jupyter_client=5.2.3=py36_0
94 | - jupyter_console=5.2.0=py36h6d89b47_1
95 | - jupyter_core=4.4.0=py36h56e9d50_0
96 | - jupyterlab=0.32.1=py36_0
97 | - jupyterlab_launcher=0.10.5=py36_0
98 | - kiwisolver=1.0.1=py36h12c3424_0
99 | - lazy-object-proxy=1.3.1=py36hd1c21d2_0
100 | - libcurl=7.60.0=hc4dcbb0_0
101 | - libiconv=1.15=h1df5818_7
102 | - libpng=1.6.34=h79bbb47_0
103 | - libsodium=1.0.16=h9d3ae62_0
104 | - libssh2=1.8.0=hd619d38_4
105 | - libtiff=4.0.9=hb8ad9f9_1
106 | - libxml2=2.9.8=hadb2253_1
107 | - libxslt=1.1.32=hf6f1972_0
108 | - llvmlite=0.23.1=py36hcacf6c6_0
109 | - locket=0.2.0=py36hfed976d_1
110 | - lxml=4.2.1=py36heafd4d3_0
111 | - lzo=2.10=h6df0209_2
112 | - m2w64-gcc-libgfortran=5.3.0=6
113 | - m2w64-gcc-libs=5.3.0=7
114 | - m2w64-gcc-libs-core=5.3.0=7
115 | - m2w64-gmp=6.1.0=2
116 | - m2w64-libwinpthread-git=5.0.0.4634.697f757=2
117 | - markupsafe=1.0=py36h0e26971_1
118 | - matplotlib=2.2.2=py36h153e9ff_1
119 | - mccabe=0.6.1=py36hb41005a_1
120 | - menuinst=1.4.14=py36hfa6e2cd_0
121 | - mistune=0.8.3=py36hfa6e2cd_1
122 | - mkl=2018.0.2=1
123 | - mkl-service=1.1.2=py36h57e144c_4
124 | - mkl_fft=1.0.1=py36h452e1ab_0
125 | - mkl_random=1.0.1=py36h9258bd6_0
126 | - more-itertools=4.1.0=py36_0
127 | - mpmath=1.0.0=py36hacc8adf_2
128 | - msgpack-python=0.5.6=py36he980bc4_0
129 | - msys2-conda-epoch=20160418=1
130 | - multipledispatch=0.5.0=py36_0
131 | - navigator-updater=0.2.1=py36_0
132 | - nbconvert=5.3.1=py36h8dc0fde_0
133 | - nbformat=4.4.0=py36h3a5bc1b_0
134 | - networkx=2.1=py36_0
135 | - nltk=3.3.0=py36_0
136 | - nose=1.3.7=py36h1c3779e_2
137 | - notebook=5.5.0=py36_0
138 | - numba=0.38.0=py36h830ac7b_0
139 | - numexpr=2.6.5=py36hcd2f87e_0
140 | - numpy=1.14.3=py36h9fa60d3_1
141 | - numpy-base=1.14.3=py36h555522e_1
142 | - numpydoc=0.8.0=py36_0
143 | - odo=0.5.1=py36h7560279_0
144 | - olefile=0.45.1=py36_0
145 | - openpyxl=2.5.3=py36_0
146 | - openssl=1.0.2o=h8ea7d77_0
147 | - packaging=17.1=py36_0
148 | - pandas=0.23.0=py36h830ac7b_0
149 | - pandoc=1.19.2.1=hb2460c7_1
150 | - pandocfilters=1.4.2=py36h3ef6317_1
151 | - parso=0.2.0=py36_0
152 | - partd=0.3.8=py36hc8e763b_0
153 | - path.py=11.0.1=py36_0
154 | - pathlib2=2.3.2=py36_0
155 | - patsy=0.5.0=py36_0
156 | - pep8=1.7.1=py36_0
157 | - pickleshare=0.7.4=py36h9de030f_0
158 | - pillow=5.1.0=py36h0738816_0
159 | - pip=10.0.1=py36_0
160 | - pkginfo=1.4.2=py36_1
161 | - plotly=3.4.1=py36h28b3542_0
162 | - pluggy=0.6.0=py36hc7daf1e_0
163 | - ply=3.11=py36_0
164 | - prompt_toolkit=1.0.15=py36h60b8f86_0
165 | - psutil=5.4.5=py36hfa6e2cd_0
166 | - py=1.5.3=py36_0
167 | - pycodestyle=2.4.0=py36_0
168 | - pycosat=0.6.3=py36h413d8a4_0
169 | - pycparser=2.18=py36hd053e01_1
170 | - pycrypto=2.6.1=py36hfa6e2cd_8
171 | - pycurl=7.43.0.1=py36h74b6da3_0
172 | - pyflakes=1.6.0=py36h0b975d6_0
173 | - pygments=2.2.0=py36hb010967_0
174 | - pylint=1.8.4=py36_0
175 | - pyodbc=4.0.23=py36h6538335_0
176 | - pyopenssl=18.0.0=py36_0
177 | - pyparsing=2.2.0=py36h785a196_1
178 | - pyqt=5.9.2=py36h1aa27d4_0
179 | - pysocks=1.6.8=py36_0
180 | - pytables=3.4.3=py36he6f6034_1
181 | - pytest=3.5.1=py36_0
182 | - pytest-arraydiff=0.2=py36_0
183 | - pytest-astropy=0.3.0=py36_0
184 | - pytest-doctestplus=0.1.3=py36_0
185 | - pytest-openfiles=0.3.0=py36_0
186 | - pytest-remotedata=0.2.1=py36_0
187 | - python=3.6.5=h0c2934d_0
188 | - python-dateutil=2.7.3=py36_0
189 | - pytz=2018.4=py36_0
190 | - pywavelets=0.5.2=py36hc649158_0
191 | - pywin32=223=py36hfa6e2cd_1
192 | - pywinpty=0.5.1=py36_0
193 | - pyyaml=3.12=py36h1d1928f_1
194 | - pyzmq=17.0.0=py36hfa6e2cd_1
195 | - qt=5.9.5=vc14he4a7d60_0
196 | - qtawesome=0.4.4=py36h5aa48f6_0
197 | - qtconsole=4.3.1=py36h99a29a9_0
198 | - qtpy=1.4.1=py36_0
199 | - requests=2.18.4=py36h4371aae_1
200 | - retrying=1.3.3=py36_2
201 | - rope=0.10.7=py36had63a69_0
202 | - ruamel_yaml=0.15.35=py36hfa6e2cd_1
203 | - scikit-image=0.13.1=py36hfa6e2cd_1
204 | - scikit-learn=0.19.1=py36h53aea1b_0
205 | - scipy=1.1.0=py36h672f292_0
206 | - seaborn=0.8.1=py36h9b69545_0
207 | - send2trash=1.5.0=py36_0
208 | - setuptools=39.1.0=py36_0
209 | - simplegeneric=0.8.1=py36_2
210 | - singledispatch=3.4.0.3=py36h17d0c80_0
211 | - sip=4.19.8=py36h6538335_0
212 | - six=1.11.0=py36h4db2310_1
213 | - snappy=1.1.7=h777316e_3
214 | - snowballstemmer=1.2.1=py36h763602f_0
215 | - sortedcollections=0.6.1=py36_0
216 | - sortedcontainers=1.5.10=py36_0
217 | - sphinx=1.7.4=py36_0
218 | - sphinxcontrib=1.0=py36hbbac3d2_1
219 | - sphinxcontrib-websupport=1.0.1=py36hb5e5916_1
220 | - spyder=3.2.8=py36_0
221 | - sqlalchemy=1.2.7=py36ha85dd04_0
222 | - sqlite=3.23.1=h35aae40_0
223 | - statsmodels=0.9.0=py36h452e1ab_0
224 | - sympy=1.1.1=py36h96708e0_0
225 | - tblib=1.3.2=py36h30f5020_0
226 | - terminado=0.8.1=py36_1
227 | - testpath=0.3.1=py36h2698cfe_0
228 | - tk=8.6.7=hcb92d03_3
229 | - toolz=0.9.0=py36_0
230 | - tornado=5.0.2=py36_0
231 | - traitlets=4.3.2=py36h096827d_0
232 | - typing=3.6.4=py36_0
233 | - unicodecsv=0.14.1=py36h6450c06_0
234 | - urllib3=1.22=py36h276f60a_0
235 | - vc=14=h0510ff6_3
236 | - vs2015_runtime=14.0.25123=3
237 | - wcwidth=0.1.7=py36h3d5aa90_0
238 | - webencodings=0.5.1=py36h67c50ae_1
239 | - werkzeug=0.14.1=py36_0
240 | - wheel=0.31.1=py36_0
241 | - widgetsnbextension=3.2.1=py36_0
242 | - win_inet_pton=1.0.1=py36he67d7fd_1
243 | - win_unicode_console=0.5=py36hcdbd4b5_0
244 | - wincertstore=0.2=py36h7fe50ca_0
245 | - winpty=0.4.3=4
246 | - wrapt=1.10.11=py36he5f5981_0
247 | - xlrd=1.1.0=py36h1cb58dc_1
248 | - xlsxwriter=1.0.4=py36_0
249 | - xlwings=0.11.8=py36_0
250 | - xlwt=1.3.0=py36h1a4751e_0
251 | - yaml=0.1.7=hc54c509_2
252 | - zeromq=4.2.5=hc6251cf_0
253 | - zict=0.1.3=py36h2d8e73e_0
254 | - zlib=1.2.11=h8395fce_2
255 | - pip:
256 | - tables==3.4.3
257 | prefix: D:\Anaconda3
258 |
259 |
--------------------------------------------------------------------------------
/helper.py:
--------------------------------------------------------------------------------
1 | import json
2 | import re, csv
3 | from wordcloud import WordCloud, STOPWORDS
4 | from matplotlib import pyplot as plt
5 | from process_text import *
6 | import pandas as pd
7 | import numpy as np
8 |
9 |
10 |
11 | def load_data(file_name):
12 | """
13 | Open the saved json data file and load the data into a dict.
14 |
15 | Parameters:
16 | file_name: the saved file name, e.g. "machine_learning_engineer.json"
17 |
18 | Returns:
19 | postings_dict: data in dict format
20 |
21 | """
22 |
23 | with open(file_name, 'r') as f:
24 | postings_dict = json.load(f)
25 | return postings_dict
26 |
27 |
28 |
29 | def plot_wc(text, max_words=200, stopwords_list=[], to_file_name=None):
30 | """
31 | Make a word cloud plot using the given text.
32 |
33 | Parameters:
34 | text -- the text as a string
35 |
36 | Returns:
37 | None
38 | """
39 | wordcloud = WordCloud().generate(text)
40 | stopwords = set(STOPWORDS)
41 | stopwords.update(stopwords_list)
42 |
43 | wordcloud = WordCloud(background_color='white',
44 | stopwords=stopwords,
45 | #prefer_horizontal=1,
46 | max_words=max_words,
47 | min_font_size=6,
48 | scale=1,
49 | width = 800, height = 800,
50 | random_state=8).generate(text)
51 |
52 | plt.figure(figsize=[16,12])
53 | plt.imshow(wordcloud, interpolation="bilinear")
54 | plt.axis("off")
55 | plt.show()
56 |
57 | if to_file_name:
58 | to_file_name = to_file_name + ".png"
59 | wordcloud.to_file(to_file_name)
60 |
61 |
62 |
63 | def plot_profile(title,
64 | first_n_postings,
65 | max_words=200,
66 | return_posting=False,
67 | return_tokens=False,
68 | return_text_list=False):
69 | """
70 | Loads the corresponding json file, extracts the first_n job postings and plot the wordcloud profile.
71 |
72 | Parameters:
73 | title: the job title such as "data scientist"
74 | first_n_postings: int, the first n job postings to use for the plot.
75 |
76 | Returns:
77 | nth_posting: the nth job posting as a string. This helps to verify the first_n_postings param used.
78 |
79 | """
80 | # Convert title to full file name then load the data
81 | file_name = '_'.join(title.split()) + '.json'
82 | data = load_data(file_name)
83 |
84 | # Only of the two can be True
85 | if (return_posting + return_tokens + return_text_list) >= 2:
86 | print('You can only return one of these: a posting, tokens, text list! \nPlease try again.')
87 | return None
88 |
89 | if return_posting:
90 | n_posting = data[str(first_n_postings)]
91 | return n_posting
92 |
93 | text_list = make_text_list(data, first_n_postings)
94 |
95 | if return_text_list:
96 | return text_list
97 | elif return_tokens:
98 | tokens = tokenize_list(text_list, return_string=False)
99 | return tokens
100 | else:
101 | # Get the tokens joined as a string
102 | text = tokenize_list(text_list, return_string=True)
103 | # Get the stop words to use
104 | with open('stopwords.csv', 'r', newline='') as f:
105 | reader = csv.reader(f)
106 | stop_list = list(reader)[0]
107 | to_file_name = '_'.join(title.split())
108 | plot_wc(text, max_words, stopwords_list=stop_list, to_file_name=to_file_name)
109 |
110 |
111 |
112 | def plot_title(df, title, save_figure=False):
113 | """
114 | Plots the skill frequencies of all skill categories for a given title.
115 |
116 | Params:
117 | df: (pandas df) the frequency df
118 | title: (str) one of the three job titles:
119 | 'data scientist', 'machine learning engineer', 'data engineer'
120 |
121 | Returns:
122 | None
123 |
124 | """
125 | categories = df.category.unique()
126 | titles = list(df.title.unique())
127 |
128 | # Ensure input is valid
129 | if title.title() not in titles:
130 | print('Title invalid. Please try again!')
131 | return None
132 | title = title.title()
133 | # Subset df to the given title
134 | df_title = df.query('title==@title')
135 | # Set up the parameters for the plotting grid
136 | nrows=4
137 | ncols=2
138 | figsize = (15, 20)
139 | # Add a dummy category name to match the grid
140 | categories = np.append(categories, 'Empty').reshape(4, 2)
141 |
142 | # Generate the plotting objects
143 | fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=figsize)
144 |
145 | # Loop thru the axes of the figure
146 | for row in range(nrows):
147 | for col in range(ncols):
148 | cat = categories[row, col]
149 | # Subset to one category for each subplot
150 | df_cat = df_title.query('category==@cat')
151 | df_cat = df_cat.sort_values(by='frequency', ascending=False)
152 | # Find the correspoinding axis in axes
153 | ax = axes[row, col]
154 | # Handle errors for the empty last subplot
155 | try:
156 | df_cat.plot(x='skill', y='frequency', kind='bar', ax=ax)
157 | ax.set(title=cat, xlabel='', ylabel='Frequency')
158 | ax.get_legend().remove() # remove legend
159 | for tick in ax.get_xticklabels():
160 | tick.set_rotation(60)
161 | except:
162 | fig.delaxes(ax)
163 |
164 | # Add the figure title
165 | fig_title = title + ' Skills Distribution'
166 | fig.suptitle(fig_title, y=0.92, verticalalignment='bottom', fontsize=30)
167 | plt.subplots_adjust(hspace=0.9) # make sure the figure title doesn't overlap with subplot titles
168 | plt.show()
169 |
170 | if save_figure:
171 | figure_name = fig_title + '.png'
172 | fig.savefig(figure_name)
173 |
174 |
175 |
176 | def plot_skill(df, cat, save_figure=False):
177 | """
178 | Plots the skill frequencies of all job titles for a given skill category.
179 |
180 | Params:
181 | df: (pandas df) the frequency df
182 | cat: (str) one of the seven skill categories:
183 | 'Programming Languages', 'Big Data Technologies'...
184 |
185 | Returns:
186 | None
187 |
188 | """
189 | categories = list(df.category.unique())
190 | titles = list(df.title.unique())
191 |
192 | if cat.title() not in categories:
193 | print('Category invalid. Please try again!')
194 | return None
195 | cat = cat.title()
196 |
197 | # Subset df to the given category
198 | df_cat = df.query('category==@cat')
199 |
200 | # Set up the parameters for the plotting grid
201 | nrows = len(titles)
202 | ncols = 1
203 | figsize = (10, 12)
204 |
205 | # Generate the plotting objects
206 | fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=figsize)
207 |
208 | # Loop thru the axes of the figure
209 | for row in range(nrows):
210 | title = titles[row]
211 | # Subset to one title for each subplot
212 | df_title = df_cat.query('title==@title')
213 | df_title = df_title.sort_values(by='frequency', ascending=False)
214 | # Find the correspoinding axis in axes
215 | ax = axes[row]
216 | df_title.plot(x='skill', y='frequency', kind='bar', ax=ax)
217 | ax.set(title=title, xlabel='', ylabel='Frequency')
218 | ax.get_legend().remove() # remove legend
219 | for tick in ax.get_xticklabels():
220 | tick.set_rotation(30)
221 |
222 | # Add the figure title
223 | fig_title = cat + ' Distribution'
224 | fig.suptitle(fig_title, y=0.95, verticalalignment='baseline', fontsize=30)
225 | plt.subplots_adjust(hspace=0.36) # make sure the figure title doesn't overlap with subplot titles
226 | plt.show()
227 |
228 | if save_figure:
229 | figure_name = fig_title + '.png'
230 | fig.savefig(figure_name)
--------------------------------------------------------------------------------
/process_text.py:
--------------------------------------------------------------------------------
1 | #from string import digits
2 | #from nltk import word_tokenize
3 | import re
4 | from nltk.corpus import stopwords
5 | from nltk.stem.snowball import SnowballStemmer
6 |
7 |
8 |
9 | def make_text_list(postings_dict, first_n_postings=100):
10 | """
11 | Extract the texts from postings_dict into a list of strings
12 |
13 | Parameters:
14 | postings_dict:
15 | first_n_postings:
16 |
17 | Returns:
18 | text_list: list of job posting texts
19 |
20 | """
21 |
22 | text_list = []
23 | for i in range(0, first_n_postings+1):
24 | # Since some number could be missing due to errors in scraping,
25 | # handle exception here to ensure error free
26 | try:
27 | text_list.append(postings_dict[str(i)]['posting'])
28 | except:
29 | continue
30 |
31 | return text_list
32 |
33 |
34 |
35 | def remove_digits(token):
36 | """
37 | Remove digits from a token
38 |
39 | Params:
40 | token: (str) a string token
41 |
42 | Returns:
43 | cleaned_token: (str) the cleaned token
44 |
45 | """
46 | # Remove digits from the token
47 | remove_digits = str.maketrans('', '', digits)
48 | token = token.translate(remove_digits)
49 | return token
50 |
51 |
52 |
53 | def tokenize_text(text, stem=False):
54 | """
55 | Tokenize, stem and remove stop words for the given text
56 |
57 | Parameters:
58 | text: a text string
59 |
60 | Returns:
61 | tokens: the processed text as a list of tokens
62 | """
63 | stop_words = set(stopwords.words('english'))
64 | #tokens = word_tokenize(text.lower())
65 |
66 | # Change "C++" to "Cpp" to avoid being removed below
67 | #tokens = ['cpp' if token=='c++' else token for token in tokens]
68 | # Same with C#
69 | #tokens = ['csharp' if token=='c#' else token for token in tokens]
70 | # Remove digits
71 | #tokens = [remove_digits(token) for token in tokens]
72 | # Remove non-alphabetic tokens and stopwords
73 | #tokens = [token for token in tokens if token.isalpha() and token not in stop_words]
74 |
75 | # Use Regex to tokenize
76 | # Replace any non word characters except .+# with space
77 | text = re.sub("[^\w.+#]", " ", text)
78 | # Twe cases to replace with space
79 | # Case 1: \d+\.?\d+\s -- any number of digits followed by a space with or without
80 | # a dot in between
81 | # Case 2: \d+\+ -- any number of digits followed by a plus sign
82 | text = re.sub("\d+\.?\d+\s|\d+\+", " ", text)
83 | tokens = text.lower().split()
84 | tokens = [token for token in tokens if token not in stop_words]
85 |
86 | # Stem tokens
87 | if stem:
88 | stemmer = SnowballStemmer("english")
89 | tokens = [stemmer.stem(i) for i in tokens]
90 |
91 | return tokens
92 |
93 |
94 |
95 | def tokenize_list(text_list, stem=False, return_string=False):
96 | """
97 | Tokenize the given list of text and then combine list of tokens into text for plotting
98 |
99 | Parameters:
100 | text_list -- list of job posting strings
101 |
102 | Returns:
103 | text -- a text string for word cloud plot
104 | """
105 | # Split the text based on slash, space and newline, then take set
106 | #text = [set(re.split('/| |\n|', i)) for i in text]
107 | #text = [set(re.split('\W', i)) for i in text_list]
108 |
109 | text_list_tokenized = [tokenize_text(text=i, stem=stem) for i in text_list]
110 |
111 | tokens = []
112 | # Combine all token lists into one big list of tokens
113 | for i in text_list_tokenized:
114 | tokens += i
115 |
116 | if return_string:
117 | text = ' '.join(tokens)
118 | return text
119 |
120 | # Return the list of all tokens
121 | return tokens
122 |
123 |
124 |
125 | def check_freq(dict_to_check, text_list):
126 | """
127 | Checks each given word's freqency in a list of posting strings.
128 |
129 | Params:
130 | words: (dict) a dict of word strings to check frequency for, format:
131 | {'languages': ['Python', 'R'..],
132 | 'big data': ['AWS', 'Azure'...],
133 | ..}
134 | text_list: (list) a list of posting strings to search in
135 |
136 | Returns:
137 | freq: (dict) frequency counts
138 |
139 | """
140 | freq = {}
141 |
142 | # Join the text together and convert words to lowercase
143 | text = ' '.join(text_list).lower()
144 |
145 | for category, skill_list in dict_to_check.items():
146 | # Initialize each category as a dictionary
147 | freq[category] = {}
148 | for skill in skill_list:
149 | if len(skill) == 1: # pad single letter skills such as "R" with spaces
150 | skill_name = ' ' + skill.lower() + ' '
151 | else:
152 | skill_name = skill.lower()
153 | freq[category][skill] = text.count(skill_name)
154 |
155 | return freq
156 |
--------------------------------------------------------------------------------
/scrape_data.py:
--------------------------------------------------------------------------------
1 | import re
2 | import json
3 | from bs4 import BeautifulSoup
4 | from selenium import webdriver
5 |
6 |
7 |
8 | def get_soup(url):
9 | """
10 | Given the url of a page, this function returns the soup object.
11 |
12 | Parameters:
13 | url: the link to get soup object for
14 |
15 | Returns:
16 | soup: soup object
17 | """
18 | driver = webdriver.Firefox()
19 | driver.get(url)
20 | html = driver.page_source
21 | soup = BeautifulSoup(html, 'html.parser')
22 | driver.close()
23 |
24 | return soup
25 |
26 |
27 |
28 | def grab_job_links(soup):
29 | """
30 | Grab all non-sponsored job posting links from a Indeed search result page using the given soup object
31 |
32 | Parameters:
33 | soup: the soup object corresponding to a search result page
34 | e.g. https://ca.indeed.com/jobs?q=data+scientist&l=Toronto&start=20
35 |
36 | Returns:
37 | urls: a python list of job posting urls
38 |
39 | """
40 | urls = []
41 |
42 | # Loop thru all the posting links
43 | for link in soup.find_all('h2', {'class': 'jobtitle'}):
44 | # Since sponsored job postings are represented by "a target" instead of "a href", no need to worry here
45 | partial_url = link.a.get('href')
46 | # This is a partial url, we need to attach the prefix
47 | url = 'https://ca.indeed.com' + partial_url
48 | # Make sure this is not a sponsored posting
49 | urls.append(url)
50 |
51 | return urls
52 |
53 |
54 |
55 | def get_urls(query, num_pages, location):
56 | """
57 | Get all the job posting URLs resulted from a specific search.
58 |
59 | Parameters:
60 | query: job title to query
61 | num_pages: number of pages needed
62 | location: city to search in
63 |
64 | Returns:
65 | urls: a list of job posting URL's (when num_pages valid)
66 | max_pages: maximum number of pages allowed ((when num_pages invalid))
67 | """
68 | # We always need the first page
69 | base_url = 'https://ca.indeed.com/jobs?q={}&l={}'.format(query, location)
70 | soup = get_soup(base_url)
71 | urls = grab_job_links(soup)
72 |
73 | # Get the total number of postings found
74 | posting_count_string = soup.find(name='div', attrs={'id':"searchCount"}).get_text()
75 | posting_count_string = posting_count_string[posting_count_string.find('of')+2:].strip()
76 | #print('posting_count_string: {}'.format(posting_count_string))
77 | #print('type is: {}'.format(type(posting_count_string)))
78 |
79 | try:
80 | posting_count = int(posting_count_string)
81 | except ValueError: # deal with special case when parsed string is "360 jobs"
82 | posting_count = int(re.search('\d+', posting_count_string).group(0))
83 | #print('posting_count: {}'.format(posting_count))
84 | #print('\ntype: {}'.format(type(posting_count)))
85 | finally:
86 | posting_count = 330 # setting to 330 when unable to get the total
87 | pass
88 |
89 | # Limit nunmber of pages to get
90 | max_pages = round(posting_count / 10) - 3
91 | if num_pages > max_pages:
92 | print('returning max_pages!!')
93 | return max_pages
94 |
95 | # Additional work is needed when more than 1 page is requested
96 | if num_pages >= 2:
97 | # Start loop from page 2 since page 1 has been dealt with above
98 | for i in range(2, num_pages+1):
99 | num = (i-1) * 10
100 | base_url = 'https://ca.indeed.com/jobs?q={}&l={}&start={}'.format(query, location, num)
101 | try:
102 | soup = get_soup(base_url)
103 | # We always combine the results back to the list
104 | urls += grab_job_links(soup)
105 | except:
106 | continue
107 |
108 | # Check to ensure the number of urls gotten is correct
109 | #assert len(urls) == num_pages * 10, "There are missing job links, check code!"
110 |
111 | return urls
112 |
113 |
114 |
115 | def get_posting(url):
116 | """
117 | Get the text portion including both title and job description of the job posting from a given url
118 |
119 | Parameters:
120 | url: The job posting link
121 |
122 | Returns:
123 | title: the job title (if "data scientist" is in the title)
124 | posting: the job posting content
125 | """
126 | # Get the url content as BS object
127 | soup = get_soup(url)
128 |
129 | # The job title is held in the h3 tag
130 | title = soup.find(name='h3').getText().lower()
131 | posting = soup.find(name='div', attrs={'class': "jobsearch-JobComponent"}).get_text()
132 |
133 | return title, posting.lower()
134 |
135 |
136 | #if 'data scientist' in title: # We'll proceed to grab the job posting text if the title is correct
137 | # All the text info is contained in the div element with the below class, extract the text.
138 | #posting = soup.find(name='div', attrs={'class': "jobsearch-JobComponent"}).get_text()
139 | #return title, posting.lower()
140 | #else:
141 | #return False
142 |
143 | # Get rid of numbers and symbols other than given
144 | #text = re.sub("[^a-zA-Z'+#&]", " ", text)
145 | # Convert to lower case and split to list and then set
146 | #text = text.lower().strip()
147 |
148 | #return text
149 |
150 |
151 |
152 | def get_data(query, num_pages, location='Toronto'):
153 | """
154 | Get all the job posting data and save in a json file using below structure:
155 |
156 | {: {'title': ..., 'posting':..., 'url':...}...}
157 |
158 | The json file name has this format: "".json"
159 |
160 | Parameters:
161 | query: Indeed query keyword such as 'Data Scientist'
162 | num_pages: Number of search results needed
163 | location: location to search for
164 |
165 | Returns:
166 | postings_dict: Python dict including all posting data
167 |
168 | """
169 | # Convert the queried title to Indeed format
170 | query = '+'.join(query.lower().split())
171 |
172 | postings_dict = {}
173 | urls = get_urls(query, num_pages, location)
174 |
175 | # Continue only if the requested number of pages is valid (when invalid, a number is returned instead of list)
176 | if isinstance(urls, list):
177 | num_urls = len(urls)
178 | for i, url in enumerate(urls):
179 | try:
180 | title, posting = get_posting(url)
181 | postings_dict[i] = {}
182 | postings_dict[i]['title'], postings_dict[i]['posting'], postings_dict[i]['url'] = \
183 | title, posting, url
184 | except:
185 | continue
186 |
187 | percent = (i+1) / num_urls
188 | # Print the progress the "end" arg keeps the message in the same line
189 | print("Progress: {:2.0f}%".format(100*percent), end='\r')
190 |
191 | # Save the dict as json file
192 | file_name = query.replace('+', '_') + '.json'
193 | with open(file_name, 'w') as f:
194 | json.dump(postings_dict, f)
195 |
196 | print('All {} postings have been scraped and saved!'.format(num_urls))
197 | #return postings_dict
198 | else:
199 | print("Due to similar results, maximum number of pages is only {}. Please try again!".format(urls))
200 |
201 |
202 |
203 | # If script is run directly, we'll take input from the user
204 | if __name__ == "__main__":
205 | queries = ["data scientist", "machine learning engineer", "data engineer"]
206 |
207 | while True:
208 | query = input("Please enter the title to scrape data for: \n").lower()
209 | if query in queries:
210 | break
211 | else:
212 | print("Invalid title! Please try again.")
213 |
214 | while True:
215 | num_pages = input("Please enter the number of pages needed (integer only): \n")
216 | try:
217 | num_pages = int(num_pages)
218 | break
219 | except:
220 | print("Invalid number of pages! Please try again.")
221 |
222 | get_data(query, num_pages, location='Toronto')
223 |
224 |
--------------------------------------------------------------------------------
/stopwords.csv:
--------------------------------------------------------------------------------
1 | "experience","job","work","working","skills","new","company","years","technology","ago","save","jobapply","nowapply","using","strong","ability","days","knowledge","opportunity","tools","related","including","original","understanding","us","role","degree","one","requirements","canada","required","toronto","world","provide","industry","help","saying","reviewsread","looking","preferred","sitesave","applicants","applications","part","field","etc","apply","across","position","life","application","employment","best","key","use","well","following","please","like","opportunities","within","nowsave","drive","qualifications","responsibilities","employees","global","must","equal","able","various","join","candidate","high","needs","education","time","meet","need",,"status","accommodation","diverse","successful","may","background","candidates","language","good","excellent","career","also","level","employer","flexible","companies","canadian","want","culture","grow","closely","available","relevant","diversity","approaches","group","used","demonstrated","full","languages","top","professional","multiple","type","description","based","sources","disability","location","day","current","take","national","highly","events","gender","individuals","variety","better","order","similar","concepts","effectively","way","offer","record","great","sets","different","next","human","include","ensure","plus","ontario","minimum","every","disabilities","data","team","benefit","understand","onapply","applying","benefits","around","office","require","future","asset","real","contribute","review","hand","responsible"
2 |
--------------------------------------------------------------------------------