├── .DS_Store
├── .gitignore
├── Dockerfile
├── README.md
├── dags
    ├── airflow-log-cleanup.py
    ├── bigquery_pipeline.py
    ├── config
    │   └── .gitkeep
    ├── gcp_connection.py
    ├── modules
    │   ├── country.py
    │   ├── country_codes.csv
    │   ├── job_title.py
    │   └── youtube_views.csv
    ├── serpapi_bigquery.py
    └── sql
    │   ├── .project
    │   ├── cache_csv.sql
    │   ├── fact_build.sql
    │   ├── public_build.sql
    │   └── wide_build.sql
├── docker-compose.yaml
├── extra
    ├── airflow_graph.png
    ├── bigquery_schema.json
    ├── dashboard.png
    └── dataproc_files
    │   ├── README.md
    │   ├── Salary Table.py
    │   ├── Skill Table.py
    │   └── keywords
    │       ├── All Keywords.txt
    │       ├── Keywords Analyst Tools.txt
    │       ├── Keywords Cloud.txt
    │       ├── Keywords Databases.txt
    │       ├── Keywords Libraries.txt
    │       ├── Keywords OS.txt
    │       ├── Keywords Other.txt
    │       ├── Keywords Sync.txt
    │       ├── Keywords async.txt
    │       ├── Programming Keywords.txt
    │       └── Web Frameworks Keywords.txt
├── logs
    └── .gitkeep
├── plugins
    └── .gitkeep
└── requirements.txt


/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lukebarousse/Data_Job_Pipeline_Airflow/b69a5609c8fcd40c11b2fb6eb06123ca414dccdd/.DS_Store


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | config.py
 2 | .env
 3 | *.ini
 4 | dags/config/*
 5 | !dags/config/.gitkeep
 6 | logs/*
 7 | !logs/.gitkeep
 8 | __pycache__/
 9 | token.json
10 | client_secret.json
11 | google_oauth.py


--------------------------------------------------------------------------------
/Dockerfile:
--------------------------------------------------------------------------------
1 | FROM apache/airflow:2.5.0
2 | COPY requirements.txt .
3 | RUN pip install -r requirements.txt


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | ![Apache Airflow](https://img.shields.io/badge/Apache%20Airflow-017CEE?style=for-the-badge&logo=Apache%20Airflow&logoColor=white)
  2 | 
  3 | # 🤓 Data Job Pipeline w/ Airflow 
  4 | ![Airflow DAG](/extra/airflow_graph.png)
  5 | What up, data nerds! This is a data pipeline I built that moves Google job search data from [SerpApi](https://serpapi.com/) to a BigQuery database.
  6 | 
  7 | 
  8 | ## Background
  9 | I built an [app](https://jobdata.streamlit.app/) to open-source job requirements to aspiring data nerds so they can more efficiently focus on the skills they need to know for their job.  This airflow pipeline collects the data for this app.  
 10 | [![Open in Streamlit](https://static.streamlit.io/badges/streamlit_badge_black_white.svg)](https://jobdata.streamlit.app/)
 11 | ![dashboard](/extra/dashboard.png)
 12 | 
 13 | # ☝🏻 Prerequisites
 14 | - [Docker](https://docs.docker.com/get-docker/) installed on your server/machine
 15 | 
 16 | SERVER NOTE: I used a QNAP machine for my server, here's prereq's for that:
 17 | -  [Familiar with QNAP instructions](https://www.qnap.com/en/how-to/faq/article/how-do-i-access-my-qnap-nas-using-ssh) 
 18 | - Enable SSH via Control Panel
 19 | - Setup *Container/* folder via Container Station 
 20 | 
 21 | # 📲 Install
 22 | ## Docker-Compose install via SSH 
 23 | 
 24 | 
 25 | 1. Access server via SSH 
 26 | ```
 27 | ssh admin@192.168.1.131
 28 | ```
 29 | 2. Change directory to main directory to add cloned directory
 30 | ```
 31 | cd ..
 32 | cd share/Container
 33 | ```
 34 | 3. Add this repository to that directory
 35 | ```
 36 | git clone https://github.com/lukebarousse/Data_Job_Pipeline_Airflow.git
 37 | ```
 38 | 3. Change directory into root and add environment variable for start-up
 39 | ```
 40 | cd Data_Job_Pipeline_Airflow
 41 | echo -e "AIRFLOW_UID=$(id -u)" > .env
 42 | ```
 43 | 
 44 | # ㊙️ Secret Keys
 45 | ## Prereq
 46 | 1. Access server via SSH 
 47 | ```
 48 | ssh admin@192.168.1.131
 49 | ```
 50 | 2. Change directory to *config/* directory
 51 | ```
 52 | cd ..
 53 | cd dags/config
 54 | ```
 55 | ## SerpApi key
 56 | Prerequisite: SerpAPI account created with enough credits
 57 | 1. Get your private API key from from [SerpApi Dashboard](https://serpapi.com/dashboard)
 58 | 2. Create python file for SerpApi key
 59 | ```
 60 | echo -e "serpapi_key = '{Insert your key here}'" > config.py # or /dag/config/config.py if in root
 61 | ```
 62 | ## BigQuery Access
 63 | Prerequisite: Empty BigQuery database created with [this schema](/extra/bigquery_schema.json)
 64 | 1. Follow [Google Cloud detailed documentation](https://cloud.google.com/bigquery/docs/quickstarts/quickstart-client-libraries) to:
 65 | - Enable BigQuery API
 66 | - Create a service account
 67 | - Create & get service account key (JSON file)
 68 | 2. Place JSON file in [/dags/config](/dags/config/) directory
 69 | 3. Add location of JSON file in [docker-compose.yaml](docker-compose.yaml)
 70 | ```
 71 | environment:
 72 |     GOOGLE_APPLICATION_CREDENTIALS: './dags/config/{Insert name of JSON file}.json'
 73 | ```
 74 | 4. Add name of json table id to config file
 75 | ```
 76 | echo -e "table_id_json = '{PROJECT_ID}.{DATASET}.{TABLE}'" >> config.py
 77 | ```
 78 | I also have this as backup:
 79 | ```
 80 | echo -e "table_id = '{PROJECT_ID}.{DATASET}.{TABLE}'" >> config.py
 81 | ```
 82 | 
 83 | # 🐳 Start & Stop
 84 | Reference: [Airflow with Docker-Compose](https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html)
 85 | ## Start-up of Docker-Compose
 86 | NOTE: If you don't want to use credits on first run change 'serpapi_biquery.py' to testing
 87 | 1. If necessary, SSH and cd into root directory
 88 | ```
 89 | cd ..
 90 | cd share/Container/Data_Job_Pipeline_Airflow
 91 | ```
 92 | 2. If first time, initialize database for airflow
 93 | ```
 94 | docker-compose up airflow-init
 95 | ```
 96 | 3. Start aiflow
 97 | ```
 98 | docker-compose up
 99 | ```
100 | 
101 | ## Shutdown & removal of Docker-Compose
102 | 1. If necessary, SSH and cd into root directory
103 | ```
104 | cd ..
105 | cd share/Container/Data_Job_Pipeline_Airflow
106 | ```
107 | 2. Stop and delete containers, delete volumes with database data and download images
108 | ```
109 | docker-compose down --volumes --rmi all
110 | ```
111 | 
112 | # 🫁 Appendix
113 | ### Want to contribute?  
114 | - **Data Analysis:** Share any interesting insights you find from the [dataset](https://www.kaggle.com/datasets/lukebarousse/data-analyst-job-postings-google-search) to this [subreddit](https://www.reddit.com/r/DataNerd/) and/or [Kaggle](https://www.kaggle.com/code/lukebarousse/eda-of-job-posting-data).  
115 | - **Dashboard Build:** Contribute changes to the dashboard by using [that repo to fork and open a pull request](https://github.com/lukebarousse/Data_Analyst_Streamlit_App_V1).
116 | - **Data Pipeline Build:** Contribute changes to this pipeline by using [this repo to fork and open a pull request](https://github.com/lukebarousse/Data_Job_Pipeline_Airflow)
117 | ---
118 | ### About the project
119 | - Background on app 📺 [YouTube](https://www.youtube.com/lukebarousse)
120 | - Data provided via 🤖 [SerpApi](https://serpapi.com/)
121 | 


--------------------------------------------------------------------------------
/dags/airflow-log-cleanup.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Source: https://github.com/teamclairvoyant/airflow-maintenance-dags/tree/master/log-cleanup
  3 | 
  4 | A maintenance workflow that you can deploy into Airflow to periodically clean
  5 | out the task logs to avoid those getting too big.
  6 | """
  7 | import logging
  8 | import os
  9 | from datetime import timedelta
 10 | 
 11 | import airflow
 12 | import jinja2
 13 | from airflow.configuration import conf
 14 | from airflow.models import DAG, Variable
 15 | from airflow.operators.bash_operator import BashOperator
 16 | from airflow.operators.dummy_operator import DummyOperator
 17 | 
 18 | # airflow-log-cleanup
 19 | DAG_ID = os.path.basename(__file__).replace(".pyc", "").replace(".py", "")
 20 | START_DATE = airflow.utils.dates.days_ago(1)
 21 | try:
 22 |     BASE_LOG_FOLDER = conf.get("core", "BASE_LOG_FOLDER").rstrip("/")
 23 | except Exception as e:
 24 |     BASE_LOG_FOLDER = conf.get("logging", "BASE_LOG_FOLDER").rstrip("/")
 25 | # How often to Run. @daily - Once a day at Midnight
 26 | SCHEDULE_INTERVAL = "@weekly"
 27 | # Who is listed as the owner of this DAG in the Airflow Web Server
 28 | DAG_OWNER_NAME = "airflow"
 29 | # List of email address to send email alerts to if this job fails
 30 | ALERT_EMAIL_ADDRESSES = ['luke@lukebarousse.com']
 31 | # Length to retain the log files if not already provided in the conf. If this
 32 | # is set to 30, the job will remove those files that are 30 days old or older
 33 | DEFAULT_MAX_LOG_AGE_IN_DAYS = 30
 34 | # Can set as variable in U/I
 35 | # Variable.get(
 36 | #     "airflow_log_cleanup__max_log_age_in_days", 30
 37 | # )
 38 | # Whether the job should delete the logs or not. Included if you want to
 39 | # temporarily avoid deleting the logs
 40 | ENABLE_DELETE = True
 41 | # The number of worker nodes you have in Airflow. Will attempt to run this
 42 | # process for however many workers there are so that each worker gets its
 43 | # logs cleared.
 44 | NUMBER_OF_WORKERS = 1
 45 | DIRECTORIES_TO_DELETE = [BASE_LOG_FOLDER]
 46 | ENABLE_DELETE_CHILD_LOG = "False"
 47 | # Can set as variable in U/I
 48 | # Variable.get(
 49 | #     "airflow_log_cleanup__enable_delete_child_log", "False"
 50 | # )
 51 | LOG_CLEANUP_PROCESS_LOCK_FILE = "/tmp/airflow_log_cleanup_worker.lock"
 52 | logging.info("ENABLE_DELETE_CHILD_LOG  " + ENABLE_DELETE_CHILD_LOG)
 53 | 
 54 | if not BASE_LOG_FOLDER or BASE_LOG_FOLDER.strip() == "":
 55 |     raise ValueError(
 56 |         "BASE_LOG_FOLDER variable is empty in airflow.cfg. It can be found "
 57 |         "under the [core] (<2.0.0) section or [logging] (>=2.0.0) in the cfg file. "
 58 |         "Kindly provide an appropriate directory path."
 59 |     )
 60 | 
 61 | if ENABLE_DELETE_CHILD_LOG.lower() == "true":
 62 |     try:
 63 |         CHILD_PROCESS_LOG_DIRECTORY = conf.get(
 64 |             "scheduler", "CHILD_PROCESS_LOG_DIRECTORY"
 65 |         )
 66 |         if CHILD_PROCESS_LOG_DIRECTORY != ' ':
 67 |             DIRECTORIES_TO_DELETE.append(CHILD_PROCESS_LOG_DIRECTORY)
 68 |     except Exception as e:
 69 |         logging.exception(
 70 |             "Could not obtain CHILD_PROCESS_LOG_DIRECTORY from " +
 71 |             "Airflow Configurations: " + str(e)
 72 |         )
 73 | 
 74 | default_args = {
 75 |     'owner': DAG_OWNER_NAME,
 76 |     'depends_on_past': False,
 77 |     'email': ALERT_EMAIL_ADDRESSES,
 78 |     'email_on_failure': True,
 79 |     'email_on_retry': False,
 80 |     'start_date': START_DATE,
 81 |     'retries': 1,
 82 |     'retry_delay': timedelta(minutes=1)
 83 | }
 84 | 
 85 | dag = DAG(
 86 |     DAG_ID,
 87 |     default_args=default_args,
 88 |     schedule_interval=SCHEDULE_INTERVAL,
 89 |     start_date=START_DATE,
 90 |     template_undefined=jinja2.Undefined,
 91 |     tags=['maintenance-dag'],
 92 | )
 93 | if hasattr(dag, 'doc_md'):
 94 |     dag.doc_md = __doc__
 95 | if hasattr(dag, 'catchup'):
 96 |     dag.catchup = False
 97 | 
 98 | start = DummyOperator(
 99 |     task_id='start',
100 |     dag=dag)
101 | 
102 | log_cleanup = """
103 | 
104 | echo "Getting Configurations..."
105 | BASE_LOG_FOLDER="{{params.directory}}"
106 | WORKER_SLEEP_TIME="{{params.sleep_time}}"
107 | 
108 | sleep ${WORKER_SLEEP_TIME}s
109 | 
110 | MAX_LOG_AGE_IN_DAYS="{{dag_run.conf.maxLogAgeInDays}}"
111 | if [ "${MAX_LOG_AGE_IN_DAYS}" == "" ]; then
112 |     echo "maxLogAgeInDays conf variable isn't included. Using Default '""" + str(DEFAULT_MAX_LOG_AGE_IN_DAYS) + """'."
113 |     MAX_LOG_AGE_IN_DAYS='""" + str(DEFAULT_MAX_LOG_AGE_IN_DAYS) + """'
114 | fi
115 | ENABLE_DELETE=""" + str("true" if ENABLE_DELETE else "false") + """
116 | echo "Finished Getting Configurations"
117 | echo ""
118 | 
119 | echo "Configurations:"
120 | echo "BASE_LOG_FOLDER:      '${BASE_LOG_FOLDER}'"
121 | echo "MAX_LOG_AGE_IN_DAYS:  '${MAX_LOG_AGE_IN_DAYS}'"
122 | echo "ENABLE_DELETE:        '${ENABLE_DELETE}'"
123 | 
124 | cleanup() {
125 |     echo "Executing Find Statement: $1"
126 |     FILES_MARKED_FOR_DELETE=`eval $1`
127 |     echo "Process will be Deleting the following File(s)/Directory(s):"
128 |     echo "${FILES_MARKED_FOR_DELETE}"
129 |     echo "Process will be Deleting `echo "${FILES_MARKED_FOR_DELETE}" | \
130 |     grep -v '^$' | wc -l` File(s)/Directory(s)"     \
131 |     # "grep -v '^$'" - removes empty lines.
132 |     # "wc -l" - Counts the number of lines
133 |     echo ""
134 |     if [ "${ENABLE_DELETE}" == "true" ];
135 |     then
136 |         if [ "${FILES_MARKED_FOR_DELETE}" != "" ];
137 |         then
138 |             echo "Executing Delete Statement: $2"
139 |             eval $2
140 |             DELETE_STMT_EXIT_CODE=$?
141 |             if [ "${DELETE_STMT_EXIT_CODE}" != "0" ]; then
142 |                 echo "Delete process failed with exit code \
143 |                     '${DELETE_STMT_EXIT_CODE}'"
144 | 
145 |                 echo "Removing lock file..."
146 |                 rm -f """ + str(LOG_CLEANUP_PROCESS_LOCK_FILE) + """
147 |                 if [ "${REMOVE_LOCK_FILE_EXIT_CODE}" != "0" ]; then
148 |                     echo "Error removing the lock file. \
149 |                     Check file permissions.\
150 |                     To re-run the DAG, ensure that the lock file has been \
151 |                     deleted (""" + str(LOG_CLEANUP_PROCESS_LOCK_FILE) + """)."
152 |                     exit ${REMOVE_LOCK_FILE_EXIT_CODE}
153 |                 fi
154 |                 exit ${DELETE_STMT_EXIT_CODE}
155 |             fi
156 |         else
157 |             echo "WARN: No File(s)/Directory(s) to Delete"
158 |         fi
159 |     else
160 |         echo "WARN: You're opted to skip deleting the File(s)/Directory(s)!!!"
161 |     fi
162 | }
163 | 
164 | 
165 | if [ ! -f """ + str(LOG_CLEANUP_PROCESS_LOCK_FILE) + """ ]; then
166 | 
167 |     echo "Lock file not found on this node! \
168 |     Creating it to prevent collisions..."
169 |     touch """ + str(LOG_CLEANUP_PROCESS_LOCK_FILE) + """
170 |     CREATE_LOCK_FILE_EXIT_CODE=$?
171 |     if [ "${CREATE_LOCK_FILE_EXIT_CODE}" != "0" ]; then
172 |         echo "Error creating the lock file. \
173 |         Check if the airflow user can create files under tmp directory. \
174 |         Exiting..."
175 |         exit ${CREATE_LOCK_FILE_EXIT_CODE}
176 |     fi
177 | 
178 |     echo ""
179 |     echo "Running Cleanup Process..."
180 | 
181 |     FIND_STATEMENT="find ${BASE_LOG_FOLDER}/*/* -type f -mtime \
182 |      +${MAX_LOG_AGE_IN_DAYS}"
183 |     DELETE_STMT="${FIND_STATEMENT} -exec rm -f {} \;"
184 | 
185 |     cleanup "${FIND_STATEMENT}" "${DELETE_STMT}"
186 |     CLEANUP_EXIT_CODE=$?
187 | 
188 |     FIND_STATEMENT="find ${BASE_LOG_FOLDER}/*/* -type d -empty"
189 |     DELETE_STMT="${FIND_STATEMENT} -prune -exec rm -rf {} \;"
190 | 
191 |     cleanup "${FIND_STATEMENT}" "${DELETE_STMT}"
192 |     CLEANUP_EXIT_CODE=$?
193 | 
194 |     FIND_STATEMENT="find ${BASE_LOG_FOLDER}/* -type d -empty"
195 |     DELETE_STMT="${FIND_STATEMENT} -prune -exec rm -rf {} \;"
196 | 
197 |     cleanup "${FIND_STATEMENT}" "${DELETE_STMT}"
198 |     CLEANUP_EXIT_CODE=$?
199 | 
200 |     echo "Finished Running Cleanup Process"
201 | 
202 |     echo "Deleting lock file..."
203 |     rm -f """ + str(LOG_CLEANUP_PROCESS_LOCK_FILE) + """
204 |     REMOVE_LOCK_FILE_EXIT_CODE=$?
205 |     if [ "${REMOVE_LOCK_FILE_EXIT_CODE}" != "0" ]; then
206 |         echo "Error removing the lock file. Check file permissions. To re-run the DAG, ensure that the lock file has been deleted (""" + str(LOG_CLEANUP_PROCESS_LOCK_FILE) + """)."
207 |         exit ${REMOVE_LOCK_FILE_EXIT_CODE}
208 |     fi
209 | 
210 | else
211 |     echo "Another task is already deleting logs on this worker node. \
212 |     Skipping it!"
213 |     echo "If you believe you're receiving this message in error, kindly check \
214 |     if """ + str(LOG_CLEANUP_PROCESS_LOCK_FILE) + """ exists and delete it."
215 |     exit 0
216 | fi
217 | 
218 | """
219 | 
220 | for log_cleanup_id in range(1, NUMBER_OF_WORKERS + 1):
221 | 
222 |     for dir_id, directory in enumerate(DIRECTORIES_TO_DELETE):
223 | 
224 |         log_cleanup_op = BashOperator(
225 |             task_id='log_cleanup_worker_num_' +
226 |             str(log_cleanup_id) + '_dir_' + str(dir_id),
227 |             bash_command=log_cleanup,
228 |             params={
229 |                 "directory": str(directory),
230 |                 "sleep_time": int(log_cleanup_id)*3},
231 |             dag=dag)
232 | 
233 |         log_cleanup_op.set_upstream(start)


--------------------------------------------------------------------------------
/dags/bigquery_pipeline.py:
--------------------------------------------------------------------------------
  1 | """
  2 | An operations workflow to clean up BigQuery table making fact and dimension tables
  3 | """
  4 | import os
  5 | import warnings
  6 | warnings.simplefilter(action='ignore', category=FutureWarning) # stop getting Pandas FutureWarning's
  7 | 
  8 | import airflow
  9 | from airflow import DAG
 10 | from airflow.operators.dummy_operator import DummyOperator
 11 | from airflow.operators.python_operator import PythonOperator
 12 | from config import config  # contains secret keys in config.py
 13 | from google.cloud import bigquery
 14 | from airflow.providers.google.cloud.operators.bigquery import BigQueryInsertJobOperator
 15 | from airflow.providers.google.cloud.operators.dataproc import ClusterGenerator, DataprocDeleteClusterOperator, DataprocCreateClusterOperator, DataprocSubmitJobOperator
 16 | from airflow.operators.http_operator import SimpleHttpOperator
 17 | from datetime import timedelta
 18 | 
 19 | from youtube.google_oauth import update_video
 20 | from modules.job_title import transform_job_title
 21 | 
 22 | 
 23 | # 'False' DAG is ready for operation; 'True' DAG only runs 'start' DummyOperator
 24 | TESTING_DAG = False
 25 | # Minutes to sleep on an error
 26 | ERROR_SLEEP_MIN = 5 
 27 | # Who is listed as the owner of this DAG in the Airflow Web Server
 28 | DAG_OWNER_NAME = "airflow"
 29 | # List of email address to send email alerts to if this job fails
 30 | ALERT_EMAIL_ADDRESSES = ['luke@lukebarousse.com']
 31 | START_DATE = airflow.utils.dates.days_ago(1)
 32 | 
 33 | default_args = {
 34 |     'owner': DAG_OWNER_NAME,
 35 |     'depends_on_past': False,
 36 |     'start_date': START_DATE, 
 37 |     'email': ALERT_EMAIL_ADDRESSES,
 38 |     'email_on_failure': True,
 39 |     'email_on_retry': True,
 40 |     'retries': 1, 
 41 |     'retry_delay': timedelta(minutes=5),
 42 |     # 'queue': 'bash_queue',
 43 |     # 'pool': 'backfill',
 44 |     # 'priority_weight': 10,
 45 |     # 'end_date': datetime(2022, 1, 1),
 46 |     # 'wait_for_downstream': False,
 47 |     # 'dag': dag,
 48 |     # 'sla': timedelta(hours=2),
 49 |     # 'execution_timeout': timedelta(seconds=300),
 50 |     # 'on_failure_callback': some_function,
 51 |     # 'on_success_callback': some_other_function,
 52 |     # 'on_retry_callback': another_function,
 53 |     # 'sla_miss_callback': yet_another_function,
 54 |     # 'trigger_rule': 'all_success'
 55 | }
 56 | 
 57 | # Dataproc cluster variables
 58 | PROJECT_ID = 'job-listings-366015'
 59 | REGION = 'us-central1'
 60 | ZONE = 'us-central1-a'
 61 | CLUSTER_NAME = 'spark-cluster'
 62 | BUCKET_NAME = 'dataproc-cluster-gsearch'
 63 | 
 64 | CLUSTER_CONFIG = ClusterGenerator(
 65 |     task_id='start_cluster',
 66 |     gcp_conn_id='google_cloud_default',
 67 |     project_id=PROJECT_ID,
 68 |     cluster_name=CLUSTER_NAME,
 69 |     region=REGION,
 70 |     zone=ZONE,
 71 |     storage_bucket=BUCKET_NAME,
 72 |     num_workers=2,
 73 |     master_machine_type='n2-standard-2',
 74 |     worker_machine_type='n2-standard-2',
 75 |     image_version='1.5-debian10',
 76 |     optional_components=['ANACONDA', 'JUPYTER'],
 77 |     properties={'spark:spark.jars.packages': 'com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.6,com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.27.1'},
 78 |     metadata={'PIP_PACKAGES': 'spark-nlp spark-nlp-display'},
 79 |     init_actions_uris=['gs://goog-dataproc-initialization-actions-us-central1/python/pip-install.sh'],
 80 |     auto_delete_ttl=60*60*4, # 4 hours
 81 |     enable_component_gateway=True
 82 | ).make()
 83 | 
 84 | dag = DAG(
 85 |     'bigquery_pipeline', 
 86 |     description='Execute BigQuery & Dataproc jobs for data pipeline',
 87 |     default_args=default_args, 
 88 |     schedule_interval='0 10 * * *', # want to run following serpapi_bigquery dag... best practice is to combine don't want to combine and make script so big
 89 |     catchup=False,
 90 |     tags=['data-pipeline-dag'],
 91 |     max_active_tasks = 3
 92 | )
 93 | 
 94 | with dag:
 95 | 
 96 |     start = DummyOperator(
 97 |         task_id='start',
 98 |         dag=dag)
 99 |     
100 |     if not TESTING_DAG:
101 | 
102 |         # Create fact table from JSON data from SerpApi
103 |         fact_table_build = BigQueryInsertJobOperator(
104 |             task_id='fact_table_build',
105 |             gcp_conn_id='google_cloud_default',
106 |             configuration={
107 |                 "query": {
108 |                     "query": 'sql/fact_build.sql',
109 |                     "useLegacySql": False
110 |                 }
111 |             },
112 |             dag=dag
113 |         )
114 | 
115 |         # Start up dataproc cluster with spark-nlp and spark-bigquery dependencies
116 |         # NOTE: CLI commands in notes under 'Dataproc' section
117 |         start_cluster = DataprocCreateClusterOperator(
118 |             task_id='start_cluster',
119 |             cluster_name=CLUSTER_NAME,
120 |             project_id=PROJECT_ID,
121 |             region=REGION,
122 |             cluster_config=CLUSTER_CONFIG,
123 |             dag=dag
124 |         )
125 | 
126 |         # Create (and/or replace) salary table
127 |         python_file_salary = 'gs://dataproc-cluster-gsearch/notebooks/jupyter/salary_table.py'
128 |         SALARY_TABLE = {
129 |             "reference": {"project_id": PROJECT_ID},
130 |             "placement": {"cluster_name": CLUSTER_NAME},
131 |             "pyspark_job": {"main_python_file_uri": python_file_salary},
132 |             }
133 | 
134 |         salary_table = DataprocSubmitJobOperator(
135 |             task_id='salary_table',
136 |             job=SALARY_TABLE,
137 |             project_id=PROJECT_ID,
138 |             region=REGION,
139 |             dag=dag
140 |         )
141 | 
142 |         # Append to skill table
143 |         python_file_skill = 'gs://dataproc-cluster-gsearch/notebooks/jupyter/skill_table.py'
144 |         SKILL_TABLE = {
145 |             "reference": {"project_id": PROJECT_ID},
146 |             "placement": {"cluster_name": CLUSTER_NAME},
147 |             "pyspark_job": {"main_python_file_uri": python_file_skill},
148 |             }
149 | 
150 |         skill_table = DataprocSubmitJobOperator(
151 |             task_id='skill_table',
152 |             job=SKILL_TABLE,
153 |             project_id=PROJECT_ID,
154 |             region=REGION,
155 |             dag=dag
156 |         )
157 | 
158 |         # Shut down dataproc cluster
159 |         stop_cluster = DataprocDeleteClusterOperator(
160 |             task_id='stop_cluster',
161 |             project_id=PROJECT_ID,
162 |             cluster_name=CLUSTER_NAME,
163 |             region=REGION,
164 |             dag=dag
165 |         )
166 | 
167 |         # Transform job title using BART
168 |         transform_job = PythonOperator(
169 |         task_id='transform_job_title',
170 |         python_callable=transform_job_title,
171 |         dag=dag
172 |         )
173 | 
174 |         # Combine fact table with dimension table
175 |         wide_table_build = BigQueryInsertJobOperator(
176 |             task_id='wide_table_build',
177 |             gcp_conn_id='google_cloud_default',
178 |             configuration={
179 |                 "query": {
180 |                     "query": 'sql/wide_build.sql',
181 |                     "useLegacySql": False
182 |                 }
183 |             },
184 |             dag=dag
185 |         )
186 | 
187 |         # Cache common BigQuery queries in CSV files
188 |         cache_csv = BigQueryInsertJobOperator(
189 |             task_id='cache_csv',
190 |             gcp_conn_id='google_cloud_default',
191 |             configuration={
192 |                 "query": {
193 |                     "query": 'sql/cache_csv.sql',
194 |                     "useLegacySql": False
195 |                 }
196 |             },
197 |             dag=dag
198 |         )
199 | 
200 |         # Update video title in YouTube with number of job listings
201 |         update_video_title = PythonOperator(
202 |         task_id='update_video_title',
203 |         python_callable=update_video,
204 |         dag=dag
205 |         )
206 | 
207 |         # Create public dataset w/ No duplicates (for ChatGPT course)
208 |         public_table_build = BigQueryInsertJobOperator(
209 |             task_id='public_table_build',
210 |             gcp_conn_id='google_cloud_default',
211 |             configuration={
212 |                 "query": {
213 |                     "query": 'sql/public_build.sql',
214 |                     "useLegacySql": False
215 |                 }
216 |             },
217 |             dag=dag
218 |         )
219 | 
220 | 
221 |         start >> fact_table_build >> start_cluster >> salary_table >> skill_table >> stop_cluster >> transform_job >> wide_table_build >> cache_csv >> update_video_title >> public_table_build
222 | 


--------------------------------------------------------------------------------
/dags/config/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lukebarousse/Data_Job_Pipeline_Airflow/b69a5609c8fcd40c11b2fb6eb06123ca414dccdd/dags/config/.gitkeep


--------------------------------------------------------------------------------
/dags/gcp_connection.py:
--------------------------------------------------------------------------------
 1 | """
 2 | An initial workflow to enter in the conn_id for google cloud from the service account JSON
 3 | file (i.e. GOOGLE_APPLICATION_CREDENTIALS environment variable specified in docker-compose.yaml)
 4 | """
 5 | import os
 6 | import airflow
 7 | from airflow import DAG, settings
 8 | from airflow.operators.python_operator import PythonOperator
 9 | from datetime import datetime, timedelta
10 | from airflow.models import Connection
11 | 
12 | import json
13 | 
14 | DAG_OWNER_NAME = "airflow"
15 | # List of email address to send email alerts to if this job fails
16 | ALERT_EMAIL_ADDRESSES = ['luke@lukebarousse.com']
17 | START_DATE = airflow.utils.dates.days_ago(1)
18 | 
19 | default_args = {
20 |     "owner": DAG_OWNER_NAME,
21 |     "depends_on_past": False,
22 |     "start_date": START_DATE,
23 |     "email": ALERT_EMAIL_ADDRESSES,
24 |     "email_on_failure": False,
25 |     "email_on_retry": False,
26 |     "retries": 3,
27 |     "retry_delay": timedelta(minutes=5)
28 | }
29 | 
30 | def add_gcp_connection(**kwargs):
31 |     new_conn = Connection(
32 |             conn_id="google_cloud_default",
33 |             conn_type='google_cloud_platform',
34 |     )
35 |     extra_field = {
36 |         "extra__google_cloud_platform__scope": "https://www.googleapis.com/auth/cloud-platform",
37 |         "extra__google_cloud_platform__project": "job-listings-366015",
38 |         "extra__google_cloud_platform__key_path": os.environ.get("GOOGLE_APPLICATION_CREDENTIALS")
39 |     }
40 | 
41 |     session = settings.Session()
42 | 
43 |     #checking if connection exist
44 |     if session.query(Connection).filter(Connection.conn_id == new_conn.conn_id).first():
45 |         my_connection = session.query(Connection).filter(Connection.conn_id == new_conn.conn_id).one()
46 |         my_connection.set_extra(json.dumps(extra_field))
47 |         session.add(my_connection)
48 |         session.commit()
49 |     else: #if it doesn't exit create one
50 |         new_conn.set_extra(json.dumps(extra_field))
51 |         session.add(new_conn)
52 |         session.commit()
53 | 
54 | dag = DAG(
55 |     "gcp_connection", 
56 |     default_args=default_args, 
57 |     schedule_interval="@once",
58 |     tags=['initial-config'],
59 | )
60 | 
61 | with dag:
62 |     activateGCP = PythonOperator(
63 |         task_id='add_gcp_connection',
64 |         python_callable=add_gcp_connection,
65 |         provide_context=True,
66 |     )
67 | 
68 |     activateGCP


--------------------------------------------------------------------------------
/dags/modules/country.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Classify countries by code and sort them by percentage of views
 3 | """
 4 | 
 5 | import pandas as pd
 6 | 
 7 | def view_percent():
 8 |     # import different country codes
 9 |     codes = pd.read_csv("/opt/airflow/dags/modules/country_codes.csv")
10 | 
11 |     # import youtube views for my channel and calculate percentage viewed
12 |     views = pd.read_csv("/opt/airflow/dags/modules/youtube_views.csv")
13 |     views = views.iloc[1: , :]
14 |     views = views[views.Views != 0] # removing countries with no views
15 |     views = views[views.Geography != 'US'] # pulling US already
16 |     views["percent"] = views['Watch time (hours)'] / views['Watch time (hours)'].sum()
17 | 
18 |     # no results returned from SerpApi from these countries
19 |     # may consider removing from search in future, but doesn' appear to use search credits for no results
20 |     no_country_results = ["MO", "IR", "SD", "SY", "SZ", "SS" ] # "Macao", "Iran", "Sudan", "Syria", "Eswatini", "South Sudan"
21 | 
22 |     # merge dataframes for final dataframe
23 |     percent = views.merge(codes, how='left', left_on='Geography', right_on='code')
24 |     percent = percent[['country','percent']]
25 |     
26 |     # return the dataframe
27 |     return percent


--------------------------------------------------------------------------------
/dags/modules/country_codes.csv:
--------------------------------------------------------------------------------
  1 | code,country,country_long,year,ccTLD,notes
  2 | AD,Andorra,Andorra,1974,.ad,
  3 | AE,United Arab Emirates,United Arab Emirates,1974,.ae,
  4 | AF,Afghanistan,Afghanistan,1974,.af,
  5 | AG,Antigua and Barbuda,Antigua and Barbuda,1974,.ag,
  6 | AI,Anguilla,Anguilla,1985,.ai,AI previously represented French Afars and Issas
  7 | AL,Albania,Albania,1974,.al,
  8 | AM,Armenia,Armenia,1992,.am,
  9 | AO,Angola,Angola,1974,.ao,
 10 | AQ,Antarctica,Antarctica,1974,.aq,"Covers the territories south of 60° south latitude
 11 | Code taken from name in French: Antarctique"
 12 | AR,Argentina,Argentina,1974,.ar,
 13 | AS,American Samoa,American Samoa,1974,.as,
 14 | AT,Austria,Austria,1974,.at,
 15 | AU,Australia,Australia,1974,.au,Includes the Ashmore and Cartier Islands and the Coral Sea Islands
 16 | AW,Aruba,Aruba,1986,.aw,
 17 | AX,Åland Islands,Åland Islands,2004,.ax,An autonomous county of Finland
 18 | AZ,Azerbaijan,Azerbaijan,1992,.az,
 19 | BA,Bosnia and Herzegovina,Bosnia and Herzegovina,1992,.ba,
 20 | BB,Barbados,Barbados,1974,.bb,
 21 | BD,Bangladesh,Bangladesh,1974,.bd,
 22 | BE,Belgium,Belgium,1974,.be,
 23 | BF,Burkina Faso,Burkina Faso,1984,.bf,Name changed from Upper Volta (HV)
 24 | BG,Bulgaria,Bulgaria,1974,.bg,
 25 | BH,Bahrain,Bahrain,1974,.bh,
 26 | BI,Burundi,Burundi,1974,.bi,
 27 | BJ,Benin,Benin,1977,.bj,Name changed from Dahomey (DY)
 28 | BL,Saint Barthélemy,Saint Barthélemy,2007,.bl,
 29 | BM,Bermuda,Bermuda,1974,.bm,
 30 | BN,Brunei,Brunei Darussalam,1974,.bn,Previous ISO country name: Brunei
 31 | BO,Bolivia,Bolivia (Plurinational State of),1974,.bo,Previous ISO country name: Bolivia
 32 | BQ,"Bonaire, Sint Eustatius and Saba","Bonaire, Sint Eustatius and Saba",2010,.bq,"Consists of three Caribbean ""special municipalities"", which are part of the Netherlands proper: Bonaire, Sint Eustatius, and Saba (the BES Islands)
 33 | Previous ISO country name: Bonaire, Saint Eustatius and Saba
 34 | BQ previously represented British Antarctic Territory"
 35 | BR,Brazil,Brazil,1974,.br,
 36 | BS,Bahamas,Bahamas,1974,.bs,
 37 | BT,Bhutan,Bhutan,1974,.bt,
 38 | BV,Bouvet Island,Bouvet Island,1974,.bv,Belongs to Norway
 39 | BW,Botswana,Botswana,1974,.bw,
 40 | BY,Belarus,Belarus,1974,.by,"Code taken from previous ISO country name: Byelorussian SSR (now assigned ISO 3166-3 code BYAA)
 41 | Code assigned as the country was already a UN member since 1945[15]"
 42 | BZ,Belize,Belize,1974,.bz,
 43 | CA,Canada,Canada,1974,.ca,
 44 | CC,Cocos (Keeling) Islands,Cocos (Keeling) Islands,1974,.cc,Belongs to Australia
 45 | CD,"Congo, Democratic Republic of the","Congo, Democratic Republic of the",1997,.cd,Name changed from Zaire (ZR)
 46 | CF,Central African Republic,Central African Republic,1974,.cf,
 47 | CG,Congo,Congo,1974,.cg,
 48 | CH,Switzerland,Switzerland,1974,.ch,Code taken from name in Latin: Confoederatio Helvetica
 49 | CI,Côte d'Ivoire,Côte d'Ivoire,1974,.ci,ISO country name follows UN designation (common name and previous ISO country name: Ivory Coast)
 50 | CK,Cook Islands,Cook Islands,1974,.ck,
 51 | CL,Chile,Chile,1974,.cl,
 52 | CM,Cameroon,Cameroon,1974,.cm,"Previous ISO country name: Cameroon, United Republic of"
 53 | CN,China,China,1974,.cn,
 54 | CO,Colombia,Colombia,1974,.co,
 55 | CR,Costa Rica,Costa Rica,1974,.cr,
 56 | CU,Cuba,Cuba,1974,.cu,
 57 | CV,Cabo Verde,Cabo Verde,1974,.cv,"ISO country name follows UN designation (common name and previous ISO country name: Cape Verde, another previous ISO country name: Cape Verde Islands)"
 58 | CW,Curaçao,Curaçao,2010,.cw,
 59 | CX,Christmas Island,Christmas Island,1974,.cx,Belongs to Australia
 60 | CY,Cyprus,Cyprus,1974,.cy,
 61 | CZ,Czechia,Czechia,1993,.cz,Previous ISO country name: Czech Republic
 62 | DE,Germany,Germany,1974,.de,"Code taken from name in German: Deutschland
 63 | Code used for West Germany before 1990 (previous ISO country name: Germany, Federal Republic of)"
 64 | DJ,Djibouti,Djibouti,1977,.dj,Name changed from French Afars and Issas (AI)
 65 | DK,Denmark,Denmark,1974,.dk,
 66 | DM,Dominica,Dominica,1974,.dm,
 67 | DO,Dominican Republic,Dominican Republic,1974,.do,
 68 | DZ,Algeria,Algeria,1974,.dz,"Code taken from name in Arabic الجزائر al-Djazā'ir, Algerian Arabic الدزاير al-Dzāyīr, or Berber ⴷⵣⴰⵢⵔ Dzayer"
 69 | EC,Ecuador,Ecuador,1974,.ec,
 70 | EE,Estonia,Estonia,1992,.ee,Code taken from name in Estonian: Eesti
 71 | EG,Egypt,Egypt,1974,.eg,
 72 | EH,Western Sahara,Western Sahara,1974,,"Previous ISO country name: Spanish Sahara (code taken from name in Spanish: Sahara español)
 73 | .eh ccTLD has not been implemented.[16]"
 74 | ER,Eritrea,Eritrea,1993,.er,
 75 | ES,Spain,Spain,1974,.es,Code taken from name in Spanish: España
 76 | ET,Ethiopia,Ethiopia,1974,.et,
 77 | FI,Finland,Finland,1974,.fi,
 78 | FJ,Fiji,Fiji,1974,.fj,
 79 | FK,Falkland Islands,Falkland Islands (Malvinas),1974,.fk,ISO country name follows UN designation due to the Falkland Islands sovereignty dispute (local common name: Falkland Islands)[17]
 80 | FM,Federated States of Micronesia,Micronesia (Federated States of),1986,.fm,Previous ISO country name: Micronesia
 81 | FO,Faroe Islands,Faroe Islands,1974,.fo,Code taken from name in Faroese: Føroyar
 82 | FR,France,France,1974,.fr,Includes Clipperton Island
 83 | GA,Gabon,Gabon,1974,.ga,
 84 | GB,United Kingdom,United Kingdom of Great Britain and Northern Ireland,1974,".gb
 85 | (.uk)","Includes Akrotiri and Dhekelia (Sovereign Base Areas)
 86 | Code taken from Great Britain (from official name: United Kingdom of Great Britain and Northern Ireland)[18]
 87 | Previous ISO country name: United Kingdom
 88 | .uk is the primary ccTLD of the United Kingdom instead of .gb (see code UK, which is exceptionally reserved)"
 89 | GD,Grenada,Grenada,1974,.gd,
 90 | GE,Georgia,Georgia,1992,.ge,GE previously represented Gilbert and Ellice Islands
 91 | GF,French Guiana,French Guiana,1974,.gf,Code taken from name in French: Guyane française
 92 | GG,Guernsey,Guernsey,2006,.gg,A British Crown Dependency
 93 | GH,Ghana,Ghana,1974,.gh,
 94 | GI,Gibraltar,Gibraltar,1974,.gi,
 95 | GL,Greenland,Greenland,1974,.gl,
 96 | GM,Gambia,Gambia,1974,.gm,
 97 | GN,Guinea,Guinea,1974,.gn,
 98 | GP,Guadeloupe,Guadeloupe,1974,.gp,
 99 | GQ,Equatorial Guinea,Equatorial Guinea,1974,.gq,Code taken from name in French: Guinée équatoriale
100 | GR,Greece,Greece,1974,.gr,
101 | GS,South Georgia and the South Sandwich Islands,South Georgia and the South Sandwich Islands,1993,.gs,
102 | GT,Guatemala,Guatemala,1974,.gt,
103 | GU,Guam,Guam,1974,.gu,
104 | GW,Guinea-Bissau,Guinea-Bissau,1974,.gw,
105 | GY,Guyana,Guyana,1974,.gy,
106 | HK,Hong Kong,Hong Kong,1974,.hk,Hong Kong is officially a Special Administrative Region of the People's Republic of China since 1 July 1997
107 | HM,Heard Island and McDonald Islands,Heard Island and McDonald Islands,1974,.hm,Belongs to Australia
108 | HN,Honduras,Honduras,1974,.hn,
109 | HR,Croatia,Croatia,1992,.hr,Code taken from name in Croatian: Hrvatska
110 | HT,Haiti,Haiti,1974,.ht,
111 | HU,Hungary,Hungary,1974,.hu,
112 | ID,Indonesia,Indonesia,1974,.id,
113 | IE,Ireland,Ireland,1974,.ie,
114 | IL,Israel,Israel,1974,.il,
115 | IM,Isle of Man,Isle of Man,2006,.im,A British Crown Dependency
116 | IN,India,India,1974,.in,
117 | IO,British Indian Ocean Territory,British Indian Ocean Territory,1974,.io,
118 | IQ,Iraq,Iraq,1974,.iq,
119 | IR,"Iran ",Iran (Islamic Republic of),1974,.ir,Previous ISO country name: Iran
120 | IS,Iceland,Iceland,1974,.is,Code taken from name in Icelandic: Ísland
121 | IT,Italy,Italy,1974,.it,
122 | JE,Jersey,Jersey,2006,.je,A British Crown Dependency
123 | JM,Jamaica,Jamaica,1974,.jm,
124 | JO,Jordan,Jordan,1974,.jo,
125 | JP,Japan,Japan,1974,.jp,
126 | KE,Kenya,Kenya,1974,.ke,
127 | KG,Kyrgyzstan,Kyrgyzstan,1992,.kg,
128 | KH,Cambodia,Cambodia,1974,.kh,"Code taken from former name: Khmer Republic
129 | Previous ISO country name: Kampuchea, Democratic"
130 | KI,Kiribati,Kiribati,1979,.ki,Name changed from Gilbert Islands (GE)
131 | KM,Comoros,Comoros,1974,.km,"Code taken from name in Comorian: Komori
132 | Previous ISO country name: Comoro Islands"
133 | KN,Saint Kitts and Nevis,Saint Kitts and Nevis,1974,.kn,Previous ISO country name: Saint Kitts-Nevis-Anguilla
134 | KP,North Korea,Korea (Democratic People's Republic of),1974,.kp,ISO country name follows UN designation (common name: North Korea)
135 | KR,South Korea,"Korea, Republic of",1974,.kr,ISO country name follows UN designation (common name: South Korea)
136 | KW,Kuwait,Kuwait,1974,.kw,
137 | KY,Cayman Islands,Cayman Islands,1974,.ky,
138 | KZ,Kazakhstan,Kazakhstan,1992,.kz,Previous ISO country name: Kazakstan
139 | LA,Laos,Lao People's Democratic Republic,1974,.la,ISO country name follows UN designation (common name and previous ISO country name: Laos)
140 | LB,Lebanon,Lebanon,1974,.lb,
141 | LC,Saint Lucia,Saint Lucia,1974,.lc,
142 | LI,Liechtenstein,Liechtenstein,1974,.li,
143 | LK,Sri Lanka,Sri Lanka,1974,.lk,
144 | LR,Liberia,Liberia,1974,.lr,
145 | LS,Lesotho,Lesotho,1974,.ls,
146 | LT,Lithuania,Lithuania,1992,.lt,
147 | LU,Luxembourg,Luxembourg,1974,.lu,
148 | LV,Latvia,Latvia,1992,.lv,
149 | LY,Libya,Libya,1974,.ly,Previous ISO country name: Libyan Arab Jamahiriya
150 | MA,Morocco,Morocco,1974,.ma,Code taken from name in French: Maroc
151 | MC,Monaco,Monaco,1974,.mc,
152 | MD,Moldova,"Moldova, Republic of",1992,.md,Previous ISO country name: Moldova (briefly from 2008 to 2009)
153 | ME,Montenegro,Montenegro,2006,.me,
154 | MF,Saint Martin,Saint Martin (French part),2007,.mf,The Dutch part of Saint Martin island is assigned code SX
155 | MG,Madagascar,Madagascar,1974,.mg,
156 | MH,Marshall Islands,Marshall Islands,1986,.mh,
157 | MK,Macedonia (FYROM),North Macedonia,1993,.mk,"Code taken from name in Macedonian: Severna Makedonija
158 | Previous ISO country name: Macedonia, the former Yugoslav Republic of (designated as such due to Macedonia naming dispute)"
159 | ML,Mali,Mali,1974,.ml,
160 | MM,Myanmar,Myanmar,1989,.mm,Name changed from Burma (BU)
161 | MN,Mongolia,Mongolia,1974,.mn,
162 | MO,Macao,Macao,1974,.mo,Previous ISO country name: Macau; Macao is officially a Special Administrative Region of the People's Republic of China since 20 December 1999
163 | MP,Northern Mariana Islands,Northern Mariana Islands,1986,.mp,
164 | MQ,Martinique,Martinique,1974,.mq,
165 | MR,Mauritania,Mauritania,1974,.mr,
166 | MS,Montserrat,Montserrat,1974,.ms,
167 | MT,Malta,Malta,1974,.mt,
168 | MU,Mauritius,Mauritius,1974,.mu,
169 | MV,Maldives,Maldives,1974,.mv,
170 | MW,Malawi,Malawi,1974,.mw,
171 | MX,Mexico,Mexico,1974,.mx,
172 | MY,Malaysia,Malaysia,1974,.my,
173 | MZ,Mozambique,Mozambique,1974,.mz,
174 | NA,Namibia,Namibia,1974,.na,
175 | NC,New Caledonia,New Caledonia,1974,.nc,
176 | NE,Niger,Niger,1974,.ne,
177 | NF,Norfolk Island,Norfolk Island,1974,.nf,Belongs to Australia
178 | NG,Nigeria,Nigeria,1974,.ng,
179 | NI,Nicaragua,Nicaragua,1974,.ni,
180 | NL,Netherlands,Netherlands,1974,.nl,"Officially includes the islands Bonaire, Saint Eustatius and Saba, which also have code BQ in ISO 3166-1. Within ISO 3166-2, Aruba (AW), Curaçao (CW), and Sint Maarten (SX) are also coded as subdivisions of NL.[19]"
181 | NO,Norway,Norway,1974,.no,
182 | NP,Nepal,Nepal,1974,.np,
183 | NR,Nauru,Nauru,1974,.nr,
184 | NU,Niue,Niue,1974,.nu,Previous ISO country name: Niue Island
185 | NZ,New Zealand,New Zealand,1974,.nz,
186 | OM,Oman,Oman,1974,.om,
187 | PA,Panama,Panama,1974,.pa,
188 | PE,Peru,Peru,1974,.pe,
189 | PF,French Polynesia,French Polynesia,1974,.pf,Code taken from name in French: Polynésie française
190 | PG,Papua New Guinea,Papua New Guinea,1974,.pg,
191 | PH,Philippines,Philippines,1974,.ph,
192 | PK,Pakistan,Pakistan,1974,.pk,
193 | PL,Poland,Poland,1974,.pl,
194 | PM,Saint Pierre and Miquelon,Saint Pierre and Miquelon,1974,.pm,
195 | PN,Pitcairn,Pitcairn,1974,.pn,Previous ISO country name: Pitcairn Islands
196 | PR,Puerto Rico,Puerto Rico,1974,.pr,
197 | PS,Palestine,"Palestine, State of",1999,.ps,"Previous ISO country name: Palestinian Territory, Occupied
198 | Consists of the West Bank and the Gaza Strip"
199 | PT,Portugal,Portugal,1974,.pt,
200 | PW,Palau,Palau,1986,.pw,
201 | PY,Paraguay,Paraguay,1974,.py,
202 | QA,Qatar,Qatar,1974,.qa,
203 | RE,Réunion,Réunion,1974,.re,
204 | RO,Romania,Romania,1974,.ro,
205 | RS,Serbia,Serbia,2006,.rs,Republic of Serbia
206 | RU,Russia,Russian Federation,1992,.ru,ISO country name follows UN designation (common name: Russia)
207 | RW,Rwanda,Rwanda,1974,.rw,
208 | SA,Saudi Arabia,Saudi Arabia,1974,.sa,
209 | SB,Solomon Islands,Solomon Islands,1974,.sb,Code taken from former name: British Solomon Islands
210 | SC,Seychelles,Seychelles,1974,.sc,
211 | SD,Sudan,Sudan,1974,.sd,
212 | SE,Sweden,Sweden,1974,.se,
213 | SG,Singapore,Singapore,1974,.sg,
214 | SH,"Saint Helena, Ascension and Tristan da Cunha","Saint Helena, Ascension and Tristan da Cunha",1974,.sh,Previous ISO country name: Saint Helena.
215 | SI,Slovenia,Slovenia,1992,.si,
216 | SJ,Svalbard and Jan Mayen,Svalbard and Jan Mayen,1974,.sj,"Previous ISO name: Svalbard and Jan Mayen Islands
217 | Consists of two Arctic territories of Norway: Svalbard and Jan Mayen"
218 | SK,Slovakia,Slovakia,1993,.sk,SK previously represented the Kingdom of Sikkim
219 | SL,Sierra Leone,Sierra Leone,1974,.sl,
220 | SM,San Marino,San Marino,1974,.sm,
221 | SN,Senegal,Senegal,1974,.sn,
222 | SO,Somalia,Somalia,1974,.so,
223 | SR,Suriname,Suriname,1974,.sr,Previous ISO country name: Surinam
224 | SS,South Sudan,South Sudan,2011,.ss,
225 | ST,Sao Tome and Principe,Sao Tome and Principe,1974,.st,
226 | SV,El Salvador,El Salvador,1974,.sv,
227 | SX,Sint Maarten,Sint Maarten (Dutch part),2010,.sx,The French part of Saint Martin island is assigned code MF
228 | SY,Syria,Syrian Arab Republic,1974,.sy,ISO country name follows UN designation (common name and previous ISO country name: Syria)
229 | SZ,Eswatini,Eswatini,1974,.sz,Previous ISO country name: Swaziland
230 | TC,Turks and Caicos Islands,Turks and Caicos Islands,1974,.tc,
231 | TD,Chad,Chad,1974,.td,Code taken from name in French: Tchad
232 | TF,French Southern Territories,French Southern Territories,1979,.tf,"Covers the French Southern and Antarctic Lands except Adélie Land
233 | Code taken from name in French: Terres australes françaises"
234 | TG,Togo,Togo,1974,.tg,
235 | TH,Thailand,Thailand,1974,.th,
236 | TJ,Tajikistan,Tajikistan,1992,.tj,
237 | TK,Tokelau,Tokelau,1974,.tk,Previous ISO country name: Tokelau Islands
238 | TL,Timor-Leste,Timor-Leste,2002,.tl,Name changed from East Timor (TP)
239 | TM,Turkmenistan,Turkmenistan,1992,.tm,
240 | TN,Tunisia,Tunisia,1974,.tn,
241 | TO,Tonga,Tonga,1974,.to,
242 | TR,Turkey,Türkiye,1974,.tr,Previous ISO country name: Turkey
243 | TT,Trinidad and Tobago,Trinidad and Tobago,1974,.tt,
244 | TV,Tuvalu,Tuvalu,1977,.tv,
245 | TW,Taiwan,"Taiwan, Province of China",1974,.tw,"Covers the current jurisdiction of the Republic of China
246 | ISO country name follows UN designation (due to political status of Taiwan within the UN)[18] (common name: Taiwan)"
247 | TZ,Tanzania,"Tanzania, United Republic of",1974,.tz,
248 | UA,Ukraine,Ukraine,1974,.ua,"Previous ISO country name: Ukrainian SSR
249 | Code assigned as the country was already a UN member since 1945[15]"
250 | UG,Uganda,Uganda,1974,.ug,
251 | UM,United States Minor Outlying Islands,United States Minor Outlying Islands,1986,,"Consists of nine minor insular areas of the United States: Baker Island, Howland Island, Jarvis Island, Johnston Atoll, Kingman Reef, Midway Islands, Navassa Island, Palmyra Atoll, and Wake Island
252 | .um ccTLD was revoked in 2007[20]The United States Department of State uses the following user assigned alpha-2 codes for the nine territories, respectively, XB, XH, XQ, XU, XM, QM, XV, XL, and QW.[21]"
253 | US,United States,United States of America,1974,.us,Previous ISO country name: United States
254 | UY,Uruguay,Uruguay,1974,.uy,
255 | UZ,Uzbekistan,Uzbekistan,1992,.uz,
256 | VA,Holy See,Holy See,1974,.va,"Covers Vatican City, territory of the Holy See
257 | Previous ISO country names: Vatican City State (Holy See) and Holy See (Vatican City State)"
258 | VC,Saint Vincent and the Grenadines,Saint Vincent and the Grenadines,1974,.vc,
259 | VE,Venezuela,Venezuela (Bolivarian Republic of),1974,.ve,Previous ISO country name: Venezuela
260 | VG,British Virgin Islands,Virgin Islands (British),1974,.vg,
261 | VI,U.S. Virgin Islands,Virgin Islands (U.S.),1974,.vi,
262 | VN,Vietnam,Viet Nam,1974,.vn,"ISO country name follows UN designation (common name: Vietnam)
263 | Code used for Republic of Viet Nam (common name: South Vietnam) before 1977"
264 | VU,Vanuatu,Vanuatu,1980,.vu,Name changed from New Hebrides (NH)
265 | WF,Wallis and Futuna,Wallis and Futuna,1974,.wf,Previous ISO country name: Wallis and Futuna Islands
266 | WS,Samoa,Samoa,1974,.ws,Code taken from former name: Western Samoa
267 | YE,Yemen,Yemen,1974,.ye,"Previous ISO country name: Yemen, Republic of (for three years after the unification)
268 | Code used for North Yemen before 1990"
269 | YT,Mayotte,Mayotte,1993,.yt,
270 | ZA,South Africa,South Africa,1974,.za,Code taken from name in Dutch: Zuid-Afrika
271 | ZM,Zambia,Zambia,1974,.zm,
272 | ZW,Zimbabwe,Zimbabwe,1980,.zw,Name changed from Southern Rhodesia (RH)


--------------------------------------------------------------------------------
/dags/modules/job_title.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Transform non-unique job titles into a standard form using BART Zero Shot Text Classification
 3 | Eg. 'SR. DATA ENGR' -> 'Senior Data Engineer'
 4 | 
 5 | Reference: https://huggingface.co/facebook/bart-large-mnli?candidateLabels=Data+Engineer%2C+Data+Scientist%2C+Data+Analyst%2C+Software+Engineer%2C+Business+Analyst%2C+Machine+Learning+Engineer%2C+Senior+Data+Engineer%2C+Senior+Data+Scientist%2C+Senior+Data+Analyst%2C+Cloud+Engineer&multiClass=false&text=Software+%2F+Data+Engineer
 6 | """
 7 | import pandas as pd
 8 | from google.cloud import bigquery
 9 | 
10 | # for large datasets store data in google storage for improved speed using the following command
11 | # %bigquery --project job-listings-366015 --use_bqstorage_api 
12 | 
13 | # for apple silicone m1 macs, use the following to enable MPS
14 | # import torch
15 | # mps_device = torch.device("mps")
16 | # import os
17 | # os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"
18 | 
19 | from transformers import pipeline
20 | 
21 | def transform_job_title():
22 |     # Initialize the pipeline
23 |     classifier = pipeline("zero-shot-classification",
24 |                         model="facebook/bart-large-mnli",
25 |                         # device=mps_device   # couldn't get this to work
26 |                         )
27 | 
28 |     # Query to get all unique job titles
29 |     client = bigquery.Client()
30 |     query_job = client.query(
31 |         """
32 |         WITH jobs_all AS (
33 |             SELECT job_title, COUNT(job_title) AS job_title_count
34 |             FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_fact
35 |             GROUP BY job_title
36 |             ORDER BY job_title_count DESC
37 |         ), jobs_clean AS (
38 |             SELECT job_title, job_title_clean
39 |             FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_job_title
40 |         ), jobs_unclean AS (
41 |             SELECT job_title, job_title_clean, job_title_count
42 |             FROM jobs_all       
43 |             LEFT JOIN jobs_clean
44 |             USING (job_title)
45 |             WHERE job_title_clean IS NULL   
46 |         )
47 | 
48 |         SELECT *
49 |         FROM jobs_unclean
50 |         ORDER BY job_title_count DESC
51 |         """
52 |     )
53 | 
54 |     jobs_df = query_job.to_dataframe()
55 |     print("Starting BART pipeline to transform unique job titles: ", len(jobs_df))
56 | 
57 |     candidate_labels = ['Data Engineer', 'Data Scientist', 'Data Analyst', 'Software Engineer', 'Business Analyst', 'Machine Learning Engineer', 'Senior Data Engineer', 'Senior Data Scientist', 'Senior Data Analyst', 'Cloud Engineer']
58 | 
59 |     # Iterate over a dataframe
60 |     for index, row in jobs_df.iterrows():
61 |         sequence_to_classify = row['job_title']
62 |         try:
63 |             results = classifier(sequence_to_classify, candidate_labels)
64 |             jobs_df.at[index, 'job_title_clean'] = results['labels'][0]
65 |         except ValueError: #raised when the input is empty
66 |             jobs_df.at[index, 'job_title_clean'] = None
67 | 
68 |     # Save to csv
69 |     # jobs_df.to_csv('jobs_unclean.csv', index=False)
70 | 
71 |     # Clean up the data
72 |     jobs_final = jobs_df[~jobs_df.job_title_clean.isnull()]
73 | 
74 |     jobs_final = jobs_final[['job_title', 'job_title_clean']]
75 | 
76 |     print("BART complete, uploading to BigQuery")
77 | 
78 |     # Write to BigQuery
79 |     jobs_final.to_gbq(destination_table='gsearch_job_listings_clean.gsearch_job_title',
80 |                 project_id='job-listings-366015',
81 |                 if_exists='append')
82 |     
83 |     return


--------------------------------------------------------------------------------
/dags/modules/youtube_views.csv:
--------------------------------------------------------------------------------
  1 | Geography,Views,Average view duration,Watch time (hours)
  2 | Total,12170978,0:03:05,626923.2823
  3 | IN,3190180,0:01:53,100887.2259
  4 | US,2204092,0:04:03,149228.0248
  5 | GB,434350,0:03:37,26289.3889
  6 | CA,376269,0:03:42,23254.0882
  7 | DE,318350,0:03:33,18881.0136
  8 | ID,267963,0:02:36,11651.5774
  9 | PH,254697,0:03:12,13648.8765
 10 | AU,211633,0:03:41,13035.0428
 11 | PK,201570,0:02:10,7317.836
 12 | BR,192415,0:03:31,11296.5917
 13 | MY,159853,0:03:01,8043.9433
 14 | NG,155099,0:04:03,10475.799
 15 | SG,152557,0:03:14,8224.0907
 16 | BD,150127,0:02:15,5635.1753
 17 | MX,130549,0:03:56,8594.2263
 18 | FR,129030,0:03:15,7012.8083
 19 | PL,126229,0:03:19,6979.2326
 20 | ZA,125037,0:03:53,8098.7245
 21 | ES,113945,0:03:24,6460.713
 22 | NL,113615,0:03:37,6872.1942
 23 | IT,106855,0:03:08,5582.0602
 24 | AE,103310,0:02:51,4911.8902
 25 | VN,101248,0:02:58,5027.3882
 26 | EG,94138,0:03:15,5114.5461
 27 | TR,88895,0:02:59,4426.3234
 28 | TH,88038,0:03:06,4552.2036
 29 | RU,86039,0:02:49,4060.3313
 30 | SA,79865,0:03:15,4331.0637
 31 | KE,71150,0:03:58,4712.6249
 32 | AR,69718,0:03:51,4475.5562
 33 | MA,66406,0:03:05,3421.3642
 34 | NP,63127,0:02:12,2330.9322
 35 | HK,62449,0:03:15,3399.5587
 36 | JP,60438,0:03:11,3219.2251
 37 | CO,60109,0:03:51,3858.7173
 38 | SE,59469,0:03:46,3742.9935
 39 | LK,59209,0:02:12,2178.0745
 40 | PT,55587,0:03:43,3453.9172
 41 | KR,51519,0:02:27,2112.2578
 42 | TW,50265,0:03:03,2557.0195
 43 | GR,48726,0:03:22,2742.4291
 44 | IL,42887,0:03:14,2318.9846
 45 | RO,42570,0:03:16,2324.3448
 46 | CL,42039,0:03:48,2672.3257
 47 | IE,41831,0:03:47,2647.9054
 48 | CH,41585,0:03:45,2603.8433
 49 | PE,38164,0:03:53,2474.2765
 50 | BE,37598,0:03:33,2232.1339
 51 | GH,36178,0:04:07,2483.7241
 52 | NZ,35185,0:03:45,2199.9395
 53 | UA,35063,0:03:01,1768.1397
 54 | DK,32240,0:03:48,2048.9427
 55 | AT,29772,0:03:39,1815.4341
 56 | NO,29201,0:03:53,1892.9639
 57 | CZ,28820,0:03:38,1751.4573
 58 | DZ,28394,0:03:05,1459.1451
 59 | HU,28100,0:03:29,1633.8395
 60 | FI,26480,0:03:52,1711.1062
 61 | RS,22802,0:03:18,1257.8289
 62 | TN,19754,0:03:09,1037.6083
 63 | ET,19385,0:02:49,912.4085
 64 | QA,17382,0:02:47,809.3941
 65 | BG,16238,0:03:30,948.2542
 66 | CN,14272,0:03:36,857.079
 67 | CR,13853,0:04:19,1000.3142
 68 | HR,12718,0:03:31,748.3247
 69 | LT,12285,0:03:45,770.496
 70 | DO,11992,0:04:02,806.6541
 71 | KZ,11553,0:03:22,649.315
 72 | SO,11023,0:02:33,469.1683
 73 | MM,10875,0:02:36,471.566
 74 | SK,10680,0:03:20,595.3328
 75 | IQ,9999,0:02:46,461.804
 76 | KW,9936,0:03:00,496.8222
 77 | EC,9912,0:03:46,624.4362
 78 | KH,9590,0:02:49,451.5871
 79 | JM,9515,0:04:10,661.4757
 80 | JO,9030,0:03:10,477.5539
 81 | LB,9003,0:03:26,515.7938
 82 | UG,8911,0:03:59,593.1003
 83 | AZ,8752,0:02:53,421.3726
 84 | OM,8695,0:02:32,369.5012
 85 | TT,8132,0:03:44,507.585
 86 | GE,6604,0:03:23,373.057
 87 | ZW,6446,0:03:58,427.5399
 88 | UZ,6338,0:02:44,290.0008
 89 | CM,6078,0:03:41,374.2997
 90 | VE,6073,0:04:21,441.1649
 91 | GT,5975,0:04:00,398.7269
 92 | MU,5883,0:02:24,236.3667
 93 | PA,5601,0:04:16,398.6883
 94 | TZ,5359,0:03:22,301.0375
 95 | BH,5171,0:02:40,230.0382
 96 | SI,5065,0:03:16,276.8601
 97 | SD,5040,0:03:27,290.4987
 98 | BO,4725,0:03:56,310.7617
 99 | ZM,4517,0:03:55,295.029
100 | LV,4492,0:03:46,282.2924
101 | CY,4472,0:03:37,270.3176
102 | PR,4398,0:04:05,300.3237
103 | EE,4337,0:03:46,272.5391
104 | BA,4110,0:03:03,209.9061
105 | BY,3974,0:02:56,195.1243
106 | UY,3703,0:03:47,233.9219
107 | MK,3638,0:03:29,211.826
108 | IR,2859,0:02:49,134.9042
109 | AL,2789,0:02:28,114.8843
110 | BW,2686,0:03:48,170.6096
111 | MN,2594,0:02:36,112.7475
112 | SV,2027,0:04:14,143.2292
113 | RW,1934,0:03:44,120.4006
114 | NA,1719,0:04:01,115.2766
115 | HN,1495,0:03:54,97.224
116 | SN,1456,0:03:06,75.2545
117 | PY,1388,0:03:32,82.0983
118 | LU,1353,0:03:27,78.0674
119 | SY,1257,0:02:54,60.7869
120 | BN,1180,0:02:41,52.8582
121 | AM,1121,0:02:54,54.3131
122 | CI,1060,0:03:08,55.3612
123 | MT,944,0:03:39,57.4443
124 | MZ,922,0:04:03,62.3198
125 | NI,881,0:04:16,62.7468
126 | MD,861,0:02:55,42.0887
127 | LY,728,0:03:35,43.6405
128 | MV,700,0:02:27,28.6537
129 | PS,656,0:02:24,26.3179
130 | IS,552,0:03:17,30.311
131 | YE,550,0:03:05,28.414
132 | KG,440,0:03:20,24.459
133 | AO,436,0:04:28,32.5251
134 | BB,429,0:04:24,31.5008
135 | AF,400,0:02:12,14.7075
136 | MO,321,0:03:26,18.3997
137 | BS,287,0:03:40,17.5762
138 | MW,285,0:04:02,19.2154
139 | SL,281,0:03:58,18.6413
140 | GY,261,0:03:46,16.4035
141 | CD,207,0:03:14,11.1726
142 | TG,202,0:03:34,12.0335
143 | ME,198,0:03:42,12.2303
144 | HT,181,0:03:14,9.7964
145 | FJ,175,0:02:58,8.6736
146 | GM,119,0:04:24,8.7383
147 | GU,89,0:04:50,7.1788
148 | LC,89,0:05:43,8.4811
149 | BT,77,0:02:03,2.6463
150 | SR,71,0:03:29,4.1407
151 | ML,62,0:03:52,4.0025
152 | PG,58,0:04:10,4.0393
153 | MR,55,0:01:04,0.9891
154 | SZ,41,0:03:55,2.677
155 | TJ,40,0:01:59,1.3255
156 | LS,38,0:03:48,2.409
157 | LA,37,0:02:43,1.6791
158 | VI,33,0:02:11,1.2018
159 | AG,32,0:02:57,1.5744
160 | MG,24,0:02:44,1.0949
161 | GA,21,0:03:50,1.3448
162 | CW,20,0:05:28,1.8255
163 | KY,15,0:10:18,2.5775
164 | VU,14,0:00:01,0.0047
165 | EH,13,0:04:02,0.8765
166 | VC,13,0:02:58,0.6463
167 | GN,12,0:06:01,1.2046
168 | RE,12,0:03:51,0.7708
169 | SS,12,0:05:18,1.0615
170 | GP,11,0:04:07,0.7573
171 | MP,11,0:06:23,1.1706
172 | BF,10,0:05:34,0.93
173 | BJ,10,0:03:13,0.5369
174 | DJ,10,0:00:41,0.1158
175 | GD,10,0:01:28,0.2458
176 | LR,10,0:04:07,0.6873
177 | AD,0,,0
178 | AI,0,,0
179 | AW,0,,0
180 | BI,0,,0
181 | BQ,0,,0
182 | BZ,0,,0
183 | CG,0,,0
184 | CU,0,,0
185 | CV,0,,0
186 | DM,0,,0
187 | GF,0,,0
188 | GG,0,,0
189 | GI,0,,0
190 | GQ,0,,0
191 | IM,0,,0
192 | KN,0,,0
193 | MF,0,,0
194 | MQ,0,,0
195 | NC,0,,0
196 | NE,0,,0
197 | SC,0,,0
198 | TL,0,,0
199 | VG,0,,0
200 | 


--------------------------------------------------------------------------------
/dags/serpapi_bigquery.py:
--------------------------------------------------------------------------------
  1 | """
  2 | An operations workflow to collect data science job postings from SerpApi and
  3 | insert into BigQuery.
  4 | """
  5 | import warnings
  6 | warnings.simplefilter(action='ignore', category=FutureWarning) # stop getting Pandas FutureWarning's
  7 | 
  8 | import time
  9 | import json
 10 | from datetime import datetime, timedelta
 11 | import pandas as pd
 12 | from numpy.random import choice
 13 | from serpapi import GoogleSearch
 14 | from config import config  # contains secret keys in config.py
 15 | from google.cloud import bigquery
 16 | import airflow
 17 | from airflow import DAG
 18 | from airflow.operators.dummy_operator import DummyOperator
 19 | from airflow.operators.python_operator import PythonOperator
 20 | from modules.country import view_percent
 21 | 
 22 | # 'False' DAG is ready for operation; i.e., 'True' DAG runs using no SerpApi credits or BigQuery requests
 23 | TESTING_DAG = False
 24 | # Minutes to sleep on an error
 25 | ERROR_SLEEP_MIN = 5 
 26 | # Max number of searches to perform daily
 27 | MAX_SEARCHES = 1500
 28 | # Who is listed as the owner of this DAG in the Airflow Web Server
 29 | DAG_OWNER_NAME = "airflow"
 30 | # List of email address to send email alerts to if this job fails
 31 | ALERT_EMAIL_ADDRESSES = ['luke@lukebarousse.com']
 32 | START_DATE = airflow.utils.dates.days_ago(1)
 33 | 
 34 | 
 35 | default_args = {
 36 |     'owner': DAG_OWNER_NAME,
 37 |     'depends_on_past': False,
 38 |     'start_date': START_DATE, 
 39 |     'email': ALERT_EMAIL_ADDRESSES,
 40 |     'email_on_failure': True,
 41 |     'email_on_retry': False,
 42 |     'retries': 0, # removing retries to not call insert duplicates into BigQuery
 43 |     'retry_delay': timedelta(minutes=5),
 44 |     # 'queue': 'bash_queue',
 45 |     # 'pool': 'backfill',
 46 |     # 'priority_weight': 10,
 47 |     # 'end_date': datetime(2022, 1, 1),
 48 |     # 'wait_for_downstream': False,
 49 |     # 'dag': dag,
 50 |     # 'sla': timedelta(hours=2),
 51 |     # 'execution_timeout': timedelta(seconds=300),
 52 |     # 'on_failure_callback': some_function,
 53 |     # 'on_success_callback': some_other_function,
 54 |     # 'on_retry_callback': another_function,
 55 |     # 'sla_miss_callback': yet_another_function,
 56 |     # 'trigger_rule': 'all_success'
 57 | }
 58 | 
 59 | dag = DAG(
 60 |     'serpapi_bigquery', 
 61 |     description='Call SerpApi and inserts results into Bigquery',
 62 |     default_args=default_args, 
 63 |     schedule_interval='0 6 * * *',
 64 |     catchup=False,
 65 |     tags=['data-pipeline-dag'],
 66 |     max_active_tasks = 3
 67 | )
 68 | 
 69 | with dag:
 70 | 
 71 |     search_terms = ['Data Analyst', 'Data Scientist', 'Data Engineer']
 72 |     search_locations_us = ["New York, United States", "California, United States", 
 73 |     "Texas, United States", "Illinois, United States", "Florida, United States"]
 74 |     # data table from 'modules/country.py' of countries and relative weighted percentages to use
 75 |     country_percent = view_percent()
 76 | 
 77 |     start = DummyOperator(
 78 |         task_id='start',
 79 |         dag=dag)
 80 | 
 81 |     def _bigquery_json(results, search_term, search_location, result_offset, error):
 82 |         """
 83 |         Submit JSON return from SerpAPI to BigQuery {gsearch_jobs_all_json} as a backup to hold the original data
 84 |         
 85 |         Args:
 86 |             results : json
 87 |                 JSON return from SerpAPI
 88 |             search_term : str
 89 |                 Search term
 90 |             search_location : str
 91 |                 Search location
 92 |             result_offset : int
 93 |                 Parameter to offset the results returned from SerpApi; used for pagination
 94 |             error : bool
 95 |                 Flag to indicate if results where returned from SerpApi or not
 96 |         
 97 |         Returns:
 98 |             None
 99 |         """               
100 |         try:
101 |             # extract metadata from results
102 |             try:
103 |                 search_id = results['search_metadata']['id']
104 |                 search_time = results['search_metadata']['created_at']
105 |                 search_time = datetime.strptime(search_time, "%Y-%m-%d %H:%M:%S %Z")
106 |                 search_time_taken = results['search_metadata']['total_time_taken']
107 |                 search_language = results['search_parameters']['hl']
108 |             except Exception as e:
109 |                 search_id = None
110 |                 search_time = None
111 |                 search_time_taken = None
112 |                 search_language = None
113 |                 print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
114 |                 print(f"JSON - SerpAPI ERROR!!!: {search_term} in {search_location} JSON file fields have changed!!!")
115 |                 print("Following error returned:")
116 |                 print(e)
117 | 
118 |             # convert search results and metadata to a dataframe
119 |             df = pd.DataFrame({'search_term': [search_term],
120 |                                 'search_location': [search_location],
121 |                                 'result_offset': [result_offset],
122 |                                 'error': [error],
123 |                                 'search_id': [search_id],
124 |                                 'search_time': [search_time],
125 |                                 'search_time_taken': [search_time_taken],
126 |                                 'search_language': [search_language],
127 |                                 'results': [json.dumps(results)]
128 |                                 })
129 | 
130 |             # submit dataframe to BigQuery
131 |             table_id = config.table_id_json
132 |             client = bigquery.Client()
133 |             table = client.get_table(table_id)
134 |             errors = client.insert_rows_from_dataframe(table, df)
135 |             if errors == [[]]:
136 |                 print(f"JSON - DATA LOADED: {search_term} in {search_location} loaded into BigQuery {table_id}")
137 |             else:
138 |                 print(f"JSON - ERROR!!!: {search_term} in {search_location} NOT loaded into BigQuery {table_id}!!!")
139 |                 print("Following error returned from the googles:")
140 |                 print(errors)
141 |         except UnboundLocalError as ule:
142 |             # TODO: Need to build something to catch this error sooner
143 |             # GoogleSearch(params) code returns blank results and then get error for  "'df' referenced before assignment" in 'errors = client.insert_rows_from_dataframe(table, df)' (i.e., SerpApi issue)
144 |             print(f"JSON - SerpApi ERROR!!!: Search {result_offset} of {search_term} in {search_location} yielded no results from SerpApi and FAILED load into BigQuery!!!")
145 |             print("Following error returned:")
146 |             print(ule)
147 |             # no sleep requirement as usually an issue with search term provided
148 |         except TimeoutError as te:
149 |             # client.get_table(table_id) code returns TimeOut Exception with no results... so also adding sleep (i.e., BigQuery issue)
150 |             print(f"JSON - BigQuery ERROR!!!: {search_term} in {search_location} had TimeOutError and FAILED to load into BigQuery  {table_id}!!!")
151 |             print("Following error returned:")
152 |             print(te)
153 |             # sleep removed for first implementation 12/31/2022
154 |             # print(f"Sleeping for {ERROR_SLEEP_MIN} minutes")
155 |             # time.sleep(ERROR_SLEEP_MIN * 60)
156 |         except Exception as e:
157 |             print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
158 |             print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
159 |             print(f"JSON - BigQuery ERROR!!!: {search_term} in {search_location} had an error that needs to be investigated!!!")
160 |             print("Following error returned:")
161 |             print(e)
162 |             # sleep removed for testing
163 |             # print(f"Sleeping for {ERROR_SLEEP_MIN} minutes")
164 |             # print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
165 |             # print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
166 |             # time.sleep(ERROR_SLEEP_MIN * 60)
167 | 
168 |         return
169 | 
170 |     def _serpapi_bigquery(search_term, search_location, search_time):
171 |         """
172 |         Function to call SerpApi and insert results into BigQuery {gsearch_jobs_all} used by us_job_postings 
173 |         and non_us_job_postings tasks
174 | 
175 |         Args:
176 |             search_terms : list
177 |                 List of search terms to search for
178 |             search_locations : list
179 |                 List of search locations to search for
180 |             search_time : str
181 |                 Time period to search for (e.g. 'past 24 hours')
182 | 
183 |         Returns:
184 |             num_searches : int
185 |                 Number of searches performed for this search term and location
186 |         
187 |         Source:
188 |             https://serpapi.com/google-jobs-results
189 |             https://cloud.google.com/bigquery/docs/reference/libraries
190 |         """
191 |         if not TESTING_DAG:
192 |             next_page_token = None
193 |             num = 0
194 |             has_more_results = True
195 | 
196 |             while has_more_results:
197 |                 print(f"START API CALL: {search_term} in {search_location} on search {num}")
198 | 
199 |                 error = False
200 |                 params = {
201 |                     "api_key": config.serpapi_key,
202 |                     "device": "desktop",
203 |                     "engine": "google_jobs",
204 |                     "google_domain": "google.com",
205 |                     "q": search_term,
206 |                     "hl": "en",
207 |                     "gl": "us",
208 |                     "location": search_location,
209 |                     "chips": search_time,
210 |                 }
211 | 
212 |                 if next_page_token:
213 |                     params["next_page_token"] = next_page_token
214 | 
215 |                 try:
216 |                     search = GoogleSearch(params)
217 |                     results = search.get_dict()
218 | 
219 |                     if 'error' in results:
220 |                         print(f"END SerpApi CALLS: {search_term} in {search_location} on search {num}")
221 |                         error = True
222 |                         _bigquery_json(results, search_term, search_location, num, error)
223 |                         break
224 | 
225 |                     print(f"SUCCESS SerpApi CALL: {search_term} in {search_location} on search {num}")
226 |                     _bigquery_json(results, search_term, search_location, num, error)
227 | 
228 |                     # Process results and insert into BigQuery
229 |                     jobs = results['jobs_results']
230 |                     jobs = pd.DataFrame(jobs)
231 |                     jobs = pd.concat([pd.DataFrame(jobs), 
232 |                                     pd.json_normalize(jobs['detected_extensions'])], 
233 |                                     axis=1).drop('detected_extensions', 1)
234 |                     jobs['date_time'] = datetime.utcnow()
235 | 
236 |                     if num == 0:
237 |                         jobs_all = jobs
238 |                     else:
239 |                         jobs_all = pd.concat([jobs_all, jobs])
240 | 
241 |                     jobs_all['search_term'] = search_term
242 |                     jobs_all['search_location'] = search_location
243 | 
244 |                     # Check for next_page_token
245 |                     if 'serpapi_pagination' in results and 'next_page_token' in results['serpapi_pagination']:
246 |                         next_page_token = results['serpapi_pagination']['next_page_token']
247 |                     else:
248 |                         print(f"END API CALLS: No more results for {search_term} in {search_location} on search {num}")
249 |                         has_more_results = False
250 | 
251 |                     num += 1
252 | 
253 |                 except Exception as e:
254 |                     print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
255 |                     print(f"SerpApi ERROR (Timeout)!!!: {search_term} in {search_location} had an error (most likely TimeOut)!!!")
256 |                     print("Following error returned:")
257 |                     print(e)
258 |                     print(f"Sleeping for {ERROR_SLEEP_MIN} minutes")
259 |                     print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
260 |                     time.sleep(ERROR_SLEEP_MIN * 60)
261 |                     error = True
262 |                     break
263 | 
264 |             # Insert data into BigQuery
265 |             if num > 0 and not error:
266 |                 try:
267 |                     final_columns = ['title', 'company_name', 'location', 'via', 'description', 'extensions',
268 |                                     'job_id', 'thumbnail', 'posted_at', 'schedule_type', 'salary',
269 |                                     'work_from_home', 'date_time', 'search_term', 'search_location', 'commute_time']
270 |                     jobs_all = jobs_all.loc[:, jobs_all.columns.isin(final_columns)]
271 | 
272 |                     table_id = config.table_id
273 |                     client = bigquery.Client()
274 |                     table = client.get_table(table_id)
275 |                     errors = client.insert_rows_from_dataframe(table, jobs_all)
276 |                     if errors == [[]]:
277 |                         print(f"DATA LOADED: {len(jobs_all)} rows of {search_term} in {search_location} loaded into BigQuery")
278 |                     else:
279 |                         print(f"ERROR!!!: {len(jobs_all)} rows of {search_term} in {search_location} NOT loaded into BigQuery!!!")
280 |                         print("Following error returned from the googles:")
281 |                         print(errors)
282 |                 except Exception as e:
283 |                     print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
284 |                     print(f"non-JSON BigQuery ERROR!!!: {search_term} in {search_location} had an error that needs to be investigated!!!")
285 |                     print("Following error returned:")
286 |                     print(e)
287 |                     print(f"Sleeping for {ERROR_SLEEP_MIN} minutes")
288 |                     print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
289 |                     time.sleep(ERROR_SLEEP_MIN * 60)
290 | 
291 |             num_searches = num + 1
292 |         
293 |         else:  # if testing
294 |             print(f"END FAKE SEARCH: {search_term} in {search_location}")
295 |             num_searches = 2  # low enough not to max out 1000 searches
296 | 
297 |         return num_searches   
298 | 
299 |     ### (9/2024) start parmater was deprecated and removed from SerpApi
300 |     # def _serpapi_bigquery(search_term, search_location, search_time):        
301 |         # if not TESTING_DAG:
302 | 
303 |         #     for num in range(45): # SerpApi docs say max returns is ~45 pages
304 | 
305 |         #         print(f"START API CALL: {search_term} in {search_location} on search {num}")
306 | 
307 |         #         start = num * 10
308 |         #         error = False
309 |         #         params = {
310 |         #             "api_key": config.serpapi_key,
311 |         #             "device": "desktop",
312 |         #             "engine": "google_jobs",
313 |         #             "google_domain": "google.com",
314 |         #             "q": search_term,
315 |         #             "hl": "en",
316 |         #             "gl": "us",
317 |         #             "location": search_location,
318 |         #             "chips": search_time,
319 |         #             "start": start,
320 |         #         }
321 | 
322 |         #         # try except statement to call SerpAPI and then handle results (inner try/except statement) or handle TimeOut errors
323 |         #         try:
324 |         #             search = GoogleSearch(params)
325 |         #             results = search.get_dict()
326 | 
327 |         #             # try except statement needed to handle whether any results are returned
328 |         #             try:
329 |         #                 if results['error'] == "Google hasn't returned any results for this query.":
330 |         #                     print(f"END SerpApi CALLS: {search_term} in {search_location} on search {num}")
331 |         #                     error = True
332 |         #                     # Send JSON request to BigQuery json table
333 |         #                     _bigquery_json(results, search_term, search_location, num, error)
334 |         #                     break
335 |         #             except KeyError:
336 |         #                 print(f"SUCCESS SerpApi CALL: {search_term} in {search_location} on search {num}")
337 |         #                 # Send JSON request to BigQuery json table
338 |         #                 _bigquery_json(results, search_term, search_location, num, error)
339 |         #             else:
340 |         #                 print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
341 |         #                 print(f"SerpApi Error on call!!!: No response on {search_term} in {search_location} on search {num}")
342 |         #                 print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
343 |         #                 error = True
344 |         #                 break
345 |         #         except Exception as e: # catching as 'TimeoutError' didn't work so resorted to catching all...
346 |         #             print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
347 |         #             print(f"SerpApi ERROR (Timeout)!!!: {search_term} in {search_location} had an error (most likely TimeOut)!!!")
348 |         #             print("Following error returned:")
349 |         #             print(e)
350 |         #             print(f"Sleeping for {ERROR_SLEEP_MIN} minutes")
351 |         #             print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
352 |         #             time.sleep(ERROR_SLEEP_MIN * 60)
353 |         #             error = True
354 |         #             break
355 | 
356 |         #         # create dataframe of 10 (or less) pulled results
357 |         #         jobs = results['jobs_results']
358 |         #         jobs = pd.DataFrame(jobs)
359 |         #         jobs = pd.concat([pd.DataFrame(jobs), 
360 |         #                         pd.json_normalize(jobs['detected_extensions'])], 
361 |         #                         axis=1).drop('detected_extensions', 1)
362 |         #         jobs['date_time'] = datetime.utcnow()
363 | 
364 |         #         if start == 0:
365 |         #             jobs_all = jobs
366 |         #         else:
367 |         #             jobs_all = pd.concat([jobs_all, jobs])
368 | 
369 |         #         jobs_all['search_term'] = search_term
370 |         #         jobs_all['search_location'] = search_location
371 | 
372 |         #         # don't call api again (and waste a credit) if less than 10 results (i.e., end of search)
373 |         #         if len(jobs) != 10:
374 |         #             print(f"END API CALLS: Only {len(jobs)} jobs (<10) for {search_term} in {search_location} on search {num}")
375 |         #             break
376 | 
377 |         #     # if no results returned on first try then will get error if try to insert 0 rows into BigQuery
378 |         #     if num == 0 and error:
379 |         #         print(f"NO DATA LOADED: {num} rows of {search_term} in {search_location} not loaded into BigQuery")
380 |         #     else:
381 |         #         try:
382 |         #             # 28Dec2022: Following added after SerpApi changed format of json file
383 |         #             # wanted to keep extra columns ['job_highlights' 'related_links'] added in json but reached bigquery resource limit
384 |         #             # tried to convert these columns to json but ran into error troubleshooting all day
385 |         #             # jobs_all['json'] = jobs_all.apply(lambda x: x.to_json(), axis=1)
386 |         #             final_columns = ['title',
387 |         #             'company_name',
388 |         #             'location',
389 |         #             'via',
390 |         #             'description',
391 |         #             'extensions',
392 |         #             'job_id',
393 |         #             'thumbnail',
394 |         #             'posted_at',
395 |         #             'schedule_type',
396 |         #             'salary',
397 |         #             'work_from_home',
398 |         #             'date_time',
399 |         #             'search_term',
400 |         #             'search_location',
401 |         #             'commute_time']
402 |         #             # select only columns from final_columns if they exist in jobs_all
403 |         #             jobs_all = jobs_all.loc[:, jobs_all.columns.isin(final_columns)]
404 | 
405 |         #             table_id = config.table_id
406 |         #             client = bigquery.Client()
407 |         #             table = client.get_table(table_id)
408 |         #             errors = client.insert_rows_from_dataframe(table, jobs_all)
409 |         #             if errors == [[]]:
410 |         #                 print(f"DATA LOADED: {len(jobs_all)} rows of {search_term} in {search_location} loaded into BigQuery")
411 |         #             else:
412 |         #                 print(f"ERROR!!!: {len(jobs_all)} rows of {search_term} in {search_location} NOT loaded into BigQuery!!!")
413 |         #                 print("Following error returned from the googles:")
414 |         #                 print(errors)
415 |         #         except UnboundLocalError as ule:
416 |         #             # TODO: Need to build something to catch this error sooner
417 |         #             # GoogleSearch(params) code returns blank results and then get error for  "'jobs_all' referenced before assignment" in 'errors = client.insert_rows_from_dataframe(table, jobs_all)' (i.e., SerpApi issue)
418 |         #             print(f"SerpApi ERROR!!!: Search {num} of {search_term} in {search_location} yielded no results from SerpApi and FAILED load into non-JSON BigQuery!!!")
419 |         #             print("Following error returned:")
420 |         #             print(ule)
421 |         #             # no sleep requirement as usually an issue with search term provided
422 |         #         except TimeoutError as te:
423 |         #             # client.get_table(table_id) code returns TimeOut Exception with no results... so also adding sleep (i.e., BigQuery issue)
424 |         #             print(f"BigQuery ERROR!!!: {search_term} in {search_location} had TimeOutError and FAILED to load into non-JSON BigQuery!!!")
425 |         #             print("Following error returned:")
426 |         #             print(te)
427 |         #             print(f"Sleeping for {ERROR_SLEEP_MIN} minutes")
428 |         #             time.sleep(ERROR_SLEEP_MIN * 60)
429 |         #         except Exception as e:
430 |         #             print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
431 |         #             print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
432 |         #             print(f"non-JSON BigQuery ERROR!!!: {search_term} in {search_location} had an error that needs to be investigated!!!")
433 |         #             print("Following error returned:")
434 |         #             print(e)
435 |         #             print(f"Sleeping for {ERROR_SLEEP_MIN} minutes")
436 |         #             print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
437 |         #             print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
438 |         #             time.sleep(ERROR_SLEEP_MIN * 60)
439 | 
440 |         #     num_searches = num + 1
441 |         
442 |         # else: # if testing
443 | 
444 |         #     print(f"END FAKE SEARCH: {search_term} in {search_location}")
445 |         #     num_searches = 2 # low enough not to max out 1000 searches
446 | 
447 |         # return num_searches
448 | 
449 |     def _us_jobs(search_terms, search_locations_us, **context):
450 |         """
451 |         DAG to pull US job postings using the _serpapi_bigquery function
452 | 
453 |         Args:
454 |             search_terms : list
455 |                 List of search terms to search for
456 |             search_locations_us : list
457 |                 List of search locations to search for
458 |             context : dict
459 |                 Context dictionary from Airflow
460 | 
461 |         Returns:
462 |             None
463 |         """
464 |         search_time = "date_posted:today"
465 |         total_searches = 0
466 | 
467 |         for search_term in search_terms:
468 |             for search_location in search_locations_us:
469 |                 print(f"START SEARCH: {total_searches} searches done, starting search...")
470 |                 num_searches = _serpapi_bigquery(search_term, search_location, search_time)   
471 |                 total_searches += num_searches
472 | 
473 |         # push total_searches to xcom so can use in next task
474 |         context['task_instance'].xcom_push(key='total_searches', value=total_searches)
475 | 
476 |         return
477 | 
478 |     us_jobs = PythonOperator(
479 |         task_id='us_jobs',
480 |         provide_context=True,
481 |         op_kwargs={'search_terms': search_terms, 'search_locations_us': search_locations_us},
482 |         python_callable=_us_jobs
483 |     )
484 | 
485 |     def _non_us_jobs(search_terms, country_percent, **context):
486 |         """
487 |         DAG to pull non-US job postings using the _serpapi_bigquery function
488 | 
489 |         Args:
490 |             search_terms : list
491 |                 List of search terms to search for
492 |             country_percent : pandas dataframe
493 |                 Dataframe of countries and their relative percent of total YouTube views for my channel
494 |             context : dict
495 |                 Context dictionary from Airflow
496 | 
497 |         Returns:
498 |             None
499 |         
500 |         Source:
501 |             https://youtube.com/@lukebarousse
502 |         """
503 |         search_time = "date_posted:today"
504 |         total_searches = context['task_instance'].xcom_pull(task_ids='us_jobs', key='total_searches')
505 |         search_countries = list(country_percent.country)
506 |         search_probabilities = list(country_percent.percent)
507 | 
508 |         # create list of countries listed based on weighted probability to get random countries
509 |         search_locations = list(choice(search_countries, size=len(search_countries), replace=False, p=search_probabilities))
510 | 
511 |         for search_location in search_locations:
512 |             if total_searches < MAX_SEARCHES:
513 |                 print("####################################")
514 |                 print(f"SEARCHING COUNTRY: {search_location} [{search_locations.index(search_location)+1} of {len(search_locations)}]")
515 |                 print("####################################")
516 |                 for search_term in search_terms:
517 |                     print("####################################")
518 |                     print(f"SEARCHING TERM: {search_term}")
519 |                     print(f"Starting search number {total_searches}...")
520 |                     num_searches = _serpapi_bigquery(search_term, search_location, search_time)   
521 |                     total_searches += num_searches
522 |             else:
523 |                 print(f"STICK A FORK IN ME, I'M DONE!!!!: {total_searches} searches complete")
524 |                 return
525 | 
526 |     non_us_jobs = PythonOperator(
527 |         task_id='non_us_jobs',
528 |         provide_context=True,
529 |         op_kwargs={'search_terms': search_terms, 'country_percent': country_percent},
530 |         python_callable=_non_us_jobs
531 |     )
532 | 
533 |     finish = DummyOperator(
534 |         task_id='finish',
535 |         dag=dag)
536 | 
537 |     start >> us_jobs >> non_us_jobs >> finish
538 | 


--------------------------------------------------------------------------------
/dags/sql/.project:
--------------------------------------------------------------------------------
 1 | <?xml version="1.0" encoding="UTF-8"?>
 2 | <projectDescription>
 3 | 	<name>Airflow_SQL</name>
 4 | 	<comment></comment>
 5 | 	<projects>
 6 | 	</projects>
 7 | 	<buildSpec>
 8 | 	</buildSpec>
 9 | 	<natures>
10 | 		<nature>org.jkiss.dbeaver.DBeaverNature</nature>
11 | 	</natures>
12 | </projectDescription>
13 | 


--------------------------------------------------------------------------------
/dags/sql/cache_csv.sql:
--------------------------------------------------------------------------------
  1 | ----------------
  2 | -- 🛠️ Skill Page 
  3 | -- "Select All"/"Select All" export
  4 | EXPORT DATA
  5 |   OPTIONS (
  6 |     uri = 'gs://gsearch_share/cache/skills/skills-*.csv',
  7 |     format = 'CSV',
  8 |     overwrite = true,
  9 |     header = true,
 10 |     field_delimiter = ',')
 11 | AS (
 12 |     WITH all_time AS (
 13 |         SELECT COUNT(*) as total
 14 |         FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide
 15 |     ),
 16 |     last_7_days AS (
 17 |         SELECT COUNT(*) as total
 18 |         FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide
 19 |         WHERE search_time >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
 20 |     ),
 21 |     last_30_days AS (
 22 |         SELECT COUNT(*) as total
 23 |         FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide
 24 |         WHERE search_time >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
 25 |     ),
 26 |     ytd AS (
 27 |         SELECT COUNT(*) as total
 28 |         FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide
 29 |         WHERE search_time >= DATE_TRUNC(CURRENT_DATE(), YEAR)
 30 |     )
 31 | 
 32 |     -- All time
 33 |     SELECT 
 34 |         keywords.element AS skill,
 35 |         COUNT(job_id) / (SELECT total FROM all_time) AS skill_percent,
 36 |         COUNT(job_id) AS skill_count,
 37 |         (SELECT total FROM all_time) AS total_jobs,
 38 |         'All time' as timeframe
 39 |     FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide,
 40 |         UNNEST(keywords_all.list) AS keywords
 41 |     GROUP BY skill 
 42 | 
 43 |     UNION ALL
 44 | 
 45 |     -- Last 7 days
 46 |     SELECT 
 47 |         keywords.element AS skill,
 48 |         COUNT(job_id) / (SELECT total FROM last_7_days) AS skill_percent,
 49 |         COUNT(job_id) AS skill_count,
 50 |         (SELECT total FROM last_7_days) AS total_jobs,
 51 |         'Last 7 days' as timeframe
 52 |     FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide,
 53 |         UNNEST(keywords_all.list) AS keywords
 54 |     WHERE search_time >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
 55 |     GROUP BY skill
 56 | 
 57 |     UNION ALL
 58 | 
 59 |     -- Last 30 days
 60 |     SELECT 
 61 |         keywords.element AS skill,
 62 |         COUNT(job_id) / (SELECT total FROM last_30_days) AS skill_percent,
 63 |         COUNT(job_id) AS skill_count,
 64 |         (SELECT total FROM last_30_days) AS total_jobs,
 65 |         'Last 30 days' as timeframe
 66 |     FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide,
 67 |         UNNEST(keywords_all.list) AS keywords
 68 |     WHERE search_time >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
 69 |     GROUP BY skill
 70 | 
 71 |     UNION ALL
 72 | 
 73 |     -- Year to date
 74 |     SELECT 
 75 |         keywords.element AS skill,
 76 |         COUNT(job_id) / (SELECT total FROM ytd) AS skill_percent,
 77 |         COUNT(job_id) AS skill_count,
 78 |         (SELECT total FROM ytd) AS total_jobs,
 79 |         'YTD' as timeframe
 80 |     FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide,
 81 |         UNNEST(keywords_all.list) AS keywords
 82 |     WHERE search_time >= DATE_TRUNC(CURRENT_DATE(), YEAR)
 83 |     GROUP BY skill
 84 | 
 85 |     ORDER BY timeframe, skill_count DESC
 86 | );
 87 | 
 88 | -- Slicer
 89 | EXPORT DATA
 90 |   OPTIONS (
 91 |     uri = 'gs://gsearch_share/cache/skills/slicer-*.csv',
 92 |     format = 'CSV',
 93 |     overwrite = true,
 94 |     header = true,
 95 |     field_delimiter = ',')
 96 | AS (
 97 |     SELECT
 98 |         job_title_final AS job_title,
 99 |         search_country,
100 |         COUNT(*) AS job_count
101 |     FROM
102 |         `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide
103 |     WHERE search_country IS NOT NULL
104 |     GROUP BY job_title, search_country
105 |     ORDER BY job_count DESC
106 | );
107 | 
108 | -- Keywords
109 | EXPORT DATA
110 |   OPTIONS (
111 |     uri = 'gs://gsearch_share/cache/skills/keywords-*.csv',
112 |     format = 'CSV',
113 |     overwrite = true,
114 |     header = true,
115 |     field_delimiter = ',')
116 | AS (
117 |     SELECT * FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_keywords
118 | );
119 | 
120 | EXPORT DATA
121 |     OPTIONS (
122 |         uri = 'gs://gsearch_share/cache/skills/timeframes/alltime-*.csv',
123 |         format = 'CSV',
124 |         overwrite = true,
125 |         header = true,
126 |         field_delimiter = ',')
127 | AS (
128 |     WITH total_jobs_country_title AS (
129 |         SELECT job_title_final, search_country, COUNT(*) AS job_count
130 |         FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide
131 |         GROUP BY job_title_final, search_country
132 |     ),
133 |     total_jobs_title AS (
134 |         SELECT job_title_final, COUNT(*) AS job_count
135 |         FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide
136 |         GROUP BY job_title_final
137 |     ),
138 |     total_jobs_country AS (
139 |         SELECT search_country, COUNT(*) AS job_count
140 |         FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide
141 |         GROUP BY search_country
142 |     )
143 | 
144 |     SELECT 
145 |         keywords.element AS skill,
146 |         j.job_title_final,
147 |         j.search_country,
148 |         COUNT(j.job_id) / t.job_count AS skill_percent,
149 |         COUNT(j.job_id) AS skill_count,
150 |         t.job_count AS total_jobs
151 |     FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide j,
152 |         UNNEST(keywords_all.list) AS keywords
153 |     JOIN total_jobs_country_title t 
154 |         ON t.job_title_final = j.job_title_final 
155 |         AND t.search_country = j.search_country
156 |     GROUP BY 
157 |         skill,
158 |         j.job_title_final,
159 |         j.search_country,
160 |         t.job_count
161 | 
162 |     UNION ALL
163 | 
164 |     SELECT 
165 |         keywords.element AS skill,
166 |         j.job_title_final,
167 |         NULL AS search_country,
168 |         COUNT(j.job_id) / t.job_count AS skill_percent,
169 |         COUNT(j.job_id) AS skill_count,
170 |         t.job_count AS total_jobs
171 |     FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide j,
172 |         UNNEST(keywords_all.list) AS keywords
173 |     JOIN total_jobs_title t 
174 |         ON t.job_title_final = j.job_title_final
175 |     GROUP BY 
176 |         skill,
177 |         j.job_title_final,
178 |         t.job_count
179 | 
180 |     UNION ALL
181 | 
182 |     SELECT 
183 |         keywords.element AS skill,
184 |         NULL AS job_title_final,
185 |         j.search_country,
186 |         COUNT(j.job_id) / t.job_count AS skill_percent,
187 |         COUNT(j.job_id) AS skill_count,
188 |         t.job_count AS total_jobs
189 |     FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide j,
190 |         UNNEST(keywords_all.list) AS keywords
191 |     JOIN total_jobs_country t 
192 |         ON t.search_country = j.search_country
193 |     GROUP BY 
194 |         skill,
195 |         j.search_country,
196 |         t.job_count
197 | 
198 |     ORDER BY skill_count DESC
199 | );
200 | 
201 | EXPORT DATA
202 |     OPTIONS (
203 |         uri = 'gs://gsearch_share/cache/skills/timeframes/7day-*.csv',
204 |         format = 'CSV',
205 |         overwrite = true,
206 |         header = true,
207 |         field_delimiter = ',')
208 | AS (
209 |     WITH total_jobs_country_title AS (
210 |         SELECT job_title_final, search_country, COUNT(*) AS job_count
211 |         FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide
212 |         WHERE search_time >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
213 |         GROUP BY job_title_final, search_country
214 |     ),
215 |     total_jobs_title AS (
216 |         SELECT job_title_final, COUNT(*) AS job_count
217 |         FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide
218 |         WHERE search_time >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
219 |         GROUP BY job_title_final
220 |     ),
221 |     total_jobs_country AS (
222 |         SELECT search_country, COUNT(*) AS job_count
223 |         FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide
224 |         WHERE search_time >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
225 |         GROUP BY search_country
226 |     )
227 | 
228 |     SELECT 
229 |         keywords.element AS skill,
230 |         j.job_title_final,
231 |         j.search_country,
232 |         COUNT(j.job_id) / t.job_count AS skill_percent,
233 |         COUNT(j.job_id) AS skill_count,
234 |         t.job_count AS total_jobs
235 |     FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide j,
236 |         UNNEST(keywords_all.list) AS keywords
237 |     JOIN total_jobs_country_title t 
238 |         ON t.job_title_final = j.job_title_final 
239 |         AND t.search_country = j.search_country
240 |     WHERE j.search_time >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
241 |     GROUP BY 
242 |         skill,
243 |         j.job_title_final,
244 |         j.search_country,
245 |         t.job_count
246 | 
247 |     UNION ALL
248 | 
249 |     SELECT 
250 |         keywords.element AS skill,
251 |         j.job_title_final,
252 |         NULL AS search_country,
253 |         COUNT(j.job_id) / t.job_count AS skill_percent,
254 |         COUNT(j.job_id) AS skill_count,
255 |         t.job_count AS total_jobs
256 |     FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide j,
257 |         UNNEST(keywords_all.list) AS keywords
258 |     JOIN total_jobs_title t 
259 |         ON t.job_title_final = j.job_title_final
260 |     WHERE j.search_time >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
261 |     GROUP BY 
262 |         skill,
263 |         j.job_title_final,
264 |         t.job_count
265 | 
266 |     UNION ALL
267 | 
268 |     SELECT 
269 |         keywords.element AS skill,
270 |         NULL AS job_title_final,
271 |         j.search_country,
272 |         COUNT(j.job_id) / t.job_count AS skill_percent,
273 |         COUNT(j.job_id) AS skill_count,
274 |         t.job_count AS total_jobs
275 |     FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide j,
276 |         UNNEST(keywords_all.list) AS keywords
277 |     JOIN total_jobs_country t 
278 |         ON t.search_country = j.search_country
279 |     WHERE j.search_time >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
280 |     GROUP BY 
281 |         skill,
282 |         j.search_country,
283 |         t.job_count
284 | 
285 |     ORDER BY skill_count DESC
286 | );
287 | 
288 | EXPORT DATA
289 |     OPTIONS (
290 |         uri = 'gs://gsearch_share/cache/skills/timeframes/30day-*.csv',
291 |         format = 'CSV',
292 |         overwrite = true,
293 |         header = true,
294 |         field_delimiter = ',')
295 | AS (
296 |     WITH total_jobs_country_title AS (
297 |         SELECT job_title_final, search_country, COUNT(*) AS job_count
298 |         FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide
299 |         WHERE search_time >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
300 |         GROUP BY job_title_final, search_country
301 |     ),
302 |     total_jobs_title AS (
303 |         SELECT job_title_final, COUNT(*) AS job_count
304 |         FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide
305 |         WHERE search_time >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
306 |         GROUP BY job_title_final
307 |     ),
308 |     total_jobs_country AS (
309 |         SELECT search_country, COUNT(*) AS job_count
310 |         FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide
311 |         WHERE search_time >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
312 |         GROUP BY search_country
313 |     )
314 | 
315 |     SELECT 
316 |         keywords.element AS skill,
317 |         j.job_title_final,
318 |         j.search_country,
319 |         COUNT(j.job_id) / t.job_count AS skill_percent,
320 |         COUNT(j.job_id) AS skill_count,
321 |         t.job_count AS total_jobs
322 |     FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide j,
323 |         UNNEST(keywords_all.list) AS keywords
324 |     JOIN total_jobs_country_title t 
325 |         ON t.job_title_final = j.job_title_final 
326 |         AND t.search_country = j.search_country
327 |     WHERE j.search_time >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
328 |     GROUP BY 
329 |         skill,
330 |         j.job_title_final,
331 |         j.search_country,
332 |         t.job_count
333 | 
334 |     UNION ALL
335 | 
336 |     SELECT 
337 |         keywords.element AS skill,
338 |         j.job_title_final,
339 |         NULL AS search_country,
340 |         COUNT(j.job_id) / t.job_count AS skill_percent,
341 |         COUNT(j.job_id) AS skill_count,
342 |         t.job_count AS total_jobs
343 |     FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide j,
344 |         UNNEST(keywords_all.list) AS keywords
345 |     JOIN total_jobs_title t 
346 |         ON t.job_title_final = j.job_title_final
347 |     WHERE j.search_time >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
348 |     GROUP BY 
349 |         skill,
350 |         j.job_title_final,
351 |         t.job_count
352 | 
353 |     UNION ALL
354 | 
355 |     SELECT 
356 |         keywords.element AS skill,
357 |         NULL AS job_title_final,
358 |         j.search_country,
359 |         COUNT(j.job_id) / t.job_count AS skill_percent,
360 |         COUNT(j.job_id) AS skill_count,
361 |         t.job_count AS total_jobs
362 |     FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide j,
363 |         UNNEST(keywords_all.list) AS keywords
364 |     JOIN total_jobs_country t 
365 |         ON t.search_country = j.search_country
366 |     WHERE j.search_time >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
367 |     GROUP BY 
368 |         skill,
369 |         j.search_country,
370 |         t.job_count
371 | 
372 |     ORDER BY skill_count DESC
373 | );
374 | 
375 | EXPORT DATA
376 |     OPTIONS (
377 |         uri = 'gs://gsearch_share/cache/skills/timeframes/ytd-*.csv',
378 |         format = 'CSV',
379 |         overwrite = true,
380 |         header = true,
381 |         field_delimiter = ',')
382 | AS (
383 |     WITH total_jobs_country_title AS (
384 |         SELECT job_title_final, search_country, COUNT(*) AS job_count
385 |         FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide
386 |         WHERE search_time >= DATE_TRUNC(CURRENT_DATE(), YEAR)
387 |         GROUP BY job_title_final, search_country
388 |     ),
389 |     total_jobs_title AS (
390 |         SELECT job_title_final, COUNT(*) AS job_count
391 |         FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide
392 |         WHERE search_time >= DATE_TRUNC(CURRENT_DATE(), YEAR)
393 |         GROUP BY job_title_final
394 |     ),
395 |     total_jobs_country AS (
396 |         SELECT search_country, COUNT(*) AS job_count
397 |         FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide
398 |         WHERE search_time >= DATE_TRUNC(CURRENT_DATE(), YEAR)
399 |         GROUP BY search_country
400 |     )
401 | 
402 |     SELECT 
403 |         keywords.element AS skill,
404 |         j.job_title_final,
405 |         j.search_country,
406 |         COUNT(j.job_id) / t.job_count AS skill_percent,
407 |         COUNT(j.job_id) AS skill_count,
408 |         t.job_count AS total_jobs
409 |     FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide j,
410 |         UNNEST(keywords_all.list) AS keywords
411 |     JOIN total_jobs_country_title t 
412 |         ON t.job_title_final = j.job_title_final 
413 |         AND t.search_country = j.search_country
414 |     WHERE j.search_time >= DATE_TRUNC(CURRENT_DATE(), YEAR)
415 |     GROUP BY 
416 |         skill,
417 |         j.job_title_final,
418 |         j.search_country,
419 |         t.job_count
420 | 
421 |     UNION ALL
422 | 
423 |     SELECT 
424 |         keywords.element AS skill,
425 |         j.job_title_final,
426 |         NULL AS search_country,
427 |         COUNT(j.job_id) / t.job_count AS skill_percent,
428 |         COUNT(j.job_id) AS skill_count,
429 |         t.job_count AS total_jobs
430 |     FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide j,
431 |         UNNEST(keywords_all.list) AS keywords
432 |     JOIN total_jobs_title t 
433 |         ON t.job_title_final = j.job_title_final
434 |     WHERE j.search_time >= DATE_TRUNC(CURRENT_DATE(), YEAR)
435 |     GROUP BY 
436 |         skill,
437 |         j.job_title_final,
438 |         t.job_count
439 | 
440 |     UNION ALL
441 | 
442 |     SELECT 
443 |         keywords.element AS skill,
444 |         NULL AS job_title_final,
445 |         j.search_country,
446 |         COUNT(j.job_id) / t.job_count AS skill_percent,
447 |         COUNT(j.job_id) AS skill_count,
448 |         t.job_count AS total_jobs
449 |     FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide j,
450 |         UNNEST(keywords_all.list) AS keywords
451 |     JOIN total_jobs_country t 
452 |         ON t.search_country = j.search_country
453 |     WHERE j.search_time >= DATE_TRUNC(CURRENT_DATE(), YEAR)
454 |     GROUP BY 
455 |         skill,
456 |         j.search_country,
457 |         t.job_count
458 | 
459 |     ORDER BY skill_count DESC
460 | );
461 | 
462 | ----------------
463 | -- 🕒 Skills Trend Page
464 | -- "Select All"/"Select All" export
465 | EXPORT DATA
466 |   OPTIONS (
467 |     uri = 'gs://gsearch_share/cache/skill-trend/skill-trend-*.csv',
468 |     format = 'CSV',
469 |     overwrite = true,
470 |     header = true,
471 |     field_delimiter = ',')
472 | AS (
473 |     WITH top_skills AS (
474 |         SELECT keywords.element AS skill
475 |         FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide,
476 |             UNNEST(keywords_all.list) AS keywords
477 |         WHERE 1 = 1
478 |         GROUP BY skill
479 |         ORDER BY COUNT(*) DESC
480 |         LIMIT 5
481 |     ),
482 | 
483 |     skill_counts AS (
484 |         SELECT date, skill, SUM(daily_skill_count) AS daily_skill_count
485 |         FROM (
486 |             SELECT DATE(search_time) AS date,
487 |                 keywords.element AS skill,
488 |                 COUNT(*) as daily_skill_count
489 |             FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide,
490 |                 UNNEST(keywords_all.list) AS keywords
491 |             WHERE 1 = 1
492 |             AND keywords.element IN (SELECT skill FROM top_skills)
493 |             GROUP BY date, skill
494 |         )
495 |         GROUP BY date, skill
496 |     ),
497 | 
498 |     total_jobs AS (
499 |                 SELECT COUNT(*)
500 |                 FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide
501 |                 WHERE 1 = 1
502 |             ),
503 | 
504 |     total_jobs_grouped AS (
505 |         SELECT date, SUM(daily_total_count) OVER (
506 |             ORDER BY date
507 |             ROWS BETWEEN 13 PRECEDING AND CURRENT ROW
508 |         ) as rolling_total_count
509 |         FROM (
510 |             SELECT DATE(search_time) AS date, COUNT(*) AS daily_total_count
511 |             FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide
512 |             WHERE 1 = 1
513 |             GROUP BY date
514 |         )
515 |     )
516 | 
517 |     SELECT sc.date,
518 |         sc.skill,
519 |         (SUM(sc.daily_skill_count) OVER (
520 |             PARTITION BY sc.skill
521 |             ORDER BY sc.date
522 |             ROWS BETWEEN 13 PRECEDING AND CURRENT ROW
523 |         ) / tjg.rolling_total_count) as skill_percentage,
524 |         (SELECT * FROM total_jobs) AS total_jobs
525 |     FROM skill_counts sc
526 |     JOIN total_jobs_grouped tjg ON sc.date = tjg.date
527 |     ORDER BY sc.date DESC, skill_percentage DESC
528 | );
529 | 
530 | ----------------
531 | -- 💰 Skill-Pay Page 
532 | -- "Select All"/"Select All" export
533 | EXPORT DATA
534 |   OPTIONS (
535 |     uri = 'gs://gsearch_share/cache/skill-pay/skill-pay-*.csv',
536 |     format = 'CSV',
537 |     overwrite = true,
538 |     header = true,
539 |     field_delimiter = ',')
540 | AS (
541 |     WITH total_jobs AS (
542 |         SELECT COUNT(*)
543 |         FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide
544 |         -- {job_choice_query} AND salary_year IS NOT NULL
545 |         WHERE salary_year IS NOT NULL
546 |     )
547 | 
548 |     SELECT
549 |         keywords.element AS skill,
550 |         AVG(salary_year) AS avg,
551 |         MIN(salary_year) AS min,
552 |         MAX(salary_year) AS max,
553 |         APPROX_QUANTILES(salary_year,2)[OFFSET(1)] AS median,
554 |         COUNT(job_id) AS count,
555 |         (SELECT * FROM total_jobs) AS total_jobs
556 |     FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide,
557 |         UNNEST(keywords_all.list) AS keywords
558 |     -- {job_choice_query} AND salary_year IS NOT NULL
559 |     WHERE salary_year IS NOT NULL
560 |     GROUP BY skill
561 |     ORDER BY count DESC
562 | );
563 | 
564 | -- Slicer
565 | EXPORT DATA
566 |   OPTIONS (
567 |     uri = 'gs://gsearch_share/cache/skill-pay/slicer-*.csv',
568 |     format = 'CSV',
569 |     overwrite = true,
570 |     header = true,
571 |     field_delimiter = ',')
572 | AS (
573 |     SELECT
574 |         job_title_final AS job_title,
575 |         search_country,
576 |         COUNT(*) AS job_count
577 |     FROM
578 |         `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide
579 |     WHERE search_country IS NOT NULL AND salary_year IS NOT NULL
580 |     GROUP BY job_title, search_country
581 |     ORDER BY job_count DESC
582 | );
583 | 
584 | -- Selection export
585 | EXPORT DATA
586 |   OPTIONS (
587 |     uri = 'gs://gsearch_share/cache/skill-pay/skill-pay-all-*.csv',
588 |     format = 'CSV',
589 |     overwrite = true,
590 |     header = true,
591 |     field_delimiter = ',')
592 | AS (
593 |     WITH numbered_jobs AS (
594 |         SELECT DISTINCT job_id,
595 |             ROW_NUMBER() OVER (ORDER BY job_id) as job_number
596 |         FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide
597 |     )
598 | 
599 |     SELECT
600 |         n.job_number,
601 |         j.job_title_final,
602 |         j.search_country,
603 |         j.search_time,
604 |         keywords.element AS skill,
605 |         APPROX_QUANTILES(j.salary_year, 2)[OFFSET(1)] AS median
606 |     FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide j,
607 |         UNNEST(keywords_all.list) AS keywords
608 |     JOIN numbered_jobs n ON j.job_id = n.job_id
609 |     WHERE j.salary_year IS NOT NULL
610 |     GROUP BY n.job_number, j.job_title_final, j.search_country, j.search_time, skill, j.job_id
611 |     ORDER BY j.search_time DESC
612 | );
613 | 
614 | ----------------
615 | -- 💸 Job Salaries Page
616 | -- (No Selections) export
617 | EXPORT DATA
618 |   OPTIONS (
619 |     uri = 'gs://gsearch_share/cache/salary/salary-*.csv',
620 |     format = 'CSV',
621 |     overwrite = true,
622 |     header = true,
623 |     field_delimiter = ',')
624 | AS (
625 |     SELECT * FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_salary_wide
626 | );
627 | 
628 | ----------------
629 | -- 🏥 Health Page exports
630 | -- Calculate num_jobs
631 | EXPORT DATA
632 |   OPTIONS (
633 |     uri = 'gs://gsearch_share/cache/health/num-jobs-*.csv',
634 |     format = 'CSV',
635 |     overwrite = true,
636 |     header = true,
637 |     field_delimiter = ',')
638 | AS (
639 |     SELECT COUNT(*) AS num_jobs,
640 |     FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_fact
641 | );  
642 | 
643 | -- Caluclate dates and missing dates
644 | EXPORT DATA
645 |   OPTIONS (
646 |     uri = 'gs://gsearch_share/cache/health/dates-*.csv',
647 |     format = 'CSV',
648 |     overwrite = true,
649 |     header = true,
650 |     field_delimiter = ',')
651 | AS (
652 |     SELECT DISTINCT CAST(search_time AS DATE) AS search_date,
653 |                     COUNT(job_id) AS jobs_daily
654 |     FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_fact
655 |     GROUP BY search_date
656 |     ORDER BY search_date
657 | );
658 | 
659 | -- Find last update
660 | EXPORT DATA
661 |   OPTIONS (
662 |     uri = 'gs://gsearch_share/cache/health/last-update-*.csv',
663 |     format = 'CSV',
664 |     overwrite = true,
665 |     header = true,
666 |     field_delimiter = ',')
667 | AS (
668 |     SELECT MAX(search_time) as last_update
669 |     FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_fact
670 | );
671 | 


--------------------------------------------------------------------------------
/dags/sql/fact_build.sql:
--------------------------------------------------------------------------------
  1 | -- pre-JSON table create
  2 | -- create table in format needed to combine with JSON data; only need to run once
  3 | -- BKGD: 'gsearch_job_listings.gsearch_jobs_all' was first used to collect data before collecting full JSON
  4 | CREATE OR REPLACE TABLE `job-listings-366015`.gsearch_job_listings.gsearch_jobs_all_json_version AS
  5 |     (SELECT title          AS job_title,
  6 |             company_name,
  7 |             location       AS job_location,
  8 |             via            AS job_via,
  9 |             description    AS job_description,
 10 |             extensions     as job_extensions,
 11 |             job_id,
 12 |             thumbnail      AS company_thumbnail,
 13 |             posted_at      AS job_posted_at,
 14 |             schedule_type  AS job_schedule_type,
 15 |             work_from_home AS job_work_from_home,
 16 |             salary         AS job_salary,
 17 |             search_term,
 18 |             search_location,
 19 |             date_time      AS search_time,
 20 |             commute_time   AS job_commute_time,
 21 |      FROM `job-listings-366015`.gsearch_job_listings.gsearch_jobs_all
 22 | -- started collecting data in 'gsearch_jobs_all_json' on 1-1-2023 (142,416 records)
 23 | -- No longer need to use data from this 'back-up' table after this period
 24 |      WHERE date_time < (SELECT MIN(search_time)
 25 |                         FROM `job-listings-366015`.gsearch_job_listings_json.gsearch_jobs_all_json));
 26 | 
 27 | -- JSON table create
 28 | -- 43 second run-time (175K rows) on 05Jan23, 4 mins (450K rows) on 09Mar2023
 29 | CREATE OR REPLACE TABLE `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_fact AS
 30 |     (WITH gsearch_json_all AS
 31 |               -- Extract JSON data and combine with pre-JSON data
 32 |               (SELECT JSON_EXTRACT_SCALAR(jobs_results.job_id)                            AS job_id,
 33 |                       JSON_EXTRACT_SCALAR(jobs_results.title)                             AS job_title,
 34 |                       JSON_EXTRACT_SCALAR(jobs_results.company_name)                      AS company_name,
 35 |                       JSON_EXTRACT_SCALAR(jobs_results.location)                          AS job_location,
 36 |                       JSON_EXTRACT_SCALAR(jobs_results.via)                               AS job_via,
 37 |                       JSON_EXTRACT_SCALAR(jobs_results.description)                       AS job_description,
 38 |                       -- how to get all 'job_highlights'
 39 |                       -- JSON_EXTRACT(jobs_results.job_highlights, '$')                       AS job_highlights_all,
 40 |                       CASE
 41 |                           WHEN JSON_EXTRACT_SCALAR(JSON_EXTRACT(jobs_results.job_highlights, '$[0].title')) =
 42 |                                'Qualifications'
 43 |                               THEN JSON_EXTRACT(jobs_results.job_highlights, '$[0].items')
 44 |                           WHEN JSON_EXTRACT_SCALAR(JSON_EXTRACT(jobs_results.job_highlights, '$[1].title')) =
 45 |                                'Qualifications'
 46 |                               THEN JSON_EXTRACT(jobs_results.job_highlights, '$[1].items')
 47 |                           WHEN JSON_EXTRACT_SCALAR(JSON_EXTRACT(jobs_results.job_highlights, '$[2].title')) =
 48 |                                'Qualifications'
 49 |                               THEN JSON_EXTRACT(jobs_results.job_highlights, '$[2].items')
 50 |                           END                                                             AS job_highlights_qualifications,
 51 |                       CASE
 52 | 
 53 |                           WHEN JSON_EXTRACT_SCALAR(JSON_EXTRACT(jobs_results.job_highlights, '$[0].title')) =
 54 |                                'Responsibilities'
 55 |                               THEN JSON_EXTRACT(jobs_results.job_highlights, '$[0].items')
 56 |                           WHEN JSON_EXTRACT_SCALAR(JSON_EXTRACT(jobs_results.job_highlights, '$[1].title')) =
 57 |                                'Responsibilities'
 58 |                               THEN JSON_EXTRACT(jobs_results.job_highlights, '$[1].items')
 59 |                           WHEN JSON_EXTRACT_SCALAR(JSON_EXTRACT(jobs_results.job_highlights, '$[2].title')) =
 60 |                                'Responsibilities'
 61 |                               THEN JSON_EXTRACT(jobs_results.job_highlights, '$[2].items')
 62 |                           END                                                             AS job_highlights_responsibilities,
 63 |                       CASE
 64 |                           WHEN JSON_EXTRACT_SCALAR(JSON_EXTRACT(jobs_results.job_highlights, '$[0].title')) =
 65 |                                'Benefits'
 66 |                               THEN JSON_EXTRACT(jobs_results.job_highlights, '$[0].items')
 67 |                           WHEN JSON_EXTRACT_SCALAR(JSON_EXTRACT(jobs_results.job_highlights, '$[1].title')) =
 68 |                                'Benefits'
 69 |                               THEN JSON_EXTRACT(jobs_results.job_highlights, '$[1].items')
 70 |                           WHEN JSON_EXTRACT_SCALAR(JSON_EXTRACT(jobs_results.job_highlights, '$[2].title')) =
 71 |                                'Benefits'
 72 |                               THEN JSON_EXTRACT(jobs_results.job_highlights, '$[2].items')
 73 |                           END                                                             AS job_highlights_benefits,
 74 |                       JSON_EXTRACT_SCALAR(jobs_results.detected_extensions.posted_at)     AS job_posted_at,
 75 |                       JSON_EXTRACT_SCALAR(jobs_results.detected_extensions.salary)        AS job_salary,
 76 |                       JSON_EXTRACT_SCALAR(jobs_results.detected_extensions.schedule_type) AS job_schedule_type,
 77 |                       CAST(JSON_EXTRACT_SCALAR(
 78 |                               jobs_results.detected_extensions.work_from_home) AS BOOL)   AS job_work_from_home,
 79 |                       JSON_EXTRACT_SCALAR(jobs_results.detected_extensions.commute_time)  AS job_commute_time,
 80 |                       JSON_VALUE_ARRAY(jobs_results, '$.extensions')                      AS job_extensions,
 81 |                       CASE
 82 |                           WHEN LEFT(
 83 |                                        JSON_EXTRACT_SCALAR(JSON_EXTRACT(jobs_results.related_links, '$[0].link')),
 84 |                                        22) !=
 85 |                                'https://www.google.com'
 86 |                               THEN JSON_EXTRACT_SCALAR(JSON_EXTRACT(jobs_results.related_links, '$[0].link'))
 87 |                           END
 88 |                                                                                           AS company_link,
 89 |                       CASE
 90 |                           WHEN LEFT(
 91 |                                        JSON_EXTRACT_SCALAR(JSON_EXTRACT(jobs_results.related_links, '$[0].text')),
 92 |                                        7) = 'See web'
 93 |                               THEN JSON_EXTRACT_SCALAR(JSON_EXTRACT(jobs_results.related_links, '$[0].link'))
 94 |                           WHEN LEFT(
 95 |                                        JSON_EXTRACT_SCALAR(JSON_EXTRACT(jobs_results.related_links, '$[1].text')),
 96 |                                        7) = 'See web'
 97 |                               THEN JSON_EXTRACT_SCALAR(JSON_EXTRACT(jobs_results.related_links, '$[1].link'))
 98 |                           END
 99 |                                                                                           AS company_link_google,
100 |                       JSON_EXTRACT_SCALAR(jobs_results.thumbnail)                         AS company_thumbnail,
101 |                       error,
102 |                       search_term,
103 |                       search_location,
104 |                       search_time,
105 |                       search_id,
106 |                FROM `job-listings-366015`.gsearch_job_listings_json.gsearch_jobs_all_json
107 |                         -- Used to unpack original JSON
108 |                         LEFT JOIN UNNEST(JSON_EXTRACT_ARRAY(results.jobs_results)) AS jobs_results
109 |                WHERE error = false
110 | -- Union with previous data not previously saved as JSON
111 |                UNION ALL
112 |                SELECT job_id,
113 |                       job_title,
114 |                       company_name,
115 |                       job_location,
116 |                       job_via,
117 |                       job_description,
118 |                       null as job_highlights_qualifications,
119 |                       null as job_highlights_responsibilities,
120 |                       null as job_highlights_benefits,
121 |                       job_posted_at,
122 |                       job_salary,
123 |                       job_schedule_type,
124 |                       job_work_from_home,
125 |                       job_commute_time,
126 |                       job_extensions,
127 |                       null as company_link,
128 |                       null as company_link_google,
129 |                       company_thumbnail,
130 |                       null as error,
131 |                       search_term,
132 |                       search_location,
133 |                       search_time,
134 |                       null as search_id,
135 |                FROM `job-listings-366015`.gsearch_job_listings.gsearch_jobs_all_json_version)
136 | -- clean table once combined for those with common fields
137 |      SELECT *,
138 |             CASE
139 |                 WHEN "No degree mentioned" IN UNNEST(job_extensions)
140 |                     THEN true
141 |                 END AS job_no_degree_mention,
142 |             CASE
143 |                 WHEN "Health insurance" IN UNNEST(job_extensions)
144 |                     THEN true
145 |                 END AS job_health_insurance,
146 |      FROM gsearch_json_all);
147 | 


--------------------------------------------------------------------------------
/dags/sql/public_build.sql:
--------------------------------------------------------------------------------
 1 | -- Public table build for Langchain interaction
 2 | -- 8 second run-tim (1.2M rows) on 5APR24
 3 | -- Does some minor cleanup like:
 4 | -- 1. Removing duplicates based on PARTITION BY company_name, job_title, job_schedule_type, job_description, job_location
 5 | -- 2. removing the "via" from the job_via column (i did this in the course so want to be consistent)
 6 | -- 3. removing the job_title_clean column (as it may be blank for some entries); job_title_final is the one to use
 7 | -- 4. converting the keywords_all.list array to a simple array
 8 | -- 5. converting the job_work_from_home, job_no_degree_mention, job_health_insurance to boolean (fill in na as false4)
 9 | -- columns not used: job_via, job_title_clean, job_description, job_posted_at, job_salary, job_commute_time, company_link, company_link_google, company_thumbnail, job_highlights_qualifications, job_highlights_responsibilities, job_highlights_benefits, job_extensions, error, search_id, search_term, search_location, keywords_programming, keywords_databases, keywords_cloud, keywords_libraries, keywords_webframeworks, keywords_os, keywords_analyst_tools, keywords_other, keywords_async, keywords_sync, salary_pay, salary_avg, salary_min, salary_max
10 | 
11 | CREATE OR REPLACE TABLE `job-listings-366015`.public_job_listings.data_nerd_jobs AS
12 | SELECT
13 |   job_title_final,
14 |   job_title                             AS job_title_original,
15 |   company_name,
16 |   job_location,
17 |   search_time                           AS job_posted_at,
18 |   REPLACE(job_via, 'via ', '')          AS job_posting_site,
19 |   job_schedule_type,
20 |   IFNULL(job_work_from_home, FALSE)     AS job_work_from_home,
21 |   IFNULL(job_no_degree_mention, FALSE)  AS job_no_degree_mention,
22 |   IFNULL(job_health_insurance, FALSE)   AS job_health_insurance,
23 |   ARRAY(
24 |     SELECT x.element
25 |     FROM UNNEST(keywords_all.list) AS x
26 |   )                                     AS job_keywords
27 | FROM (
28 |   SELECT 
29 |     *, 
30 |     ROW_NUMBER() OVER (PARTITION BY company_name, job_title, job_schedule_type, job_description, job_location) AS rn
31 |   FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide
32 | )
33 | WHERE rn = 1;


--------------------------------------------------------------------------------
/dags/sql/wide_build.sql:
--------------------------------------------------------------------------------
  1 | -- Final wide table build that combines fact and dimension tables
  2 | -- 77 second run-time (550K rows) on 09Mar23
  3 | 
  4 | CREATE OR REPLACE TABLE `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide AS
  5 |         (SELECT  
  6 |                 CASE 
  7 |                         WHEN job_title_clean IS NULL THEN search_term
  8 |                         ELSE job_title_clean
  9 |                 END AS job_title_final,
 10 |                 t.* EXCEPT (job_title),
 11 |                 f.*,
 12 |                 s.* EXCEPT (job_id, job_description),
 13 |                 c.* EXCEPT (search_location),
 14 |                 d.* EXCEPT (job_id)
 15 |         FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_fact AS f
 16 |                 LEFT JOIN `job-listings-366015`.gsearch_job_listings_clean.gsearch_skills AS s ON f.job_id = s.job_id 
 17 |                 LEFT JOIN `job-listings-366015`.gsearch_job_listings_clean.gsearch_country AS c ON c.search_location = f.search_location
 18 |                 LEFT JOIN `job-listings-366015`.gsearch_job_listings_clean.gsearch_salary AS d ON d.job_id = f.job_id
 19 |                 LEFT JOIN `job-listings-366015`.gsearch_job_listings_clean.gsearch_job_title AS t ON t.job_title = f.job_title
 20 |         );
 21 | 
 22 | -- Keywords table
 23 | -- Had to create a physical table as unable to figure out how to export this query to one CSV for cache of website
 24 | CREATE OR REPLACE TABLE `job-listings-366015`.gsearch_job_listings_clean.gsearch_keywords AS
 25 |         (WITH keywords AS (
 26 |         SELECT DISTINCT keywords_all AS element,
 27 |                 SPLIT(kv, ':')[OFFSET(0)] as keyword,
 28 |         FROM (
 29 |                 SELECT DISTINCT keywords_all.element AS keywords_all
 30 |                 FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_skills,
 31 |                 UNNEST(keywords_all.list) AS keywords_all
 32 |         ) AS k,
 33 |                 UNNEST(SPLIT(TRANSLATE(TO_JSON_STRING(k), '"{}', ''))) kv
 34 |         ), keywords_programming AS (
 35 |         SELECT DISTINCT keywords_programming AS element,
 36 |                 SPLIT(kv, ':')[OFFSET(0)] as keyword,
 37 |         FROM (
 38 |                 SELECT DISTINCT keywords_programming.element AS keywords_programming
 39 |                 FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_skills,
 40 |                 UNNEST(keywords_programming.list) AS keywords_programming
 41 |         ) AS k,
 42 |                 UNNEST(SPLIT(TRANSLATE(TO_JSON_STRING(k), '"{}', ''))) kv
 43 |         ), keywords_databases AS (
 44 |         SELECT DISTINCT keywords_databases AS element,
 45 |                 SPLIT(kv, ':')[OFFSET(0)] as keyword,
 46 |         FROM (
 47 |                 SELECT DISTINCT keywords_databases.element AS keywords_databases
 48 |                 FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_skills,
 49 |                 UNNEST(keywords_databases.list) AS keywords_databases
 50 |         ) AS k,
 51 |                 UNNEST(SPLIT(TRANSLATE(TO_JSON_STRING(k), '"{}', ''))) kv
 52 |         ), keywords_cloud AS (
 53 |         SELECT DISTINCT keywords_cloud AS element,
 54 |                 SPLIT(kv, ':')[OFFSET(0)] as keyword,
 55 |         FROM (
 56 |                 SELECT DISTINCT keywords_cloud.element AS keywords_cloud
 57 |                 FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_skills,
 58 |                 UNNEST(keywords_cloud.list) AS keywords_cloud
 59 |         ) AS k,
 60 |                 UNNEST(SPLIT(TRANSLATE(TO_JSON_STRING(k), '"{}', ''))) kv
 61 |         ), keywords_libraries AS (
 62 |         SELECT DISTINCT keywords_libraries AS element,
 63 |                 SPLIT(kv, ':')[OFFSET(0)] as keyword,
 64 |         FROM (
 65 |                 SELECT DISTINCT keywords_libraries.element AS keywords_libraries
 66 |                 FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_skills,
 67 |                 UNNEST(keywords_libraries.list) AS keywords_libraries
 68 |         ) AS k,
 69 |                 UNNEST(SPLIT(TRANSLATE(TO_JSON_STRING(k), '"{}', ''))) kv
 70 |         ), keywords_webframeworks AS (
 71 |         SELECT DISTINCT keywords_webframeworks AS element,
 72 |                 SPLIT(kv, ':')[OFFSET(0)] as keyword,
 73 |         FROM (
 74 |                 SELECT DISTINCT keywords_webframeworks.element AS keywords_webframeworks
 75 |                 FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_skills,
 76 |                 UNNEST(keywords_webframeworks.list) AS keywords_webframeworks
 77 |         ) AS k,
 78 |                 UNNEST(SPLIT(TRANSLATE(TO_JSON_STRING(k), '"{}', ''))) kv
 79 |         ), keywords_os AS (
 80 |         SELECT DISTINCT keywords_os AS element,
 81 |                 SPLIT(kv, ':')[OFFSET(0)] as keyword,
 82 |         FROM (
 83 |                 SELECT DISTINCT keywords_os.element AS keywords_os
 84 |                 FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_skills,
 85 |                 UNNEST(keywords_os.list) AS keywords_os
 86 |         ) AS k,
 87 |                 UNNEST(SPLIT(TRANSLATE(TO_JSON_STRING(k), '"{}', ''))) kv
 88 |         ), keywords_analyst_tools AS (
 89 |         SELECT DISTINCT keywords_analyst_tools AS element,
 90 |                 SPLIT(kv, ':')[OFFSET(0)] as keyword,
 91 |         FROM (
 92 |                 SELECT DISTINCT keywords_analyst_tools.element AS keywords_analyst_tools
 93 |                 FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_skills,
 94 |                 UNNEST(keywords_analyst_tools.list) AS keywords_analyst_tools
 95 |         ) AS k,
 96 |                 UNNEST(SPLIT(TRANSLATE(TO_JSON_STRING(k), '"{}', ''))) kv
 97 |         ), keywords_other AS (
 98 |         SELECT DISTINCT keywords_other AS element,
 99 |                 SPLIT(kv, ':')[OFFSET(0)] as keyword,
100 |         FROM (
101 |                 SELECT DISTINCT keywords_other.element AS keywords_other
102 |                 FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_skills,
103 |                 UNNEST(keywords_other.list) AS keywords_other
104 |         ) AS k,
105 |                 UNNEST(SPLIT(TRANSLATE(TO_JSON_STRING(k), '"{}', ''))) kv
106 |         ), keywords_async AS (
107 |         SELECT DISTINCT keywords_async AS element,
108 |                 SPLIT(kv, ':')[OFFSET(0)] as keyword,
109 |         FROM (
110 |                 SELECT DISTINCT keywords_async.element AS keywords_async
111 |                 FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_skills,
112 |                 UNNEST(keywords_async.list) AS keywords_async
113 |         ) AS k,
114 |                 UNNEST(SPLIT(TRANSLATE(TO_JSON_STRING(k), '"{}', ''))) kv
115 |         )
116 | 
117 |         SELECT * FROM keywords
118 |         UNION ALL
119 |         SELECT * FROM keywords_programming
120 |         UNION ALL
121 |         SELECT * FROM keywords_databases
122 |         UNION ALL
123 |         SELECT * FROM keywords_cloud
124 |         UNION ALL 
125 |         SELECT * FROM keywords_libraries
126 |         UNION ALL
127 |         SELECT * FROM keywords_webframeworks
128 |         UNION ALL
129 |         SELECT * FROM keywords_os
130 |         UNION ALL
131 |         SELECT * FROM keywords_analyst_tools
132 |         UNION ALL
133 |         SELECT * FROM keywords_other
134 |         UNION ALL
135 |         SELECT * FROM keywords_async
136 | );
137 | 
138 | 
139 | -- Salary table for salary page
140 | -- Had to create a physical table as unable to figure out how to export this query to one CSV for cache of website
141 | CREATE OR REPLACE TABLE `job-listings-366015`.gsearch_job_listings_clean.gsearch_salary_wide AS
142 |         (SELECT
143 |             job_title_final AS job_title,
144 |             search_term,
145 |             salary_avg,
146 |             salary_min,
147 |             salary_max,
148 |             salary_year,
149 |             salary_hour,
150 |             search_location,
151 |             job_location,
152 |             job_schedule_type,
153 |             job_via,
154 |             search_country,
155 |             search_time,
156 |         FROM
157 |             `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide
158 |         WHERE salary_avg IS NOT NULL
159 |         );


--------------------------------------------------------------------------------
/docker-compose.yaml:
--------------------------------------------------------------------------------
  1 | # Licensed to the Apache Software Foundation (ASF) under one
  2 | # or more contributor license agreements.  See the NOTICE file
  3 | # distributed with this work for additional information
  4 | # regarding copyright ownership.  The ASF licenses this file
  5 | # to you under the Apache License, Version 2.0 (the
  6 | # "License"); you may not use this file except in compliance
  7 | # with the License.  You may obtain a copy of the License at
  8 | #
  9 | #   http://www.apache.org/licenses/LICENSE-2.0
 10 | #
 11 | # Unless required by applicable law or agreed to in writing,
 12 | # software distributed under the License is distributed on an
 13 | # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 14 | # KIND, either express or implied.  See the License for the
 15 | # specific language governing permissions and limitations
 16 | # under the License.
 17 | #
 18 | 
 19 | # Basic Airflow cluster configuration for CeleryExecutor with Redis and PostgreSQL.
 20 | #
 21 | # WARNING: This configuration is for local development. Do not use it in a production deployment.
 22 | #
 23 | # This configuration supports basic configuration using environment variables or an .env file
 24 | # The following variables are supported:
 25 | #
 26 | # AIRFLOW_IMAGE_NAME           - Docker image name used to run Airflow.
 27 | #                                Default: apache/airflow:2.5.0
 28 | # AIRFLOW_UID                  - User ID in Airflow containers
 29 | #                                Default: 50000
 30 | # Those configurations are useful mostly in case of standalone testing/running Airflow in test/try-out mode
 31 | #
 32 | # _AIRFLOW_WWW_USER_USERNAME   - Username for the administrator account (if requested).
 33 | #                                Default: airflow
 34 | # _AIRFLOW_WWW_USER_PASSWORD   - Password for the administrator account (if requested).
 35 | #                                Default: airflow
 36 | # _PIP_ADDITIONAL_REQUIREMENTS - Additional PIP requirements to add when starting all containers.
 37 | #                                Default: ''
 38 | #
 39 | # Feel free to modify this file to suit your needs.
 40 | ---
 41 | version: '3'
 42 | x-airflow-common:
 43 |   &airflow-common
 44 |   # In order to add custom dependencies or upgrade provider packages you can use your extended image.
 45 |   # Comment the image line, place your Dockerfile in the directory where you placed the docker-compose.yaml
 46 |   # and uncomment the "build" line below, Then run `docker-compose build` to build the images.
 47 |   # https://stackoverflow.com/questions/67887138/how-to-install-packages-in-airflow-docker-compose
 48 |   # REPLACED # image: ${AIRFLOW_IMAGE_NAME:-apache/airflow:2.5.0}
 49 |   build: .
 50 |   environment:
 51 |     &airflow-common-env
 52 |     AIRFLOW__CORE__EXECUTOR: CeleryExecutor
 53 |     AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
 54 |     # For backward compatibility, with Airflow <2.3
 55 |     AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
 56 |     AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres/airflow
 57 |     AIRFLOW__CELERY__BROKER_URL: redis://:@redis:6379/0
 58 |     AIRFLOW__CORE__FERNET_KEY: ''
 59 |     AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'false'
 60 |     # Stoped loading core examples as they cluttered up the UI
 61 |     AIRFLOW__CORE__LOAD_EXAMPLES: 'false'
 62 |     AIRFLOW__API__AUTH_BACKENDS: 'airflow.api.auth.backend.basic_auth'
 63 |     # For email
 64 |     AIRFLOW__SMTP__SMTP_HOST: smtp.gmail.com
 65 |     AIRFLOW__SMTP__SMTP_PORT: 587
 66 |     AIRFLOW__SMTP__SMTP_USER: ${AIRFLOW__SMTP__SMTP_USER}
 67 |     AIRFLOW__SMTP__SMTP_PASSWORD: ${AIRFLOW__SMTP__SMTP_PASSWORD}
 68 |     AIRFLOW__SMTP__SMTP_MAIL_FROM: ${AIRFLOW__SMTP__SMTP_USER}
 69 |     # Failed 'docker-compose up' with additional pip requirments here... also it's bad practice... need to build seperate image... link above on how to
 70 |     _PIP_ADDITIONAL_REQUIREMENTS: ${_PIP_ADDITIONAL_REQUIREMENTS:-} # pandas google-search-results google-cloud-bigquery numpy matplotlib configparser
 71 |     GOOGLE_APPLICATION_CREDENTIALS: './dags/config/job-listings-366015-9ac668151dc3.json'
 72 |   volumes:
 73 |     - ./dags:/opt/airflow/dags
 74 |     - ./logs:/opt/airflow/logs
 75 |     - ./plugins:/opt/airflow/plugins
 76 |   user: "${AIRFLOW_UID:-50000}:0"
 77 |   depends_on:
 78 |     &airflow-common-depends-on
 79 |     redis:
 80 |       condition: service_healthy
 81 |     postgres:
 82 |       condition: service_healthy
 83 | 
 84 | services:
 85 |   postgres:
 86 |     image: postgres:13
 87 |     environment:
 88 |       POSTGRES_USER: airflow
 89 |       POSTGRES_PASSWORD: airflow
 90 |       POSTGRES_DB: airflow
 91 |     volumes:
 92 |       - postgres-db-volume:/var/lib/postgresql/data
 93 |     healthcheck:
 94 |       test: ["CMD", "pg_isready", "-U", "airflow"]
 95 |       interval: 5s
 96 |       retries: 5
 97 |     restart: always
 98 | 
 99 |   redis:
100 |     image: redis:latest
101 |     expose:
102 |       - 6379
103 |     healthcheck:
104 |       test: ["CMD", "redis-cli", "ping"]
105 |       interval: 5s
106 |       timeout: 30s
107 |       retries: 50
108 |     restart: always
109 | 
110 |   airflow-webserver:
111 |     <<: *airflow-common
112 |     command: webserver
113 |     # Changed port to 8081 to not conflict with QNAP port of 8080
114 |     ports:
115 |       - 8081:8080
116 |     healthcheck:
117 |       test: ["CMD", "curl", "--fail", "http://localhost:8081/health"]
118 |       interval: 10s
119 |       timeout: 10s
120 |       retries: 5
121 |     restart: always
122 |     depends_on:
123 |       <<: *airflow-common-depends-on
124 |       airflow-init:
125 |         condition: service_completed_successfully
126 | 
127 |   airflow-scheduler:
128 |     <<: *airflow-common
129 |     command: scheduler
130 |     healthcheck:
131 |       test: ["CMD-SHELL", 'airflow jobs check --job-type SchedulerJob --hostname "$${HOSTNAME}"']
132 |       interval: 10s
133 |       timeout: 10s
134 |       retries: 5
135 |     restart: always
136 |     depends_on:
137 |       <<: *airflow-common-depends-on
138 |       airflow-init:
139 |         condition: service_completed_successfully
140 | 
141 |   airflow-worker:
142 |     <<: *airflow-common
143 |     command: celery worker
144 |     healthcheck:
145 |       test:
146 |         - "CMD-SHELL"
147 |         - 'celery --app airflow.executors.celery_executor.app inspect ping -d "celery@$${HOSTNAME}"'
148 |       interval: 10s
149 |       timeout: 10s
150 |       retries: 5
151 |     environment:
152 |       <<: *airflow-common-env
153 |       # Required to handle warm shutdown of the celery workers properly
154 |       # See https://airflow.apache.org/docs/docker-stack/entrypoint.html#signal-propagation
155 |       DUMB_INIT_SETSID: "0"
156 |     restart: always
157 |     depends_on:
158 |       <<: *airflow-common-depends-on
159 |       airflow-init:
160 |         condition: service_completed_successfully
161 | 
162 |   airflow-triggerer:
163 |     <<: *airflow-common
164 |     command: triggerer
165 |     healthcheck:
166 |       test: ["CMD-SHELL", 'airflow jobs check --job-type TriggererJob --hostname "$${HOSTNAME}"']
167 |       interval: 10s
168 |       timeout: 10s
169 |       retries: 5
170 |     restart: always
171 |     depends_on:
172 |       <<: *airflow-common-depends-on
173 |       airflow-init:
174 |         condition: service_completed_successfully
175 | 
176 |   airflow-init:
177 |     <<: *airflow-common
178 |     entrypoint: /bin/bash
179 |     # yamllint disable rule:line-length
180 |     command:
181 |       - -c
182 |       - |
183 |         function ver() {
184 |           printf "%04d%04d%04d%04d" $${1//./ }
185 |         }
186 |         airflow_version=$$(AIRFLOW__LOGGING__LOGGING_LEVEL=INFO && gosu airflow airflow version)
187 |         airflow_version_comparable=$$(ver $${airflow_version})
188 |         min_airflow_version=2.2.0
189 |         min_airflow_version_comparable=$$(ver $${min_airflow_version})
190 |         if (( airflow_version_comparable < min_airflow_version_comparable )); then
191 |           echo
192 |           echo -e "\033[1;31mERROR!!!: Too old Airflow version $${airflow_version}!\e[0m"
193 |           echo "The minimum Airflow version supported: $${min_airflow_version}. Only use this or higher!"
194 |           echo
195 |           exit 1
196 |         fi
197 |         if [[ -z "${AIRFLOW_UID}" ]]; then
198 |           echo
199 |           echo -e "\033[1;33mWARNING!!!: AIRFLOW_UID not set!\e[0m"
200 |           echo "If you are on Linux, you SHOULD follow the instructions below to set "
201 |           echo "AIRFLOW_UID environment variable, otherwise files will be owned by root."
202 |           echo "For other operating systems you can get rid of the warning with manually created .env file:"
203 |           echo "    See: https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html#setting-the-right-airflow-user"
204 |           echo
205 |         fi
206 |         one_meg=1048576
207 |         mem_available=$$(($$(getconf _PHYS_PAGES) * $$(getconf PAGE_SIZE) / one_meg))
208 |         cpus_available=$$(grep -cE 'cpu[0-9]+' /proc/stat)
209 |         disk_available=$$(df / | tail -1 | awk '{print $$4}')
210 |         warning_resources="false"
211 |         if (( mem_available < 4000 )) ; then
212 |           echo
213 |           echo -e "\033[1;33mWARNING!!!: Not enough memory available for Docker.\e[0m"
214 |           echo "At least 4GB of memory required. You have $$(numfmt --to iec $$((mem_available * one_meg)))"
215 |           echo
216 |           warning_resources="true"
217 |         fi
218 |         if (( cpus_available < 2 )); then
219 |           echo
220 |           echo -e "\033[1;33mWARNING!!!: Not enough CPUS available for Docker.\e[0m"
221 |           echo "At least 2 CPUs recommended. You have $${cpus_available}"
222 |           echo
223 |           warning_resources="true"
224 |         fi
225 |         if (( disk_available < one_meg * 10 )); then
226 |           echo
227 |           echo -e "\033[1;33mWARNING!!!: Not enough Disk space available for Docker.\e[0m"
228 |           echo "At least 10 GBs recommended. You have $$(numfmt --to iec $$((disk_available * 1024 )))"
229 |           echo
230 |           warning_resources="true"
231 |         fi
232 |         if [[ $${warning_resources} == "true" ]]; then
233 |           echo
234 |           echo -e "\033[1;33mWARNING!!!: You have not enough resources to run Airflow (see above)!\e[0m"
235 |           echo "Please follow the instructions to increase amount of resources available:"
236 |           echo "   https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html#before-you-begin"
237 |           echo
238 |         fi
239 |         mkdir -p /sources/logs /sources/dags /sources/plugins
240 |         chown -R "${AIRFLOW_UID}:0" /sources/{logs,dags,plugins}
241 |         exec /entrypoint airflow version
242 |     # yamllint enable rule:line-length
243 |     environment:
244 |       <<: *airflow-common-env
245 |       _AIRFLOW_DB_UPGRADE: 'true'
246 |       _AIRFLOW_WWW_USER_CREATE: 'true'
247 |       _AIRFLOW_WWW_USER_USERNAME: ${_AIRFLOW_WWW_USER_USERNAME:-airflow}
248 |       _AIRFLOW_WWW_USER_PASSWORD: ${_AIRFLOW_WWW_USER_PASSWORD:-airflow}
249 |       _PIP_ADDITIONAL_REQUIREMENTS: ''
250 |     user: "0:0"
251 |     volumes:
252 |       - .:/sources
253 | 
254 |   airflow-cli:
255 |     <<: *airflow-common
256 |     profiles:
257 |       - debug
258 |     environment:
259 |       <<: *airflow-common-env
260 |       CONNECTION_CHECK_MAX_COUNT: "0"
261 |     # Workaround for entrypoint issue. See: https://github.com/apache/airflow/issues/16252
262 |     command:
263 |       - bash
264 |       - -c
265 |       - airflow
266 | 
267 |   # You can enable flower by adding "--profile flower" option e.g. docker-compose --profile flower up
268 |   # or by explicitly targeted on the command line e.g. docker-compose up flower.
269 |   # See: https://docs.docker.com/compose/profiles/
270 |   flower:
271 |     <<: *airflow-common
272 |     command: celery flower
273 |     profiles:
274 |       - flower
275 |     ports:
276 |       - 5555:5555
277 |     healthcheck:
278 |       test: ["CMD", "curl", "--fail", "http://localhost:5555/"]
279 |       interval: 10s
280 |       timeout: 10s
281 |       retries: 5
282 |     restart: always
283 |     depends_on:
284 |       <<: *airflow-common-depends-on
285 |       airflow-init:
286 |         condition: service_completed_successfully
287 | 
288 | volumes:
289 |   postgres-db-volume:
290 | 


--------------------------------------------------------------------------------
/extra/airflow_graph.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lukebarousse/Data_Job_Pipeline_Airflow/b69a5609c8fcd40c11b2fb6eb06123ca414dccdd/extra/airflow_graph.png


--------------------------------------------------------------------------------
/extra/bigquery_schema.json:
--------------------------------------------------------------------------------
 1 | [
 2 |     {
 3 |       "name": "title",
 4 |       "mode": "NULLABLE",
 5 |       "type": "STRING",
 6 |       "fields": []
 7 |     },
 8 |     {
 9 |       "name": "company_name",
10 |       "mode": "NULLABLE",
11 |       "type": "STRING",
12 |       "fields": []
13 |     },
14 |     {
15 |       "name": "location",
16 |       "mode": "NULLABLE",
17 |       "type": "STRING",
18 |       "fields": []
19 |     },
20 |     {
21 |       "name": "via",
22 |       "mode": "NULLABLE",
23 |       "type": "STRING",
24 |       "fields": []
25 |     },
26 |     {
27 |       "name": "description",
28 |       "mode": "NULLABLE",
29 |       "type": "STRING",
30 |       "fields": []
31 |     },
32 |     {
33 |       "name": "extensions",
34 |       "mode": "REPEATED",
35 |       "type": "STRING",
36 |       "fields": []
37 |     },
38 |     {
39 |       "name": "job_id",
40 |       "mode": "NULLABLE",
41 |       "type": "STRING",
42 |       "fields": []
43 |     },
44 |     {
45 |       "name": "thumbnail",
46 |       "mode": "NULLABLE",
47 |       "type": "STRING",
48 |       "fields": []
49 |     },
50 |     {
51 |       "name": "posted_at",
52 |       "mode": "NULLABLE",
53 |       "type": "STRING",
54 |       "fields": []
55 |     },
56 |     {
57 |       "name": "schedule_type",
58 |       "mode": "NULLABLE",
59 |       "type": "STRING",
60 |       "fields": []
61 |     },
62 |     {
63 |       "name": "work_from_home",
64 |       "mode": "NULLABLE",
65 |       "type": "BOOLEAN",
66 |       "fields": []
67 |     },
68 |     {
69 |       "name": "salary",
70 |       "mode": "NULLABLE",
71 |       "type": "STRING",
72 |       "fields": []
73 |     },
74 |     {
75 |       "name": "search_term",
76 |       "mode": "NULLABLE",
77 |       "type": "STRING",
78 |       "fields": []
79 |     },
80 |     {
81 |       "name": "date_time",
82 |       "mode": "NULLABLE",
83 |       "type": "DATETIME",
84 |       "fields": []
85 |     },
86 |     {
87 |       "name": "search_location",
88 |       "mode": "NULLABLE",
89 |       "type": "STRING",
90 |       "fields": []
91 |     },
92 |     {
93 |       "name": "commute_time",
94 |       "mode": "NULLABLE",
95 |       "type": "STRING",
96 |       "fields": []
97 |     }
98 |   ]


--------------------------------------------------------------------------------
/extra/dashboard.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lukebarousse/Data_Job_Pipeline_Airflow/b69a5609c8fcd40c11b2fb6eb06123ca414dccdd/extra/dashboard.png


--------------------------------------------------------------------------------
/extra/dataproc_files/README.md:
--------------------------------------------------------------------------------
 1 | # Dataproc Python Files
 2 | 
 3 | Note: These files are included for example purposes only and may or may not match the current versions of the files in the Dataproc cluster.
 4 | 
 5 | ---
 6 | ### Spark Cluster Creation
 7 | 
 8 | ```
 9 | REGION=us-central1
10 | ZONE=us-central1-a
11 | CLUSTER_NAME=spark-cluster
12 | BUCKET_NAME=dataproc-cluster-gsearch
13 | 
14 | gcloud dataproc clusters create ${CLUSTER_NAME} \
15 |  --enable-component-gateway \
16 |  --region ${REGION} \
17 |  --zone ${ZONE} \
18 |  --bucket ${BUCKET_NAME} \
19 |  --master-machine-type n2-standard-2 \
20 |  --master-boot-disk-size 500 \
21 |  --num-workers 2 \
22 |  --worker-machine-type n2-standard-2 \
23 |  --worker-boot-disk-size 500 \
24 |  --image-version 1.5-debian10 \
25 |  --optional-components ANACONDA,JUPYTER \
26 |  --project job-listings-366015 \
27 |  --properties=^#^spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.6,com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.27.1 \
28 |  --metadata 'PIP_PACKAGES=spark-nlp spark-nlp-display' \
29 |  --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh
30 | ```


--------------------------------------------------------------------------------
/extra/dataproc_files/Salary Table.py:
--------------------------------------------------------------------------------
  1 | from google.cloud import bigquery
  2 | from google.cloud import storage
  3 | 
  4 | import sparknlp
  5 | from sparknlp.annotator import Lemmatizer, Stemmer, Tokenizer, Normalizer, TextMatcher, DocumentNormalizer
  6 | from sparknlp.base import DocumentAssembler, Finisher
  7 | 
  8 | from pyspark.sql import SparkSession
  9 | from pyspark.ml.feature import StopWordsRemover, CountVectorizer, IDF
 10 | from pyspark.ml.clustering import LDA
 11 | from pyspark.sql.functions import split, regexp_replace, when, lower, array_distinct, monotonically_increasing_id, col, size
 12 | from pyspark.ml import Pipeline
 13 | 
 14 | import pandas as pd
 15 | 
 16 | # Start Spark Session
 17 | spark = SparkSession \
 18 |  .builder \
 19 |  .appName('BigQuery Storage & Spark DataFrames') \
 20 |  .getOrCreate()
 21 | 
 22 | # BigQuery Setup
 23 | dataset = 'gsearch_job_listings_clean'
 24 | table = 'job-listings-366015.gsearch_job_listings_clean.gsearch_jobs_fact'
 25 | BUCKET_NAME = 'dataproc-cluster-gsearch'
 26 | # need to read with SQL from table
 27 | spark.conf.set("viewsEnabled","true")
 28 | spark.conf.set("materializationDataset", dataset)
 29 | # need to set temp bucket for writing
 30 | spark.conf.set("temporaryGcsBucket", BUCKET_NAME)
 31 | 
 32 | # Salary Cleanup
 33 | sql = """
 34 |   SELECT job_id, job_salary
 35 |   FROM `job-listings-366015.gsearch_job_listings_clean.gsearch_jobs_fact`
 36 |   WHERE job_salary IS NOT null
 37 |   """
 38 | salary = spark.read \
 39 |  .format("bigquery") \
 40 |  .load(sql)
 41 | # drop duplicates so when create final dimension table only one job_id to match on
 42 | salary = salary.dropDuplicates(['job_id'])
 43 | 
 44 | salary_clean = salary
 45 | 
 46 | # split column on ' '
 47 | split_col =  split(salary_clean.job_salary, ' ',)
 48 | salary_clean = salary_clean.withColumn('salary_pay', split_col.getItem(0))\
 49 |     .withColumn('salary_rate', split_col.getItem(2))\
 50 |     .drop('job_salary')
 51 | 
 52 | # remove , $, " "
 53 | salary_clean = salary_clean.withColumn('salary_pay', regexp_replace('salary_pay', ',', ''))
 54 | salary_clean = salary_clean.withColumn('salary_pay', regexp_replace('salary_pay', '$', ''))
 55 | salary_clean = salary_clean.withColumn('salary_pay', regexp_replace('salary_pay', ' ', ''))
 56 | 
 57 | # start creation of 'salary_avg' on columns with no "-"
 58 | # The character U+2013 "–" could be confused with the character U+002d "-", which is more common in source code.
 59 | salary_clean = salary_clean.withColumn(
 60 |     'salary_avg', 
 61 |     when(salary_clean.salary_pay.contains("–"), None)\
 62 |     .otherwise(salary_clean.salary_pay))
 63 | 
 64 | # create 'salary_min' & 'salary_max'column for cleaning
 65 | salary_clean = salary_clean.withColumn(
 66 |     'salary_', 
 67 |     when(salary_clean.salary_pay.contains("–"), salary_clean.salary_pay)\
 68 |     .otherwise(None))
 69 | split_col =  split(salary_clean.salary_, "–",)
 70 | salary_clean = salary_clean.withColumn('salary_min', split_col.getItem(0))\
 71 |     .withColumn('salary_max', split_col.getItem(1))\
 72 |     .drop('salary_')
 73 | 
 74 | # format to remove 'K', multiply by 1000
 75 | for column in ['salary_avg', 'salary_min', 'salary_max']:
 76 |     salary_clean = salary_clean.withColumn(
 77 |         column, 
 78 |         when(salary_clean[column].contains('K'), regexp_replace(column, 'K', '').cast('float')*1000)\
 79 |         .otherwise(salary_clean[column]))
 80 |     salary_clean = salary_clean.withColumn(column,salary_clean[column].cast('float'))
 81 | 
 82 | # update 'salary_avg' column to take avg of 'salary_min' and 'salary_max'
 83 | salary_clean = salary_clean.withColumn(
 84 |     'salary_avg', 
 85 |     when(salary_clean.salary_min.isNotNull(), 
 86 |     (salary_clean.salary_min + salary_clean.salary_max)/2)\
 87 |     .otherwise(salary_clean.salary_avg))
 88 | salary_clean = salary_clean.withColumn('salary_avg',salary_clean.salary_avg.cast('float'))
 89 | 
 90 | for rate in ['year', 'hour']:
 91 |     salary_clean = salary_clean.withColumn(
 92 |         'salary_'+rate, 
 93 |         when(salary_clean.salary_rate.contains(rate), salary_clean.salary_avg)\
 94 |         .otherwise(None))
 95 |    # salary_clean = salary_clean.withColumn('salary_'+rate,salary_clean['salary_'+rate].cast('float'))
 96 | 
 97 | print("Number of jobs with salary: ", salary_clean.count())
 98 | salary_clean.printSchema()
 99 | 
100 | # Write to BigQuery
101 | salary_clean.write.format('bigquery') \
102 |   .option('table', 'job-listings-366015.gsearch_job_listings_clean.gsearch_salary') \
103 |   .option('materializationExpirationTimeInMinutes', 60) \
104 |   .mode('overwrite') \
105 |   .save()


--------------------------------------------------------------------------------
/extra/dataproc_files/Skill Table.py:
--------------------------------------------------------------------------------
  1 | from google.cloud import bigquery
  2 | from google.cloud import storage
  3 | 
  4 | import sparknlp
  5 | from sparknlp.annotator import Lemmatizer, Stemmer, Tokenizer, Normalizer, TextMatcher, DocumentNormalizer
  6 | from sparknlp.base import DocumentAssembler, Finisher
  7 | 
  8 | from pyspark.sql import SparkSession
  9 | from pyspark.ml.feature import StopWordsRemover, CountVectorizer, IDF
 10 | from pyspark.ml.clustering import LDA
 11 | from pyspark.sql.functions import split, regexp_replace, when, lower, array_distinct, monotonically_increasing_id, col, size
 12 | from pyspark.ml import Pipeline
 13 | 
 14 | import pandas as pd
 15 | 
 16 | """
 17 | Start Spark Session
 18 | """
 19 | spark = SparkSession \
 20 |  .builder \
 21 |  .appName('BigQuery Storage & Spark DataFrames') \
 22 |  .getOrCreate()
 23 | 
 24 | 
 25 | """
 26 | #BigQuery Setup
 27 | """
 28 | dataset = 'gsearch_job_listings_clean'
 29 | table = 'job-listings-366015.gsearch_job_listings_clean.gsearch_jobs_fact'
 30 | BUCKET_NAME = 'dataproc-cluster-gsearch'
 31 | # need to read with SQL from table
 32 | spark.conf.set("viewsEnabled","true")
 33 | spark.conf.set("materializationDataset", dataset)
 34 | # need to set temp bucket for writing
 35 | spark.conf.set("temporaryGcsBucket", BUCKET_NAME)
 36 | 
 37 | """
 38 | Keywords Setup
 39 | """
 40 | # Keywords from hand picking from data and Stack Overflow survey https://survey.stackoverflow.co/2022/#technology-most-popular-technologies            
 41 | keywords_programming = [
 42 |     'sql', 'python', 'r', 'c', 'c#', 'javascript', 'java', 'scala', 'sas', 'matlab', 
 43 |     'c++', 'perl', 'go', 'typescript', 'bash', 'html', 'css', 'php', 'powershell', 'rust', 
 44 |     'kotlin', 'ruby',  'dart', 'assembly', 'swift', 'vba', 'lua', 'groovy', 'delphi', 'objective-c', 
 45 |     'haskell', 'elixir', 'julia', 'clojure', 'solidity', 'lisp', 'f#', 'fortran', 'erlang', 'apl', 
 46 |     'cobol', 'ocaml', 'crystal',  'golang', 'nosql', 'mongodb', 't-sql', 'no-sql', 
 47 |     'pascal', 'mongo', 'sass', 'vb.net', 'shell', 'visual basic',
 48 | ]
 49 | # 'js',  'c/c++', 'pl/sql', 'javascript/typescript', 'visualbasic', 'objective c', 
 50 | 
 51 | keywords_databases = [
 52 |     'mysql', 'sql server', 'postgresql', 'sqlite', 'mongodb', 'redis', 'mariadb',
 53 |     'elasticsearch', 'firebase', 'dynamodb', 'firestore', 'cassandra', 'neo4j', 'db2', 
 54 |     'couchbase', 'couchdb', 
 55 | ]
 56 | # 'mssql', 'sqlserver', 'postgres', 
 57 | 
 58 | keywords_cloud = [
 59 |     'aws', 'azure', 'gcp', 'firebase', 'heroku', 'digitalocean', 'vmware', 'managedhosting',
 60 |     'linode', 'ovh', 'oracle', 'openstack', 'watson', 'colocation', 
 61 |     'snowflake', 'redshift', 'bigquery', 'aurora', 'databricks', 'ibm cloud',
 62 | ]
 63 | # 'googlecloud', 'google cloud', 'oraclecloud',  'oracle cloud'  'amazonweb', 'amazon web', 'ibmcloud', 
 64 | 
 65 | keywords_libraries = [
 66 |     'scikit-learn', 'jupyter', 'theano', 'openCV', 'pyspark', 'nltk', 'mlpack', 'chainer', 'fann', 'shogun', 
 67 |     'dlib', 'mxnet', 'keras', '.net', 'numpy', 'pandas', 'matplotlib', 'spring', 'tensorflow', 'flutter', 
 68 |     'react', 'kafka', 'electron', 'pytorch', 'qt', 'ionic', 'xamarin', 'spark', 'cordova', 'hadoop', 'gtx',
 69 |     'capacitor', 'tidyverse', 'unoplatform', 'dplyr', 'tidyr', 'ggplot2', 'plotly', 'rshiny', 'mlr',
 70 |     'airflow', 'seaborn', 'gdpr', 'graphql', 'selenium', 'hugging face', 'uno platform'
 71 | 
 72 | ]
 73 | # 'huggingface', 
 74 | 
 75 | keywords_webframeworks = [
 76 |     'node.js', 'vue', 'vue.js', 'ember.js', 'node', 'jquery', 'asp.net', 'react.js', 'express',
 77 |     'angular', 'asp.netcore', 'django', 'flask', 'next.js', 'laravel', 'angular.js', 'fastapi', 'ruby', 
 78 |     'svelte', 'blazor', 'nuxt.js', 'symfony', 'gatsby', 'drupal', 'phoenix', 'fastify', 'deno', 
 79 |     'asp.net core', 'ruby on rails', 'play framework',
 80 | ]
 81 | # 'jse/jee', 'rubyonrails', 'playframework',
 82 | 
 83 | keywords_os = [
 84 |     'unix', 'linux', 'windows', 'macos', 'wsl', 'ubuntu', 'centos', 'debian', 'redhat', 
 85 |     'suse', 'fedora', 'kali', 'arch',
 86 | ]
 87 | # 'unix/linux', 'linux/unix', 
 88 | 
 89 | keywords_analyst_tools = [
 90 |     'excel', 'tableau', 'word', 'powerpoint', 'looker', 'power bi', 'outlook', 'sas', 'sharepoint', 'visio',  
 91 |     'spreadsheet', 'alteryx', 'ssis', 'spss', 'ssrs', 'microstrategy',  'cognos', 'dax',  
 92 |     'esquisse', 'sap', 'splunk', 'qlik', 'nuix', 'datarobot', 'ms access', 'sheets',
 93 | ]
 94 | # 'powerbi', 'powerpoints', 'spreadsheets', 
 95 | 
 96 | keywords_other = [
 97 |     'npm', 'docker', 'yarn', 'homebrew', 'kubernetes', 'terraform', 'unity', 'ansible', 'unreal', 'puppet',
 98 |     'chef', 'pulumi', 'flow', 'git', 'svn', 'gitlab', 'github', 'jenkins', 'bitbucket', 'terminal', 'atlassian',
 99 |     'codecommit',
100 | ]
101 | 
102 | keywords_async = [
103 |     'jira', 'confluence', 'trello', 'notion', 'asana', 'clickup', 'planner', 'monday.com', 'airtable', 'smartsheet',
104 |     'wrike', 'workfront', 'dingtalk', 'swit', 'workzone', 'projectplace', 'cerri', 'wimi', 'leankor', 'microsoft lists'
105 | ]
106 | # 'microsoftlists', 
107 | 
108 | keywords_sync = [
109 |     'slack', 'microsoft teams', 'twilio', 'zoom', 'webex', 'mattermost', 'rocketchat', 'ringcentral',
110 |     'symphony', 'wire', 'wickr', 'unify', 'coolfire', 'google chat', 
111 | ]
112 | 
113 | # keywords_skills = [
114 | # 'coding', 'server', 'database', 'cloud', 'warehousing', 'scrum', 'devops', 'programming', 'saas', 'ci/cd', 'cicd', 
115 | # 'ml', 'data_lake', 'frontend',' front-end', 'back-end', 'backend', 'json', 'xml', 'ios', 'kanban', 'nlp',
116 | # 'iot', 'codebase', 'agile/scrum', 'agile', 'ai/ml', 'ai', 'paas', 'machine_learning', 'macros', 'iaas',
117 | # 'fullstack', 'dataops', 'scrum/agile', 'ssas', 'mlops', 'debug', 'etl', 'a/b', 'slack', 'erp', 'oop', 
118 | # 'object-oriented', 'etl/elt', 'elt', 'dashboarding', 'big-data', 'twilio', 'ui/ux', 'ux/ui', 'vlookup', 
119 | # 'crossover',  'data_lake', 'data_lakes', 'bi', 
120 | # ]
121 | 
122 | # put all keywords in a dict
123 | keywords_dict = {
124 |  'keywords_programming': keywords_programming, 'keywords_databases': keywords_databases, 'keywords_cloud': keywords_cloud, 
125 |  'keywords_libraries': keywords_libraries, 'keywords_webframeworks': keywords_webframeworks, 'keywords_os': keywords_os, 
126 |  'keywords_analyst_tools': keywords_analyst_tools, 'keywords_other': keywords_other, 'keywords_async': keywords_async,
127 |  'keywords_sync': keywords_sync
128 | }
129 | 
130 | # create a list of all keywords
131 | keywords_all = [item for sublist in keywords_dict.values() for item in sublist] 
132 | # add keywords_all to dict
133 | keywords_dict['keywords_all'] = keywords_all
134 | 
135 | """
136 | Keywords Save
137 | """
138 | # write keywords to google storage bucket
139 | client = storage.Client()
140 | bucket = client.get_bucket(BUCKET_NAME)
141 | 
142 | # save all keywords variations to different files
143 | for key, value in keywords_dict.items():
144 |     keywords_df = pd.DataFrame(value, index=None)
145 |     bucket.blob(f'notebooks/jupyter/keywords/{key}.txt').upload_from_string(keywords_df.to_csv(index=False, header=False) , content_type='text/csv')
146 | 
147 | """
148 | Create Keywords Dataframe
149 | """
150 | # using SQL to read data to ensure JSON values not selected (throws error)
151 | # delete where statement to process all jobs over again
152 | sql = """
153 |   SELECT job_id, job_description
154 |   FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_fact
155 |   WHERE job_id NOT IN
156 |     (SELECT job_id FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_skills)
157 |   """
158 | skills = spark.read \
159 |  .format("bigquery") \
160 |  .load(sql)
161 | 
162 | # drop duplicates so when create final dimension table only one job_id to match on
163 | skills = skills.dropDuplicates(['job_id'])
164 | 
165 | # lowercase the description here for pre-cleanup before model (couldn't get to work with SparkNLP lib)
166 | skills = skills.withColumn('job_description', lower(skills.job_description))
167 | 
168 | # make final dataframe to append to
169 | skills_final = skills
170 | skills_final = skills_final.withColumn("id", monotonically_increasing_id())
171 | skills_final = skills_final.alias('skills_final')
172 | 
173 | for keyword_name, keyword_list in keywords_dict.items():
174 | 
175 |     # Makes document tokenizable
176 |     document_assembler = DocumentAssembler() \
177 |         .setInputCol("job_description") \
178 |         .setOutputCol("document") 
179 | 
180 |     # Capture multi-words as one word
181 |     # can't have space in between or won't pick up multiple word
182 |     multi_words = [
183 |         'power_bi', 'sql_server', 'google_cloud', 'visual_basic', 'oracle_cloud',
184 |         'ibm_cloud', 'hugging_face', 'uno_platform', 'microsoft_lists', 'ms_access',
185 |         'microsoft_teams', 'google_chat', 'asp.net_core', 'ruby_on_rails', 'play_framework', 
186 |     ]
187 |     # 'objective_c', 'amazon_web'
188 |     
189 |     # Tokenizes document with multi_word exceptions
190 |     tokenizer = Tokenizer() \
191 |         .setInputCols(["document"]) \
192 |         .setOutputCol("token") \
193 |         .setCaseSensitiveExceptions(False) \
194 |         .setExceptions(multi_words)
195 | 
196 |     # Get all the keywords we defined above
197 |     keywords_file = f'gs://dataproc-cluster-gsearch/notebooks/jupyter/keywords/{keyword_name}.txt'
198 |     keywords_all = TextMatcher() \
199 |         .setInputCols(["document", "token"])\
200 |         .setOutputCol("matcher")\
201 |         .setEntities(keywords_file)\
202 |         .setCaseSensitive(False)\
203 |         .setEntityValue(keyword_name)
204 | 
205 |     # Make output machine readable
206 |     finisher = Finisher() \
207 |         .setInputCols(["matcher"]) \
208 |         .setOutputCols([keyword_name]) \
209 |         .setValueSplitSymbol(" ")
210 | 
211 |     pipeline = Pipeline(
212 |         stages = [
213 |             document_assembler,
214 |             tokenizer,
215 |             keywords_all,
216 |             finisher,
217 |         ]
218 |     )
219 | 
220 |     # Fit the data to the model.
221 |     model = pipeline.fit(skills).transform(skills)
222 | 
223 |     # Remove duplicate tokens, was unable to figure out how to do this in the model
224 |     model = model.withColumn(keyword_name, array_distinct(keyword_name))
225 |     
226 |     from pyspark.sql.types import StringType, ArrayType
227 |     model = model.withColumn(keyword_name, when(size(col(keyword_name))==0 , None).otherwise(col(keyword_name)))
228 |     
229 |     # Drop unnecessary columns prior to join to final df
230 |     model = model.drop('job_id', 'job_description')
231 |     
232 |     # Assign index number for joining
233 |     model = model.withColumn("id", monotonically_increasing_id())
234 |     
235 |     # Join model df with the final df
236 |     skills_final = skills_final.join(model, ['id'])
237 | 
238 | # No longer need id column
239 | skills_final = skills_final.drop('id')
240 | print("Number of jobs with skills: ", skills_final.count())
241 | 
242 | """
243 | Write to BigQuery
244 | """
245 | skills_final.write.format('bigquery') \
246 |   .option('table', 'job-listings-366015.gsearch_job_listings_clean.gsearch_skills') \
247 |   .option('materializationExpirationTimeInMinutes', 180) \
248 |   .mode('append') \
249 |   .save()   
250 | # 'overwrite' or 'append'


--------------------------------------------------------------------------------
/extra/dataproc_files/keywords/All Keywords.txt:
--------------------------------------------------------------------------------
  1 | sql
  2 | python
  3 | r
  4 | c
  5 | c#
  6 | javascript
  7 | java
  8 | scala
  9 | sas
 10 | matlab
 11 | c++
 12 | perl
 13 | go
 14 | typescript
 15 | bash
 16 | html
 17 | css
 18 | php
 19 | powershell
 20 | rust
 21 | kotlin
 22 | ruby
 23 | dart
 24 | assembly
 25 | swift
 26 | vba
 27 | lua
 28 | groovy
 29 | delphi
 30 | objective-c
 31 | haskell
 32 | elixir
 33 | julia
 34 | clojure
 35 | solidity
 36 | lisp
 37 | f#
 38 | fortran
 39 | erlang
 40 | apl
 41 | cobol
 42 | ocaml
 43 | crystal
 44 | golang
 45 | nosql
 46 | mongodb
 47 | t-sql
 48 | no-sql
 49 | pascal
 50 | mongo
 51 | sass
 52 | vb.net
 53 | shell
 54 | visual basic
 55 | mysql
 56 | sql server
 57 | postgresql
 58 | sqlite
 59 | mongodb
 60 | redis
 61 | mariadb
 62 | elasticsearch
 63 | firebase
 64 | dynamodb
 65 | firestore
 66 | cassandra
 67 | neo4j
 68 | db2
 69 | couchbase
 70 | couchdb
 71 | aws
 72 | azure
 73 | gcp
 74 | firebase
 75 | heroku
 76 | digitalocean
 77 | vmware
 78 | managedhosting
 79 | linode
 80 | ovh
 81 | oracle
 82 | openstack
 83 | watson
 84 | colocation
 85 | snowflake
 86 | redshift
 87 | bigquery
 88 | aurora
 89 | databricks
 90 | ibm cloud
 91 | scikit-learn
 92 | jupyter
 93 | theano
 94 | openCV
 95 | pyspark
 96 | nltk
 97 | mlpack
 98 | chainer
 99 | fann
100 | shogun
101 | dlib
102 | mxnet
103 | keras
104 | .net
105 | numpy
106 | pandas
107 | matplotlib
108 | spring
109 | tensorflow
110 | flutter
111 | react
112 | kafka
113 | electron
114 | pytorch
115 | qt
116 | ionic
117 | xamarin
118 | spark
119 | cordova
120 | hadoop
121 | gtx
122 | capacitor
123 | tidyverse
124 | unoplatform
125 | dplyr
126 | tidyr
127 | ggplot2
128 | plotly
129 | rshiny
130 | mlr
131 | airflow
132 | seaborn
133 | gdpr
134 | graphql
135 | selenium
136 | hugging face
137 | uno platform
138 | node.js
139 | vue
140 | vue.js
141 | ember.js
142 | node
143 | jquery
144 | asp.net
145 | react.js
146 | express
147 | angular
148 | asp.netcore
149 | django
150 | flask
151 | next.js
152 | laravel
153 | angular.js
154 | fastapi
155 | ruby
156 | svelte
157 | blazor
158 | nuxt.js
159 | symfony
160 | gatsby
161 | drupal
162 | phoenix
163 | fastify
164 | deno
165 | asp.net core
166 | ruby on rails
167 | play framework
168 | unix
169 | linux
170 | windows
171 | macos
172 | wsl
173 | ubuntu
174 | centos
175 | debian
176 | redhat
177 | suse
178 | fedora
179 | kali
180 | arch
181 | excel
182 | tableau
183 | word
184 | powerpoint
185 | looker
186 | power bi
187 | outlook
188 | sas
189 | sharepoint
190 | visio
191 | spreadsheet
192 | alteryx
193 | ssis
194 | spss
195 | ssrs
196 | microstrategy
197 | cognos
198 | dax
199 | esquisse
200 | sap
201 | splunk
202 | qlik
203 | nuix
204 | datarobot
205 | ms access
206 | sheets
207 | npm
208 | docker
209 | yarn
210 | homebrew
211 | kubernetes
212 | terraform
213 | unity
214 | ansible
215 | unreal
216 | puppet
217 | chef
218 | pulumi
219 | flow
220 | git
221 | svn
222 | gitlab
223 | github
224 | jenkins
225 | bitbucket
226 | terminal
227 | atlassian
228 | codecommit
229 | jira
230 | confluence
231 | trello
232 | notion
233 | asana
234 | clickup
235 | planner
236 | monday.com
237 | airtable
238 | smartsheet
239 | wrike
240 | workfront
241 | dingtalk
242 | swit
243 | workzone
244 | projectplace
245 | cerri
246 | wimi
247 | leankor
248 | microsoft lists
249 | slack
250 | microsoft teams
251 | twilio
252 | zoom
253 | webex
254 | mattermost
255 | rocketchat
256 | ringcentral
257 | symphony
258 | wire
259 | wickr
260 | unify
261 | coolfire
262 | google chat
263 | 


--------------------------------------------------------------------------------
/extra/dataproc_files/keywords/Keywords Analyst Tools.txt:
--------------------------------------------------------------------------------
 1 | excel
 2 | tableau
 3 | word
 4 | powerpoint
 5 | looker
 6 | power bi
 7 | outlook
 8 | sas
 9 | sharepoint
10 | visio
11 | spreadsheet
12 | alteryx
13 | ssis
14 | spss
15 | ssrs
16 | microstrategy
17 | cognos
18 | dax
19 | esquisse
20 | sap
21 | splunk
22 | qlik
23 | nuix
24 | datarobot
25 | ms access
26 | sheets
27 | 


--------------------------------------------------------------------------------
/extra/dataproc_files/keywords/Keywords Cloud.txt:
--------------------------------------------------------------------------------
 1 | aws
 2 | azure
 3 | gcp
 4 | firebase
 5 | heroku
 6 | digitalocean
 7 | vmware
 8 | managedhosting
 9 | linode
10 | ovh
11 | oracle
12 | openstack
13 | watson
14 | colocation
15 | snowflake
16 | redshift
17 | bigquery
18 | aurora
19 | databricks
20 | ibm cloud
21 | 


--------------------------------------------------------------------------------
/extra/dataproc_files/keywords/Keywords Databases.txt:
--------------------------------------------------------------------------------
 1 | mysql
 2 | sql server
 3 | postgresql
 4 | sqlite
 5 | mongodb
 6 | redis
 7 | mariadb
 8 | elasticsearch
 9 | firebase
10 | dynamodb
11 | firestore
12 | cassandra
13 | neo4j
14 | db2
15 | couchbase
16 | couchdb
17 | 


--------------------------------------------------------------------------------
/extra/dataproc_files/keywords/Keywords Libraries.txt:
--------------------------------------------------------------------------------
 1 | scikit-learn
 2 | jupyter
 3 | theano
 4 | openCV
 5 | pyspark
 6 | nltk
 7 | mlpack
 8 | chainer
 9 | fann
10 | shogun
11 | dlib
12 | mxnet
13 | keras
14 | .net
15 | numpy
16 | pandas
17 | matplotlib
18 | spring
19 | tensorflow
20 | flutter
21 | react
22 | kafka
23 | electron
24 | pytorch
25 | qt
26 | ionic
27 | xamarin
28 | spark
29 | cordova
30 | hadoop
31 | gtx
32 | capacitor
33 | tidyverse
34 | unoplatform
35 | dplyr
36 | tidyr
37 | ggplot2
38 | plotly
39 | rshiny
40 | mlr
41 | airflow
42 | seaborn
43 | gdpr
44 | graphql
45 | selenium
46 | hugging face
47 | uno platform
48 | 


--------------------------------------------------------------------------------
/extra/dataproc_files/keywords/Keywords OS.txt:
--------------------------------------------------------------------------------
 1 | unix
 2 | linux
 3 | windows
 4 | macos
 5 | wsl
 6 | ubuntu
 7 | centos
 8 | debian
 9 | redhat
10 | suse
11 | fedora
12 | kali
13 | arch
14 | 


--------------------------------------------------------------------------------
/extra/dataproc_files/keywords/Keywords Other.txt:
--------------------------------------------------------------------------------
 1 | npm
 2 | docker
 3 | yarn
 4 | homebrew
 5 | kubernetes
 6 | terraform
 7 | unity
 8 | ansible
 9 | unreal
10 | puppet
11 | chef
12 | pulumi
13 | flow
14 | git
15 | svn
16 | gitlab
17 | github
18 | jenkins
19 | bitbucket
20 | terminal
21 | atlassian
22 | codecommit
23 | 


--------------------------------------------------------------------------------
/extra/dataproc_files/keywords/Keywords Sync.txt:
--------------------------------------------------------------------------------
 1 | slack
 2 | microsoft teams
 3 | twilio
 4 | zoom
 5 | webex
 6 | mattermost
 7 | rocketchat
 8 | ringcentral
 9 | symphony
10 | wire
11 | wickr
12 | unify
13 | coolfire
14 | google chat
15 | 


--------------------------------------------------------------------------------
/extra/dataproc_files/keywords/Keywords async.txt:
--------------------------------------------------------------------------------
 1 | jira
 2 | confluence
 3 | trello
 4 | notion
 5 | asana
 6 | clickup
 7 | planner
 8 | monday.com
 9 | airtable
10 | smartsheet
11 | wrike
12 | workfront
13 | dingtalk
14 | swit
15 | workzone
16 | projectplace
17 | cerri
18 | wimi
19 | leankor
20 | microsoft lists
21 | 


--------------------------------------------------------------------------------
/extra/dataproc_files/keywords/Programming Keywords.txt:
--------------------------------------------------------------------------------
 1 | sql
 2 | python
 3 | r
 4 | c
 5 | c#
 6 | javascript
 7 | java
 8 | scala
 9 | sas
10 | matlab
11 | c++
12 | perl
13 | go
14 | typescript
15 | bash
16 | html
17 | css
18 | php
19 | powershell
20 | rust
21 | kotlin
22 | ruby
23 | dart
24 | assembly
25 | swift
26 | vba
27 | lua
28 | groovy
29 | delphi
30 | objective-c
31 | haskell
32 | elixir
33 | julia
34 | clojure
35 | solidity
36 | lisp
37 | f#
38 | fortran
39 | erlang
40 | apl
41 | cobol
42 | ocaml
43 | crystal
44 | golang
45 | nosql
46 | mongodb
47 | t-sql
48 | no-sql
49 | pascal
50 | mongo
51 | sass
52 | vb.net
53 | shell
54 | visual basic
55 | 


--------------------------------------------------------------------------------
/extra/dataproc_files/keywords/Web Frameworks Keywords.txt:
--------------------------------------------------------------------------------
 1 | node.js
 2 | vue
 3 | vue.js
 4 | ember.js
 5 | node
 6 | jquery
 7 | asp.net
 8 | react.js
 9 | express
10 | angular
11 | asp.netcore
12 | django
13 | flask
14 | next.js
15 | laravel
16 | angular.js
17 | fastapi
18 | ruby
19 | svelte
20 | blazor
21 | nuxt.js
22 | symfony
23 | gatsby
24 | drupal
25 | phoenix
26 | fastify
27 | deno
28 | asp.net core
29 | ruby on rails
30 | play framework
31 | 


--------------------------------------------------------------------------------
/logs/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lukebarousse/Data_Job_Pipeline_Airflow/b69a5609c8fcd40c11b2fb6eb06123ca414dccdd/logs/.gitkeep


--------------------------------------------------------------------------------
/plugins/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lukebarousse/Data_Job_Pipeline_Airflow/b69a5609c8fcd40c11b2fb6eb06123ca414dccdd/plugins/.gitkeep


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | numpy
 2 | pandas
 3 | matplotlib
 4 | configparser
 5 | google-search-results
 6 | google-cloud-bigquery
 7 | google-cloud-storage
 8 | google-api-python-client
 9 | google-auth-oauthlib
10 | google-auth-httplib2
11 | torch
12 | torchvision
13 | transformers
14 | pandas-gbq


--------------------------------------------------------------------------------