├── .DS_Store ├── .gitignore ├── Dockerfile ├── README.md ├── dags ├── airflow-log-cleanup.py ├── bigquery_pipeline.py ├── config │ └── .gitkeep ├── gcp_connection.py ├── modules │ ├── country.py │ ├── country_codes.csv │ ├── job_title.py │ └── youtube_views.csv ├── serpapi_bigquery.py └── sql │ ├── .project │ ├── cache_csv.sql │ ├── fact_build.sql │ ├── public_build.sql │ └── wide_build.sql ├── docker-compose.yaml ├── extra ├── airflow_graph.png ├── bigquery_schema.json ├── dashboard.png └── dataproc_files │ ├── README.md │ ├── Salary Table.py │ ├── Skill Table.py │ └── keywords │ ├── All Keywords.txt │ ├── Keywords Analyst Tools.txt │ ├── Keywords Cloud.txt │ ├── Keywords Databases.txt │ ├── Keywords Libraries.txt │ ├── Keywords OS.txt │ ├── Keywords Other.txt │ ├── Keywords Sync.txt │ ├── Keywords async.txt │ ├── Programming Keywords.txt │ └── Web Frameworks Keywords.txt ├── logs └── .gitkeep ├── plugins └── .gitkeep └── requirements.txt /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lukebarousse/Data_Job_Pipeline_Airflow/b69a5609c8fcd40c11b2fb6eb06123ca414dccdd/.DS_Store -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | config.py 2 | .env 3 | *.ini 4 | dags/config/* 5 | !dags/config/.gitkeep 6 | logs/* 7 | !logs/.gitkeep 8 | __pycache__/ 9 | token.json 10 | client_secret.json 11 | google_oauth.py -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM apache/airflow:2.5.0 2 | COPY requirements.txt . 3 | RUN pip install -r requirements.txt -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ![Apache Airflow](https://img.shields.io/badge/Apache%20Airflow-017CEE?style=for-the-badge&logo=Apache%20Airflow&logoColor=white) 2 | 3 | # 🤓 Data Job Pipeline w/ Airflow 4 | ![Airflow DAG](/extra/airflow_graph.png) 5 | What up, data nerds! This is a data pipeline I built that moves Google job search data from [SerpApi](https://serpapi.com/) to a BigQuery database. 6 | 7 | 8 | ## Background 9 | I built an [app](https://jobdata.streamlit.app/) to open-source job requirements to aspiring data nerds so they can more efficiently focus on the skills they need to know for their job. This airflow pipeline collects the data for this app. 10 | [![Open in Streamlit](https://static.streamlit.io/badges/streamlit_badge_black_white.svg)](https://jobdata.streamlit.app/) 11 | ![dashboard](/extra/dashboard.png) 12 | 13 | # ☝🏻 Prerequisites 14 | - [Docker](https://docs.docker.com/get-docker/) installed on your server/machine 15 | 16 | SERVER NOTE: I used a QNAP machine for my server, here's prereq's for that: 17 | - [Familiar with QNAP instructions](https://www.qnap.com/en/how-to/faq/article/how-do-i-access-my-qnap-nas-using-ssh) 18 | - Enable SSH via Control Panel 19 | - Setup *Container/* folder via Container Station 20 | 21 | # 📲 Install 22 | ## Docker-Compose install via SSH 23 | 24 | 25 | 1. Access server via SSH 26 | ``` 27 | ssh admin@192.168.1.131 28 | ``` 29 | 2. Change directory to main directory to add cloned directory 30 | ``` 31 | cd .. 32 | cd share/Container 33 | ``` 34 | 3. Add this repository to that directory 35 | ``` 36 | git clone https://github.com/lukebarousse/Data_Job_Pipeline_Airflow.git 37 | ``` 38 | 3. Change directory into root and add environment variable for start-up 39 | ``` 40 | cd Data_Job_Pipeline_Airflow 41 | echo -e "AIRFLOW_UID=$(id -u)" > .env 42 | ``` 43 | 44 | # ㊙️ Secret Keys 45 | ## Prereq 46 | 1. Access server via SSH 47 | ``` 48 | ssh admin@192.168.1.131 49 | ``` 50 | 2. Change directory to *config/* directory 51 | ``` 52 | cd .. 53 | cd dags/config 54 | ``` 55 | ## SerpApi key 56 | Prerequisite: SerpAPI account created with enough credits 57 | 1. Get your private API key from from [SerpApi Dashboard](https://serpapi.com/dashboard) 58 | 2. Create python file for SerpApi key 59 | ``` 60 | echo -e "serpapi_key = '{Insert your key here}'" > config.py # or /dag/config/config.py if in root 61 | ``` 62 | ## BigQuery Access 63 | Prerequisite: Empty BigQuery database created with [this schema](/extra/bigquery_schema.json) 64 | 1. Follow [Google Cloud detailed documentation](https://cloud.google.com/bigquery/docs/quickstarts/quickstart-client-libraries) to: 65 | - Enable BigQuery API 66 | - Create a service account 67 | - Create & get service account key (JSON file) 68 | 2. Place JSON file in [/dags/config](/dags/config/) directory 69 | 3. Add location of JSON file in [docker-compose.yaml](docker-compose.yaml) 70 | ``` 71 | environment: 72 | GOOGLE_APPLICATION_CREDENTIALS: './dags/config/{Insert name of JSON file}.json' 73 | ``` 74 | 4. Add name of json table id to config file 75 | ``` 76 | echo -e "table_id_json = '{PROJECT_ID}.{DATASET}.{TABLE}'" >> config.py 77 | ``` 78 | I also have this as backup: 79 | ``` 80 | echo -e "table_id = '{PROJECT_ID}.{DATASET}.{TABLE}'" >> config.py 81 | ``` 82 | 83 | # 🐳 Start & Stop 84 | Reference: [Airflow with Docker-Compose](https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html) 85 | ## Start-up of Docker-Compose 86 | NOTE: If you don't want to use credits on first run change 'serpapi_biquery.py' to testing 87 | 1. If necessary, SSH and cd into root directory 88 | ``` 89 | cd .. 90 | cd share/Container/Data_Job_Pipeline_Airflow 91 | ``` 92 | 2. If first time, initialize database for airflow 93 | ``` 94 | docker-compose up airflow-init 95 | ``` 96 | 3. Start aiflow 97 | ``` 98 | docker-compose up 99 | ``` 100 | 101 | ## Shutdown & removal of Docker-Compose 102 | 1. If necessary, SSH and cd into root directory 103 | ``` 104 | cd .. 105 | cd share/Container/Data_Job_Pipeline_Airflow 106 | ``` 107 | 2. Stop and delete containers, delete volumes with database data and download images 108 | ``` 109 | docker-compose down --volumes --rmi all 110 | ``` 111 | 112 | # 🫁 Appendix 113 | ### Want to contribute? 114 | - **Data Analysis:** Share any interesting insights you find from the [dataset](https://www.kaggle.com/datasets/lukebarousse/data-analyst-job-postings-google-search) to this [subreddit](https://www.reddit.com/r/DataNerd/) and/or [Kaggle](https://www.kaggle.com/code/lukebarousse/eda-of-job-posting-data). 115 | - **Dashboard Build:** Contribute changes to the dashboard by using [that repo to fork and open a pull request](https://github.com/lukebarousse/Data_Analyst_Streamlit_App_V1). 116 | - **Data Pipeline Build:** Contribute changes to this pipeline by using [this repo to fork and open a pull request](https://github.com/lukebarousse/Data_Job_Pipeline_Airflow) 117 | --- 118 | ### About the project 119 | - Background on app 📺 [YouTube](https://www.youtube.com/lukebarousse) 120 | - Data provided via 🤖 [SerpApi](https://serpapi.com/) 121 | -------------------------------------------------------------------------------- /dags/airflow-log-cleanup.py: -------------------------------------------------------------------------------- 1 | """ 2 | Source: https://github.com/teamclairvoyant/airflow-maintenance-dags/tree/master/log-cleanup 3 | 4 | A maintenance workflow that you can deploy into Airflow to periodically clean 5 | out the task logs to avoid those getting too big. 6 | """ 7 | import logging 8 | import os 9 | from datetime import timedelta 10 | 11 | import airflow 12 | import jinja2 13 | from airflow.configuration import conf 14 | from airflow.models import DAG, Variable 15 | from airflow.operators.bash_operator import BashOperator 16 | from airflow.operators.dummy_operator import DummyOperator 17 | 18 | # airflow-log-cleanup 19 | DAG_ID = os.path.basename(__file__).replace(".pyc", "").replace(".py", "") 20 | START_DATE = airflow.utils.dates.days_ago(1) 21 | try: 22 | BASE_LOG_FOLDER = conf.get("core", "BASE_LOG_FOLDER").rstrip("/") 23 | except Exception as e: 24 | BASE_LOG_FOLDER = conf.get("logging", "BASE_LOG_FOLDER").rstrip("/") 25 | # How often to Run. @daily - Once a day at Midnight 26 | SCHEDULE_INTERVAL = "@weekly" 27 | # Who is listed as the owner of this DAG in the Airflow Web Server 28 | DAG_OWNER_NAME = "airflow" 29 | # List of email address to send email alerts to if this job fails 30 | ALERT_EMAIL_ADDRESSES = ['luke@lukebarousse.com'] 31 | # Length to retain the log files if not already provided in the conf. If this 32 | # is set to 30, the job will remove those files that are 30 days old or older 33 | DEFAULT_MAX_LOG_AGE_IN_DAYS = 30 34 | # Can set as variable in U/I 35 | # Variable.get( 36 | # "airflow_log_cleanup__max_log_age_in_days", 30 37 | # ) 38 | # Whether the job should delete the logs or not. Included if you want to 39 | # temporarily avoid deleting the logs 40 | ENABLE_DELETE = True 41 | # The number of worker nodes you have in Airflow. Will attempt to run this 42 | # process for however many workers there are so that each worker gets its 43 | # logs cleared. 44 | NUMBER_OF_WORKERS = 1 45 | DIRECTORIES_TO_DELETE = [BASE_LOG_FOLDER] 46 | ENABLE_DELETE_CHILD_LOG = "False" 47 | # Can set as variable in U/I 48 | # Variable.get( 49 | # "airflow_log_cleanup__enable_delete_child_log", "False" 50 | # ) 51 | LOG_CLEANUP_PROCESS_LOCK_FILE = "/tmp/airflow_log_cleanup_worker.lock" 52 | logging.info("ENABLE_DELETE_CHILD_LOG " + ENABLE_DELETE_CHILD_LOG) 53 | 54 | if not BASE_LOG_FOLDER or BASE_LOG_FOLDER.strip() == "": 55 | raise ValueError( 56 | "BASE_LOG_FOLDER variable is empty in airflow.cfg. It can be found " 57 | "under the [core] (<2.0.0) section or [logging] (>=2.0.0) in the cfg file. " 58 | "Kindly provide an appropriate directory path." 59 | ) 60 | 61 | if ENABLE_DELETE_CHILD_LOG.lower() == "true": 62 | try: 63 | CHILD_PROCESS_LOG_DIRECTORY = conf.get( 64 | "scheduler", "CHILD_PROCESS_LOG_DIRECTORY" 65 | ) 66 | if CHILD_PROCESS_LOG_DIRECTORY != ' ': 67 | DIRECTORIES_TO_DELETE.append(CHILD_PROCESS_LOG_DIRECTORY) 68 | except Exception as e: 69 | logging.exception( 70 | "Could not obtain CHILD_PROCESS_LOG_DIRECTORY from " + 71 | "Airflow Configurations: " + str(e) 72 | ) 73 | 74 | default_args = { 75 | 'owner': DAG_OWNER_NAME, 76 | 'depends_on_past': False, 77 | 'email': ALERT_EMAIL_ADDRESSES, 78 | 'email_on_failure': True, 79 | 'email_on_retry': False, 80 | 'start_date': START_DATE, 81 | 'retries': 1, 82 | 'retry_delay': timedelta(minutes=1) 83 | } 84 | 85 | dag = DAG( 86 | DAG_ID, 87 | default_args=default_args, 88 | schedule_interval=SCHEDULE_INTERVAL, 89 | start_date=START_DATE, 90 | template_undefined=jinja2.Undefined, 91 | tags=['maintenance-dag'], 92 | ) 93 | if hasattr(dag, 'doc_md'): 94 | dag.doc_md = __doc__ 95 | if hasattr(dag, 'catchup'): 96 | dag.catchup = False 97 | 98 | start = DummyOperator( 99 | task_id='start', 100 | dag=dag) 101 | 102 | log_cleanup = """ 103 | 104 | echo "Getting Configurations..." 105 | BASE_LOG_FOLDER="{{params.directory}}" 106 | WORKER_SLEEP_TIME="{{params.sleep_time}}" 107 | 108 | sleep ${WORKER_SLEEP_TIME}s 109 | 110 | MAX_LOG_AGE_IN_DAYS="{{dag_run.conf.maxLogAgeInDays}}" 111 | if [ "${MAX_LOG_AGE_IN_DAYS}" == "" ]; then 112 | echo "maxLogAgeInDays conf variable isn't included. Using Default '""" + str(DEFAULT_MAX_LOG_AGE_IN_DAYS) + """'." 113 | MAX_LOG_AGE_IN_DAYS='""" + str(DEFAULT_MAX_LOG_AGE_IN_DAYS) + """' 114 | fi 115 | ENABLE_DELETE=""" + str("true" if ENABLE_DELETE else "false") + """ 116 | echo "Finished Getting Configurations" 117 | echo "" 118 | 119 | echo "Configurations:" 120 | echo "BASE_LOG_FOLDER: '${BASE_LOG_FOLDER}'" 121 | echo "MAX_LOG_AGE_IN_DAYS: '${MAX_LOG_AGE_IN_DAYS}'" 122 | echo "ENABLE_DELETE: '${ENABLE_DELETE}'" 123 | 124 | cleanup() { 125 | echo "Executing Find Statement: $1" 126 | FILES_MARKED_FOR_DELETE=`eval $1` 127 | echo "Process will be Deleting the following File(s)/Directory(s):" 128 | echo "${FILES_MARKED_FOR_DELETE}" 129 | echo "Process will be Deleting `echo "${FILES_MARKED_FOR_DELETE}" | \ 130 | grep -v '^$' | wc -l` File(s)/Directory(s)" \ 131 | # "grep -v '^$'" - removes empty lines. 132 | # "wc -l" - Counts the number of lines 133 | echo "" 134 | if [ "${ENABLE_DELETE}" == "true" ]; 135 | then 136 | if [ "${FILES_MARKED_FOR_DELETE}" != "" ]; 137 | then 138 | echo "Executing Delete Statement: $2" 139 | eval $2 140 | DELETE_STMT_EXIT_CODE=$? 141 | if [ "${DELETE_STMT_EXIT_CODE}" != "0" ]; then 142 | echo "Delete process failed with exit code \ 143 | '${DELETE_STMT_EXIT_CODE}'" 144 | 145 | echo "Removing lock file..." 146 | rm -f """ + str(LOG_CLEANUP_PROCESS_LOCK_FILE) + """ 147 | if [ "${REMOVE_LOCK_FILE_EXIT_CODE}" != "0" ]; then 148 | echo "Error removing the lock file. \ 149 | Check file permissions.\ 150 | To re-run the DAG, ensure that the lock file has been \ 151 | deleted (""" + str(LOG_CLEANUP_PROCESS_LOCK_FILE) + """)." 152 | exit ${REMOVE_LOCK_FILE_EXIT_CODE} 153 | fi 154 | exit ${DELETE_STMT_EXIT_CODE} 155 | fi 156 | else 157 | echo "WARN: No File(s)/Directory(s) to Delete" 158 | fi 159 | else 160 | echo "WARN: You're opted to skip deleting the File(s)/Directory(s)!!!" 161 | fi 162 | } 163 | 164 | 165 | if [ ! -f """ + str(LOG_CLEANUP_PROCESS_LOCK_FILE) + """ ]; then 166 | 167 | echo "Lock file not found on this node! \ 168 | Creating it to prevent collisions..." 169 | touch """ + str(LOG_CLEANUP_PROCESS_LOCK_FILE) + """ 170 | CREATE_LOCK_FILE_EXIT_CODE=$? 171 | if [ "${CREATE_LOCK_FILE_EXIT_CODE}" != "0" ]; then 172 | echo "Error creating the lock file. \ 173 | Check if the airflow user can create files under tmp directory. \ 174 | Exiting..." 175 | exit ${CREATE_LOCK_FILE_EXIT_CODE} 176 | fi 177 | 178 | echo "" 179 | echo "Running Cleanup Process..." 180 | 181 | FIND_STATEMENT="find ${BASE_LOG_FOLDER}/*/* -type f -mtime \ 182 | +${MAX_LOG_AGE_IN_DAYS}" 183 | DELETE_STMT="${FIND_STATEMENT} -exec rm -f {} \;" 184 | 185 | cleanup "${FIND_STATEMENT}" "${DELETE_STMT}" 186 | CLEANUP_EXIT_CODE=$? 187 | 188 | FIND_STATEMENT="find ${BASE_LOG_FOLDER}/*/* -type d -empty" 189 | DELETE_STMT="${FIND_STATEMENT} -prune -exec rm -rf {} \;" 190 | 191 | cleanup "${FIND_STATEMENT}" "${DELETE_STMT}" 192 | CLEANUP_EXIT_CODE=$? 193 | 194 | FIND_STATEMENT="find ${BASE_LOG_FOLDER}/* -type d -empty" 195 | DELETE_STMT="${FIND_STATEMENT} -prune -exec rm -rf {} \;" 196 | 197 | cleanup "${FIND_STATEMENT}" "${DELETE_STMT}" 198 | CLEANUP_EXIT_CODE=$? 199 | 200 | echo "Finished Running Cleanup Process" 201 | 202 | echo "Deleting lock file..." 203 | rm -f """ + str(LOG_CLEANUP_PROCESS_LOCK_FILE) + """ 204 | REMOVE_LOCK_FILE_EXIT_CODE=$? 205 | if [ "${REMOVE_LOCK_FILE_EXIT_CODE}" != "0" ]; then 206 | echo "Error removing the lock file. Check file permissions. To re-run the DAG, ensure that the lock file has been deleted (""" + str(LOG_CLEANUP_PROCESS_LOCK_FILE) + """)." 207 | exit ${REMOVE_LOCK_FILE_EXIT_CODE} 208 | fi 209 | 210 | else 211 | echo "Another task is already deleting logs on this worker node. \ 212 | Skipping it!" 213 | echo "If you believe you're receiving this message in error, kindly check \ 214 | if """ + str(LOG_CLEANUP_PROCESS_LOCK_FILE) + """ exists and delete it." 215 | exit 0 216 | fi 217 | 218 | """ 219 | 220 | for log_cleanup_id in range(1, NUMBER_OF_WORKERS + 1): 221 | 222 | for dir_id, directory in enumerate(DIRECTORIES_TO_DELETE): 223 | 224 | log_cleanup_op = BashOperator( 225 | task_id='log_cleanup_worker_num_' + 226 | str(log_cleanup_id) + '_dir_' + str(dir_id), 227 | bash_command=log_cleanup, 228 | params={ 229 | "directory": str(directory), 230 | "sleep_time": int(log_cleanup_id)*3}, 231 | dag=dag) 232 | 233 | log_cleanup_op.set_upstream(start) -------------------------------------------------------------------------------- /dags/bigquery_pipeline.py: -------------------------------------------------------------------------------- 1 | """ 2 | An operations workflow to clean up BigQuery table making fact and dimension tables 3 | """ 4 | import os 5 | import warnings 6 | warnings.simplefilter(action='ignore', category=FutureWarning) # stop getting Pandas FutureWarning's 7 | 8 | import airflow 9 | from airflow import DAG 10 | from airflow.operators.dummy_operator import DummyOperator 11 | from airflow.operators.python_operator import PythonOperator 12 | from config import config # contains secret keys in config.py 13 | from google.cloud import bigquery 14 | from airflow.providers.google.cloud.operators.bigquery import BigQueryInsertJobOperator 15 | from airflow.providers.google.cloud.operators.dataproc import ClusterGenerator, DataprocDeleteClusterOperator, DataprocCreateClusterOperator, DataprocSubmitJobOperator 16 | from airflow.operators.http_operator import SimpleHttpOperator 17 | from datetime import timedelta 18 | 19 | from youtube.google_oauth import update_video 20 | from modules.job_title import transform_job_title 21 | 22 | 23 | # 'False' DAG is ready for operation; 'True' DAG only runs 'start' DummyOperator 24 | TESTING_DAG = False 25 | # Minutes to sleep on an error 26 | ERROR_SLEEP_MIN = 5 27 | # Who is listed as the owner of this DAG in the Airflow Web Server 28 | DAG_OWNER_NAME = "airflow" 29 | # List of email address to send email alerts to if this job fails 30 | ALERT_EMAIL_ADDRESSES = ['luke@lukebarousse.com'] 31 | START_DATE = airflow.utils.dates.days_ago(1) 32 | 33 | default_args = { 34 | 'owner': DAG_OWNER_NAME, 35 | 'depends_on_past': False, 36 | 'start_date': START_DATE, 37 | 'email': ALERT_EMAIL_ADDRESSES, 38 | 'email_on_failure': True, 39 | 'email_on_retry': True, 40 | 'retries': 1, 41 | 'retry_delay': timedelta(minutes=5), 42 | # 'queue': 'bash_queue', 43 | # 'pool': 'backfill', 44 | # 'priority_weight': 10, 45 | # 'end_date': datetime(2022, 1, 1), 46 | # 'wait_for_downstream': False, 47 | # 'dag': dag, 48 | # 'sla': timedelta(hours=2), 49 | # 'execution_timeout': timedelta(seconds=300), 50 | # 'on_failure_callback': some_function, 51 | # 'on_success_callback': some_other_function, 52 | # 'on_retry_callback': another_function, 53 | # 'sla_miss_callback': yet_another_function, 54 | # 'trigger_rule': 'all_success' 55 | } 56 | 57 | # Dataproc cluster variables 58 | PROJECT_ID = 'job-listings-366015' 59 | REGION = 'us-central1' 60 | ZONE = 'us-central1-a' 61 | CLUSTER_NAME = 'spark-cluster' 62 | BUCKET_NAME = 'dataproc-cluster-gsearch' 63 | 64 | CLUSTER_CONFIG = ClusterGenerator( 65 | task_id='start_cluster', 66 | gcp_conn_id='google_cloud_default', 67 | project_id=PROJECT_ID, 68 | cluster_name=CLUSTER_NAME, 69 | region=REGION, 70 | zone=ZONE, 71 | storage_bucket=BUCKET_NAME, 72 | num_workers=2, 73 | master_machine_type='n2-standard-2', 74 | worker_machine_type='n2-standard-2', 75 | image_version='1.5-debian10', 76 | optional_components=['ANACONDA', 'JUPYTER'], 77 | properties={'spark:spark.jars.packages': 'com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.6,com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.27.1'}, 78 | metadata={'PIP_PACKAGES': 'spark-nlp spark-nlp-display'}, 79 | init_actions_uris=['gs://goog-dataproc-initialization-actions-us-central1/python/pip-install.sh'], 80 | auto_delete_ttl=60*60*4, # 4 hours 81 | enable_component_gateway=True 82 | ).make() 83 | 84 | dag = DAG( 85 | 'bigquery_pipeline', 86 | description='Execute BigQuery & Dataproc jobs for data pipeline', 87 | default_args=default_args, 88 | schedule_interval='0 10 * * *', # want to run following serpapi_bigquery dag... best practice is to combine don't want to combine and make script so big 89 | catchup=False, 90 | tags=['data-pipeline-dag'], 91 | max_active_tasks = 3 92 | ) 93 | 94 | with dag: 95 | 96 | start = DummyOperator( 97 | task_id='start', 98 | dag=dag) 99 | 100 | if not TESTING_DAG: 101 | 102 | # Create fact table from JSON data from SerpApi 103 | fact_table_build = BigQueryInsertJobOperator( 104 | task_id='fact_table_build', 105 | gcp_conn_id='google_cloud_default', 106 | configuration={ 107 | "query": { 108 | "query": 'sql/fact_build.sql', 109 | "useLegacySql": False 110 | } 111 | }, 112 | dag=dag 113 | ) 114 | 115 | # Start up dataproc cluster with spark-nlp and spark-bigquery dependencies 116 | # NOTE: CLI commands in notes under 'Dataproc' section 117 | start_cluster = DataprocCreateClusterOperator( 118 | task_id='start_cluster', 119 | cluster_name=CLUSTER_NAME, 120 | project_id=PROJECT_ID, 121 | region=REGION, 122 | cluster_config=CLUSTER_CONFIG, 123 | dag=dag 124 | ) 125 | 126 | # Create (and/or replace) salary table 127 | python_file_salary = 'gs://dataproc-cluster-gsearch/notebooks/jupyter/salary_table.py' 128 | SALARY_TABLE = { 129 | "reference": {"project_id": PROJECT_ID}, 130 | "placement": {"cluster_name": CLUSTER_NAME}, 131 | "pyspark_job": {"main_python_file_uri": python_file_salary}, 132 | } 133 | 134 | salary_table = DataprocSubmitJobOperator( 135 | task_id='salary_table', 136 | job=SALARY_TABLE, 137 | project_id=PROJECT_ID, 138 | region=REGION, 139 | dag=dag 140 | ) 141 | 142 | # Append to skill table 143 | python_file_skill = 'gs://dataproc-cluster-gsearch/notebooks/jupyter/skill_table.py' 144 | SKILL_TABLE = { 145 | "reference": {"project_id": PROJECT_ID}, 146 | "placement": {"cluster_name": CLUSTER_NAME}, 147 | "pyspark_job": {"main_python_file_uri": python_file_skill}, 148 | } 149 | 150 | skill_table = DataprocSubmitJobOperator( 151 | task_id='skill_table', 152 | job=SKILL_TABLE, 153 | project_id=PROJECT_ID, 154 | region=REGION, 155 | dag=dag 156 | ) 157 | 158 | # Shut down dataproc cluster 159 | stop_cluster = DataprocDeleteClusterOperator( 160 | task_id='stop_cluster', 161 | project_id=PROJECT_ID, 162 | cluster_name=CLUSTER_NAME, 163 | region=REGION, 164 | dag=dag 165 | ) 166 | 167 | # Transform job title using BART 168 | transform_job = PythonOperator( 169 | task_id='transform_job_title', 170 | python_callable=transform_job_title, 171 | dag=dag 172 | ) 173 | 174 | # Combine fact table with dimension table 175 | wide_table_build = BigQueryInsertJobOperator( 176 | task_id='wide_table_build', 177 | gcp_conn_id='google_cloud_default', 178 | configuration={ 179 | "query": { 180 | "query": 'sql/wide_build.sql', 181 | "useLegacySql": False 182 | } 183 | }, 184 | dag=dag 185 | ) 186 | 187 | # Cache common BigQuery queries in CSV files 188 | cache_csv = BigQueryInsertJobOperator( 189 | task_id='cache_csv', 190 | gcp_conn_id='google_cloud_default', 191 | configuration={ 192 | "query": { 193 | "query": 'sql/cache_csv.sql', 194 | "useLegacySql": False 195 | } 196 | }, 197 | dag=dag 198 | ) 199 | 200 | # Update video title in YouTube with number of job listings 201 | update_video_title = PythonOperator( 202 | task_id='update_video_title', 203 | python_callable=update_video, 204 | dag=dag 205 | ) 206 | 207 | # Create public dataset w/ No duplicates (for ChatGPT course) 208 | public_table_build = BigQueryInsertJobOperator( 209 | task_id='public_table_build', 210 | gcp_conn_id='google_cloud_default', 211 | configuration={ 212 | "query": { 213 | "query": 'sql/public_build.sql', 214 | "useLegacySql": False 215 | } 216 | }, 217 | dag=dag 218 | ) 219 | 220 | 221 | start >> fact_table_build >> start_cluster >> salary_table >> skill_table >> stop_cluster >> transform_job >> wide_table_build >> cache_csv >> update_video_title >> public_table_build 222 | -------------------------------------------------------------------------------- /dags/config/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lukebarousse/Data_Job_Pipeline_Airflow/b69a5609c8fcd40c11b2fb6eb06123ca414dccdd/dags/config/.gitkeep -------------------------------------------------------------------------------- /dags/gcp_connection.py: -------------------------------------------------------------------------------- 1 | """ 2 | An initial workflow to enter in the conn_id for google cloud from the service account JSON 3 | file (i.e. GOOGLE_APPLICATION_CREDENTIALS environment variable specified in docker-compose.yaml) 4 | """ 5 | import os 6 | import airflow 7 | from airflow import DAG, settings 8 | from airflow.operators.python_operator import PythonOperator 9 | from datetime import datetime, timedelta 10 | from airflow.models import Connection 11 | 12 | import json 13 | 14 | DAG_OWNER_NAME = "airflow" 15 | # List of email address to send email alerts to if this job fails 16 | ALERT_EMAIL_ADDRESSES = ['luke@lukebarousse.com'] 17 | START_DATE = airflow.utils.dates.days_ago(1) 18 | 19 | default_args = { 20 | "owner": DAG_OWNER_NAME, 21 | "depends_on_past": False, 22 | "start_date": START_DATE, 23 | "email": ALERT_EMAIL_ADDRESSES, 24 | "email_on_failure": False, 25 | "email_on_retry": False, 26 | "retries": 3, 27 | "retry_delay": timedelta(minutes=5) 28 | } 29 | 30 | def add_gcp_connection(**kwargs): 31 | new_conn = Connection( 32 | conn_id="google_cloud_default", 33 | conn_type='google_cloud_platform', 34 | ) 35 | extra_field = { 36 | "extra__google_cloud_platform__scope": "https://www.googleapis.com/auth/cloud-platform", 37 | "extra__google_cloud_platform__project": "job-listings-366015", 38 | "extra__google_cloud_platform__key_path": os.environ.get("GOOGLE_APPLICATION_CREDENTIALS") 39 | } 40 | 41 | session = settings.Session() 42 | 43 | #checking if connection exist 44 | if session.query(Connection).filter(Connection.conn_id == new_conn.conn_id).first(): 45 | my_connection = session.query(Connection).filter(Connection.conn_id == new_conn.conn_id).one() 46 | my_connection.set_extra(json.dumps(extra_field)) 47 | session.add(my_connection) 48 | session.commit() 49 | else: #if it doesn't exit create one 50 | new_conn.set_extra(json.dumps(extra_field)) 51 | session.add(new_conn) 52 | session.commit() 53 | 54 | dag = DAG( 55 | "gcp_connection", 56 | default_args=default_args, 57 | schedule_interval="@once", 58 | tags=['initial-config'], 59 | ) 60 | 61 | with dag: 62 | activateGCP = PythonOperator( 63 | task_id='add_gcp_connection', 64 | python_callable=add_gcp_connection, 65 | provide_context=True, 66 | ) 67 | 68 | activateGCP -------------------------------------------------------------------------------- /dags/modules/country.py: -------------------------------------------------------------------------------- 1 | """ 2 | Classify countries by code and sort them by percentage of views 3 | """ 4 | 5 | import pandas as pd 6 | 7 | def view_percent(): 8 | # import different country codes 9 | codes = pd.read_csv("/opt/airflow/dags/modules/country_codes.csv") 10 | 11 | # import youtube views for my channel and calculate percentage viewed 12 | views = pd.read_csv("/opt/airflow/dags/modules/youtube_views.csv") 13 | views = views.iloc[1: , :] 14 | views = views[views.Views != 0] # removing countries with no views 15 | views = views[views.Geography != 'US'] # pulling US already 16 | views["percent"] = views['Watch time (hours)'] / views['Watch time (hours)'].sum() 17 | 18 | # no results returned from SerpApi from these countries 19 | # may consider removing from search in future, but doesn' appear to use search credits for no results 20 | no_country_results = ["MO", "IR", "SD", "SY", "SZ", "SS" ] # "Macao", "Iran", "Sudan", "Syria", "Eswatini", "South Sudan" 21 | 22 | # merge dataframes for final dataframe 23 | percent = views.merge(codes, how='left', left_on='Geography', right_on='code') 24 | percent = percent[['country','percent']] 25 | 26 | # return the dataframe 27 | return percent -------------------------------------------------------------------------------- /dags/modules/country_codes.csv: -------------------------------------------------------------------------------- 1 | code,country,country_long,year,ccTLD,notes 2 | AD,Andorra,Andorra,1974,.ad, 3 | AE,United Arab Emirates,United Arab Emirates,1974,.ae, 4 | AF,Afghanistan,Afghanistan,1974,.af, 5 | AG,Antigua and Barbuda,Antigua and Barbuda,1974,.ag, 6 | AI,Anguilla,Anguilla,1985,.ai,AI previously represented French Afars and Issas 7 | AL,Albania,Albania,1974,.al, 8 | AM,Armenia,Armenia,1992,.am, 9 | AO,Angola,Angola,1974,.ao, 10 | AQ,Antarctica,Antarctica,1974,.aq,"Covers the territories south of 60° south latitude 11 | Code taken from name in French: Antarctique" 12 | AR,Argentina,Argentina,1974,.ar, 13 | AS,American Samoa,American Samoa,1974,.as, 14 | AT,Austria,Austria,1974,.at, 15 | AU,Australia,Australia,1974,.au,Includes the Ashmore and Cartier Islands and the Coral Sea Islands 16 | AW,Aruba,Aruba,1986,.aw, 17 | AX,Åland Islands,Åland Islands,2004,.ax,An autonomous county of Finland 18 | AZ,Azerbaijan,Azerbaijan,1992,.az, 19 | BA,Bosnia and Herzegovina,Bosnia and Herzegovina,1992,.ba, 20 | BB,Barbados,Barbados,1974,.bb, 21 | BD,Bangladesh,Bangladesh,1974,.bd, 22 | BE,Belgium,Belgium,1974,.be, 23 | BF,Burkina Faso,Burkina Faso,1984,.bf,Name changed from Upper Volta (HV) 24 | BG,Bulgaria,Bulgaria,1974,.bg, 25 | BH,Bahrain,Bahrain,1974,.bh, 26 | BI,Burundi,Burundi,1974,.bi, 27 | BJ,Benin,Benin,1977,.bj,Name changed from Dahomey (DY) 28 | BL,Saint Barthélemy,Saint Barthélemy,2007,.bl, 29 | BM,Bermuda,Bermuda,1974,.bm, 30 | BN,Brunei,Brunei Darussalam,1974,.bn,Previous ISO country name: Brunei 31 | BO,Bolivia,Bolivia (Plurinational State of),1974,.bo,Previous ISO country name: Bolivia 32 | BQ,"Bonaire, Sint Eustatius and Saba","Bonaire, Sint Eustatius and Saba",2010,.bq,"Consists of three Caribbean ""special municipalities"", which are part of the Netherlands proper: Bonaire, Sint Eustatius, and Saba (the BES Islands) 33 | Previous ISO country name: Bonaire, Saint Eustatius and Saba 34 | BQ previously represented British Antarctic Territory" 35 | BR,Brazil,Brazil,1974,.br, 36 | BS,Bahamas,Bahamas,1974,.bs, 37 | BT,Bhutan,Bhutan,1974,.bt, 38 | BV,Bouvet Island,Bouvet Island,1974,.bv,Belongs to Norway 39 | BW,Botswana,Botswana,1974,.bw, 40 | BY,Belarus,Belarus,1974,.by,"Code taken from previous ISO country name: Byelorussian SSR (now assigned ISO 3166-3 code BYAA) 41 | Code assigned as the country was already a UN member since 1945[15]" 42 | BZ,Belize,Belize,1974,.bz, 43 | CA,Canada,Canada,1974,.ca, 44 | CC,Cocos (Keeling) Islands,Cocos (Keeling) Islands,1974,.cc,Belongs to Australia 45 | CD,"Congo, Democratic Republic of the","Congo, Democratic Republic of the",1997,.cd,Name changed from Zaire (ZR) 46 | CF,Central African Republic,Central African Republic,1974,.cf, 47 | CG,Congo,Congo,1974,.cg, 48 | CH,Switzerland,Switzerland,1974,.ch,Code taken from name in Latin: Confoederatio Helvetica 49 | CI,Côte d'Ivoire,Côte d'Ivoire,1974,.ci,ISO country name follows UN designation (common name and previous ISO country name: Ivory Coast) 50 | CK,Cook Islands,Cook Islands,1974,.ck, 51 | CL,Chile,Chile,1974,.cl, 52 | CM,Cameroon,Cameroon,1974,.cm,"Previous ISO country name: Cameroon, United Republic of" 53 | CN,China,China,1974,.cn, 54 | CO,Colombia,Colombia,1974,.co, 55 | CR,Costa Rica,Costa Rica,1974,.cr, 56 | CU,Cuba,Cuba,1974,.cu, 57 | CV,Cabo Verde,Cabo Verde,1974,.cv,"ISO country name follows UN designation (common name and previous ISO country name: Cape Verde, another previous ISO country name: Cape Verde Islands)" 58 | CW,Curaçao,Curaçao,2010,.cw, 59 | CX,Christmas Island,Christmas Island,1974,.cx,Belongs to Australia 60 | CY,Cyprus,Cyprus,1974,.cy, 61 | CZ,Czechia,Czechia,1993,.cz,Previous ISO country name: Czech Republic 62 | DE,Germany,Germany,1974,.de,"Code taken from name in German: Deutschland 63 | Code used for West Germany before 1990 (previous ISO country name: Germany, Federal Republic of)" 64 | DJ,Djibouti,Djibouti,1977,.dj,Name changed from French Afars and Issas (AI) 65 | DK,Denmark,Denmark,1974,.dk, 66 | DM,Dominica,Dominica,1974,.dm, 67 | DO,Dominican Republic,Dominican Republic,1974,.do, 68 | DZ,Algeria,Algeria,1974,.dz,"Code taken from name in Arabic الجزائر al-Djazā'ir, Algerian Arabic الدزاير al-Dzāyīr, or Berber ⴷⵣⴰⵢⵔ Dzayer" 69 | EC,Ecuador,Ecuador,1974,.ec, 70 | EE,Estonia,Estonia,1992,.ee,Code taken from name in Estonian: Eesti 71 | EG,Egypt,Egypt,1974,.eg, 72 | EH,Western Sahara,Western Sahara,1974,,"Previous ISO country name: Spanish Sahara (code taken from name in Spanish: Sahara español) 73 | .eh ccTLD has not been implemented.[16]" 74 | ER,Eritrea,Eritrea,1993,.er, 75 | ES,Spain,Spain,1974,.es,Code taken from name in Spanish: España 76 | ET,Ethiopia,Ethiopia,1974,.et, 77 | FI,Finland,Finland,1974,.fi, 78 | FJ,Fiji,Fiji,1974,.fj, 79 | FK,Falkland Islands,Falkland Islands (Malvinas),1974,.fk,ISO country name follows UN designation due to the Falkland Islands sovereignty dispute (local common name: Falkland Islands)[17] 80 | FM,Federated States of Micronesia,Micronesia (Federated States of),1986,.fm,Previous ISO country name: Micronesia 81 | FO,Faroe Islands,Faroe Islands,1974,.fo,Code taken from name in Faroese: Føroyar 82 | FR,France,France,1974,.fr,Includes Clipperton Island 83 | GA,Gabon,Gabon,1974,.ga, 84 | GB,United Kingdom,United Kingdom of Great Britain and Northern Ireland,1974,".gb 85 | (.uk)","Includes Akrotiri and Dhekelia (Sovereign Base Areas) 86 | Code taken from Great Britain (from official name: United Kingdom of Great Britain and Northern Ireland)[18] 87 | Previous ISO country name: United Kingdom 88 | .uk is the primary ccTLD of the United Kingdom instead of .gb (see code UK, which is exceptionally reserved)" 89 | GD,Grenada,Grenada,1974,.gd, 90 | GE,Georgia,Georgia,1992,.ge,GE previously represented Gilbert and Ellice Islands 91 | GF,French Guiana,French Guiana,1974,.gf,Code taken from name in French: Guyane française 92 | GG,Guernsey,Guernsey,2006,.gg,A British Crown Dependency 93 | GH,Ghana,Ghana,1974,.gh, 94 | GI,Gibraltar,Gibraltar,1974,.gi, 95 | GL,Greenland,Greenland,1974,.gl, 96 | GM,Gambia,Gambia,1974,.gm, 97 | GN,Guinea,Guinea,1974,.gn, 98 | GP,Guadeloupe,Guadeloupe,1974,.gp, 99 | GQ,Equatorial Guinea,Equatorial Guinea,1974,.gq,Code taken from name in French: Guinée équatoriale 100 | GR,Greece,Greece,1974,.gr, 101 | GS,South Georgia and the South Sandwich Islands,South Georgia and the South Sandwich Islands,1993,.gs, 102 | GT,Guatemala,Guatemala,1974,.gt, 103 | GU,Guam,Guam,1974,.gu, 104 | GW,Guinea-Bissau,Guinea-Bissau,1974,.gw, 105 | GY,Guyana,Guyana,1974,.gy, 106 | HK,Hong Kong,Hong Kong,1974,.hk,Hong Kong is officially a Special Administrative Region of the People's Republic of China since 1 July 1997 107 | HM,Heard Island and McDonald Islands,Heard Island and McDonald Islands,1974,.hm,Belongs to Australia 108 | HN,Honduras,Honduras,1974,.hn, 109 | HR,Croatia,Croatia,1992,.hr,Code taken from name in Croatian: Hrvatska 110 | HT,Haiti,Haiti,1974,.ht, 111 | HU,Hungary,Hungary,1974,.hu, 112 | ID,Indonesia,Indonesia,1974,.id, 113 | IE,Ireland,Ireland,1974,.ie, 114 | IL,Israel,Israel,1974,.il, 115 | IM,Isle of Man,Isle of Man,2006,.im,A British Crown Dependency 116 | IN,India,India,1974,.in, 117 | IO,British Indian Ocean Territory,British Indian Ocean Territory,1974,.io, 118 | IQ,Iraq,Iraq,1974,.iq, 119 | IR,"Iran ",Iran (Islamic Republic of),1974,.ir,Previous ISO country name: Iran 120 | IS,Iceland,Iceland,1974,.is,Code taken from name in Icelandic: Ísland 121 | IT,Italy,Italy,1974,.it, 122 | JE,Jersey,Jersey,2006,.je,A British Crown Dependency 123 | JM,Jamaica,Jamaica,1974,.jm, 124 | JO,Jordan,Jordan,1974,.jo, 125 | JP,Japan,Japan,1974,.jp, 126 | KE,Kenya,Kenya,1974,.ke, 127 | KG,Kyrgyzstan,Kyrgyzstan,1992,.kg, 128 | KH,Cambodia,Cambodia,1974,.kh,"Code taken from former name: Khmer Republic 129 | Previous ISO country name: Kampuchea, Democratic" 130 | KI,Kiribati,Kiribati,1979,.ki,Name changed from Gilbert Islands (GE) 131 | KM,Comoros,Comoros,1974,.km,"Code taken from name in Comorian: Komori 132 | Previous ISO country name: Comoro Islands" 133 | KN,Saint Kitts and Nevis,Saint Kitts and Nevis,1974,.kn,Previous ISO country name: Saint Kitts-Nevis-Anguilla 134 | KP,North Korea,Korea (Democratic People's Republic of),1974,.kp,ISO country name follows UN designation (common name: North Korea) 135 | KR,South Korea,"Korea, Republic of",1974,.kr,ISO country name follows UN designation (common name: South Korea) 136 | KW,Kuwait,Kuwait,1974,.kw, 137 | KY,Cayman Islands,Cayman Islands,1974,.ky, 138 | KZ,Kazakhstan,Kazakhstan,1992,.kz,Previous ISO country name: Kazakstan 139 | LA,Laos,Lao People's Democratic Republic,1974,.la,ISO country name follows UN designation (common name and previous ISO country name: Laos) 140 | LB,Lebanon,Lebanon,1974,.lb, 141 | LC,Saint Lucia,Saint Lucia,1974,.lc, 142 | LI,Liechtenstein,Liechtenstein,1974,.li, 143 | LK,Sri Lanka,Sri Lanka,1974,.lk, 144 | LR,Liberia,Liberia,1974,.lr, 145 | LS,Lesotho,Lesotho,1974,.ls, 146 | LT,Lithuania,Lithuania,1992,.lt, 147 | LU,Luxembourg,Luxembourg,1974,.lu, 148 | LV,Latvia,Latvia,1992,.lv, 149 | LY,Libya,Libya,1974,.ly,Previous ISO country name: Libyan Arab Jamahiriya 150 | MA,Morocco,Morocco,1974,.ma,Code taken from name in French: Maroc 151 | MC,Monaco,Monaco,1974,.mc, 152 | MD,Moldova,"Moldova, Republic of",1992,.md,Previous ISO country name: Moldova (briefly from 2008 to 2009) 153 | ME,Montenegro,Montenegro,2006,.me, 154 | MF,Saint Martin,Saint Martin (French part),2007,.mf,The Dutch part of Saint Martin island is assigned code SX 155 | MG,Madagascar,Madagascar,1974,.mg, 156 | MH,Marshall Islands,Marshall Islands,1986,.mh, 157 | MK,Macedonia (FYROM),North Macedonia,1993,.mk,"Code taken from name in Macedonian: Severna Makedonija 158 | Previous ISO country name: Macedonia, the former Yugoslav Republic of (designated as such due to Macedonia naming dispute)" 159 | ML,Mali,Mali,1974,.ml, 160 | MM,Myanmar,Myanmar,1989,.mm,Name changed from Burma (BU) 161 | MN,Mongolia,Mongolia,1974,.mn, 162 | MO,Macao,Macao,1974,.mo,Previous ISO country name: Macau; Macao is officially a Special Administrative Region of the People's Republic of China since 20 December 1999 163 | MP,Northern Mariana Islands,Northern Mariana Islands,1986,.mp, 164 | MQ,Martinique,Martinique,1974,.mq, 165 | MR,Mauritania,Mauritania,1974,.mr, 166 | MS,Montserrat,Montserrat,1974,.ms, 167 | MT,Malta,Malta,1974,.mt, 168 | MU,Mauritius,Mauritius,1974,.mu, 169 | MV,Maldives,Maldives,1974,.mv, 170 | MW,Malawi,Malawi,1974,.mw, 171 | MX,Mexico,Mexico,1974,.mx, 172 | MY,Malaysia,Malaysia,1974,.my, 173 | MZ,Mozambique,Mozambique,1974,.mz, 174 | NA,Namibia,Namibia,1974,.na, 175 | NC,New Caledonia,New Caledonia,1974,.nc, 176 | NE,Niger,Niger,1974,.ne, 177 | NF,Norfolk Island,Norfolk Island,1974,.nf,Belongs to Australia 178 | NG,Nigeria,Nigeria,1974,.ng, 179 | NI,Nicaragua,Nicaragua,1974,.ni, 180 | NL,Netherlands,Netherlands,1974,.nl,"Officially includes the islands Bonaire, Saint Eustatius and Saba, which also have code BQ in ISO 3166-1. Within ISO 3166-2, Aruba (AW), Curaçao (CW), and Sint Maarten (SX) are also coded as subdivisions of NL.[19]" 181 | NO,Norway,Norway,1974,.no, 182 | NP,Nepal,Nepal,1974,.np, 183 | NR,Nauru,Nauru,1974,.nr, 184 | NU,Niue,Niue,1974,.nu,Previous ISO country name: Niue Island 185 | NZ,New Zealand,New Zealand,1974,.nz, 186 | OM,Oman,Oman,1974,.om, 187 | PA,Panama,Panama,1974,.pa, 188 | PE,Peru,Peru,1974,.pe, 189 | PF,French Polynesia,French Polynesia,1974,.pf,Code taken from name in French: Polynésie française 190 | PG,Papua New Guinea,Papua New Guinea,1974,.pg, 191 | PH,Philippines,Philippines,1974,.ph, 192 | PK,Pakistan,Pakistan,1974,.pk, 193 | PL,Poland,Poland,1974,.pl, 194 | PM,Saint Pierre and Miquelon,Saint Pierre and Miquelon,1974,.pm, 195 | PN,Pitcairn,Pitcairn,1974,.pn,Previous ISO country name: Pitcairn Islands 196 | PR,Puerto Rico,Puerto Rico,1974,.pr, 197 | PS,Palestine,"Palestine, State of",1999,.ps,"Previous ISO country name: Palestinian Territory, Occupied 198 | Consists of the West Bank and the Gaza Strip" 199 | PT,Portugal,Portugal,1974,.pt, 200 | PW,Palau,Palau,1986,.pw, 201 | PY,Paraguay,Paraguay,1974,.py, 202 | QA,Qatar,Qatar,1974,.qa, 203 | RE,Réunion,Réunion,1974,.re, 204 | RO,Romania,Romania,1974,.ro, 205 | RS,Serbia,Serbia,2006,.rs,Republic of Serbia 206 | RU,Russia,Russian Federation,1992,.ru,ISO country name follows UN designation (common name: Russia) 207 | RW,Rwanda,Rwanda,1974,.rw, 208 | SA,Saudi Arabia,Saudi Arabia,1974,.sa, 209 | SB,Solomon Islands,Solomon Islands,1974,.sb,Code taken from former name: British Solomon Islands 210 | SC,Seychelles,Seychelles,1974,.sc, 211 | SD,Sudan,Sudan,1974,.sd, 212 | SE,Sweden,Sweden,1974,.se, 213 | SG,Singapore,Singapore,1974,.sg, 214 | SH,"Saint Helena, Ascension and Tristan da Cunha","Saint Helena, Ascension and Tristan da Cunha",1974,.sh,Previous ISO country name: Saint Helena. 215 | SI,Slovenia,Slovenia,1992,.si, 216 | SJ,Svalbard and Jan Mayen,Svalbard and Jan Mayen,1974,.sj,"Previous ISO name: Svalbard and Jan Mayen Islands 217 | Consists of two Arctic territories of Norway: Svalbard and Jan Mayen" 218 | SK,Slovakia,Slovakia,1993,.sk,SK previously represented the Kingdom of Sikkim 219 | SL,Sierra Leone,Sierra Leone,1974,.sl, 220 | SM,San Marino,San Marino,1974,.sm, 221 | SN,Senegal,Senegal,1974,.sn, 222 | SO,Somalia,Somalia,1974,.so, 223 | SR,Suriname,Suriname,1974,.sr,Previous ISO country name: Surinam 224 | SS,South Sudan,South Sudan,2011,.ss, 225 | ST,Sao Tome and Principe,Sao Tome and Principe,1974,.st, 226 | SV,El Salvador,El Salvador,1974,.sv, 227 | SX,Sint Maarten,Sint Maarten (Dutch part),2010,.sx,The French part of Saint Martin island is assigned code MF 228 | SY,Syria,Syrian Arab Republic,1974,.sy,ISO country name follows UN designation (common name and previous ISO country name: Syria) 229 | SZ,Eswatini,Eswatini,1974,.sz,Previous ISO country name: Swaziland 230 | TC,Turks and Caicos Islands,Turks and Caicos Islands,1974,.tc, 231 | TD,Chad,Chad,1974,.td,Code taken from name in French: Tchad 232 | TF,French Southern Territories,French Southern Territories,1979,.tf,"Covers the French Southern and Antarctic Lands except Adélie Land 233 | Code taken from name in French: Terres australes françaises" 234 | TG,Togo,Togo,1974,.tg, 235 | TH,Thailand,Thailand,1974,.th, 236 | TJ,Tajikistan,Tajikistan,1992,.tj, 237 | TK,Tokelau,Tokelau,1974,.tk,Previous ISO country name: Tokelau Islands 238 | TL,Timor-Leste,Timor-Leste,2002,.tl,Name changed from East Timor (TP) 239 | TM,Turkmenistan,Turkmenistan,1992,.tm, 240 | TN,Tunisia,Tunisia,1974,.tn, 241 | TO,Tonga,Tonga,1974,.to, 242 | TR,Turkey,Türkiye,1974,.tr,Previous ISO country name: Turkey 243 | TT,Trinidad and Tobago,Trinidad and Tobago,1974,.tt, 244 | TV,Tuvalu,Tuvalu,1977,.tv, 245 | TW,Taiwan,"Taiwan, Province of China",1974,.tw,"Covers the current jurisdiction of the Republic of China 246 | ISO country name follows UN designation (due to political status of Taiwan within the UN)[18] (common name: Taiwan)" 247 | TZ,Tanzania,"Tanzania, United Republic of",1974,.tz, 248 | UA,Ukraine,Ukraine,1974,.ua,"Previous ISO country name: Ukrainian SSR 249 | Code assigned as the country was already a UN member since 1945[15]" 250 | UG,Uganda,Uganda,1974,.ug, 251 | UM,United States Minor Outlying Islands,United States Minor Outlying Islands,1986,,"Consists of nine minor insular areas of the United States: Baker Island, Howland Island, Jarvis Island, Johnston Atoll, Kingman Reef, Midway Islands, Navassa Island, Palmyra Atoll, and Wake Island 252 | .um ccTLD was revoked in 2007[20]The United States Department of State uses the following user assigned alpha-2 codes for the nine territories, respectively, XB, XH, XQ, XU, XM, QM, XV, XL, and QW.[21]" 253 | US,United States,United States of America,1974,.us,Previous ISO country name: United States 254 | UY,Uruguay,Uruguay,1974,.uy, 255 | UZ,Uzbekistan,Uzbekistan,1992,.uz, 256 | VA,Holy See,Holy See,1974,.va,"Covers Vatican City, territory of the Holy See 257 | Previous ISO country names: Vatican City State (Holy See) and Holy See (Vatican City State)" 258 | VC,Saint Vincent and the Grenadines,Saint Vincent and the Grenadines,1974,.vc, 259 | VE,Venezuela,Venezuela (Bolivarian Republic of),1974,.ve,Previous ISO country name: Venezuela 260 | VG,British Virgin Islands,Virgin Islands (British),1974,.vg, 261 | VI,U.S. Virgin Islands,Virgin Islands (U.S.),1974,.vi, 262 | VN,Vietnam,Viet Nam,1974,.vn,"ISO country name follows UN designation (common name: Vietnam) 263 | Code used for Republic of Viet Nam (common name: South Vietnam) before 1977" 264 | VU,Vanuatu,Vanuatu,1980,.vu,Name changed from New Hebrides (NH) 265 | WF,Wallis and Futuna,Wallis and Futuna,1974,.wf,Previous ISO country name: Wallis and Futuna Islands 266 | WS,Samoa,Samoa,1974,.ws,Code taken from former name: Western Samoa 267 | YE,Yemen,Yemen,1974,.ye,"Previous ISO country name: Yemen, Republic of (for three years after the unification) 268 | Code used for North Yemen before 1990" 269 | YT,Mayotte,Mayotte,1993,.yt, 270 | ZA,South Africa,South Africa,1974,.za,Code taken from name in Dutch: Zuid-Afrika 271 | ZM,Zambia,Zambia,1974,.zm, 272 | ZW,Zimbabwe,Zimbabwe,1980,.zw,Name changed from Southern Rhodesia (RH) -------------------------------------------------------------------------------- /dags/modules/job_title.py: -------------------------------------------------------------------------------- 1 | """ 2 | Transform non-unique job titles into a standard form using BART Zero Shot Text Classification 3 | Eg. 'SR. DATA ENGR' -> 'Senior Data Engineer' 4 | 5 | Reference: https://huggingface.co/facebook/bart-large-mnli?candidateLabels=Data+Engineer%2C+Data+Scientist%2C+Data+Analyst%2C+Software+Engineer%2C+Business+Analyst%2C+Machine+Learning+Engineer%2C+Senior+Data+Engineer%2C+Senior+Data+Scientist%2C+Senior+Data+Analyst%2C+Cloud+Engineer&multiClass=false&text=Software+%2F+Data+Engineer 6 | """ 7 | import pandas as pd 8 | from google.cloud import bigquery 9 | 10 | # for large datasets store data in google storage for improved speed using the following command 11 | # %bigquery --project job-listings-366015 --use_bqstorage_api 12 | 13 | # for apple silicone m1 macs, use the following to enable MPS 14 | # import torch 15 | # mps_device = torch.device("mps") 16 | # import os 17 | # os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1" 18 | 19 | from transformers import pipeline 20 | 21 | def transform_job_title(): 22 | # Initialize the pipeline 23 | classifier = pipeline("zero-shot-classification", 24 | model="facebook/bart-large-mnli", 25 | # device=mps_device # couldn't get this to work 26 | ) 27 | 28 | # Query to get all unique job titles 29 | client = bigquery.Client() 30 | query_job = client.query( 31 | """ 32 | WITH jobs_all AS ( 33 | SELECT job_title, COUNT(job_title) AS job_title_count 34 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_fact 35 | GROUP BY job_title 36 | ORDER BY job_title_count DESC 37 | ), jobs_clean AS ( 38 | SELECT job_title, job_title_clean 39 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_job_title 40 | ), jobs_unclean AS ( 41 | SELECT job_title, job_title_clean, job_title_count 42 | FROM jobs_all 43 | LEFT JOIN jobs_clean 44 | USING (job_title) 45 | WHERE job_title_clean IS NULL 46 | ) 47 | 48 | SELECT * 49 | FROM jobs_unclean 50 | ORDER BY job_title_count DESC 51 | """ 52 | ) 53 | 54 | jobs_df = query_job.to_dataframe() 55 | print("Starting BART pipeline to transform unique job titles: ", len(jobs_df)) 56 | 57 | candidate_labels = ['Data Engineer', 'Data Scientist', 'Data Analyst', 'Software Engineer', 'Business Analyst', 'Machine Learning Engineer', 'Senior Data Engineer', 'Senior Data Scientist', 'Senior Data Analyst', 'Cloud Engineer'] 58 | 59 | # Iterate over a dataframe 60 | for index, row in jobs_df.iterrows(): 61 | sequence_to_classify = row['job_title'] 62 | try: 63 | results = classifier(sequence_to_classify, candidate_labels) 64 | jobs_df.at[index, 'job_title_clean'] = results['labels'][0] 65 | except ValueError: #raised when the input is empty 66 | jobs_df.at[index, 'job_title_clean'] = None 67 | 68 | # Save to csv 69 | # jobs_df.to_csv('jobs_unclean.csv', index=False) 70 | 71 | # Clean up the data 72 | jobs_final = jobs_df[~jobs_df.job_title_clean.isnull()] 73 | 74 | jobs_final = jobs_final[['job_title', 'job_title_clean']] 75 | 76 | print("BART complete, uploading to BigQuery") 77 | 78 | # Write to BigQuery 79 | jobs_final.to_gbq(destination_table='gsearch_job_listings_clean.gsearch_job_title', 80 | project_id='job-listings-366015', 81 | if_exists='append') 82 | 83 | return -------------------------------------------------------------------------------- /dags/modules/youtube_views.csv: -------------------------------------------------------------------------------- 1 | Geography,Views,Average view duration,Watch time (hours) 2 | Total,12170978,0:03:05,626923.2823 3 | IN,3190180,0:01:53,100887.2259 4 | US,2204092,0:04:03,149228.0248 5 | GB,434350,0:03:37,26289.3889 6 | CA,376269,0:03:42,23254.0882 7 | DE,318350,0:03:33,18881.0136 8 | ID,267963,0:02:36,11651.5774 9 | PH,254697,0:03:12,13648.8765 10 | AU,211633,0:03:41,13035.0428 11 | PK,201570,0:02:10,7317.836 12 | BR,192415,0:03:31,11296.5917 13 | MY,159853,0:03:01,8043.9433 14 | NG,155099,0:04:03,10475.799 15 | SG,152557,0:03:14,8224.0907 16 | BD,150127,0:02:15,5635.1753 17 | MX,130549,0:03:56,8594.2263 18 | FR,129030,0:03:15,7012.8083 19 | PL,126229,0:03:19,6979.2326 20 | ZA,125037,0:03:53,8098.7245 21 | ES,113945,0:03:24,6460.713 22 | NL,113615,0:03:37,6872.1942 23 | IT,106855,0:03:08,5582.0602 24 | AE,103310,0:02:51,4911.8902 25 | VN,101248,0:02:58,5027.3882 26 | EG,94138,0:03:15,5114.5461 27 | TR,88895,0:02:59,4426.3234 28 | TH,88038,0:03:06,4552.2036 29 | RU,86039,0:02:49,4060.3313 30 | SA,79865,0:03:15,4331.0637 31 | KE,71150,0:03:58,4712.6249 32 | AR,69718,0:03:51,4475.5562 33 | MA,66406,0:03:05,3421.3642 34 | NP,63127,0:02:12,2330.9322 35 | HK,62449,0:03:15,3399.5587 36 | JP,60438,0:03:11,3219.2251 37 | CO,60109,0:03:51,3858.7173 38 | SE,59469,0:03:46,3742.9935 39 | LK,59209,0:02:12,2178.0745 40 | PT,55587,0:03:43,3453.9172 41 | KR,51519,0:02:27,2112.2578 42 | TW,50265,0:03:03,2557.0195 43 | GR,48726,0:03:22,2742.4291 44 | IL,42887,0:03:14,2318.9846 45 | RO,42570,0:03:16,2324.3448 46 | CL,42039,0:03:48,2672.3257 47 | IE,41831,0:03:47,2647.9054 48 | CH,41585,0:03:45,2603.8433 49 | PE,38164,0:03:53,2474.2765 50 | BE,37598,0:03:33,2232.1339 51 | GH,36178,0:04:07,2483.7241 52 | NZ,35185,0:03:45,2199.9395 53 | UA,35063,0:03:01,1768.1397 54 | DK,32240,0:03:48,2048.9427 55 | AT,29772,0:03:39,1815.4341 56 | NO,29201,0:03:53,1892.9639 57 | CZ,28820,0:03:38,1751.4573 58 | DZ,28394,0:03:05,1459.1451 59 | HU,28100,0:03:29,1633.8395 60 | FI,26480,0:03:52,1711.1062 61 | RS,22802,0:03:18,1257.8289 62 | TN,19754,0:03:09,1037.6083 63 | ET,19385,0:02:49,912.4085 64 | QA,17382,0:02:47,809.3941 65 | BG,16238,0:03:30,948.2542 66 | CN,14272,0:03:36,857.079 67 | CR,13853,0:04:19,1000.3142 68 | HR,12718,0:03:31,748.3247 69 | LT,12285,0:03:45,770.496 70 | DO,11992,0:04:02,806.6541 71 | KZ,11553,0:03:22,649.315 72 | SO,11023,0:02:33,469.1683 73 | MM,10875,0:02:36,471.566 74 | SK,10680,0:03:20,595.3328 75 | IQ,9999,0:02:46,461.804 76 | KW,9936,0:03:00,496.8222 77 | EC,9912,0:03:46,624.4362 78 | KH,9590,0:02:49,451.5871 79 | JM,9515,0:04:10,661.4757 80 | JO,9030,0:03:10,477.5539 81 | LB,9003,0:03:26,515.7938 82 | UG,8911,0:03:59,593.1003 83 | AZ,8752,0:02:53,421.3726 84 | OM,8695,0:02:32,369.5012 85 | TT,8132,0:03:44,507.585 86 | GE,6604,0:03:23,373.057 87 | ZW,6446,0:03:58,427.5399 88 | UZ,6338,0:02:44,290.0008 89 | CM,6078,0:03:41,374.2997 90 | VE,6073,0:04:21,441.1649 91 | GT,5975,0:04:00,398.7269 92 | MU,5883,0:02:24,236.3667 93 | PA,5601,0:04:16,398.6883 94 | TZ,5359,0:03:22,301.0375 95 | BH,5171,0:02:40,230.0382 96 | SI,5065,0:03:16,276.8601 97 | SD,5040,0:03:27,290.4987 98 | BO,4725,0:03:56,310.7617 99 | ZM,4517,0:03:55,295.029 100 | LV,4492,0:03:46,282.2924 101 | CY,4472,0:03:37,270.3176 102 | PR,4398,0:04:05,300.3237 103 | EE,4337,0:03:46,272.5391 104 | BA,4110,0:03:03,209.9061 105 | BY,3974,0:02:56,195.1243 106 | UY,3703,0:03:47,233.9219 107 | MK,3638,0:03:29,211.826 108 | IR,2859,0:02:49,134.9042 109 | AL,2789,0:02:28,114.8843 110 | BW,2686,0:03:48,170.6096 111 | MN,2594,0:02:36,112.7475 112 | SV,2027,0:04:14,143.2292 113 | RW,1934,0:03:44,120.4006 114 | NA,1719,0:04:01,115.2766 115 | HN,1495,0:03:54,97.224 116 | SN,1456,0:03:06,75.2545 117 | PY,1388,0:03:32,82.0983 118 | LU,1353,0:03:27,78.0674 119 | SY,1257,0:02:54,60.7869 120 | BN,1180,0:02:41,52.8582 121 | AM,1121,0:02:54,54.3131 122 | CI,1060,0:03:08,55.3612 123 | MT,944,0:03:39,57.4443 124 | MZ,922,0:04:03,62.3198 125 | NI,881,0:04:16,62.7468 126 | MD,861,0:02:55,42.0887 127 | LY,728,0:03:35,43.6405 128 | MV,700,0:02:27,28.6537 129 | PS,656,0:02:24,26.3179 130 | IS,552,0:03:17,30.311 131 | YE,550,0:03:05,28.414 132 | KG,440,0:03:20,24.459 133 | AO,436,0:04:28,32.5251 134 | BB,429,0:04:24,31.5008 135 | AF,400,0:02:12,14.7075 136 | MO,321,0:03:26,18.3997 137 | BS,287,0:03:40,17.5762 138 | MW,285,0:04:02,19.2154 139 | SL,281,0:03:58,18.6413 140 | GY,261,0:03:46,16.4035 141 | CD,207,0:03:14,11.1726 142 | TG,202,0:03:34,12.0335 143 | ME,198,0:03:42,12.2303 144 | HT,181,0:03:14,9.7964 145 | FJ,175,0:02:58,8.6736 146 | GM,119,0:04:24,8.7383 147 | GU,89,0:04:50,7.1788 148 | LC,89,0:05:43,8.4811 149 | BT,77,0:02:03,2.6463 150 | SR,71,0:03:29,4.1407 151 | ML,62,0:03:52,4.0025 152 | PG,58,0:04:10,4.0393 153 | MR,55,0:01:04,0.9891 154 | SZ,41,0:03:55,2.677 155 | TJ,40,0:01:59,1.3255 156 | LS,38,0:03:48,2.409 157 | LA,37,0:02:43,1.6791 158 | VI,33,0:02:11,1.2018 159 | AG,32,0:02:57,1.5744 160 | MG,24,0:02:44,1.0949 161 | GA,21,0:03:50,1.3448 162 | CW,20,0:05:28,1.8255 163 | KY,15,0:10:18,2.5775 164 | VU,14,0:00:01,0.0047 165 | EH,13,0:04:02,0.8765 166 | VC,13,0:02:58,0.6463 167 | GN,12,0:06:01,1.2046 168 | RE,12,0:03:51,0.7708 169 | SS,12,0:05:18,1.0615 170 | GP,11,0:04:07,0.7573 171 | MP,11,0:06:23,1.1706 172 | BF,10,0:05:34,0.93 173 | BJ,10,0:03:13,0.5369 174 | DJ,10,0:00:41,0.1158 175 | GD,10,0:01:28,0.2458 176 | LR,10,0:04:07,0.6873 177 | AD,0,,0 178 | AI,0,,0 179 | AW,0,,0 180 | BI,0,,0 181 | BQ,0,,0 182 | BZ,0,,0 183 | CG,0,,0 184 | CU,0,,0 185 | CV,0,,0 186 | DM,0,,0 187 | GF,0,,0 188 | GG,0,,0 189 | GI,0,,0 190 | GQ,0,,0 191 | IM,0,,0 192 | KN,0,,0 193 | MF,0,,0 194 | MQ,0,,0 195 | NC,0,,0 196 | NE,0,,0 197 | SC,0,,0 198 | TL,0,,0 199 | VG,0,,0 200 | -------------------------------------------------------------------------------- /dags/serpapi_bigquery.py: -------------------------------------------------------------------------------- 1 | """ 2 | An operations workflow to collect data science job postings from SerpApi and 3 | insert into BigQuery. 4 | """ 5 | import warnings 6 | warnings.simplefilter(action='ignore', category=FutureWarning) # stop getting Pandas FutureWarning's 7 | 8 | import time 9 | import json 10 | from datetime import datetime, timedelta 11 | import pandas as pd 12 | from numpy.random import choice 13 | from serpapi import GoogleSearch 14 | from config import config # contains secret keys in config.py 15 | from google.cloud import bigquery 16 | import airflow 17 | from airflow import DAG 18 | from airflow.operators.dummy_operator import DummyOperator 19 | from airflow.operators.python_operator import PythonOperator 20 | from modules.country import view_percent 21 | 22 | # 'False' DAG is ready for operation; i.e., 'True' DAG runs using no SerpApi credits or BigQuery requests 23 | TESTING_DAG = False 24 | # Minutes to sleep on an error 25 | ERROR_SLEEP_MIN = 5 26 | # Max number of searches to perform daily 27 | MAX_SEARCHES = 1500 28 | # Who is listed as the owner of this DAG in the Airflow Web Server 29 | DAG_OWNER_NAME = "airflow" 30 | # List of email address to send email alerts to if this job fails 31 | ALERT_EMAIL_ADDRESSES = ['luke@lukebarousse.com'] 32 | START_DATE = airflow.utils.dates.days_ago(1) 33 | 34 | 35 | default_args = { 36 | 'owner': DAG_OWNER_NAME, 37 | 'depends_on_past': False, 38 | 'start_date': START_DATE, 39 | 'email': ALERT_EMAIL_ADDRESSES, 40 | 'email_on_failure': True, 41 | 'email_on_retry': False, 42 | 'retries': 0, # removing retries to not call insert duplicates into BigQuery 43 | 'retry_delay': timedelta(minutes=5), 44 | # 'queue': 'bash_queue', 45 | # 'pool': 'backfill', 46 | # 'priority_weight': 10, 47 | # 'end_date': datetime(2022, 1, 1), 48 | # 'wait_for_downstream': False, 49 | # 'dag': dag, 50 | # 'sla': timedelta(hours=2), 51 | # 'execution_timeout': timedelta(seconds=300), 52 | # 'on_failure_callback': some_function, 53 | # 'on_success_callback': some_other_function, 54 | # 'on_retry_callback': another_function, 55 | # 'sla_miss_callback': yet_another_function, 56 | # 'trigger_rule': 'all_success' 57 | } 58 | 59 | dag = DAG( 60 | 'serpapi_bigquery', 61 | description='Call SerpApi and inserts results into Bigquery', 62 | default_args=default_args, 63 | schedule_interval='0 6 * * *', 64 | catchup=False, 65 | tags=['data-pipeline-dag'], 66 | max_active_tasks = 3 67 | ) 68 | 69 | with dag: 70 | 71 | search_terms = ['Data Analyst', 'Data Scientist', 'Data Engineer'] 72 | search_locations_us = ["New York, United States", "California, United States", 73 | "Texas, United States", "Illinois, United States", "Florida, United States"] 74 | # data table from 'modules/country.py' of countries and relative weighted percentages to use 75 | country_percent = view_percent() 76 | 77 | start = DummyOperator( 78 | task_id='start', 79 | dag=dag) 80 | 81 | def _bigquery_json(results, search_term, search_location, result_offset, error): 82 | """ 83 | Submit JSON return from SerpAPI to BigQuery {gsearch_jobs_all_json} as a backup to hold the original data 84 | 85 | Args: 86 | results : json 87 | JSON return from SerpAPI 88 | search_term : str 89 | Search term 90 | search_location : str 91 | Search location 92 | result_offset : int 93 | Parameter to offset the results returned from SerpApi; used for pagination 94 | error : bool 95 | Flag to indicate if results where returned from SerpApi or not 96 | 97 | Returns: 98 | None 99 | """ 100 | try: 101 | # extract metadata from results 102 | try: 103 | search_id = results['search_metadata']['id'] 104 | search_time = results['search_metadata']['created_at'] 105 | search_time = datetime.strptime(search_time, "%Y-%m-%d %H:%M:%S %Z") 106 | search_time_taken = results['search_metadata']['total_time_taken'] 107 | search_language = results['search_parameters']['hl'] 108 | except Exception as e: 109 | search_id = None 110 | search_time = None 111 | search_time_taken = None 112 | search_language = None 113 | print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!") 114 | print(f"JSON - SerpAPI ERROR!!!: {search_term} in {search_location} JSON file fields have changed!!!") 115 | print("Following error returned:") 116 | print(e) 117 | 118 | # convert search results and metadata to a dataframe 119 | df = pd.DataFrame({'search_term': [search_term], 120 | 'search_location': [search_location], 121 | 'result_offset': [result_offset], 122 | 'error': [error], 123 | 'search_id': [search_id], 124 | 'search_time': [search_time], 125 | 'search_time_taken': [search_time_taken], 126 | 'search_language': [search_language], 127 | 'results': [json.dumps(results)] 128 | }) 129 | 130 | # submit dataframe to BigQuery 131 | table_id = config.table_id_json 132 | client = bigquery.Client() 133 | table = client.get_table(table_id) 134 | errors = client.insert_rows_from_dataframe(table, df) 135 | if errors == [[]]: 136 | print(f"JSON - DATA LOADED: {search_term} in {search_location} loaded into BigQuery {table_id}") 137 | else: 138 | print(f"JSON - ERROR!!!: {search_term} in {search_location} NOT loaded into BigQuery {table_id}!!!") 139 | print("Following error returned from the googles:") 140 | print(errors) 141 | except UnboundLocalError as ule: 142 | # TODO: Need to build something to catch this error sooner 143 | # GoogleSearch(params) code returns blank results and then get error for "'df' referenced before assignment" in 'errors = client.insert_rows_from_dataframe(table, df)' (i.e., SerpApi issue) 144 | print(f"JSON - SerpApi ERROR!!!: Search {result_offset} of {search_term} in {search_location} yielded no results from SerpApi and FAILED load into BigQuery!!!") 145 | print("Following error returned:") 146 | print(ule) 147 | # no sleep requirement as usually an issue with search term provided 148 | except TimeoutError as te: 149 | # client.get_table(table_id) code returns TimeOut Exception with no results... so also adding sleep (i.e., BigQuery issue) 150 | print(f"JSON - BigQuery ERROR!!!: {search_term} in {search_location} had TimeOutError and FAILED to load into BigQuery {table_id}!!!") 151 | print("Following error returned:") 152 | print(te) 153 | # sleep removed for first implementation 12/31/2022 154 | # print(f"Sleeping for {ERROR_SLEEP_MIN} minutes") 155 | # time.sleep(ERROR_SLEEP_MIN * 60) 156 | except Exception as e: 157 | print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!") 158 | print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!") 159 | print(f"JSON - BigQuery ERROR!!!: {search_term} in {search_location} had an error that needs to be investigated!!!") 160 | print("Following error returned:") 161 | print(e) 162 | # sleep removed for testing 163 | # print(f"Sleeping for {ERROR_SLEEP_MIN} minutes") 164 | # print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!") 165 | # print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!") 166 | # time.sleep(ERROR_SLEEP_MIN * 60) 167 | 168 | return 169 | 170 | def _serpapi_bigquery(search_term, search_location, search_time): 171 | """ 172 | Function to call SerpApi and insert results into BigQuery {gsearch_jobs_all} used by us_job_postings 173 | and non_us_job_postings tasks 174 | 175 | Args: 176 | search_terms : list 177 | List of search terms to search for 178 | search_locations : list 179 | List of search locations to search for 180 | search_time : str 181 | Time period to search for (e.g. 'past 24 hours') 182 | 183 | Returns: 184 | num_searches : int 185 | Number of searches performed for this search term and location 186 | 187 | Source: 188 | https://serpapi.com/google-jobs-results 189 | https://cloud.google.com/bigquery/docs/reference/libraries 190 | """ 191 | if not TESTING_DAG: 192 | next_page_token = None 193 | num = 0 194 | has_more_results = True 195 | 196 | while has_more_results: 197 | print(f"START API CALL: {search_term} in {search_location} on search {num}") 198 | 199 | error = False 200 | params = { 201 | "api_key": config.serpapi_key, 202 | "device": "desktop", 203 | "engine": "google_jobs", 204 | "google_domain": "google.com", 205 | "q": search_term, 206 | "hl": "en", 207 | "gl": "us", 208 | "location": search_location, 209 | "chips": search_time, 210 | } 211 | 212 | if next_page_token: 213 | params["next_page_token"] = next_page_token 214 | 215 | try: 216 | search = GoogleSearch(params) 217 | results = search.get_dict() 218 | 219 | if 'error' in results: 220 | print(f"END SerpApi CALLS: {search_term} in {search_location} on search {num}") 221 | error = True 222 | _bigquery_json(results, search_term, search_location, num, error) 223 | break 224 | 225 | print(f"SUCCESS SerpApi CALL: {search_term} in {search_location} on search {num}") 226 | _bigquery_json(results, search_term, search_location, num, error) 227 | 228 | # Process results and insert into BigQuery 229 | jobs = results['jobs_results'] 230 | jobs = pd.DataFrame(jobs) 231 | jobs = pd.concat([pd.DataFrame(jobs), 232 | pd.json_normalize(jobs['detected_extensions'])], 233 | axis=1).drop('detected_extensions', 1) 234 | jobs['date_time'] = datetime.utcnow() 235 | 236 | if num == 0: 237 | jobs_all = jobs 238 | else: 239 | jobs_all = pd.concat([jobs_all, jobs]) 240 | 241 | jobs_all['search_term'] = search_term 242 | jobs_all['search_location'] = search_location 243 | 244 | # Check for next_page_token 245 | if 'serpapi_pagination' in results and 'next_page_token' in results['serpapi_pagination']: 246 | next_page_token = results['serpapi_pagination']['next_page_token'] 247 | else: 248 | print(f"END API CALLS: No more results for {search_term} in {search_location} on search {num}") 249 | has_more_results = False 250 | 251 | num += 1 252 | 253 | except Exception as e: 254 | print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!") 255 | print(f"SerpApi ERROR (Timeout)!!!: {search_term} in {search_location} had an error (most likely TimeOut)!!!") 256 | print("Following error returned:") 257 | print(e) 258 | print(f"Sleeping for {ERROR_SLEEP_MIN} minutes") 259 | print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!") 260 | time.sleep(ERROR_SLEEP_MIN * 60) 261 | error = True 262 | break 263 | 264 | # Insert data into BigQuery 265 | if num > 0 and not error: 266 | try: 267 | final_columns = ['title', 'company_name', 'location', 'via', 'description', 'extensions', 268 | 'job_id', 'thumbnail', 'posted_at', 'schedule_type', 'salary', 269 | 'work_from_home', 'date_time', 'search_term', 'search_location', 'commute_time'] 270 | jobs_all = jobs_all.loc[:, jobs_all.columns.isin(final_columns)] 271 | 272 | table_id = config.table_id 273 | client = bigquery.Client() 274 | table = client.get_table(table_id) 275 | errors = client.insert_rows_from_dataframe(table, jobs_all) 276 | if errors == [[]]: 277 | print(f"DATA LOADED: {len(jobs_all)} rows of {search_term} in {search_location} loaded into BigQuery") 278 | else: 279 | print(f"ERROR!!!: {len(jobs_all)} rows of {search_term} in {search_location} NOT loaded into BigQuery!!!") 280 | print("Following error returned from the googles:") 281 | print(errors) 282 | except Exception as e: 283 | print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!") 284 | print(f"non-JSON BigQuery ERROR!!!: {search_term} in {search_location} had an error that needs to be investigated!!!") 285 | print("Following error returned:") 286 | print(e) 287 | print(f"Sleeping for {ERROR_SLEEP_MIN} minutes") 288 | print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!") 289 | time.sleep(ERROR_SLEEP_MIN * 60) 290 | 291 | num_searches = num + 1 292 | 293 | else: # if testing 294 | print(f"END FAKE SEARCH: {search_term} in {search_location}") 295 | num_searches = 2 # low enough not to max out 1000 searches 296 | 297 | return num_searches 298 | 299 | ### (9/2024) start parmater was deprecated and removed from SerpApi 300 | # def _serpapi_bigquery(search_term, search_location, search_time): 301 | # if not TESTING_DAG: 302 | 303 | # for num in range(45): # SerpApi docs say max returns is ~45 pages 304 | 305 | # print(f"START API CALL: {search_term} in {search_location} on search {num}") 306 | 307 | # start = num * 10 308 | # error = False 309 | # params = { 310 | # "api_key": config.serpapi_key, 311 | # "device": "desktop", 312 | # "engine": "google_jobs", 313 | # "google_domain": "google.com", 314 | # "q": search_term, 315 | # "hl": "en", 316 | # "gl": "us", 317 | # "location": search_location, 318 | # "chips": search_time, 319 | # "start": start, 320 | # } 321 | 322 | # # try except statement to call SerpAPI and then handle results (inner try/except statement) or handle TimeOut errors 323 | # try: 324 | # search = GoogleSearch(params) 325 | # results = search.get_dict() 326 | 327 | # # try except statement needed to handle whether any results are returned 328 | # try: 329 | # if results['error'] == "Google hasn't returned any results for this query.": 330 | # print(f"END SerpApi CALLS: {search_term} in {search_location} on search {num}") 331 | # error = True 332 | # # Send JSON request to BigQuery json table 333 | # _bigquery_json(results, search_term, search_location, num, error) 334 | # break 335 | # except KeyError: 336 | # print(f"SUCCESS SerpApi CALL: {search_term} in {search_location} on search {num}") 337 | # # Send JSON request to BigQuery json table 338 | # _bigquery_json(results, search_term, search_location, num, error) 339 | # else: 340 | # print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!") 341 | # print(f"SerpApi Error on call!!!: No response on {search_term} in {search_location} on search {num}") 342 | # print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!") 343 | # error = True 344 | # break 345 | # except Exception as e: # catching as 'TimeoutError' didn't work so resorted to catching all... 346 | # print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!") 347 | # print(f"SerpApi ERROR (Timeout)!!!: {search_term} in {search_location} had an error (most likely TimeOut)!!!") 348 | # print("Following error returned:") 349 | # print(e) 350 | # print(f"Sleeping for {ERROR_SLEEP_MIN} minutes") 351 | # print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!") 352 | # time.sleep(ERROR_SLEEP_MIN * 60) 353 | # error = True 354 | # break 355 | 356 | # # create dataframe of 10 (or less) pulled results 357 | # jobs = results['jobs_results'] 358 | # jobs = pd.DataFrame(jobs) 359 | # jobs = pd.concat([pd.DataFrame(jobs), 360 | # pd.json_normalize(jobs['detected_extensions'])], 361 | # axis=1).drop('detected_extensions', 1) 362 | # jobs['date_time'] = datetime.utcnow() 363 | 364 | # if start == 0: 365 | # jobs_all = jobs 366 | # else: 367 | # jobs_all = pd.concat([jobs_all, jobs]) 368 | 369 | # jobs_all['search_term'] = search_term 370 | # jobs_all['search_location'] = search_location 371 | 372 | # # don't call api again (and waste a credit) if less than 10 results (i.e., end of search) 373 | # if len(jobs) != 10: 374 | # print(f"END API CALLS: Only {len(jobs)} jobs (<10) for {search_term} in {search_location} on search {num}") 375 | # break 376 | 377 | # # if no results returned on first try then will get error if try to insert 0 rows into BigQuery 378 | # if num == 0 and error: 379 | # print(f"NO DATA LOADED: {num} rows of {search_term} in {search_location} not loaded into BigQuery") 380 | # else: 381 | # try: 382 | # # 28Dec2022: Following added after SerpApi changed format of json file 383 | # # wanted to keep extra columns ['job_highlights' 'related_links'] added in json but reached bigquery resource limit 384 | # # tried to convert these columns to json but ran into error troubleshooting all day 385 | # # jobs_all['json'] = jobs_all.apply(lambda x: x.to_json(), axis=1) 386 | # final_columns = ['title', 387 | # 'company_name', 388 | # 'location', 389 | # 'via', 390 | # 'description', 391 | # 'extensions', 392 | # 'job_id', 393 | # 'thumbnail', 394 | # 'posted_at', 395 | # 'schedule_type', 396 | # 'salary', 397 | # 'work_from_home', 398 | # 'date_time', 399 | # 'search_term', 400 | # 'search_location', 401 | # 'commute_time'] 402 | # # select only columns from final_columns if they exist in jobs_all 403 | # jobs_all = jobs_all.loc[:, jobs_all.columns.isin(final_columns)] 404 | 405 | # table_id = config.table_id 406 | # client = bigquery.Client() 407 | # table = client.get_table(table_id) 408 | # errors = client.insert_rows_from_dataframe(table, jobs_all) 409 | # if errors == [[]]: 410 | # print(f"DATA LOADED: {len(jobs_all)} rows of {search_term} in {search_location} loaded into BigQuery") 411 | # else: 412 | # print(f"ERROR!!!: {len(jobs_all)} rows of {search_term} in {search_location} NOT loaded into BigQuery!!!") 413 | # print("Following error returned from the googles:") 414 | # print(errors) 415 | # except UnboundLocalError as ule: 416 | # # TODO: Need to build something to catch this error sooner 417 | # # GoogleSearch(params) code returns blank results and then get error for "'jobs_all' referenced before assignment" in 'errors = client.insert_rows_from_dataframe(table, jobs_all)' (i.e., SerpApi issue) 418 | # print(f"SerpApi ERROR!!!: Search {num} of {search_term} in {search_location} yielded no results from SerpApi and FAILED load into non-JSON BigQuery!!!") 419 | # print("Following error returned:") 420 | # print(ule) 421 | # # no sleep requirement as usually an issue with search term provided 422 | # except TimeoutError as te: 423 | # # client.get_table(table_id) code returns TimeOut Exception with no results... so also adding sleep (i.e., BigQuery issue) 424 | # print(f"BigQuery ERROR!!!: {search_term} in {search_location} had TimeOutError and FAILED to load into non-JSON BigQuery!!!") 425 | # print("Following error returned:") 426 | # print(te) 427 | # print(f"Sleeping for {ERROR_SLEEP_MIN} minutes") 428 | # time.sleep(ERROR_SLEEP_MIN * 60) 429 | # except Exception as e: 430 | # print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!") 431 | # print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!") 432 | # print(f"non-JSON BigQuery ERROR!!!: {search_term} in {search_location} had an error that needs to be investigated!!!") 433 | # print("Following error returned:") 434 | # print(e) 435 | # print(f"Sleeping for {ERROR_SLEEP_MIN} minutes") 436 | # print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!") 437 | # print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!") 438 | # time.sleep(ERROR_SLEEP_MIN * 60) 439 | 440 | # num_searches = num + 1 441 | 442 | # else: # if testing 443 | 444 | # print(f"END FAKE SEARCH: {search_term} in {search_location}") 445 | # num_searches = 2 # low enough not to max out 1000 searches 446 | 447 | # return num_searches 448 | 449 | def _us_jobs(search_terms, search_locations_us, **context): 450 | """ 451 | DAG to pull US job postings using the _serpapi_bigquery function 452 | 453 | Args: 454 | search_terms : list 455 | List of search terms to search for 456 | search_locations_us : list 457 | List of search locations to search for 458 | context : dict 459 | Context dictionary from Airflow 460 | 461 | Returns: 462 | None 463 | """ 464 | search_time = "date_posted:today" 465 | total_searches = 0 466 | 467 | for search_term in search_terms: 468 | for search_location in search_locations_us: 469 | print(f"START SEARCH: {total_searches} searches done, starting search...") 470 | num_searches = _serpapi_bigquery(search_term, search_location, search_time) 471 | total_searches += num_searches 472 | 473 | # push total_searches to xcom so can use in next task 474 | context['task_instance'].xcom_push(key='total_searches', value=total_searches) 475 | 476 | return 477 | 478 | us_jobs = PythonOperator( 479 | task_id='us_jobs', 480 | provide_context=True, 481 | op_kwargs={'search_terms': search_terms, 'search_locations_us': search_locations_us}, 482 | python_callable=_us_jobs 483 | ) 484 | 485 | def _non_us_jobs(search_terms, country_percent, **context): 486 | """ 487 | DAG to pull non-US job postings using the _serpapi_bigquery function 488 | 489 | Args: 490 | search_terms : list 491 | List of search terms to search for 492 | country_percent : pandas dataframe 493 | Dataframe of countries and their relative percent of total YouTube views for my channel 494 | context : dict 495 | Context dictionary from Airflow 496 | 497 | Returns: 498 | None 499 | 500 | Source: 501 | https://youtube.com/@lukebarousse 502 | """ 503 | search_time = "date_posted:today" 504 | total_searches = context['task_instance'].xcom_pull(task_ids='us_jobs', key='total_searches') 505 | search_countries = list(country_percent.country) 506 | search_probabilities = list(country_percent.percent) 507 | 508 | # create list of countries listed based on weighted probability to get random countries 509 | search_locations = list(choice(search_countries, size=len(search_countries), replace=False, p=search_probabilities)) 510 | 511 | for search_location in search_locations: 512 | if total_searches < MAX_SEARCHES: 513 | print("####################################") 514 | print(f"SEARCHING COUNTRY: {search_location} [{search_locations.index(search_location)+1} of {len(search_locations)}]") 515 | print("####################################") 516 | for search_term in search_terms: 517 | print("####################################") 518 | print(f"SEARCHING TERM: {search_term}") 519 | print(f"Starting search number {total_searches}...") 520 | num_searches = _serpapi_bigquery(search_term, search_location, search_time) 521 | total_searches += num_searches 522 | else: 523 | print(f"STICK A FORK IN ME, I'M DONE!!!!: {total_searches} searches complete") 524 | return 525 | 526 | non_us_jobs = PythonOperator( 527 | task_id='non_us_jobs', 528 | provide_context=True, 529 | op_kwargs={'search_terms': search_terms, 'country_percent': country_percent}, 530 | python_callable=_non_us_jobs 531 | ) 532 | 533 | finish = DummyOperator( 534 | task_id='finish', 535 | dag=dag) 536 | 537 | start >> us_jobs >> non_us_jobs >> finish 538 | -------------------------------------------------------------------------------- /dags/sql/.project: -------------------------------------------------------------------------------- 1 | 2 | 3 | Airflow_SQL 4 | 5 | 6 | 7 | 8 | 9 | 10 | org.jkiss.dbeaver.DBeaverNature 11 | 12 | 13 | -------------------------------------------------------------------------------- /dags/sql/cache_csv.sql: -------------------------------------------------------------------------------- 1 | ---------------- 2 | -- 🛠️ Skill Page 3 | -- "Select All"/"Select All" export 4 | EXPORT DATA 5 | OPTIONS ( 6 | uri = 'gs://gsearch_share/cache/skills/skills-*.csv', 7 | format = 'CSV', 8 | overwrite = true, 9 | header = true, 10 | field_delimiter = ',') 11 | AS ( 12 | WITH all_time AS ( 13 | SELECT COUNT(*) as total 14 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide 15 | ), 16 | last_7_days AS ( 17 | SELECT COUNT(*) as total 18 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide 19 | WHERE search_time >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) 20 | ), 21 | last_30_days AS ( 22 | SELECT COUNT(*) as total 23 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide 24 | WHERE search_time >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) 25 | ), 26 | ytd AS ( 27 | SELECT COUNT(*) as total 28 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide 29 | WHERE search_time >= DATE_TRUNC(CURRENT_DATE(), YEAR) 30 | ) 31 | 32 | -- All time 33 | SELECT 34 | keywords.element AS skill, 35 | COUNT(job_id) / (SELECT total FROM all_time) AS skill_percent, 36 | COUNT(job_id) AS skill_count, 37 | (SELECT total FROM all_time) AS total_jobs, 38 | 'All time' as timeframe 39 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide, 40 | UNNEST(keywords_all.list) AS keywords 41 | GROUP BY skill 42 | 43 | UNION ALL 44 | 45 | -- Last 7 days 46 | SELECT 47 | keywords.element AS skill, 48 | COUNT(job_id) / (SELECT total FROM last_7_days) AS skill_percent, 49 | COUNT(job_id) AS skill_count, 50 | (SELECT total FROM last_7_days) AS total_jobs, 51 | 'Last 7 days' as timeframe 52 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide, 53 | UNNEST(keywords_all.list) AS keywords 54 | WHERE search_time >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) 55 | GROUP BY skill 56 | 57 | UNION ALL 58 | 59 | -- Last 30 days 60 | SELECT 61 | keywords.element AS skill, 62 | COUNT(job_id) / (SELECT total FROM last_30_days) AS skill_percent, 63 | COUNT(job_id) AS skill_count, 64 | (SELECT total FROM last_30_days) AS total_jobs, 65 | 'Last 30 days' as timeframe 66 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide, 67 | UNNEST(keywords_all.list) AS keywords 68 | WHERE search_time >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) 69 | GROUP BY skill 70 | 71 | UNION ALL 72 | 73 | -- Year to date 74 | SELECT 75 | keywords.element AS skill, 76 | COUNT(job_id) / (SELECT total FROM ytd) AS skill_percent, 77 | COUNT(job_id) AS skill_count, 78 | (SELECT total FROM ytd) AS total_jobs, 79 | 'YTD' as timeframe 80 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide, 81 | UNNEST(keywords_all.list) AS keywords 82 | WHERE search_time >= DATE_TRUNC(CURRENT_DATE(), YEAR) 83 | GROUP BY skill 84 | 85 | ORDER BY timeframe, skill_count DESC 86 | ); 87 | 88 | -- Slicer 89 | EXPORT DATA 90 | OPTIONS ( 91 | uri = 'gs://gsearch_share/cache/skills/slicer-*.csv', 92 | format = 'CSV', 93 | overwrite = true, 94 | header = true, 95 | field_delimiter = ',') 96 | AS ( 97 | SELECT 98 | job_title_final AS job_title, 99 | search_country, 100 | COUNT(*) AS job_count 101 | FROM 102 | `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide 103 | WHERE search_country IS NOT NULL 104 | GROUP BY job_title, search_country 105 | ORDER BY job_count DESC 106 | ); 107 | 108 | -- Keywords 109 | EXPORT DATA 110 | OPTIONS ( 111 | uri = 'gs://gsearch_share/cache/skills/keywords-*.csv', 112 | format = 'CSV', 113 | overwrite = true, 114 | header = true, 115 | field_delimiter = ',') 116 | AS ( 117 | SELECT * FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_keywords 118 | ); 119 | 120 | EXPORT DATA 121 | OPTIONS ( 122 | uri = 'gs://gsearch_share/cache/skills/timeframes/alltime-*.csv', 123 | format = 'CSV', 124 | overwrite = true, 125 | header = true, 126 | field_delimiter = ',') 127 | AS ( 128 | WITH total_jobs_country_title AS ( 129 | SELECT job_title_final, search_country, COUNT(*) AS job_count 130 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide 131 | GROUP BY job_title_final, search_country 132 | ), 133 | total_jobs_title AS ( 134 | SELECT job_title_final, COUNT(*) AS job_count 135 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide 136 | GROUP BY job_title_final 137 | ), 138 | total_jobs_country AS ( 139 | SELECT search_country, COUNT(*) AS job_count 140 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide 141 | GROUP BY search_country 142 | ) 143 | 144 | SELECT 145 | keywords.element AS skill, 146 | j.job_title_final, 147 | j.search_country, 148 | COUNT(j.job_id) / t.job_count AS skill_percent, 149 | COUNT(j.job_id) AS skill_count, 150 | t.job_count AS total_jobs 151 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide j, 152 | UNNEST(keywords_all.list) AS keywords 153 | JOIN total_jobs_country_title t 154 | ON t.job_title_final = j.job_title_final 155 | AND t.search_country = j.search_country 156 | GROUP BY 157 | skill, 158 | j.job_title_final, 159 | j.search_country, 160 | t.job_count 161 | 162 | UNION ALL 163 | 164 | SELECT 165 | keywords.element AS skill, 166 | j.job_title_final, 167 | NULL AS search_country, 168 | COUNT(j.job_id) / t.job_count AS skill_percent, 169 | COUNT(j.job_id) AS skill_count, 170 | t.job_count AS total_jobs 171 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide j, 172 | UNNEST(keywords_all.list) AS keywords 173 | JOIN total_jobs_title t 174 | ON t.job_title_final = j.job_title_final 175 | GROUP BY 176 | skill, 177 | j.job_title_final, 178 | t.job_count 179 | 180 | UNION ALL 181 | 182 | SELECT 183 | keywords.element AS skill, 184 | NULL AS job_title_final, 185 | j.search_country, 186 | COUNT(j.job_id) / t.job_count AS skill_percent, 187 | COUNT(j.job_id) AS skill_count, 188 | t.job_count AS total_jobs 189 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide j, 190 | UNNEST(keywords_all.list) AS keywords 191 | JOIN total_jobs_country t 192 | ON t.search_country = j.search_country 193 | GROUP BY 194 | skill, 195 | j.search_country, 196 | t.job_count 197 | 198 | ORDER BY skill_count DESC 199 | ); 200 | 201 | EXPORT DATA 202 | OPTIONS ( 203 | uri = 'gs://gsearch_share/cache/skills/timeframes/7day-*.csv', 204 | format = 'CSV', 205 | overwrite = true, 206 | header = true, 207 | field_delimiter = ',') 208 | AS ( 209 | WITH total_jobs_country_title AS ( 210 | SELECT job_title_final, search_country, COUNT(*) AS job_count 211 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide 212 | WHERE search_time >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) 213 | GROUP BY job_title_final, search_country 214 | ), 215 | total_jobs_title AS ( 216 | SELECT job_title_final, COUNT(*) AS job_count 217 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide 218 | WHERE search_time >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) 219 | GROUP BY job_title_final 220 | ), 221 | total_jobs_country AS ( 222 | SELECT search_country, COUNT(*) AS job_count 223 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide 224 | WHERE search_time >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) 225 | GROUP BY search_country 226 | ) 227 | 228 | SELECT 229 | keywords.element AS skill, 230 | j.job_title_final, 231 | j.search_country, 232 | COUNT(j.job_id) / t.job_count AS skill_percent, 233 | COUNT(j.job_id) AS skill_count, 234 | t.job_count AS total_jobs 235 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide j, 236 | UNNEST(keywords_all.list) AS keywords 237 | JOIN total_jobs_country_title t 238 | ON t.job_title_final = j.job_title_final 239 | AND t.search_country = j.search_country 240 | WHERE j.search_time >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) 241 | GROUP BY 242 | skill, 243 | j.job_title_final, 244 | j.search_country, 245 | t.job_count 246 | 247 | UNION ALL 248 | 249 | SELECT 250 | keywords.element AS skill, 251 | j.job_title_final, 252 | NULL AS search_country, 253 | COUNT(j.job_id) / t.job_count AS skill_percent, 254 | COUNT(j.job_id) AS skill_count, 255 | t.job_count AS total_jobs 256 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide j, 257 | UNNEST(keywords_all.list) AS keywords 258 | JOIN total_jobs_title t 259 | ON t.job_title_final = j.job_title_final 260 | WHERE j.search_time >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) 261 | GROUP BY 262 | skill, 263 | j.job_title_final, 264 | t.job_count 265 | 266 | UNION ALL 267 | 268 | SELECT 269 | keywords.element AS skill, 270 | NULL AS job_title_final, 271 | j.search_country, 272 | COUNT(j.job_id) / t.job_count AS skill_percent, 273 | COUNT(j.job_id) AS skill_count, 274 | t.job_count AS total_jobs 275 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide j, 276 | UNNEST(keywords_all.list) AS keywords 277 | JOIN total_jobs_country t 278 | ON t.search_country = j.search_country 279 | WHERE j.search_time >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) 280 | GROUP BY 281 | skill, 282 | j.search_country, 283 | t.job_count 284 | 285 | ORDER BY skill_count DESC 286 | ); 287 | 288 | EXPORT DATA 289 | OPTIONS ( 290 | uri = 'gs://gsearch_share/cache/skills/timeframes/30day-*.csv', 291 | format = 'CSV', 292 | overwrite = true, 293 | header = true, 294 | field_delimiter = ',') 295 | AS ( 296 | WITH total_jobs_country_title AS ( 297 | SELECT job_title_final, search_country, COUNT(*) AS job_count 298 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide 299 | WHERE search_time >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) 300 | GROUP BY job_title_final, search_country 301 | ), 302 | total_jobs_title AS ( 303 | SELECT job_title_final, COUNT(*) AS job_count 304 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide 305 | WHERE search_time >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) 306 | GROUP BY job_title_final 307 | ), 308 | total_jobs_country AS ( 309 | SELECT search_country, COUNT(*) AS job_count 310 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide 311 | WHERE search_time >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) 312 | GROUP BY search_country 313 | ) 314 | 315 | SELECT 316 | keywords.element AS skill, 317 | j.job_title_final, 318 | j.search_country, 319 | COUNT(j.job_id) / t.job_count AS skill_percent, 320 | COUNT(j.job_id) AS skill_count, 321 | t.job_count AS total_jobs 322 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide j, 323 | UNNEST(keywords_all.list) AS keywords 324 | JOIN total_jobs_country_title t 325 | ON t.job_title_final = j.job_title_final 326 | AND t.search_country = j.search_country 327 | WHERE j.search_time >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) 328 | GROUP BY 329 | skill, 330 | j.job_title_final, 331 | j.search_country, 332 | t.job_count 333 | 334 | UNION ALL 335 | 336 | SELECT 337 | keywords.element AS skill, 338 | j.job_title_final, 339 | NULL AS search_country, 340 | COUNT(j.job_id) / t.job_count AS skill_percent, 341 | COUNT(j.job_id) AS skill_count, 342 | t.job_count AS total_jobs 343 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide j, 344 | UNNEST(keywords_all.list) AS keywords 345 | JOIN total_jobs_title t 346 | ON t.job_title_final = j.job_title_final 347 | WHERE j.search_time >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) 348 | GROUP BY 349 | skill, 350 | j.job_title_final, 351 | t.job_count 352 | 353 | UNION ALL 354 | 355 | SELECT 356 | keywords.element AS skill, 357 | NULL AS job_title_final, 358 | j.search_country, 359 | COUNT(j.job_id) / t.job_count AS skill_percent, 360 | COUNT(j.job_id) AS skill_count, 361 | t.job_count AS total_jobs 362 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide j, 363 | UNNEST(keywords_all.list) AS keywords 364 | JOIN total_jobs_country t 365 | ON t.search_country = j.search_country 366 | WHERE j.search_time >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) 367 | GROUP BY 368 | skill, 369 | j.search_country, 370 | t.job_count 371 | 372 | ORDER BY skill_count DESC 373 | ); 374 | 375 | EXPORT DATA 376 | OPTIONS ( 377 | uri = 'gs://gsearch_share/cache/skills/timeframes/ytd-*.csv', 378 | format = 'CSV', 379 | overwrite = true, 380 | header = true, 381 | field_delimiter = ',') 382 | AS ( 383 | WITH total_jobs_country_title AS ( 384 | SELECT job_title_final, search_country, COUNT(*) AS job_count 385 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide 386 | WHERE search_time >= DATE_TRUNC(CURRENT_DATE(), YEAR) 387 | GROUP BY job_title_final, search_country 388 | ), 389 | total_jobs_title AS ( 390 | SELECT job_title_final, COUNT(*) AS job_count 391 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide 392 | WHERE search_time >= DATE_TRUNC(CURRENT_DATE(), YEAR) 393 | GROUP BY job_title_final 394 | ), 395 | total_jobs_country AS ( 396 | SELECT search_country, COUNT(*) AS job_count 397 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide 398 | WHERE search_time >= DATE_TRUNC(CURRENT_DATE(), YEAR) 399 | GROUP BY search_country 400 | ) 401 | 402 | SELECT 403 | keywords.element AS skill, 404 | j.job_title_final, 405 | j.search_country, 406 | COUNT(j.job_id) / t.job_count AS skill_percent, 407 | COUNT(j.job_id) AS skill_count, 408 | t.job_count AS total_jobs 409 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide j, 410 | UNNEST(keywords_all.list) AS keywords 411 | JOIN total_jobs_country_title t 412 | ON t.job_title_final = j.job_title_final 413 | AND t.search_country = j.search_country 414 | WHERE j.search_time >= DATE_TRUNC(CURRENT_DATE(), YEAR) 415 | GROUP BY 416 | skill, 417 | j.job_title_final, 418 | j.search_country, 419 | t.job_count 420 | 421 | UNION ALL 422 | 423 | SELECT 424 | keywords.element AS skill, 425 | j.job_title_final, 426 | NULL AS search_country, 427 | COUNT(j.job_id) / t.job_count AS skill_percent, 428 | COUNT(j.job_id) AS skill_count, 429 | t.job_count AS total_jobs 430 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide j, 431 | UNNEST(keywords_all.list) AS keywords 432 | JOIN total_jobs_title t 433 | ON t.job_title_final = j.job_title_final 434 | WHERE j.search_time >= DATE_TRUNC(CURRENT_DATE(), YEAR) 435 | GROUP BY 436 | skill, 437 | j.job_title_final, 438 | t.job_count 439 | 440 | UNION ALL 441 | 442 | SELECT 443 | keywords.element AS skill, 444 | NULL AS job_title_final, 445 | j.search_country, 446 | COUNT(j.job_id) / t.job_count AS skill_percent, 447 | COUNT(j.job_id) AS skill_count, 448 | t.job_count AS total_jobs 449 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide j, 450 | UNNEST(keywords_all.list) AS keywords 451 | JOIN total_jobs_country t 452 | ON t.search_country = j.search_country 453 | WHERE j.search_time >= DATE_TRUNC(CURRENT_DATE(), YEAR) 454 | GROUP BY 455 | skill, 456 | j.search_country, 457 | t.job_count 458 | 459 | ORDER BY skill_count DESC 460 | ); 461 | 462 | ---------------- 463 | -- 🕒 Skills Trend Page 464 | -- "Select All"/"Select All" export 465 | EXPORT DATA 466 | OPTIONS ( 467 | uri = 'gs://gsearch_share/cache/skill-trend/skill-trend-*.csv', 468 | format = 'CSV', 469 | overwrite = true, 470 | header = true, 471 | field_delimiter = ',') 472 | AS ( 473 | WITH top_skills AS ( 474 | SELECT keywords.element AS skill 475 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide, 476 | UNNEST(keywords_all.list) AS keywords 477 | WHERE 1 = 1 478 | GROUP BY skill 479 | ORDER BY COUNT(*) DESC 480 | LIMIT 5 481 | ), 482 | 483 | skill_counts AS ( 484 | SELECT date, skill, SUM(daily_skill_count) AS daily_skill_count 485 | FROM ( 486 | SELECT DATE(search_time) AS date, 487 | keywords.element AS skill, 488 | COUNT(*) as daily_skill_count 489 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide, 490 | UNNEST(keywords_all.list) AS keywords 491 | WHERE 1 = 1 492 | AND keywords.element IN (SELECT skill FROM top_skills) 493 | GROUP BY date, skill 494 | ) 495 | GROUP BY date, skill 496 | ), 497 | 498 | total_jobs AS ( 499 | SELECT COUNT(*) 500 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide 501 | WHERE 1 = 1 502 | ), 503 | 504 | total_jobs_grouped AS ( 505 | SELECT date, SUM(daily_total_count) OVER ( 506 | ORDER BY date 507 | ROWS BETWEEN 13 PRECEDING AND CURRENT ROW 508 | ) as rolling_total_count 509 | FROM ( 510 | SELECT DATE(search_time) AS date, COUNT(*) AS daily_total_count 511 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide 512 | WHERE 1 = 1 513 | GROUP BY date 514 | ) 515 | ) 516 | 517 | SELECT sc.date, 518 | sc.skill, 519 | (SUM(sc.daily_skill_count) OVER ( 520 | PARTITION BY sc.skill 521 | ORDER BY sc.date 522 | ROWS BETWEEN 13 PRECEDING AND CURRENT ROW 523 | ) / tjg.rolling_total_count) as skill_percentage, 524 | (SELECT * FROM total_jobs) AS total_jobs 525 | FROM skill_counts sc 526 | JOIN total_jobs_grouped tjg ON sc.date = tjg.date 527 | ORDER BY sc.date DESC, skill_percentage DESC 528 | ); 529 | 530 | ---------------- 531 | -- 💰 Skill-Pay Page 532 | -- "Select All"/"Select All" export 533 | EXPORT DATA 534 | OPTIONS ( 535 | uri = 'gs://gsearch_share/cache/skill-pay/skill-pay-*.csv', 536 | format = 'CSV', 537 | overwrite = true, 538 | header = true, 539 | field_delimiter = ',') 540 | AS ( 541 | WITH total_jobs AS ( 542 | SELECT COUNT(*) 543 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide 544 | -- {job_choice_query} AND salary_year IS NOT NULL 545 | WHERE salary_year IS NOT NULL 546 | ) 547 | 548 | SELECT 549 | keywords.element AS skill, 550 | AVG(salary_year) AS avg, 551 | MIN(salary_year) AS min, 552 | MAX(salary_year) AS max, 553 | APPROX_QUANTILES(salary_year,2)[OFFSET(1)] AS median, 554 | COUNT(job_id) AS count, 555 | (SELECT * FROM total_jobs) AS total_jobs 556 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide, 557 | UNNEST(keywords_all.list) AS keywords 558 | -- {job_choice_query} AND salary_year IS NOT NULL 559 | WHERE salary_year IS NOT NULL 560 | GROUP BY skill 561 | ORDER BY count DESC 562 | ); 563 | 564 | -- Slicer 565 | EXPORT DATA 566 | OPTIONS ( 567 | uri = 'gs://gsearch_share/cache/skill-pay/slicer-*.csv', 568 | format = 'CSV', 569 | overwrite = true, 570 | header = true, 571 | field_delimiter = ',') 572 | AS ( 573 | SELECT 574 | job_title_final AS job_title, 575 | search_country, 576 | COUNT(*) AS job_count 577 | FROM 578 | `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide 579 | WHERE search_country IS NOT NULL AND salary_year IS NOT NULL 580 | GROUP BY job_title, search_country 581 | ORDER BY job_count DESC 582 | ); 583 | 584 | -- Selection export 585 | EXPORT DATA 586 | OPTIONS ( 587 | uri = 'gs://gsearch_share/cache/skill-pay/skill-pay-all-*.csv', 588 | format = 'CSV', 589 | overwrite = true, 590 | header = true, 591 | field_delimiter = ',') 592 | AS ( 593 | WITH numbered_jobs AS ( 594 | SELECT DISTINCT job_id, 595 | ROW_NUMBER() OVER (ORDER BY job_id) as job_number 596 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide 597 | ) 598 | 599 | SELECT 600 | n.job_number, 601 | j.job_title_final, 602 | j.search_country, 603 | j.search_time, 604 | keywords.element AS skill, 605 | APPROX_QUANTILES(j.salary_year, 2)[OFFSET(1)] AS median 606 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide j, 607 | UNNEST(keywords_all.list) AS keywords 608 | JOIN numbered_jobs n ON j.job_id = n.job_id 609 | WHERE j.salary_year IS NOT NULL 610 | GROUP BY n.job_number, j.job_title_final, j.search_country, j.search_time, skill, j.job_id 611 | ORDER BY j.search_time DESC 612 | ); 613 | 614 | ---------------- 615 | -- 💸 Job Salaries Page 616 | -- (No Selections) export 617 | EXPORT DATA 618 | OPTIONS ( 619 | uri = 'gs://gsearch_share/cache/salary/salary-*.csv', 620 | format = 'CSV', 621 | overwrite = true, 622 | header = true, 623 | field_delimiter = ',') 624 | AS ( 625 | SELECT * FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_salary_wide 626 | ); 627 | 628 | ---------------- 629 | -- 🏥 Health Page exports 630 | -- Calculate num_jobs 631 | EXPORT DATA 632 | OPTIONS ( 633 | uri = 'gs://gsearch_share/cache/health/num-jobs-*.csv', 634 | format = 'CSV', 635 | overwrite = true, 636 | header = true, 637 | field_delimiter = ',') 638 | AS ( 639 | SELECT COUNT(*) AS num_jobs, 640 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_fact 641 | ); 642 | 643 | -- Caluclate dates and missing dates 644 | EXPORT DATA 645 | OPTIONS ( 646 | uri = 'gs://gsearch_share/cache/health/dates-*.csv', 647 | format = 'CSV', 648 | overwrite = true, 649 | header = true, 650 | field_delimiter = ',') 651 | AS ( 652 | SELECT DISTINCT CAST(search_time AS DATE) AS search_date, 653 | COUNT(job_id) AS jobs_daily 654 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_fact 655 | GROUP BY search_date 656 | ORDER BY search_date 657 | ); 658 | 659 | -- Find last update 660 | EXPORT DATA 661 | OPTIONS ( 662 | uri = 'gs://gsearch_share/cache/health/last-update-*.csv', 663 | format = 'CSV', 664 | overwrite = true, 665 | header = true, 666 | field_delimiter = ',') 667 | AS ( 668 | SELECT MAX(search_time) as last_update 669 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_fact 670 | ); 671 | -------------------------------------------------------------------------------- /dags/sql/fact_build.sql: -------------------------------------------------------------------------------- 1 | -- pre-JSON table create 2 | -- create table in format needed to combine with JSON data; only need to run once 3 | -- BKGD: 'gsearch_job_listings.gsearch_jobs_all' was first used to collect data before collecting full JSON 4 | CREATE OR REPLACE TABLE `job-listings-366015`.gsearch_job_listings.gsearch_jobs_all_json_version AS 5 | (SELECT title AS job_title, 6 | company_name, 7 | location AS job_location, 8 | via AS job_via, 9 | description AS job_description, 10 | extensions as job_extensions, 11 | job_id, 12 | thumbnail AS company_thumbnail, 13 | posted_at AS job_posted_at, 14 | schedule_type AS job_schedule_type, 15 | work_from_home AS job_work_from_home, 16 | salary AS job_salary, 17 | search_term, 18 | search_location, 19 | date_time AS search_time, 20 | commute_time AS job_commute_time, 21 | FROM `job-listings-366015`.gsearch_job_listings.gsearch_jobs_all 22 | -- started collecting data in 'gsearch_jobs_all_json' on 1-1-2023 (142,416 records) 23 | -- No longer need to use data from this 'back-up' table after this period 24 | WHERE date_time < (SELECT MIN(search_time) 25 | FROM `job-listings-366015`.gsearch_job_listings_json.gsearch_jobs_all_json)); 26 | 27 | -- JSON table create 28 | -- 43 second run-time (175K rows) on 05Jan23, 4 mins (450K rows) on 09Mar2023 29 | CREATE OR REPLACE TABLE `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_fact AS 30 | (WITH gsearch_json_all AS 31 | -- Extract JSON data and combine with pre-JSON data 32 | (SELECT JSON_EXTRACT_SCALAR(jobs_results.job_id) AS job_id, 33 | JSON_EXTRACT_SCALAR(jobs_results.title) AS job_title, 34 | JSON_EXTRACT_SCALAR(jobs_results.company_name) AS company_name, 35 | JSON_EXTRACT_SCALAR(jobs_results.location) AS job_location, 36 | JSON_EXTRACT_SCALAR(jobs_results.via) AS job_via, 37 | JSON_EXTRACT_SCALAR(jobs_results.description) AS job_description, 38 | -- how to get all 'job_highlights' 39 | -- JSON_EXTRACT(jobs_results.job_highlights, '$') AS job_highlights_all, 40 | CASE 41 | WHEN JSON_EXTRACT_SCALAR(JSON_EXTRACT(jobs_results.job_highlights, '$[0].title')) = 42 | 'Qualifications' 43 | THEN JSON_EXTRACT(jobs_results.job_highlights, '$[0].items') 44 | WHEN JSON_EXTRACT_SCALAR(JSON_EXTRACT(jobs_results.job_highlights, '$[1].title')) = 45 | 'Qualifications' 46 | THEN JSON_EXTRACT(jobs_results.job_highlights, '$[1].items') 47 | WHEN JSON_EXTRACT_SCALAR(JSON_EXTRACT(jobs_results.job_highlights, '$[2].title')) = 48 | 'Qualifications' 49 | THEN JSON_EXTRACT(jobs_results.job_highlights, '$[2].items') 50 | END AS job_highlights_qualifications, 51 | CASE 52 | 53 | WHEN JSON_EXTRACT_SCALAR(JSON_EXTRACT(jobs_results.job_highlights, '$[0].title')) = 54 | 'Responsibilities' 55 | THEN JSON_EXTRACT(jobs_results.job_highlights, '$[0].items') 56 | WHEN JSON_EXTRACT_SCALAR(JSON_EXTRACT(jobs_results.job_highlights, '$[1].title')) = 57 | 'Responsibilities' 58 | THEN JSON_EXTRACT(jobs_results.job_highlights, '$[1].items') 59 | WHEN JSON_EXTRACT_SCALAR(JSON_EXTRACT(jobs_results.job_highlights, '$[2].title')) = 60 | 'Responsibilities' 61 | THEN JSON_EXTRACT(jobs_results.job_highlights, '$[2].items') 62 | END AS job_highlights_responsibilities, 63 | CASE 64 | WHEN JSON_EXTRACT_SCALAR(JSON_EXTRACT(jobs_results.job_highlights, '$[0].title')) = 65 | 'Benefits' 66 | THEN JSON_EXTRACT(jobs_results.job_highlights, '$[0].items') 67 | WHEN JSON_EXTRACT_SCALAR(JSON_EXTRACT(jobs_results.job_highlights, '$[1].title')) = 68 | 'Benefits' 69 | THEN JSON_EXTRACT(jobs_results.job_highlights, '$[1].items') 70 | WHEN JSON_EXTRACT_SCALAR(JSON_EXTRACT(jobs_results.job_highlights, '$[2].title')) = 71 | 'Benefits' 72 | THEN JSON_EXTRACT(jobs_results.job_highlights, '$[2].items') 73 | END AS job_highlights_benefits, 74 | JSON_EXTRACT_SCALAR(jobs_results.detected_extensions.posted_at) AS job_posted_at, 75 | JSON_EXTRACT_SCALAR(jobs_results.detected_extensions.salary) AS job_salary, 76 | JSON_EXTRACT_SCALAR(jobs_results.detected_extensions.schedule_type) AS job_schedule_type, 77 | CAST(JSON_EXTRACT_SCALAR( 78 | jobs_results.detected_extensions.work_from_home) AS BOOL) AS job_work_from_home, 79 | JSON_EXTRACT_SCALAR(jobs_results.detected_extensions.commute_time) AS job_commute_time, 80 | JSON_VALUE_ARRAY(jobs_results, '$.extensions') AS job_extensions, 81 | CASE 82 | WHEN LEFT( 83 | JSON_EXTRACT_SCALAR(JSON_EXTRACT(jobs_results.related_links, '$[0].link')), 84 | 22) != 85 | 'https://www.google.com' 86 | THEN JSON_EXTRACT_SCALAR(JSON_EXTRACT(jobs_results.related_links, '$[0].link')) 87 | END 88 | AS company_link, 89 | CASE 90 | WHEN LEFT( 91 | JSON_EXTRACT_SCALAR(JSON_EXTRACT(jobs_results.related_links, '$[0].text')), 92 | 7) = 'See web' 93 | THEN JSON_EXTRACT_SCALAR(JSON_EXTRACT(jobs_results.related_links, '$[0].link')) 94 | WHEN LEFT( 95 | JSON_EXTRACT_SCALAR(JSON_EXTRACT(jobs_results.related_links, '$[1].text')), 96 | 7) = 'See web' 97 | THEN JSON_EXTRACT_SCALAR(JSON_EXTRACT(jobs_results.related_links, '$[1].link')) 98 | END 99 | AS company_link_google, 100 | JSON_EXTRACT_SCALAR(jobs_results.thumbnail) AS company_thumbnail, 101 | error, 102 | search_term, 103 | search_location, 104 | search_time, 105 | search_id, 106 | FROM `job-listings-366015`.gsearch_job_listings_json.gsearch_jobs_all_json 107 | -- Used to unpack original JSON 108 | LEFT JOIN UNNEST(JSON_EXTRACT_ARRAY(results.jobs_results)) AS jobs_results 109 | WHERE error = false 110 | -- Union with previous data not previously saved as JSON 111 | UNION ALL 112 | SELECT job_id, 113 | job_title, 114 | company_name, 115 | job_location, 116 | job_via, 117 | job_description, 118 | null as job_highlights_qualifications, 119 | null as job_highlights_responsibilities, 120 | null as job_highlights_benefits, 121 | job_posted_at, 122 | job_salary, 123 | job_schedule_type, 124 | job_work_from_home, 125 | job_commute_time, 126 | job_extensions, 127 | null as company_link, 128 | null as company_link_google, 129 | company_thumbnail, 130 | null as error, 131 | search_term, 132 | search_location, 133 | search_time, 134 | null as search_id, 135 | FROM `job-listings-366015`.gsearch_job_listings.gsearch_jobs_all_json_version) 136 | -- clean table once combined for those with common fields 137 | SELECT *, 138 | CASE 139 | WHEN "No degree mentioned" IN UNNEST(job_extensions) 140 | THEN true 141 | END AS job_no_degree_mention, 142 | CASE 143 | WHEN "Health insurance" IN UNNEST(job_extensions) 144 | THEN true 145 | END AS job_health_insurance, 146 | FROM gsearch_json_all); 147 | -------------------------------------------------------------------------------- /dags/sql/public_build.sql: -------------------------------------------------------------------------------- 1 | -- Public table build for Langchain interaction 2 | -- 8 second run-tim (1.2M rows) on 5APR24 3 | -- Does some minor cleanup like: 4 | -- 1. Removing duplicates based on PARTITION BY company_name, job_title, job_schedule_type, job_description, job_location 5 | -- 2. removing the "via" from the job_via column (i did this in the course so want to be consistent) 6 | -- 3. removing the job_title_clean column (as it may be blank for some entries); job_title_final is the one to use 7 | -- 4. converting the keywords_all.list array to a simple array 8 | -- 5. converting the job_work_from_home, job_no_degree_mention, job_health_insurance to boolean (fill in na as false4) 9 | -- columns not used: job_via, job_title_clean, job_description, job_posted_at, job_salary, job_commute_time, company_link, company_link_google, company_thumbnail, job_highlights_qualifications, job_highlights_responsibilities, job_highlights_benefits, job_extensions, error, search_id, search_term, search_location, keywords_programming, keywords_databases, keywords_cloud, keywords_libraries, keywords_webframeworks, keywords_os, keywords_analyst_tools, keywords_other, keywords_async, keywords_sync, salary_pay, salary_avg, salary_min, salary_max 10 | 11 | CREATE OR REPLACE TABLE `job-listings-366015`.public_job_listings.data_nerd_jobs AS 12 | SELECT 13 | job_title_final, 14 | job_title AS job_title_original, 15 | company_name, 16 | job_location, 17 | search_time AS job_posted_at, 18 | REPLACE(job_via, 'via ', '') AS job_posting_site, 19 | job_schedule_type, 20 | IFNULL(job_work_from_home, FALSE) AS job_work_from_home, 21 | IFNULL(job_no_degree_mention, FALSE) AS job_no_degree_mention, 22 | IFNULL(job_health_insurance, FALSE) AS job_health_insurance, 23 | ARRAY( 24 | SELECT x.element 25 | FROM UNNEST(keywords_all.list) AS x 26 | ) AS job_keywords 27 | FROM ( 28 | SELECT 29 | *, 30 | ROW_NUMBER() OVER (PARTITION BY company_name, job_title, job_schedule_type, job_description, job_location) AS rn 31 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide 32 | ) 33 | WHERE rn = 1; -------------------------------------------------------------------------------- /dags/sql/wide_build.sql: -------------------------------------------------------------------------------- 1 | -- Final wide table build that combines fact and dimension tables 2 | -- 77 second run-time (550K rows) on 09Mar23 3 | 4 | CREATE OR REPLACE TABLE `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide AS 5 | (SELECT 6 | CASE 7 | WHEN job_title_clean IS NULL THEN search_term 8 | ELSE job_title_clean 9 | END AS job_title_final, 10 | t.* EXCEPT (job_title), 11 | f.*, 12 | s.* EXCEPT (job_id, job_description), 13 | c.* EXCEPT (search_location), 14 | d.* EXCEPT (job_id) 15 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_fact AS f 16 | LEFT JOIN `job-listings-366015`.gsearch_job_listings_clean.gsearch_skills AS s ON f.job_id = s.job_id 17 | LEFT JOIN `job-listings-366015`.gsearch_job_listings_clean.gsearch_country AS c ON c.search_location = f.search_location 18 | LEFT JOIN `job-listings-366015`.gsearch_job_listings_clean.gsearch_salary AS d ON d.job_id = f.job_id 19 | LEFT JOIN `job-listings-366015`.gsearch_job_listings_clean.gsearch_job_title AS t ON t.job_title = f.job_title 20 | ); 21 | 22 | -- Keywords table 23 | -- Had to create a physical table as unable to figure out how to export this query to one CSV for cache of website 24 | CREATE OR REPLACE TABLE `job-listings-366015`.gsearch_job_listings_clean.gsearch_keywords AS 25 | (WITH keywords AS ( 26 | SELECT DISTINCT keywords_all AS element, 27 | SPLIT(kv, ':')[OFFSET(0)] as keyword, 28 | FROM ( 29 | SELECT DISTINCT keywords_all.element AS keywords_all 30 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_skills, 31 | UNNEST(keywords_all.list) AS keywords_all 32 | ) AS k, 33 | UNNEST(SPLIT(TRANSLATE(TO_JSON_STRING(k), '"{}', ''))) kv 34 | ), keywords_programming AS ( 35 | SELECT DISTINCT keywords_programming AS element, 36 | SPLIT(kv, ':')[OFFSET(0)] as keyword, 37 | FROM ( 38 | SELECT DISTINCT keywords_programming.element AS keywords_programming 39 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_skills, 40 | UNNEST(keywords_programming.list) AS keywords_programming 41 | ) AS k, 42 | UNNEST(SPLIT(TRANSLATE(TO_JSON_STRING(k), '"{}', ''))) kv 43 | ), keywords_databases AS ( 44 | SELECT DISTINCT keywords_databases AS element, 45 | SPLIT(kv, ':')[OFFSET(0)] as keyword, 46 | FROM ( 47 | SELECT DISTINCT keywords_databases.element AS keywords_databases 48 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_skills, 49 | UNNEST(keywords_databases.list) AS keywords_databases 50 | ) AS k, 51 | UNNEST(SPLIT(TRANSLATE(TO_JSON_STRING(k), '"{}', ''))) kv 52 | ), keywords_cloud AS ( 53 | SELECT DISTINCT keywords_cloud AS element, 54 | SPLIT(kv, ':')[OFFSET(0)] as keyword, 55 | FROM ( 56 | SELECT DISTINCT keywords_cloud.element AS keywords_cloud 57 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_skills, 58 | UNNEST(keywords_cloud.list) AS keywords_cloud 59 | ) AS k, 60 | UNNEST(SPLIT(TRANSLATE(TO_JSON_STRING(k), '"{}', ''))) kv 61 | ), keywords_libraries AS ( 62 | SELECT DISTINCT keywords_libraries AS element, 63 | SPLIT(kv, ':')[OFFSET(0)] as keyword, 64 | FROM ( 65 | SELECT DISTINCT keywords_libraries.element AS keywords_libraries 66 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_skills, 67 | UNNEST(keywords_libraries.list) AS keywords_libraries 68 | ) AS k, 69 | UNNEST(SPLIT(TRANSLATE(TO_JSON_STRING(k), '"{}', ''))) kv 70 | ), keywords_webframeworks AS ( 71 | SELECT DISTINCT keywords_webframeworks AS element, 72 | SPLIT(kv, ':')[OFFSET(0)] as keyword, 73 | FROM ( 74 | SELECT DISTINCT keywords_webframeworks.element AS keywords_webframeworks 75 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_skills, 76 | UNNEST(keywords_webframeworks.list) AS keywords_webframeworks 77 | ) AS k, 78 | UNNEST(SPLIT(TRANSLATE(TO_JSON_STRING(k), '"{}', ''))) kv 79 | ), keywords_os AS ( 80 | SELECT DISTINCT keywords_os AS element, 81 | SPLIT(kv, ':')[OFFSET(0)] as keyword, 82 | FROM ( 83 | SELECT DISTINCT keywords_os.element AS keywords_os 84 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_skills, 85 | UNNEST(keywords_os.list) AS keywords_os 86 | ) AS k, 87 | UNNEST(SPLIT(TRANSLATE(TO_JSON_STRING(k), '"{}', ''))) kv 88 | ), keywords_analyst_tools AS ( 89 | SELECT DISTINCT keywords_analyst_tools AS element, 90 | SPLIT(kv, ':')[OFFSET(0)] as keyword, 91 | FROM ( 92 | SELECT DISTINCT keywords_analyst_tools.element AS keywords_analyst_tools 93 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_skills, 94 | UNNEST(keywords_analyst_tools.list) AS keywords_analyst_tools 95 | ) AS k, 96 | UNNEST(SPLIT(TRANSLATE(TO_JSON_STRING(k), '"{}', ''))) kv 97 | ), keywords_other AS ( 98 | SELECT DISTINCT keywords_other AS element, 99 | SPLIT(kv, ':')[OFFSET(0)] as keyword, 100 | FROM ( 101 | SELECT DISTINCT keywords_other.element AS keywords_other 102 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_skills, 103 | UNNEST(keywords_other.list) AS keywords_other 104 | ) AS k, 105 | UNNEST(SPLIT(TRANSLATE(TO_JSON_STRING(k), '"{}', ''))) kv 106 | ), keywords_async AS ( 107 | SELECT DISTINCT keywords_async AS element, 108 | SPLIT(kv, ':')[OFFSET(0)] as keyword, 109 | FROM ( 110 | SELECT DISTINCT keywords_async.element AS keywords_async 111 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_skills, 112 | UNNEST(keywords_async.list) AS keywords_async 113 | ) AS k, 114 | UNNEST(SPLIT(TRANSLATE(TO_JSON_STRING(k), '"{}', ''))) kv 115 | ) 116 | 117 | SELECT * FROM keywords 118 | UNION ALL 119 | SELECT * FROM keywords_programming 120 | UNION ALL 121 | SELECT * FROM keywords_databases 122 | UNION ALL 123 | SELECT * FROM keywords_cloud 124 | UNION ALL 125 | SELECT * FROM keywords_libraries 126 | UNION ALL 127 | SELECT * FROM keywords_webframeworks 128 | UNION ALL 129 | SELECT * FROM keywords_os 130 | UNION ALL 131 | SELECT * FROM keywords_analyst_tools 132 | UNION ALL 133 | SELECT * FROM keywords_other 134 | UNION ALL 135 | SELECT * FROM keywords_async 136 | ); 137 | 138 | 139 | -- Salary table for salary page 140 | -- Had to create a physical table as unable to figure out how to export this query to one CSV for cache of website 141 | CREATE OR REPLACE TABLE `job-listings-366015`.gsearch_job_listings_clean.gsearch_salary_wide AS 142 | (SELECT 143 | job_title_final AS job_title, 144 | search_term, 145 | salary_avg, 146 | salary_min, 147 | salary_max, 148 | salary_year, 149 | salary_hour, 150 | search_location, 151 | job_location, 152 | job_schedule_type, 153 | job_via, 154 | search_country, 155 | search_time, 156 | FROM 157 | `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_wide 158 | WHERE salary_avg IS NOT NULL 159 | ); -------------------------------------------------------------------------------- /docker-compose.yaml: -------------------------------------------------------------------------------- 1 | # Licensed to the Apache Software Foundation (ASF) under one 2 | # or more contributor license agreements. See the NOTICE file 3 | # distributed with this work for additional information 4 | # regarding copyright ownership. The ASF licenses this file 5 | # to you under the Apache License, Version 2.0 (the 6 | # "License"); you may not use this file except in compliance 7 | # with the License. You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, 12 | # software distributed under the License is distributed on an 13 | # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY 14 | # KIND, either express or implied. See the License for the 15 | # specific language governing permissions and limitations 16 | # under the License. 17 | # 18 | 19 | # Basic Airflow cluster configuration for CeleryExecutor with Redis and PostgreSQL. 20 | # 21 | # WARNING: This configuration is for local development. Do not use it in a production deployment. 22 | # 23 | # This configuration supports basic configuration using environment variables or an .env file 24 | # The following variables are supported: 25 | # 26 | # AIRFLOW_IMAGE_NAME - Docker image name used to run Airflow. 27 | # Default: apache/airflow:2.5.0 28 | # AIRFLOW_UID - User ID in Airflow containers 29 | # Default: 50000 30 | # Those configurations are useful mostly in case of standalone testing/running Airflow in test/try-out mode 31 | # 32 | # _AIRFLOW_WWW_USER_USERNAME - Username for the administrator account (if requested). 33 | # Default: airflow 34 | # _AIRFLOW_WWW_USER_PASSWORD - Password for the administrator account (if requested). 35 | # Default: airflow 36 | # _PIP_ADDITIONAL_REQUIREMENTS - Additional PIP requirements to add when starting all containers. 37 | # Default: '' 38 | # 39 | # Feel free to modify this file to suit your needs. 40 | --- 41 | version: '3' 42 | x-airflow-common: 43 | &airflow-common 44 | # In order to add custom dependencies or upgrade provider packages you can use your extended image. 45 | # Comment the image line, place your Dockerfile in the directory where you placed the docker-compose.yaml 46 | # and uncomment the "build" line below, Then run `docker-compose build` to build the images. 47 | # https://stackoverflow.com/questions/67887138/how-to-install-packages-in-airflow-docker-compose 48 | # REPLACED # image: ${AIRFLOW_IMAGE_NAME:-apache/airflow:2.5.0} 49 | build: . 50 | environment: 51 | &airflow-common-env 52 | AIRFLOW__CORE__EXECUTOR: CeleryExecutor 53 | AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow 54 | # For backward compatibility, with Airflow <2.3 55 | AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow 56 | AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres/airflow 57 | AIRFLOW__CELERY__BROKER_URL: redis://:@redis:6379/0 58 | AIRFLOW__CORE__FERNET_KEY: '' 59 | AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'false' 60 | # Stoped loading core examples as they cluttered up the UI 61 | AIRFLOW__CORE__LOAD_EXAMPLES: 'false' 62 | AIRFLOW__API__AUTH_BACKENDS: 'airflow.api.auth.backend.basic_auth' 63 | # For email 64 | AIRFLOW__SMTP__SMTP_HOST: smtp.gmail.com 65 | AIRFLOW__SMTP__SMTP_PORT: 587 66 | AIRFLOW__SMTP__SMTP_USER: ${AIRFLOW__SMTP__SMTP_USER} 67 | AIRFLOW__SMTP__SMTP_PASSWORD: ${AIRFLOW__SMTP__SMTP_PASSWORD} 68 | AIRFLOW__SMTP__SMTP_MAIL_FROM: ${AIRFLOW__SMTP__SMTP_USER} 69 | # Failed 'docker-compose up' with additional pip requirments here... also it's bad practice... need to build seperate image... link above on how to 70 | _PIP_ADDITIONAL_REQUIREMENTS: ${_PIP_ADDITIONAL_REQUIREMENTS:-} # pandas google-search-results google-cloud-bigquery numpy matplotlib configparser 71 | GOOGLE_APPLICATION_CREDENTIALS: './dags/config/job-listings-366015-9ac668151dc3.json' 72 | volumes: 73 | - ./dags:/opt/airflow/dags 74 | - ./logs:/opt/airflow/logs 75 | - ./plugins:/opt/airflow/plugins 76 | user: "${AIRFLOW_UID:-50000}:0" 77 | depends_on: 78 | &airflow-common-depends-on 79 | redis: 80 | condition: service_healthy 81 | postgres: 82 | condition: service_healthy 83 | 84 | services: 85 | postgres: 86 | image: postgres:13 87 | environment: 88 | POSTGRES_USER: airflow 89 | POSTGRES_PASSWORD: airflow 90 | POSTGRES_DB: airflow 91 | volumes: 92 | - postgres-db-volume:/var/lib/postgresql/data 93 | healthcheck: 94 | test: ["CMD", "pg_isready", "-U", "airflow"] 95 | interval: 5s 96 | retries: 5 97 | restart: always 98 | 99 | redis: 100 | image: redis:latest 101 | expose: 102 | - 6379 103 | healthcheck: 104 | test: ["CMD", "redis-cli", "ping"] 105 | interval: 5s 106 | timeout: 30s 107 | retries: 50 108 | restart: always 109 | 110 | airflow-webserver: 111 | <<: *airflow-common 112 | command: webserver 113 | # Changed port to 8081 to not conflict with QNAP port of 8080 114 | ports: 115 | - 8081:8080 116 | healthcheck: 117 | test: ["CMD", "curl", "--fail", "http://localhost:8081/health"] 118 | interval: 10s 119 | timeout: 10s 120 | retries: 5 121 | restart: always 122 | depends_on: 123 | <<: *airflow-common-depends-on 124 | airflow-init: 125 | condition: service_completed_successfully 126 | 127 | airflow-scheduler: 128 | <<: *airflow-common 129 | command: scheduler 130 | healthcheck: 131 | test: ["CMD-SHELL", 'airflow jobs check --job-type SchedulerJob --hostname "$${HOSTNAME}"'] 132 | interval: 10s 133 | timeout: 10s 134 | retries: 5 135 | restart: always 136 | depends_on: 137 | <<: *airflow-common-depends-on 138 | airflow-init: 139 | condition: service_completed_successfully 140 | 141 | airflow-worker: 142 | <<: *airflow-common 143 | command: celery worker 144 | healthcheck: 145 | test: 146 | - "CMD-SHELL" 147 | - 'celery --app airflow.executors.celery_executor.app inspect ping -d "celery@$${HOSTNAME}"' 148 | interval: 10s 149 | timeout: 10s 150 | retries: 5 151 | environment: 152 | <<: *airflow-common-env 153 | # Required to handle warm shutdown of the celery workers properly 154 | # See https://airflow.apache.org/docs/docker-stack/entrypoint.html#signal-propagation 155 | DUMB_INIT_SETSID: "0" 156 | restart: always 157 | depends_on: 158 | <<: *airflow-common-depends-on 159 | airflow-init: 160 | condition: service_completed_successfully 161 | 162 | airflow-triggerer: 163 | <<: *airflow-common 164 | command: triggerer 165 | healthcheck: 166 | test: ["CMD-SHELL", 'airflow jobs check --job-type TriggererJob --hostname "$${HOSTNAME}"'] 167 | interval: 10s 168 | timeout: 10s 169 | retries: 5 170 | restart: always 171 | depends_on: 172 | <<: *airflow-common-depends-on 173 | airflow-init: 174 | condition: service_completed_successfully 175 | 176 | airflow-init: 177 | <<: *airflow-common 178 | entrypoint: /bin/bash 179 | # yamllint disable rule:line-length 180 | command: 181 | - -c 182 | - | 183 | function ver() { 184 | printf "%04d%04d%04d%04d" $${1//./ } 185 | } 186 | airflow_version=$$(AIRFLOW__LOGGING__LOGGING_LEVEL=INFO && gosu airflow airflow version) 187 | airflow_version_comparable=$$(ver $${airflow_version}) 188 | min_airflow_version=2.2.0 189 | min_airflow_version_comparable=$$(ver $${min_airflow_version}) 190 | if (( airflow_version_comparable < min_airflow_version_comparable )); then 191 | echo 192 | echo -e "\033[1;31mERROR!!!: Too old Airflow version $${airflow_version}!\e[0m" 193 | echo "The minimum Airflow version supported: $${min_airflow_version}. Only use this or higher!" 194 | echo 195 | exit 1 196 | fi 197 | if [[ -z "${AIRFLOW_UID}" ]]; then 198 | echo 199 | echo -e "\033[1;33mWARNING!!!: AIRFLOW_UID not set!\e[0m" 200 | echo "If you are on Linux, you SHOULD follow the instructions below to set " 201 | echo "AIRFLOW_UID environment variable, otherwise files will be owned by root." 202 | echo "For other operating systems you can get rid of the warning with manually created .env file:" 203 | echo " See: https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html#setting-the-right-airflow-user" 204 | echo 205 | fi 206 | one_meg=1048576 207 | mem_available=$$(($$(getconf _PHYS_PAGES) * $$(getconf PAGE_SIZE) / one_meg)) 208 | cpus_available=$$(grep -cE 'cpu[0-9]+' /proc/stat) 209 | disk_available=$$(df / | tail -1 | awk '{print $$4}') 210 | warning_resources="false" 211 | if (( mem_available < 4000 )) ; then 212 | echo 213 | echo -e "\033[1;33mWARNING!!!: Not enough memory available for Docker.\e[0m" 214 | echo "At least 4GB of memory required. You have $$(numfmt --to iec $$((mem_available * one_meg)))" 215 | echo 216 | warning_resources="true" 217 | fi 218 | if (( cpus_available < 2 )); then 219 | echo 220 | echo -e "\033[1;33mWARNING!!!: Not enough CPUS available for Docker.\e[0m" 221 | echo "At least 2 CPUs recommended. You have $${cpus_available}" 222 | echo 223 | warning_resources="true" 224 | fi 225 | if (( disk_available < one_meg * 10 )); then 226 | echo 227 | echo -e "\033[1;33mWARNING!!!: Not enough Disk space available for Docker.\e[0m" 228 | echo "At least 10 GBs recommended. You have $$(numfmt --to iec $$((disk_available * 1024 )))" 229 | echo 230 | warning_resources="true" 231 | fi 232 | if [[ $${warning_resources} == "true" ]]; then 233 | echo 234 | echo -e "\033[1;33mWARNING!!!: You have not enough resources to run Airflow (see above)!\e[0m" 235 | echo "Please follow the instructions to increase amount of resources available:" 236 | echo " https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html#before-you-begin" 237 | echo 238 | fi 239 | mkdir -p /sources/logs /sources/dags /sources/plugins 240 | chown -R "${AIRFLOW_UID}:0" /sources/{logs,dags,plugins} 241 | exec /entrypoint airflow version 242 | # yamllint enable rule:line-length 243 | environment: 244 | <<: *airflow-common-env 245 | _AIRFLOW_DB_UPGRADE: 'true' 246 | _AIRFLOW_WWW_USER_CREATE: 'true' 247 | _AIRFLOW_WWW_USER_USERNAME: ${_AIRFLOW_WWW_USER_USERNAME:-airflow} 248 | _AIRFLOW_WWW_USER_PASSWORD: ${_AIRFLOW_WWW_USER_PASSWORD:-airflow} 249 | _PIP_ADDITIONAL_REQUIREMENTS: '' 250 | user: "0:0" 251 | volumes: 252 | - .:/sources 253 | 254 | airflow-cli: 255 | <<: *airflow-common 256 | profiles: 257 | - debug 258 | environment: 259 | <<: *airflow-common-env 260 | CONNECTION_CHECK_MAX_COUNT: "0" 261 | # Workaround for entrypoint issue. See: https://github.com/apache/airflow/issues/16252 262 | command: 263 | - bash 264 | - -c 265 | - airflow 266 | 267 | # You can enable flower by adding "--profile flower" option e.g. docker-compose --profile flower up 268 | # or by explicitly targeted on the command line e.g. docker-compose up flower. 269 | # See: https://docs.docker.com/compose/profiles/ 270 | flower: 271 | <<: *airflow-common 272 | command: celery flower 273 | profiles: 274 | - flower 275 | ports: 276 | - 5555:5555 277 | healthcheck: 278 | test: ["CMD", "curl", "--fail", "http://localhost:5555/"] 279 | interval: 10s 280 | timeout: 10s 281 | retries: 5 282 | restart: always 283 | depends_on: 284 | <<: *airflow-common-depends-on 285 | airflow-init: 286 | condition: service_completed_successfully 287 | 288 | volumes: 289 | postgres-db-volume: 290 | -------------------------------------------------------------------------------- /extra/airflow_graph.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lukebarousse/Data_Job_Pipeline_Airflow/b69a5609c8fcd40c11b2fb6eb06123ca414dccdd/extra/airflow_graph.png -------------------------------------------------------------------------------- /extra/bigquery_schema.json: -------------------------------------------------------------------------------- 1 | [ 2 | { 3 | "name": "title", 4 | "mode": "NULLABLE", 5 | "type": "STRING", 6 | "fields": [] 7 | }, 8 | { 9 | "name": "company_name", 10 | "mode": "NULLABLE", 11 | "type": "STRING", 12 | "fields": [] 13 | }, 14 | { 15 | "name": "location", 16 | "mode": "NULLABLE", 17 | "type": "STRING", 18 | "fields": [] 19 | }, 20 | { 21 | "name": "via", 22 | "mode": "NULLABLE", 23 | "type": "STRING", 24 | "fields": [] 25 | }, 26 | { 27 | "name": "description", 28 | "mode": "NULLABLE", 29 | "type": "STRING", 30 | "fields": [] 31 | }, 32 | { 33 | "name": "extensions", 34 | "mode": "REPEATED", 35 | "type": "STRING", 36 | "fields": [] 37 | }, 38 | { 39 | "name": "job_id", 40 | "mode": "NULLABLE", 41 | "type": "STRING", 42 | "fields": [] 43 | }, 44 | { 45 | "name": "thumbnail", 46 | "mode": "NULLABLE", 47 | "type": "STRING", 48 | "fields": [] 49 | }, 50 | { 51 | "name": "posted_at", 52 | "mode": "NULLABLE", 53 | "type": "STRING", 54 | "fields": [] 55 | }, 56 | { 57 | "name": "schedule_type", 58 | "mode": "NULLABLE", 59 | "type": "STRING", 60 | "fields": [] 61 | }, 62 | { 63 | "name": "work_from_home", 64 | "mode": "NULLABLE", 65 | "type": "BOOLEAN", 66 | "fields": [] 67 | }, 68 | { 69 | "name": "salary", 70 | "mode": "NULLABLE", 71 | "type": "STRING", 72 | "fields": [] 73 | }, 74 | { 75 | "name": "search_term", 76 | "mode": "NULLABLE", 77 | "type": "STRING", 78 | "fields": [] 79 | }, 80 | { 81 | "name": "date_time", 82 | "mode": "NULLABLE", 83 | "type": "DATETIME", 84 | "fields": [] 85 | }, 86 | { 87 | "name": "search_location", 88 | "mode": "NULLABLE", 89 | "type": "STRING", 90 | "fields": [] 91 | }, 92 | { 93 | "name": "commute_time", 94 | "mode": "NULLABLE", 95 | "type": "STRING", 96 | "fields": [] 97 | } 98 | ] -------------------------------------------------------------------------------- /extra/dashboard.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lukebarousse/Data_Job_Pipeline_Airflow/b69a5609c8fcd40c11b2fb6eb06123ca414dccdd/extra/dashboard.png -------------------------------------------------------------------------------- /extra/dataproc_files/README.md: -------------------------------------------------------------------------------- 1 | # Dataproc Python Files 2 | 3 | Note: These files are included for example purposes only and may or may not match the current versions of the files in the Dataproc cluster. 4 | 5 | --- 6 | ### Spark Cluster Creation 7 | 8 | ``` 9 | REGION=us-central1 10 | ZONE=us-central1-a 11 | CLUSTER_NAME=spark-cluster 12 | BUCKET_NAME=dataproc-cluster-gsearch 13 | 14 | gcloud dataproc clusters create ${CLUSTER_NAME} \ 15 | --enable-component-gateway \ 16 | --region ${REGION} \ 17 | --zone ${ZONE} \ 18 | --bucket ${BUCKET_NAME} \ 19 | --master-machine-type n2-standard-2 \ 20 | --master-boot-disk-size 500 \ 21 | --num-workers 2 \ 22 | --worker-machine-type n2-standard-2 \ 23 | --worker-boot-disk-size 500 \ 24 | --image-version 1.5-debian10 \ 25 | --optional-components ANACONDA,JUPYTER \ 26 | --project job-listings-366015 \ 27 | --properties=^#^spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.6,com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.27.1 \ 28 | --metadata 'PIP_PACKAGES=spark-nlp spark-nlp-display' \ 29 | --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh 30 | ``` -------------------------------------------------------------------------------- /extra/dataproc_files/Salary Table.py: -------------------------------------------------------------------------------- 1 | from google.cloud import bigquery 2 | from google.cloud import storage 3 | 4 | import sparknlp 5 | from sparknlp.annotator import Lemmatizer, Stemmer, Tokenizer, Normalizer, TextMatcher, DocumentNormalizer 6 | from sparknlp.base import DocumentAssembler, Finisher 7 | 8 | from pyspark.sql import SparkSession 9 | from pyspark.ml.feature import StopWordsRemover, CountVectorizer, IDF 10 | from pyspark.ml.clustering import LDA 11 | from pyspark.sql.functions import split, regexp_replace, when, lower, array_distinct, monotonically_increasing_id, col, size 12 | from pyspark.ml import Pipeline 13 | 14 | import pandas as pd 15 | 16 | # Start Spark Session 17 | spark = SparkSession \ 18 | .builder \ 19 | .appName('BigQuery Storage & Spark DataFrames') \ 20 | .getOrCreate() 21 | 22 | # BigQuery Setup 23 | dataset = 'gsearch_job_listings_clean' 24 | table = 'job-listings-366015.gsearch_job_listings_clean.gsearch_jobs_fact' 25 | BUCKET_NAME = 'dataproc-cluster-gsearch' 26 | # need to read with SQL from table 27 | spark.conf.set("viewsEnabled","true") 28 | spark.conf.set("materializationDataset", dataset) 29 | # need to set temp bucket for writing 30 | spark.conf.set("temporaryGcsBucket", BUCKET_NAME) 31 | 32 | # Salary Cleanup 33 | sql = """ 34 | SELECT job_id, job_salary 35 | FROM `job-listings-366015.gsearch_job_listings_clean.gsearch_jobs_fact` 36 | WHERE job_salary IS NOT null 37 | """ 38 | salary = spark.read \ 39 | .format("bigquery") \ 40 | .load(sql) 41 | # drop duplicates so when create final dimension table only one job_id to match on 42 | salary = salary.dropDuplicates(['job_id']) 43 | 44 | salary_clean = salary 45 | 46 | # split column on ' ' 47 | split_col = split(salary_clean.job_salary, ' ',) 48 | salary_clean = salary_clean.withColumn('salary_pay', split_col.getItem(0))\ 49 | .withColumn('salary_rate', split_col.getItem(2))\ 50 | .drop('job_salary') 51 | 52 | # remove , $, " " 53 | salary_clean = salary_clean.withColumn('salary_pay', regexp_replace('salary_pay', ',', '')) 54 | salary_clean = salary_clean.withColumn('salary_pay', regexp_replace('salary_pay', '$', '')) 55 | salary_clean = salary_clean.withColumn('salary_pay', regexp_replace('salary_pay', ' ', '')) 56 | 57 | # start creation of 'salary_avg' on columns with no "-" 58 | # The character U+2013 "–" could be confused with the character U+002d "-", which is more common in source code. 59 | salary_clean = salary_clean.withColumn( 60 | 'salary_avg', 61 | when(salary_clean.salary_pay.contains("–"), None)\ 62 | .otherwise(salary_clean.salary_pay)) 63 | 64 | # create 'salary_min' & 'salary_max'column for cleaning 65 | salary_clean = salary_clean.withColumn( 66 | 'salary_', 67 | when(salary_clean.salary_pay.contains("–"), salary_clean.salary_pay)\ 68 | .otherwise(None)) 69 | split_col = split(salary_clean.salary_, "–",) 70 | salary_clean = salary_clean.withColumn('salary_min', split_col.getItem(0))\ 71 | .withColumn('salary_max', split_col.getItem(1))\ 72 | .drop('salary_') 73 | 74 | # format to remove 'K', multiply by 1000 75 | for column in ['salary_avg', 'salary_min', 'salary_max']: 76 | salary_clean = salary_clean.withColumn( 77 | column, 78 | when(salary_clean[column].contains('K'), regexp_replace(column, 'K', '').cast('float')*1000)\ 79 | .otherwise(salary_clean[column])) 80 | salary_clean = salary_clean.withColumn(column,salary_clean[column].cast('float')) 81 | 82 | # update 'salary_avg' column to take avg of 'salary_min' and 'salary_max' 83 | salary_clean = salary_clean.withColumn( 84 | 'salary_avg', 85 | when(salary_clean.salary_min.isNotNull(), 86 | (salary_clean.salary_min + salary_clean.salary_max)/2)\ 87 | .otherwise(salary_clean.salary_avg)) 88 | salary_clean = salary_clean.withColumn('salary_avg',salary_clean.salary_avg.cast('float')) 89 | 90 | for rate in ['year', 'hour']: 91 | salary_clean = salary_clean.withColumn( 92 | 'salary_'+rate, 93 | when(salary_clean.salary_rate.contains(rate), salary_clean.salary_avg)\ 94 | .otherwise(None)) 95 | # salary_clean = salary_clean.withColumn('salary_'+rate,salary_clean['salary_'+rate].cast('float')) 96 | 97 | print("Number of jobs with salary: ", salary_clean.count()) 98 | salary_clean.printSchema() 99 | 100 | # Write to BigQuery 101 | salary_clean.write.format('bigquery') \ 102 | .option('table', 'job-listings-366015.gsearch_job_listings_clean.gsearch_salary') \ 103 | .option('materializationExpirationTimeInMinutes', 60) \ 104 | .mode('overwrite') \ 105 | .save() -------------------------------------------------------------------------------- /extra/dataproc_files/Skill Table.py: -------------------------------------------------------------------------------- 1 | from google.cloud import bigquery 2 | from google.cloud import storage 3 | 4 | import sparknlp 5 | from sparknlp.annotator import Lemmatizer, Stemmer, Tokenizer, Normalizer, TextMatcher, DocumentNormalizer 6 | from sparknlp.base import DocumentAssembler, Finisher 7 | 8 | from pyspark.sql import SparkSession 9 | from pyspark.ml.feature import StopWordsRemover, CountVectorizer, IDF 10 | from pyspark.ml.clustering import LDA 11 | from pyspark.sql.functions import split, regexp_replace, when, lower, array_distinct, monotonically_increasing_id, col, size 12 | from pyspark.ml import Pipeline 13 | 14 | import pandas as pd 15 | 16 | """ 17 | Start Spark Session 18 | """ 19 | spark = SparkSession \ 20 | .builder \ 21 | .appName('BigQuery Storage & Spark DataFrames') \ 22 | .getOrCreate() 23 | 24 | 25 | """ 26 | #BigQuery Setup 27 | """ 28 | dataset = 'gsearch_job_listings_clean' 29 | table = 'job-listings-366015.gsearch_job_listings_clean.gsearch_jobs_fact' 30 | BUCKET_NAME = 'dataproc-cluster-gsearch' 31 | # need to read with SQL from table 32 | spark.conf.set("viewsEnabled","true") 33 | spark.conf.set("materializationDataset", dataset) 34 | # need to set temp bucket for writing 35 | spark.conf.set("temporaryGcsBucket", BUCKET_NAME) 36 | 37 | """ 38 | Keywords Setup 39 | """ 40 | # Keywords from hand picking from data and Stack Overflow survey https://survey.stackoverflow.co/2022/#technology-most-popular-technologies 41 | keywords_programming = [ 42 | 'sql', 'python', 'r', 'c', 'c#', 'javascript', 'java', 'scala', 'sas', 'matlab', 43 | 'c++', 'perl', 'go', 'typescript', 'bash', 'html', 'css', 'php', 'powershell', 'rust', 44 | 'kotlin', 'ruby', 'dart', 'assembly', 'swift', 'vba', 'lua', 'groovy', 'delphi', 'objective-c', 45 | 'haskell', 'elixir', 'julia', 'clojure', 'solidity', 'lisp', 'f#', 'fortran', 'erlang', 'apl', 46 | 'cobol', 'ocaml', 'crystal', 'golang', 'nosql', 'mongodb', 't-sql', 'no-sql', 47 | 'pascal', 'mongo', 'sass', 'vb.net', 'shell', 'visual basic', 48 | ] 49 | # 'js', 'c/c++', 'pl/sql', 'javascript/typescript', 'visualbasic', 'objective c', 50 | 51 | keywords_databases = [ 52 | 'mysql', 'sql server', 'postgresql', 'sqlite', 'mongodb', 'redis', 'mariadb', 53 | 'elasticsearch', 'firebase', 'dynamodb', 'firestore', 'cassandra', 'neo4j', 'db2', 54 | 'couchbase', 'couchdb', 55 | ] 56 | # 'mssql', 'sqlserver', 'postgres', 57 | 58 | keywords_cloud = [ 59 | 'aws', 'azure', 'gcp', 'firebase', 'heroku', 'digitalocean', 'vmware', 'managedhosting', 60 | 'linode', 'ovh', 'oracle', 'openstack', 'watson', 'colocation', 61 | 'snowflake', 'redshift', 'bigquery', 'aurora', 'databricks', 'ibm cloud', 62 | ] 63 | # 'googlecloud', 'google cloud', 'oraclecloud', 'oracle cloud' 'amazonweb', 'amazon web', 'ibmcloud', 64 | 65 | keywords_libraries = [ 66 | 'scikit-learn', 'jupyter', 'theano', 'openCV', 'pyspark', 'nltk', 'mlpack', 'chainer', 'fann', 'shogun', 67 | 'dlib', 'mxnet', 'keras', '.net', 'numpy', 'pandas', 'matplotlib', 'spring', 'tensorflow', 'flutter', 68 | 'react', 'kafka', 'electron', 'pytorch', 'qt', 'ionic', 'xamarin', 'spark', 'cordova', 'hadoop', 'gtx', 69 | 'capacitor', 'tidyverse', 'unoplatform', 'dplyr', 'tidyr', 'ggplot2', 'plotly', 'rshiny', 'mlr', 70 | 'airflow', 'seaborn', 'gdpr', 'graphql', 'selenium', 'hugging face', 'uno platform' 71 | 72 | ] 73 | # 'huggingface', 74 | 75 | keywords_webframeworks = [ 76 | 'node.js', 'vue', 'vue.js', 'ember.js', 'node', 'jquery', 'asp.net', 'react.js', 'express', 77 | 'angular', 'asp.netcore', 'django', 'flask', 'next.js', 'laravel', 'angular.js', 'fastapi', 'ruby', 78 | 'svelte', 'blazor', 'nuxt.js', 'symfony', 'gatsby', 'drupal', 'phoenix', 'fastify', 'deno', 79 | 'asp.net core', 'ruby on rails', 'play framework', 80 | ] 81 | # 'jse/jee', 'rubyonrails', 'playframework', 82 | 83 | keywords_os = [ 84 | 'unix', 'linux', 'windows', 'macos', 'wsl', 'ubuntu', 'centos', 'debian', 'redhat', 85 | 'suse', 'fedora', 'kali', 'arch', 86 | ] 87 | # 'unix/linux', 'linux/unix', 88 | 89 | keywords_analyst_tools = [ 90 | 'excel', 'tableau', 'word', 'powerpoint', 'looker', 'power bi', 'outlook', 'sas', 'sharepoint', 'visio', 91 | 'spreadsheet', 'alteryx', 'ssis', 'spss', 'ssrs', 'microstrategy', 'cognos', 'dax', 92 | 'esquisse', 'sap', 'splunk', 'qlik', 'nuix', 'datarobot', 'ms access', 'sheets', 93 | ] 94 | # 'powerbi', 'powerpoints', 'spreadsheets', 95 | 96 | keywords_other = [ 97 | 'npm', 'docker', 'yarn', 'homebrew', 'kubernetes', 'terraform', 'unity', 'ansible', 'unreal', 'puppet', 98 | 'chef', 'pulumi', 'flow', 'git', 'svn', 'gitlab', 'github', 'jenkins', 'bitbucket', 'terminal', 'atlassian', 99 | 'codecommit', 100 | ] 101 | 102 | keywords_async = [ 103 | 'jira', 'confluence', 'trello', 'notion', 'asana', 'clickup', 'planner', 'monday.com', 'airtable', 'smartsheet', 104 | 'wrike', 'workfront', 'dingtalk', 'swit', 'workzone', 'projectplace', 'cerri', 'wimi', 'leankor', 'microsoft lists' 105 | ] 106 | # 'microsoftlists', 107 | 108 | keywords_sync = [ 109 | 'slack', 'microsoft teams', 'twilio', 'zoom', 'webex', 'mattermost', 'rocketchat', 'ringcentral', 110 | 'symphony', 'wire', 'wickr', 'unify', 'coolfire', 'google chat', 111 | ] 112 | 113 | # keywords_skills = [ 114 | # 'coding', 'server', 'database', 'cloud', 'warehousing', 'scrum', 'devops', 'programming', 'saas', 'ci/cd', 'cicd', 115 | # 'ml', 'data_lake', 'frontend',' front-end', 'back-end', 'backend', 'json', 'xml', 'ios', 'kanban', 'nlp', 116 | # 'iot', 'codebase', 'agile/scrum', 'agile', 'ai/ml', 'ai', 'paas', 'machine_learning', 'macros', 'iaas', 117 | # 'fullstack', 'dataops', 'scrum/agile', 'ssas', 'mlops', 'debug', 'etl', 'a/b', 'slack', 'erp', 'oop', 118 | # 'object-oriented', 'etl/elt', 'elt', 'dashboarding', 'big-data', 'twilio', 'ui/ux', 'ux/ui', 'vlookup', 119 | # 'crossover', 'data_lake', 'data_lakes', 'bi', 120 | # ] 121 | 122 | # put all keywords in a dict 123 | keywords_dict = { 124 | 'keywords_programming': keywords_programming, 'keywords_databases': keywords_databases, 'keywords_cloud': keywords_cloud, 125 | 'keywords_libraries': keywords_libraries, 'keywords_webframeworks': keywords_webframeworks, 'keywords_os': keywords_os, 126 | 'keywords_analyst_tools': keywords_analyst_tools, 'keywords_other': keywords_other, 'keywords_async': keywords_async, 127 | 'keywords_sync': keywords_sync 128 | } 129 | 130 | # create a list of all keywords 131 | keywords_all = [item for sublist in keywords_dict.values() for item in sublist] 132 | # add keywords_all to dict 133 | keywords_dict['keywords_all'] = keywords_all 134 | 135 | """ 136 | Keywords Save 137 | """ 138 | # write keywords to google storage bucket 139 | client = storage.Client() 140 | bucket = client.get_bucket(BUCKET_NAME) 141 | 142 | # save all keywords variations to different files 143 | for key, value in keywords_dict.items(): 144 | keywords_df = pd.DataFrame(value, index=None) 145 | bucket.blob(f'notebooks/jupyter/keywords/{key}.txt').upload_from_string(keywords_df.to_csv(index=False, header=False) , content_type='text/csv') 146 | 147 | """ 148 | Create Keywords Dataframe 149 | """ 150 | # using SQL to read data to ensure JSON values not selected (throws error) 151 | # delete where statement to process all jobs over again 152 | sql = """ 153 | SELECT job_id, job_description 154 | FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_jobs_fact 155 | WHERE job_id NOT IN 156 | (SELECT job_id FROM `job-listings-366015`.gsearch_job_listings_clean.gsearch_skills) 157 | """ 158 | skills = spark.read \ 159 | .format("bigquery") \ 160 | .load(sql) 161 | 162 | # drop duplicates so when create final dimension table only one job_id to match on 163 | skills = skills.dropDuplicates(['job_id']) 164 | 165 | # lowercase the description here for pre-cleanup before model (couldn't get to work with SparkNLP lib) 166 | skills = skills.withColumn('job_description', lower(skills.job_description)) 167 | 168 | # make final dataframe to append to 169 | skills_final = skills 170 | skills_final = skills_final.withColumn("id", monotonically_increasing_id()) 171 | skills_final = skills_final.alias('skills_final') 172 | 173 | for keyword_name, keyword_list in keywords_dict.items(): 174 | 175 | # Makes document tokenizable 176 | document_assembler = DocumentAssembler() \ 177 | .setInputCol("job_description") \ 178 | .setOutputCol("document") 179 | 180 | # Capture multi-words as one word 181 | # can't have space in between or won't pick up multiple word 182 | multi_words = [ 183 | 'power_bi', 'sql_server', 'google_cloud', 'visual_basic', 'oracle_cloud', 184 | 'ibm_cloud', 'hugging_face', 'uno_platform', 'microsoft_lists', 'ms_access', 185 | 'microsoft_teams', 'google_chat', 'asp.net_core', 'ruby_on_rails', 'play_framework', 186 | ] 187 | # 'objective_c', 'amazon_web' 188 | 189 | # Tokenizes document with multi_word exceptions 190 | tokenizer = Tokenizer() \ 191 | .setInputCols(["document"]) \ 192 | .setOutputCol("token") \ 193 | .setCaseSensitiveExceptions(False) \ 194 | .setExceptions(multi_words) 195 | 196 | # Get all the keywords we defined above 197 | keywords_file = f'gs://dataproc-cluster-gsearch/notebooks/jupyter/keywords/{keyword_name}.txt' 198 | keywords_all = TextMatcher() \ 199 | .setInputCols(["document", "token"])\ 200 | .setOutputCol("matcher")\ 201 | .setEntities(keywords_file)\ 202 | .setCaseSensitive(False)\ 203 | .setEntityValue(keyword_name) 204 | 205 | # Make output machine readable 206 | finisher = Finisher() \ 207 | .setInputCols(["matcher"]) \ 208 | .setOutputCols([keyword_name]) \ 209 | .setValueSplitSymbol(" ") 210 | 211 | pipeline = Pipeline( 212 | stages = [ 213 | document_assembler, 214 | tokenizer, 215 | keywords_all, 216 | finisher, 217 | ] 218 | ) 219 | 220 | # Fit the data to the model. 221 | model = pipeline.fit(skills).transform(skills) 222 | 223 | # Remove duplicate tokens, was unable to figure out how to do this in the model 224 | model = model.withColumn(keyword_name, array_distinct(keyword_name)) 225 | 226 | from pyspark.sql.types import StringType, ArrayType 227 | model = model.withColumn(keyword_name, when(size(col(keyword_name))==0 , None).otherwise(col(keyword_name))) 228 | 229 | # Drop unnecessary columns prior to join to final df 230 | model = model.drop('job_id', 'job_description') 231 | 232 | # Assign index number for joining 233 | model = model.withColumn("id", monotonically_increasing_id()) 234 | 235 | # Join model df with the final df 236 | skills_final = skills_final.join(model, ['id']) 237 | 238 | # No longer need id column 239 | skills_final = skills_final.drop('id') 240 | print("Number of jobs with skills: ", skills_final.count()) 241 | 242 | """ 243 | Write to BigQuery 244 | """ 245 | skills_final.write.format('bigquery') \ 246 | .option('table', 'job-listings-366015.gsearch_job_listings_clean.gsearch_skills') \ 247 | .option('materializationExpirationTimeInMinutes', 180) \ 248 | .mode('append') \ 249 | .save() 250 | # 'overwrite' or 'append' -------------------------------------------------------------------------------- /extra/dataproc_files/keywords/All Keywords.txt: -------------------------------------------------------------------------------- 1 | sql 2 | python 3 | r 4 | c 5 | c# 6 | javascript 7 | java 8 | scala 9 | sas 10 | matlab 11 | c++ 12 | perl 13 | go 14 | typescript 15 | bash 16 | html 17 | css 18 | php 19 | powershell 20 | rust 21 | kotlin 22 | ruby 23 | dart 24 | assembly 25 | swift 26 | vba 27 | lua 28 | groovy 29 | delphi 30 | objective-c 31 | haskell 32 | elixir 33 | julia 34 | clojure 35 | solidity 36 | lisp 37 | f# 38 | fortran 39 | erlang 40 | apl 41 | cobol 42 | ocaml 43 | crystal 44 | golang 45 | nosql 46 | mongodb 47 | t-sql 48 | no-sql 49 | pascal 50 | mongo 51 | sass 52 | vb.net 53 | shell 54 | visual basic 55 | mysql 56 | sql server 57 | postgresql 58 | sqlite 59 | mongodb 60 | redis 61 | mariadb 62 | elasticsearch 63 | firebase 64 | dynamodb 65 | firestore 66 | cassandra 67 | neo4j 68 | db2 69 | couchbase 70 | couchdb 71 | aws 72 | azure 73 | gcp 74 | firebase 75 | heroku 76 | digitalocean 77 | vmware 78 | managedhosting 79 | linode 80 | ovh 81 | oracle 82 | openstack 83 | watson 84 | colocation 85 | snowflake 86 | redshift 87 | bigquery 88 | aurora 89 | databricks 90 | ibm cloud 91 | scikit-learn 92 | jupyter 93 | theano 94 | openCV 95 | pyspark 96 | nltk 97 | mlpack 98 | chainer 99 | fann 100 | shogun 101 | dlib 102 | mxnet 103 | keras 104 | .net 105 | numpy 106 | pandas 107 | matplotlib 108 | spring 109 | tensorflow 110 | flutter 111 | react 112 | kafka 113 | electron 114 | pytorch 115 | qt 116 | ionic 117 | xamarin 118 | spark 119 | cordova 120 | hadoop 121 | gtx 122 | capacitor 123 | tidyverse 124 | unoplatform 125 | dplyr 126 | tidyr 127 | ggplot2 128 | plotly 129 | rshiny 130 | mlr 131 | airflow 132 | seaborn 133 | gdpr 134 | graphql 135 | selenium 136 | hugging face 137 | uno platform 138 | node.js 139 | vue 140 | vue.js 141 | ember.js 142 | node 143 | jquery 144 | asp.net 145 | react.js 146 | express 147 | angular 148 | asp.netcore 149 | django 150 | flask 151 | next.js 152 | laravel 153 | angular.js 154 | fastapi 155 | ruby 156 | svelte 157 | blazor 158 | nuxt.js 159 | symfony 160 | gatsby 161 | drupal 162 | phoenix 163 | fastify 164 | deno 165 | asp.net core 166 | ruby on rails 167 | play framework 168 | unix 169 | linux 170 | windows 171 | macos 172 | wsl 173 | ubuntu 174 | centos 175 | debian 176 | redhat 177 | suse 178 | fedora 179 | kali 180 | arch 181 | excel 182 | tableau 183 | word 184 | powerpoint 185 | looker 186 | power bi 187 | outlook 188 | sas 189 | sharepoint 190 | visio 191 | spreadsheet 192 | alteryx 193 | ssis 194 | spss 195 | ssrs 196 | microstrategy 197 | cognos 198 | dax 199 | esquisse 200 | sap 201 | splunk 202 | qlik 203 | nuix 204 | datarobot 205 | ms access 206 | sheets 207 | npm 208 | docker 209 | yarn 210 | homebrew 211 | kubernetes 212 | terraform 213 | unity 214 | ansible 215 | unreal 216 | puppet 217 | chef 218 | pulumi 219 | flow 220 | git 221 | svn 222 | gitlab 223 | github 224 | jenkins 225 | bitbucket 226 | terminal 227 | atlassian 228 | codecommit 229 | jira 230 | confluence 231 | trello 232 | notion 233 | asana 234 | clickup 235 | planner 236 | monday.com 237 | airtable 238 | smartsheet 239 | wrike 240 | workfront 241 | dingtalk 242 | swit 243 | workzone 244 | projectplace 245 | cerri 246 | wimi 247 | leankor 248 | microsoft lists 249 | slack 250 | microsoft teams 251 | twilio 252 | zoom 253 | webex 254 | mattermost 255 | rocketchat 256 | ringcentral 257 | symphony 258 | wire 259 | wickr 260 | unify 261 | coolfire 262 | google chat 263 | -------------------------------------------------------------------------------- /extra/dataproc_files/keywords/Keywords Analyst Tools.txt: -------------------------------------------------------------------------------- 1 | excel 2 | tableau 3 | word 4 | powerpoint 5 | looker 6 | power bi 7 | outlook 8 | sas 9 | sharepoint 10 | visio 11 | spreadsheet 12 | alteryx 13 | ssis 14 | spss 15 | ssrs 16 | microstrategy 17 | cognos 18 | dax 19 | esquisse 20 | sap 21 | splunk 22 | qlik 23 | nuix 24 | datarobot 25 | ms access 26 | sheets 27 | -------------------------------------------------------------------------------- /extra/dataproc_files/keywords/Keywords Cloud.txt: -------------------------------------------------------------------------------- 1 | aws 2 | azure 3 | gcp 4 | firebase 5 | heroku 6 | digitalocean 7 | vmware 8 | managedhosting 9 | linode 10 | ovh 11 | oracle 12 | openstack 13 | watson 14 | colocation 15 | snowflake 16 | redshift 17 | bigquery 18 | aurora 19 | databricks 20 | ibm cloud 21 | -------------------------------------------------------------------------------- /extra/dataproc_files/keywords/Keywords Databases.txt: -------------------------------------------------------------------------------- 1 | mysql 2 | sql server 3 | postgresql 4 | sqlite 5 | mongodb 6 | redis 7 | mariadb 8 | elasticsearch 9 | firebase 10 | dynamodb 11 | firestore 12 | cassandra 13 | neo4j 14 | db2 15 | couchbase 16 | couchdb 17 | -------------------------------------------------------------------------------- /extra/dataproc_files/keywords/Keywords Libraries.txt: -------------------------------------------------------------------------------- 1 | scikit-learn 2 | jupyter 3 | theano 4 | openCV 5 | pyspark 6 | nltk 7 | mlpack 8 | chainer 9 | fann 10 | shogun 11 | dlib 12 | mxnet 13 | keras 14 | .net 15 | numpy 16 | pandas 17 | matplotlib 18 | spring 19 | tensorflow 20 | flutter 21 | react 22 | kafka 23 | electron 24 | pytorch 25 | qt 26 | ionic 27 | xamarin 28 | spark 29 | cordova 30 | hadoop 31 | gtx 32 | capacitor 33 | tidyverse 34 | unoplatform 35 | dplyr 36 | tidyr 37 | ggplot2 38 | plotly 39 | rshiny 40 | mlr 41 | airflow 42 | seaborn 43 | gdpr 44 | graphql 45 | selenium 46 | hugging face 47 | uno platform 48 | -------------------------------------------------------------------------------- /extra/dataproc_files/keywords/Keywords OS.txt: -------------------------------------------------------------------------------- 1 | unix 2 | linux 3 | windows 4 | macos 5 | wsl 6 | ubuntu 7 | centos 8 | debian 9 | redhat 10 | suse 11 | fedora 12 | kali 13 | arch 14 | -------------------------------------------------------------------------------- /extra/dataproc_files/keywords/Keywords Other.txt: -------------------------------------------------------------------------------- 1 | npm 2 | docker 3 | yarn 4 | homebrew 5 | kubernetes 6 | terraform 7 | unity 8 | ansible 9 | unreal 10 | puppet 11 | chef 12 | pulumi 13 | flow 14 | git 15 | svn 16 | gitlab 17 | github 18 | jenkins 19 | bitbucket 20 | terminal 21 | atlassian 22 | codecommit 23 | -------------------------------------------------------------------------------- /extra/dataproc_files/keywords/Keywords Sync.txt: -------------------------------------------------------------------------------- 1 | slack 2 | microsoft teams 3 | twilio 4 | zoom 5 | webex 6 | mattermost 7 | rocketchat 8 | ringcentral 9 | symphony 10 | wire 11 | wickr 12 | unify 13 | coolfire 14 | google chat 15 | -------------------------------------------------------------------------------- /extra/dataproc_files/keywords/Keywords async.txt: -------------------------------------------------------------------------------- 1 | jira 2 | confluence 3 | trello 4 | notion 5 | asana 6 | clickup 7 | planner 8 | monday.com 9 | airtable 10 | smartsheet 11 | wrike 12 | workfront 13 | dingtalk 14 | swit 15 | workzone 16 | projectplace 17 | cerri 18 | wimi 19 | leankor 20 | microsoft lists 21 | -------------------------------------------------------------------------------- /extra/dataproc_files/keywords/Programming Keywords.txt: -------------------------------------------------------------------------------- 1 | sql 2 | python 3 | r 4 | c 5 | c# 6 | javascript 7 | java 8 | scala 9 | sas 10 | matlab 11 | c++ 12 | perl 13 | go 14 | typescript 15 | bash 16 | html 17 | css 18 | php 19 | powershell 20 | rust 21 | kotlin 22 | ruby 23 | dart 24 | assembly 25 | swift 26 | vba 27 | lua 28 | groovy 29 | delphi 30 | objective-c 31 | haskell 32 | elixir 33 | julia 34 | clojure 35 | solidity 36 | lisp 37 | f# 38 | fortran 39 | erlang 40 | apl 41 | cobol 42 | ocaml 43 | crystal 44 | golang 45 | nosql 46 | mongodb 47 | t-sql 48 | no-sql 49 | pascal 50 | mongo 51 | sass 52 | vb.net 53 | shell 54 | visual basic 55 | -------------------------------------------------------------------------------- /extra/dataproc_files/keywords/Web Frameworks Keywords.txt: -------------------------------------------------------------------------------- 1 | node.js 2 | vue 3 | vue.js 4 | ember.js 5 | node 6 | jquery 7 | asp.net 8 | react.js 9 | express 10 | angular 11 | asp.netcore 12 | django 13 | flask 14 | next.js 15 | laravel 16 | angular.js 17 | fastapi 18 | ruby 19 | svelte 20 | blazor 21 | nuxt.js 22 | symfony 23 | gatsby 24 | drupal 25 | phoenix 26 | fastify 27 | deno 28 | asp.net core 29 | ruby on rails 30 | play framework 31 | -------------------------------------------------------------------------------- /logs/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lukebarousse/Data_Job_Pipeline_Airflow/b69a5609c8fcd40c11b2fb6eb06123ca414dccdd/logs/.gitkeep -------------------------------------------------------------------------------- /plugins/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lukebarousse/Data_Job_Pipeline_Airflow/b69a5609c8fcd40c11b2fb6eb06123ca414dccdd/plugins/.gitkeep -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy 2 | pandas 3 | matplotlib 4 | configparser 5 | google-search-results 6 | google-cloud-bigquery 7 | google-cloud-storage 8 | google-api-python-client 9 | google-auth-oauthlib 10 | google-auth-httplib2 11 | torch 12 | torchvision 13 | transformers 14 | pandas-gbq --------------------------------------------------------------------------------