├── .gitignore ├── README.md ├── bigquery ├── automating_gdelt_query_overwriting_1st_table_weekly.sql ├── automating_gdelt_query_updating_dashboard_tables.sql ├── gdelt_1st_query_themes_and_sentiment.sql ├── gdelt_2nd_query_creating_tables.sql ├── gdelt_list_themes.sql ├── spanish_newspapers.py └── tables_python_api.sql ├── cloud_function ├── backup │ └── data_pytrends_backup.csv ├── main.py ├── pytrends_request.py ├── remove_files.py ├── requirements.txt └── upload_gcs_example.py ├── pics ├── automated_dashboard_preview.gif ├── budget_alert.png ├── cloud_function_error.png ├── gdelt_sketch.png ├── interactive_dashboard.png ├── list_of_scheduled_queries.png ├── monguerteam.png ├── mrdogscience.jpeg ├── trends_example.png ├── twitter_fees.png └── updating_gdelt_tables.png └── twitter_folder └── twitter_merging.py /.gitignore: -------------------------------------------------------------------------------- 1 | ipynb_checkpoints/ 2 | twitter_folder/.ipynb_checkpoints/ 3 | *.ipynb 4 | .env 5 | cloud_function_backup 6 | tmp 7 | output/ 8 | input/ 9 | *.json 10 | *.zip 11 | upload_gcs_real.py 12 | spanish_newspapers_SourceCommonName.py 13 | 14 | 15 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Socioeconomic Portrait Project. 2 | 3 | ![alt](pics/automated_dashboard_preview.gif "uy mola") 4 | 5 | - Above buttons enable a better visualization: select wanted keywords you want to study and exclude the others: 6 | ![alt](pics/interactive_dashboard.png " ") 7 | 8 | The final dashboard: 9 | - https://datastudio.google.com/reporting/755f3183-dd44-4073-804e-9f7d3d993315 10 | 11 | 12 | ----------------------------- 13 | 14 | Motivation: 15 | 16 | Is there a way of monitoring some aspects of the global crisis in Spain? I believe so, and this is the motivation to develop this **automated** ETL process in Google Cloud involving **Google Trends**, sentiment analysis and influence in news through the **Gdelt Project** and **Twitter**, from raw data acquisition to the final dashboard. Thanks to it, I have been fighting with credentials, permissions, storage locations, processing locations, 3rd party authentications, Cloud Functions, pipelines, trigger schedulers with different time format, Dataprep global updates, etc... And I learned a lot in the way, quaratine fun! :D 17 | 18 | Thanks to Patricia, who worked in very interesting methodologies with Twitter. Here you can find her content => https://medium.com/@patricrp/buscando-contenido-con-la-api-de-twitter-f3c12994a77f 19 | 20 | ------------------------------- 21 | 22 | 1. **Introduction** 23 | 2. **How to use** 24 | 3. **Results** 25 | 4. **Analyzing results** 26 | 5. **Conclusion** 27 | 6. **Further improvements** 28 | 7. **Documentation** 29 | 30 | 31 | 32 | ------------------------------ 33 | 34 | # 1. Introduction. 35 |
36 | Click to expand 37 | 38 | ----------------- 39 | 40 | Let's show a social point of view of the pandemic's impact in Spain, through an automated ETL involving the following tools in Google Cloud: 41 | 42 | - Billing: create a maximum budget to avoid unexpected fees (10€,20€,300€... What suits your project). 43 | - Cloud Function number: Python script sending data to Cloud Storage. 44 | - Requesting data from Google Trends API. 45 | - Pub/Sub topic: Triggers (activates) the Cloud Function. 46 | - Cloud Scheduler: Schedule a Pub/Sub topic. 47 | - Cloud Storage: Data Lake. Gets fed periodically with the python script. 48 | - Transfer: Load periodically from Cloud Storage to BigQuery (done before with Dataprep, but on September 2020 they will change their pricing and it is going to be unaffordable for any personal project). 49 | - BigQuery: Data Warehouse. 50 | - Creates different tables with the output of dataprep. 51 | - Performs a weekly query modifying tables. 52 | - Data Studio: Dashboard of results weekly updated, linked to the BigQuery tables. 53 | 54 | The Python ETL pipeline is based on this article: 55 | 56 | - **https://towardsdatascience.com/creation-of-an-etl-in-google-cloud-platform-for-automated-reporting-8a0309ee8a78** 57 | 58 | The Gdelt Queries are based on the shared knowledge with the Data Team of Labelium España: 59 | 60 | - **https://medium.com/@a.vargas.pina/biqquery-and-the-gdelt-project-beyond-dreams-of-marketing-analysts-62e586cc0343** 61 | 62 | 63 | 64 | ### **TROUBLESHOOTING TIME 1:** ***Golden rules for Cloud Services***: 65 |
66 | Click to expand 67 | --------------------------------- 68 | 69 | - **1. Your wallet is your sacred temple:** If you won't read all Google documentation carefully, at least go to billing and create a budget with alerts for your project, in case you do something wrong, receiving custom alerts before you spend a lot without noticing. Don't end up living under a bridge. 70 | 71 | - **2. Your sacred temple has no protection by default:** Budget alerts won't cap payment when the limit is reached. They just alert you so you have time to turn everything down in case of panic. It can be configured to cap tho. 72 | - **3. Worship Location:** While working on the different stages of your project, pay attention to put all your stuff in the very same region. Beware of headaches if don't. 73 | - **4. Don't panic:** That's all. 74 | 75 |
76 | 77 | 78 | 79 | -------------------- 80 | 81 | 82 | 83 | ### Used tools to analyse the internet in Spain: 84 |
85 | Click to expand 86 | ----------------------------------- 87 | 88 | ***Google Trends***: 89 | 90 | - Python. 91 | - Pytrends library. 92 | 93 | Google Trends is a tool that analyses the popularity of top search queries in Google Search across various regions and languages. Basically, what people are looking for in Google. 94 | 95 | Google trends searches the maximum on the specified period, makes that maximum the 100% of Trend Index and everything else in the requested dates averaged by that top. If you request information weekly, you will have a point with 100% of Trend Index each week, regardless how popular it is. 96 | 97 | - If you request a list of elements, all elements will be averaged by the top one. 98 | 99 | - If you request each of your keywords separately, each keyword will be averaged on time by its own top. 100 | 101 | So basically there are 2 ways of using it: Compare the evolution in time of different keywords, or checking the evolution in time of every keyword, each one separately. We took the 2nd path. 102 | 103 | - César discovered how tricky is Pytrends. The idea of requesting all at once but separately is his => https://www.linkedin.com/in/cesar-castanon/ 104 | 105 | ----------------------- 106 | 107 | 108 | ***Twitter***: 109 | 110 | Using Tweepy, the Python's library for Twitter requests to generate dataframes for each query. The built-in function Cursor help to develop a function to create a dataframe and include all tweets from the last 7 days related to a word, recording date, user, id and text for each tweet. The function search includes a complex query, geolocation and language to get only tweets from Spanish people. 111 | 112 | We used Twitter Standard API, the *freemium* service, so results were limited up to 7 days and a maximum of mixed results from popular and recent tweets. 113 | 114 | This helped us to know what were people talking about unemployment and its effects during COVID-19. 115 | 116 | In the twitter folder there is merging-datasets script, you can find the twitter requesting script here, and the process here: 117 | 118 | - **https://medium.com/@patricrp/buscando-contenido-con-la-api-de-twitter-f3c12994a77f** 119 | 120 | 121 | By the way, work *seriously* with twitter is quite expensive, and we do love programming, but our budget, as our leisure time, is limited. 122 | 123 | ![alt](pics/twitter_fees.png " ") 124 | 125 | ------------------------------- 126 | 127 | 128 | ***Gdelt Project***: 129 | 130 | - SQL. 131 | - Bigquery. 132 | 133 | The Gdelt Project is a database with all the news of the world, updated every 15 minutes. It also classifies the incoming data, so you can search topics, themes, people, related people to them... It is impressive, and available in almost 70 languages via BigQuery. 134 | 135 | - https://www.gdeltproject.org/ 136 | 137 |
138 | 139 | ------------------------------------------- 140 | 141 | We want to monitor what spanish news say about certain topics, so the process is the following: 142 | 143 | 1. Create a dataset with all the spanish newspapers (paper version and digital ones). Taking a glance to *bigquery/spanish_newspapers.py* you can figure out the created dataset (>150 different media). 144 | 145 | 2. Make a bigquery table with that dataset. 146 | 147 | 3. Select the *themes* you want to monitor in the Gdelt database. In *bigquery/bigquery_gdelt_themes_list.sql* you can find all themes available. 148 | 149 | 4. Make a query filtering by your goals, contrasting against the created table of spanish media and loading the data into a new table, the *main table*. 150 | 151 | 5. Divide the info of the *main table* in different tables, for different graphs in Data Studio (to get a faster visualization in Data Studio) 152 | 153 | 6. Schedule the overwriting of *main table*, to get periodical updates. 154 | 155 | 7. Schedule the appending of new info to the dashboarding tables. 156 | 157 | Updating tables: 158 | ![alt](pics/updating_gdelt_tables.png) 159 | 160 | List of scheduled queries (Google Trends included) 161 | ![alt](pics/list_of_scheduled_queries.png) 162 | 163 | 8. Plot it in Data Studio. 164 | 165 | - Alex taught me how to UNNEST and EXTRACT. Kudos for the sentiment extraction => https://www.linkedin.com/in/alexmasip/ 166 | 167 | - Jesús gave me a hand with some final SQL statements :) => https://www.linkedin.com/in/jralvarez1986/ 168 | 169 | 170 | 171 |
172 | 173 | 174 | ------------------------------ 175 | ------------------------------- 176 | 177 | # 2. How to: 178 | 179 |
180 | Click to expand 181 | 182 | ------------------------ 183 | 184 | ### Billing. 185 | 186 |
187 | Click to expand 188 | 189 | ------------------------ 190 | In Cloud environment, select your project, go to billing, and take a glance. Just select a weekly/monthly budget to your project, remember that it is linked to your bank account and a wrong action can cost a lot of money if you don't pay attention. 191 | 192 | In Notification Channels you can choose a lot of options, whatever suits you. 193 | 194 | ![alt](pics/budget_alert.png " ") 195 | 196 | 197 |
198 | 199 | -------------------- 200 | 201 | 202 | ### Create a new project and get your credentials here: 203 | 204 |
205 | Click to expand 206 | 207 | ------------------------ 208 | 209 | https://console.cloud.google.com/projectselector2/iam-admin/serviceaccounts?_ga=2.187077160.1256540572.1587286598-806021438.1579453370&pli=1&supportedpurview=project 210 | 211 | - Service Account: Give it a nice name. 212 | - Role: Storage Object Admin, Cloud Function 213 | 214 |
215 | 216 | ---------------- 217 | 218 | ### Create a bucket in Cloud Storage 219 | 220 |
221 | Click to expand 222 | 223 | ------------------------ 224 | 225 | https://console.cloud.google.com/storage/create-bucket? 226 | 227 | - Configure it: Due to our requirements we will use region location (**Europe Multi-region**). Be careful, it can give you a hard headache, mainly if working with BigQuery or data from other regions that are not your selected one. Always locate all buckets where all data sources you are using for the same project. 228 | 229 |
230 | 231 | ---------------- 232 | 233 | ### Python script 234 | 235 |
236 | Click to expand 237 | 238 | ------------------------ 239 | 240 | 241 | *The Python script I'm using is available in the Cloud Function folder*. 242 | 243 | - Make sure it runs fine in your PC before uploading to Cloud Function. It is a hassle debugging in the Cloud. 244 | 245 | - If you are following my steps (gcsfs library) and a *NotFounderror* arises while trying to upload to Cloud Storage, try the first time uploading manually the requested csv(s) to Cloud, and after that you'll automate the reading/overwriting of this csv(s) 246 | 247 | 248 | Pytrends (keywords without accents or capital letters. Script available in Cloud Function). The chosen keywords for this project are in the ***cloud_function/requesting_from_pytrends.py*** folder. 249 | 250 | 251 | - Google Trends: 252 | - Has the option of select in which category you want to find your keywords. We have our reasons to not include categories in the code. 253 | - **Important**: When the requested information is more than a month, it is delivered grouped by week. That means you need to choose carefully when you will start requesting. 254 | 255 | It is important to clarify the second point: In my script, I was weekly requesting information from 2019-01-01 to the last sunday, and the script was activated every monday. The problem is the following: 2019-01-01 was Tuesday. Due to the fact that my wanted information was delivered grouped by week, it didn't reach every last sunday, it reached until every last tuesday, so I always was missing 5 days of valuable information and at the end of the week it seemed like I had a 2-week delay for some unknown reason. (I realised while having a shower :D) 256 | 257 | - Moreover, instead of appending new information to our tables each week, we are overwriting them intentionally. This is due to how Google process and delivers information from Trends. 258 | 259 | If you want another approach, this is an awesome tutorial: 260 | 261 | - https://searchengineland.com/learn-how-to-chart-and-track-google-trends-in-data-studio-using-python-329119 262 | 263 |
264 | 265 | ### Deploy the scripts on Google Cloud Function: 266 | 267 |
268 | Click to expand 269 | 270 | ------------------------ 271 | 272 | 1. Create a function here 273 | 274 | - https://console.cloud.google.com/functions/add? 275 | 276 | 2. Trigger: Pub/sub 277 | 3. Give a name to your pub/sub topic and select it 278 | 4. Runtime: Python 3.7 279 | 5. Stage bucket: The bucket you created in GCS 280 | 6. Advanced options. Region: Select the same region than the created bucket in GCS (**this is very important**) 281 | 7. Service account: The same which you used to export your json credentials. 282 | 8. Advanced options. Environment variables: In the code, for private stuff like paths or keywords, instead of writing them, do something like (check the code). Then in this section, write PROJECT_NAME and its real value, both without the str symbol (" "). It looks like it does not work with os.dotenv. If you want to see how to use env variables => https://github.com/albertovpd/cloud_function_example_pytrends 283 | 9. Function to execute: main 284 | 285 | - You can measure on your pc how much your script lasts and how much memory uses. Then try using that measurements for the configuration (on RAM and timeout). If you get errors (being completely sure your script works fine and attending to the logs), raise them until it works. 286 | 287 | Why the pub/sub topic is **main** or questions about the script structure can be found here: 288 | 289 | - https://github.com/albertovpd/pytrends_cloud_function_example 290 | 291 |
292 | 293 | ---------------------- 294 | 295 | 296 | ### **TROUBLESHOOTING TIME 2: CF problems** 297 | 298 |
299 | Click to expand 300 | 301 | ------------------------ 302 | 303 | - If you can deploy your function perfectly but testing it you get the following error: 304 | 305 | ![alt](pics/cloud_function_error.png " ") 306 | 307 | Try debugging the following: 308 | 309 | 1. When configuring the CF, use a large timeout and memory, in that order. If it works, start reducing that parameters until you use the minimum and still works (if you are going to work with increasing data with time you should foresee that first before assigning the minimal capacity). 310 | 311 | 2. If nothing seems to work out you should double-check the permissions associated to the used Credentials Accounts. 312 | 313 | I have been stuck some time with this error. Maybe some information found here can be useful: 314 | 315 | - https://stackoverflow.com/questions/62064082/unknown-error-has-occurred-in-cloud-functions/62119865#62119865 316 | 317 | - https://www.reddit.com/r/googlecloud/comments/gs2gd9/an_unknown_error_has_occurred_in_cloud_functions/ 318 | 319 | - https://www.reddit.com/r/cloudfunctions/comments/gu37ye/huge_amout_of_allocated_memory_needed_for/ 320 | 321 | - Another issue that looks like black magic to me at this time: *I am sending from a dataframe a csv to Cloud Storage. If there is no file in Storage with the very same name than the csv you are trying to send, you will get an error. It is like instead of sending, you are overwriting one file (I am writing this warning before checking out what is going on. I do not have all the time I want for my projects, so this is a problem for myself from the future).* 322 | 323 |
324 | 325 | ------------------- 326 | 327 | 328 | ### Scheduling 329 | 330 |
331 | Click to expand 332 | 333 | ------------------------ 334 | 335 | ### Go to Cloud Scheduler to schedule your topic. 336 | 337 | 338 | - https://console.cloud.google.com/cloudscheduler/appengine 339 | 340 | 1. Select the same region than all before 341 | 2. Follow instructions and run your job to make sure it works :) 342 | - Be careful copypasting your Pub/Sub. "trigger" and "trigger " are not obviously the same. 343 | 344 | ### Bigquery 1: 345 | 346 | - Create a dataset with the same location than the rest of the project. 347 | - Create a table with the elements of the Cloud Storage bucket (this will be updated with DataPrep) 348 | 349 |
350 | 351 | ### Transfer: 352 | 353 |
354 | Click to expand 355 | 356 | ------------------------ 357 | 358 | - Go to Bigquery => Transfer. 359 | 360 | From there, it is really user-friendly to find your data in Cloud Storage, choose a dataset in Bigquery, append/overwrite the table within the dataset (in our case, append), and schedule it. 361 | 362 | There is also the option of triggering it with Pub/Sub, but it is not necessary for me. 363 | 364 | A cool option is, once the data is loaded from Cloud Storage to Bigquery, you can configure Transfer to erase the Cloud Storage bucket content (I understand this can save money if you are dealing with heavy volumes of data). 365 | 366 | In our case: 367 | - MIRROR the bucket info to the BQ table (because, as explained here, Google Trends works weirdly and the receiving info varies regarding the date interval the user is requesting). 368 | - Ignore 0 errors 369 | - "," as delimitation field 370 | - Exclude 1st row (1st row has the column names) 371 | - **Do not erase the file in Storage after sending it to BigQuery.** Your Cloud Function will not work. 372 | 373 | *If you modify the content of Storage, wait at least 1 hour before using Transfer again, as said in this thread https://stackoverflow.com/questions/61126019/not-able-to-update-big-query-table-with-transfer-from-a-storage-file* 374 | 375 |
376 | 377 | ### Dataprep: 378 | 379 |
380 | Click to expand 381 | 382 | ------------------------ 383 | 384 | **This is not part of the pipeline anymore. It has been changed by Transfer. Nevertheless I leave here the information, in case someone finds it useful.** 385 | 386 | Dataprep will load and transfer files from Cloud Storage to BigQuery tables in a scheduled process. 387 | 388 | - Select your project 389 | 390 | - **Very important:** Go to and to locate your Dataprep folder in the same region than the rest of your other Google tools. 391 | 392 | - Go to Cloud Storage and check the location of the dataprep-stagging bucket. If it is not in the location of your project, create a new bucket. In Dataprep/Preferences/Profile modify location of temp directory, job run directory and upload directory. 393 | 394 | - Workflow: Manual Job (for testing): 395 | 396 | - Select the Cloud Storage bucket (dataset or datasets). 397 | - Create a recipe. 398 | - Select an output for the recipe, Select the BigQuery table you will feed (I think it should be done in Bigquery First). 399 | - Select wether you want to append or modify the existing tables in BigQuery. 400 | - In the workflow, use the calendar icon to schedule it (beware of time formats between Cloud Function, Dataprep, BigQuery... Use always the same location (if you can)). 401 | 402 | - When the job works fine, do exactly the same but with a scheduled job. 403 | 404 |
405 | 406 | -------------------------------- 407 | 408 | 409 | ### **TROUBLESHOOTING TIME 3:** 410 | 411 |
412 | Click to expand 413 | 414 | ------------------------ 415 | 416 | Follow all mentioned above. Dataprep processes are veeery slow, so you can waste a lot of time waiting for the job to be done and going to the log console. 417 | 418 | In our case: 419 | 420 | - Dataflow Execution Settings: Europe-West1 (Belgium) 421 | - AutoZone 422 | - n1 standard 423 | - Truncate table in Bigquery (we'll overwrite continuously the Google Trends data due to how Google Trends works) 424 | 425 |
426 | 427 | ------------------------------------------------- 428 | 429 | ### Bigquery 2: 430 | 431 |
432 | Click to expand 433 | 434 | ------------------------ 435 | 436 | 1. Go there. 437 | 2. Create a scheduled query requesting from the tables periodically fed by Transfer, and save them in other tables. 438 | 439 | The motivation for this is the following: Data Studio is the cheapest when the dashboarded data is allocated in BigQuery tables. So, if you have all the needed info in a BigQuery table, is a good option to create as much tables as graphs you will have in your Data Studio. Don't create tables randomly, just plan beforehand what you want to show in your dashboard, and how do you want to show it. 440 | 441 | In ***bigquery/tables_python_api*** is available the sql query of Pytrends, in case you want to take a glance. 442 | 443 |
444 | 445 | ### Data Studio: 446 |
447 | Click to expand 448 | 449 | ------------------------ 450 | Find your tables in BigQuery and make some art. 451 | 452 | **Warning:** I learnt by the hard way not to use Custom Query option in DataStudio, because every time someone opens the dashboard, the query will be performed, and it can cost you kind of A LOT. So always, create your dashboards from BigQuery tables. 453 | 454 | Hopefully reddit has a great BigQuery community, and it worth taking a glance there. This was the issue if you're curious: https://www.reddit.com/r/bigquery/comments/gdk5mo/some_help_to_understand_how_i_burnt_the_300_and/ 455 | 456 |
457 | 458 | ------------------------------ 459 | 460 | **- Gdelt Project.** 461 | 462 |
463 | Click to expand 464 | 465 | ------------------------ 466 | 467 | - Filter by country with the help of a personal table, with all main national news (not all national webpage as the .es domain). 468 | - Analyse the sentiment of each filtered news. 469 | 470 | The query is available in this folder: ***bigquery_gdelt_project_sql*** 471 | 472 |
473 | 474 |
475 | 476 | 477 | 478 | --------------------------- 479 | ------------------------- 480 | 481 | 482 | # 3. Results 483 | 484 | Results can be found here: 485 | 486 | https://datastudio.google.com/s/lE64HKGyR8w 487 | 488 | ------------------------- 489 | ------------------------ 490 | 491 | 492 | # 4. Analyzing results 493 |
494 | Click to expand 495 | 496 | ------------------------ 497 | 498 | 499 | Do you remember when refugees or nationalism were main topic on TV in Spain? The topic will come back for other reasons like an economical crisis across Europe? How long is going to be the keyword "economical war" the main topic of some politicians? 500 | 501 | The circumstances we live right now are temporal or it will be part of our future? 502 | 503 | People will spend more money on non necessary purchases like ordering food from home? 504 | 505 | Remote working will last after the crisis? 506 | 507 | At this date, friday 05/06/2020, it is not easy to give an answer, nevertheless this project could set the basis of how understanding the short term past, the present, and maybe the close future. 508 | 509 |
510 | 511 | ----------------------- 512 | ----------------------- 513 | 514 | # 5. Conclusion 515 | 516 |
517 | Click to expand 518 | 519 | ------------------------ 520 | 521 | 522 | - This project has being very useful to reinforce my Google Cloud domain, which is the main goal, and it is accomplished. 523 | 524 | - The second goal is to get more expertise working with Gdelt, performing more accurate filters. After creating my own tables, testing the Gdelt info against it, extracting, unnesting, requesting by last week regardless the day... I think I learned a lot, and also had fun. I do love Gdelt and starting to really appreciate and understand the power of Bigquery SQL. 525 | 526 | - It is a drawback the fact that Google Trends delivers information in a *quite weird manner*, it is already processed, and for instance, requesting weekly is pointless for our purposes, you have to request all the period you want to analyse at once. Also doing that implies that the information is delivered by blocks, so if I request information every monday from 1st january 2019, I will get all the info grouped by week, without the last week, making the dashboard look like it is not up to date. 527 | 528 | - Working with Twitter in a professional manner is very expensive, out-of-fun-budget. You need to request all written twitters in a region/time, and then analyse that info. We did not see a cheap way of requesting just the number of times a concrete word appears, for example. The requesting is also very slow... I don't want to imagine how expensive could be to have a VM instance requesting and processing 24/7. 529 | 530 |
531 | 532 | --------------------- 533 | --------------------- 534 | 535 | 536 | # 6. Further improvements. 537 | 538 |
539 | Click to expand 540 | 541 | ------------------------ 542 | 543 | - Better research on themes to follow in Gdelt. 544 | - Use Python APis to track Stock Markets on time, and add it to the dashboard :) 545 | - I am in the process of automating an ML algorithm to infere how many times the keyword ***"unemployment"*** is searched in Google one week ahead. 546 | 547 | 548 |
549 | 550 | 551 | 552 | ---------------------------- 553 | ------------------------------- 554 | 555 | # 7. Documentation: 556 | 557 |
558 | Click to expand 559 | 560 | ------------------------ 561 | 562 | - What is Google Trends (no code) https://www.karinakumykova.com/2019/03/calculate-search-interest-with-pytrends-api-and-python/ 563 | 564 | - List of categories: https://github.com/pat310/google-trends-api/wiki/Google-Trends-Categories 565 | 566 | - Official documentation: https://pypi.org/project/pytrends/ 567 | - Info about how it works: https://www.karinakumykova.com/2019/03/calculate-search-interest-with-pytrends-api-and-python/ 568 | - The "real" Google Trends: https://towardsdatascience.com/google-trends-api-for-python-a84bc25db88f 569 | - A great tutorial https://searchengineland.com/learn-how-to-chart-and-track-google-trends-in-data-studio-using-python-329119 570 | 571 | 572 | - https://github.com/albertovpd/pytrends_cloud_function_example 573 | 574 | 575 | - The Python ETL pipeline is based on this article **https://towardsdatascience.com/creation-of-an-etl-in-google-cloud-platform-for-automated-reporting-8a0309ee8a78** 576 | 577 | - The Gdelt Queries are based on the shared knowledge with the Data Team of Labelium España: **https://medium.com/@a.vargas.pina/biqquery-and-the-gdelt-project-beyond-dreams-of-marketing-analysts-62e586cc0343** 578 | 579 | - The final dashboard: - https://datastudio.google.com/reporting/755f3183-dd44-4073-804e-9f7d3d993315 580 | 581 | - Twitter code and how-to: **https://medium.com/@patricrp/buscando-contenido-con-la-api-de-twitter-f3c12994a77f** 582 | 583 | - https://github.com/albertovpd/automated_etl_google_cloud-social_dashboard 584 | 585 |
586 | 587 | -------------------------------- 588 | ------------------------------- 589 | 590 | 591 | Many thanks to: 592 | 593 | - **Alex Masip**, Head of Data at Labelium España https://www.linkedin.com/in/alexmasip/ 594 | - **César Castañón**: https://www.linkedin.com/in/cesar-castanon/ 595 | - **Jesús Rodríguez**:https://www.linkedin.com/in/jralvarez1986/ 596 | 597 | 598 | 599 | --------------------- 600 | 601 | ![alt](pics/monguerteam.png " ") 602 | Project by **Patricia Carmona** and **Alberto Vargas**. 603 | 604 | 605 | - https://www.linkedin.com/in/carmonapulidopatricia/ 606 | 607 | - https://www.linkedin.com/in/alberto-vargas-pina/ 608 | 609 | 610 | 611 | 612 | 613 | ---------------- 614 | ---------------- 615 | 616 | # Developer notes. 617 |
618 | Click to expand 619 | 620 | ------------------------ 621 | 622 | ## Python script automation. 623 | 624 | - Cloud Storage: Bucket in Europe multi-region (dataprep has not Europe-London location). 625 | 626 | - Cloud Function: Testing it will modify the outcome in Cloud Storage. 627 | 628 | - Cloud Scheduler for triggering the Cloud Function: 0 3 * * 1 Europe(Germany). It means it will run every monday at 7:00 (GTM2. 0=Sunday => 1=Monday). 629 | 630 | - Transfer (Load from Storage to BigQuery): Germany. Weekly, on Monday at 26/10/2020, 05:00. 631 | 632 | *Be careful, if you modify the content of Cloud Storage it will last at least 1 hour to be appreciated by Transfer, as said in this thread https://stackoverflow.com/questions/61126019/not-able-to-update-big-query-table-with-transfer-from-a-storage-file you'll get an nice log like "no changes have bee appreciated to modify the table. This caused a 2-week delay in Pytrends results.* 633 | 634 | - BigQuery pytrends: This schedule will run Every Mon at 06:00 Europe/Berlin, starting Mon Jun 08 2020. 635 | 636 | *Processed in EU.* 637 | 638 | ## Gdelt automation. 639 | 640 | - Overwriting the *main table*: This schedule will run Every Sun at 23:59 Europe/Paris, starting Sun Jun 21 2020. 641 | 642 | - Appending to dashboard tables: This schedule will run Every Mon at 02:00 Europe/Paris, starting Mon Jun 22 2020 643 | 644 | *Processed in US.* 645 | 646 | **Very important**: My project is in EU but Gdelt is in US. My Gdelt dataset is also in US => don't forget processing location US 647 | 648 | 649 |
650 | 651 | ------------------------------ 652 | ------------------------------ -------------------------------------------------------------------------------- /bigquery/automating_gdelt_query_overwriting_1st_table_weekly.sql: -------------------------------------------------------------------------------- 1 | -- the first query is to generate the whole content. this content is loaded in different 2 | -- tables for the sake of speed in Data Studio, so the important tables are this 3 | -- dashboard tables. 4 | 5 | -- then, once the first big request is done, the main table will be weekly overwritten, 6 | -- to get just the new weekly info. This info will be appended to the important tables. 7 | 8 | CREATE OR REPLACE TABLE 9 | `myproject-mydataset.gdelt_info_filtering.filtered_spanish_news` AS 10 | -- list of themes http://data.gdeltproject.org/api/v2/guides/LOOKUP-GKGTHEMES.TXT 11 | SELECT 12 | -- este select está solo para usar el where al final y eliminar los null que se generan en news_in_Spain 13 | * 14 | FROM ( 15 | SELECT 16 | EXTRACT (date 17 | FROM 18 | PARSE_TIMESTAMP('%Y%m%d%H%M%S',CAST(date AS string))) AS Date, 19 | SourceCommonName, 20 | ROUND(CAST(SPLIT(V2Tone, ",") [ 21 | OFFSET 22 | (0)] AS FLOAT64),2) AS Sentiment, 23 | (CASE 24 | --WHEN V2Themes LIKE "%IDEOLOGY%" THEN "ideologia" 25 | -- WHEN V2Themes LIKE "%CRISISLEX_CRISISLEXREC%" THEN "use_of_social_media_for_crisis_and_disaster_response" --https://blog.gdeltproject.org/crisislex-taxonomies-now-available-in-gkg/ 26 | -- WHEN V2Themes LIKE "%EPU_POLICY_GOVERNMENT%" THEN "epu_policy_goverment" 27 | --WHEN V2Themes LIKE "%ARMEDCONFLICT%" THEN "conflicto_armado" 28 | --WHEN V2Themes LIKE "%TRANSPARENCY%" THEN "transparencia" 29 | --WHEN V2Themes LIKE "%SCANDAL%" THEN "escandalo" 30 | -- ECONOMICAL 31 | WHEN V2Themes LIKE "%ECON_STOCKMARKET%" THEN "stock_market" 32 | WHEN V2Themes LIKE "%EPU_POLICY%" THEN "incertidumbre_economica" -- https://blog.gdeltproject.org/economic-policy-uncertainty-is-driving-economic-uncertainty-in-the-era-of-trump/ 33 | WHEN V2Themes LIKE "%WB_1235_CENTRAL_BANKS%" THEN "banco_mundial" 34 | WHEN V2Themes LIKE "%UNEMPLOYMENT%" THEN "desempleo" 35 | WHEN V2Themes LIKE "%ECON_INFLATION%" THEN "inflacion_economica" 36 | WHEN v2themes LIKE "%ECON_BANKRUPTCY%" THEN "quiebra_economica" 37 | WHEN V2THEMES LIKE "%WB_1104_MACROECONOMIC_VULNERABILITY_AND_DEBT%" THEN "macroeconomia_deuda_y_vulnerabilidad" 38 | WHEN V2THEMES LIKE "%WB_2745_JOB_QUALITY_AND_LABOR_MARKET_PERFORMANCE%" THEN "job_quality_&_labor_market_performance" 39 | WHEN v2themes LIKE "%POVERTY%" THEN "pobreza" 40 | WHEN v2themes LIKE "%ECON_DEBT" THEN "deuda_publica" 41 | WHEN v2themes LIKE "%WB_471_ECONOMIC_GROWTH%" THEN "crecimiento_economico" 42 | WHEN v2themes LIKE "%WB_318_FINANCIAL_ARCHITECTURE_AND_BANKING%" THEN "finanzas_y_bancos" 43 | WHEN v2themes LIKE "%ECON_OILPRICE%" THEN "precio_petroleo" 44 | WHEN v2themes LIKE "%WB_625_HEALTH_ECONOMICS_AND_FINANCE%" THEN "prosperidad_economica_y_finanzas" 45 | WHEN v2themes LIKE "%ECON_FREETRADE%" THEN "libre_comercio" 46 | -- POLITICAL 47 | WHEN V2Themes LIKE "%POLITICAL_TURMOIL%" THEN "inestabilidad_politica" 48 | WHEN V2Themes LIKE "%REBELLION%" THEN "rebelion" 49 | WHEN V2Themes LIKE "%TRIAL%" THEN "juicio" 50 | WHEN V2Themes LIKE "%TERROR%" THEN "terrorismo" 51 | WHEN V2Themes LIKE "%MILITARY%" THEN "ejercito" 52 | WHEN V2Themes LIKE "%CORRUPTION%" THEN "corrupcion" 53 | WHEN V2Themes LIKE "%PROTEST%" THEN "protestas" 54 | WHEN V2Themes LIKE "%EXTREMISM%" THEN "extremismo" 55 | WHEN V2Themes LIKE "%REFUGEES%" THEN "refugiados" 56 | WHEN V2Themes LIKE "%SURVEILLANCE%" THEN "vigilancia" 57 | WHEN v2themes LIKE "%WB_2024_ANTI_CORRUPTION_AUTHORITIES%" THEN "autoridades_anticorrupcion" 58 | WHEN v2themes LIKE "%EPU_CATS_NATIONAL_SECURITY%" THEN "seguridad_nacional" -- http://data.gdeltproject.org/dailytrendreport/GDELT_Trend_Report-2019-12-22.pdf 59 | WHEN v2themes LIKE "%democracy%" THEN "democracia" 60 | WHEN v2themes LIKE "%WB_2019_ANTI_CORRUPTION_LEGISLATION%" THEN "legislacion_anticorrupcion" 61 | WHEN v2themes LIKE "%WB_739_POLITICAL_VIOLENCE_AND_CIVIL_WAR%" THEN "violencia_politica_y_guerra_civil" 62 | -- SOCIAL 63 | WHEN v2themes LIKE "%WB_2443_RAPE_AND_SEXUAL_VIOLENCE%" THEN "agresion_sexual" 64 | WHEN v2themes LIKE "%IMMIGRATION%" THEN "inmigracion" 65 | WHEN V2Themes LIKE "%SCIENCE%" THEN "ciencia" 66 | WHEN v2themes LIKE "%EDUCATION %" THEN "educacion" 67 | WHEN v2themes LIKE "%ENV_CLIMATECHANGE" THEN "cambio_climatico" 68 | WHEN v2themes LIKE "%ECON_ENTREPRENEURSHIP%" THEN "emprendimiento" 69 | WHEN v2themes LIKE "%WB_2165_HEALTH_EMERGENCIES%" THEN "emergencia_sanitaria" 70 | WHEN v2themes LIKE "%ECON_HOUSING_PRICES%" THEN "precio_vivienda" 71 | WHEN v2themes LIKE "%WB_2167_PANDEMICS" THEN "pandemia" 72 | WHEN v2themes LIKE "%WB_525_RENEWABLE_ENERG%" THEN "energias_renovables" 73 | WHEN v2themes LIKE "%ECON_SUBSIDIES%" THEN "subsidios" 74 | WHEN v2themes LIKE "%DISCRIMINATION_RACE_RACISM%" THEN "racismo" 75 | WHEN v2themes LIKE "%HEALTH_VACCINATION%" THEN "vacunas" 76 | WHEN v2themes LIKE "%MEDIA_CENSORSHIP%" THEN "censura_en_medios" 77 | WHEN V2THEMES LIKE "%TAX_DISEASE%" THEN "enfermedades_muy_infecciosas" -- https://blog.gdeltproject.org/infectious-disease-mapping-and-early-warning-with-gdelt/ 78 | WHEN v2themes LIKE "%TAX_DISEASE_CORONAVIRUS_INFECTIONS%" THEN "numero_de_contagios_covid%" 79 | WHEN v2themes LIKE "%TAX_DISEASE_CORONAVIRUS%" THEN "fallecimiento_por_covid" 80 | END 81 | ) AS news_in_Spain, 82 | v2counts, 83 | v2locations 84 | FROM 85 | `gdelt-bq.gdeltv2.gkg_partitioned` 86 | WHERE 87 | v2counts LIKE '%#SP#%' 88 | AND counts LIKE '%#SP#%' 89 | AND V2Locations LIKE '%#SP#%' 90 | AND DATE(_PARTITIONTIME) >= DATE_TRUNC(CURRENT_DATE(), WEEK(MONDAY)) -- this is the part to overwriting weekly 91 | AND ( SourceCommonName IN ( 92 | SELECT 93 | spanish_newspapers 94 | FROM 95 | `myproject-mydataset.gdelt_info_filtering.spanish_newspapers_SourceCommonName_160620`))) 96 | WHERE 97 | news_in_Spain IS NOT NULL -------------------------------------------------------------------------------- /bigquery/automating_gdelt_query_updating_dashboard_tables.sql: -------------------------------------------------------------------------------- 1 | -- go to schedule query 2 | -- select schedule 3 | -- Run on mondays 4 | -- Append to existing related table 5 | -- The same with the other tables 6 | 7 | SELECT 8 | news_in_Spain AS social, 9 | Date, 10 | Sentiment 11 | FROM 12 | `myproject-mydataset.gdelt_info_filtering.filtered_spanish_news` 13 | WHERE 14 | news_in_Spain = "agresion_sexual" 15 | OR news_in_Spain = "inmigracion" 16 | OR news_in_Spain = "ciencia" 17 | OR news_in_Spain = "educacion" 18 | OR news_in_Spain = "cambio_climatico" 19 | OR news_in_Spain ="emprendimiento" 20 | OR news_in_Spain ="emergencia_sanitaria" 21 | OR news_in_Spain ="precio_vivienda" 22 | OR news_in_Spain = "energias_renovables" 23 | OR news_in_Spain ="pandemia" 24 | OR news_in_Spain = "subsidios" 25 | OR news_in_Spain ="racismo" 26 | OR news_in_Spain ="vacunas" 27 | OR news_in_Spain ="censura_en_medios" 28 | OR news_in_Spain = "enfermedades_muy_infecciosas" -- https://blog.gdeltproject.org/infectious-disease-mapping-and-early-warning-with-gdelt/ 29 | OR news_in_Spain = "numero_de_contagios_covid%" 30 | OR news_in_Spain = "fallecimiento_por_covid"; -------------------------------------------------------------------------------- /bigquery/gdelt_1st_query_themes_and_sentiment.sql: -------------------------------------------------------------------------------- 1 | -- This 1st query request all info and save it to make a concrete graph (percentaje of news according to the selected topics) 2 | -- The 2nd query create tables dividing the Gdelt info, to make other Graphs by theme 3 | 4 | CREATE OR REPLACE TABLE 5 | `myproject-mydataset.gdelt_info_filtering.filtered_spanish_news` AS 6 | SELECT 7 | -- This select * avoid nulls generated by contrasting with News_in_Spain tables 8 | * 9 | FROM ( 10 | 11 | SELECT 12 | EXTRACT (date 13 | FROM 14 | PARSE_TIMESTAMP('%Y%m%d%H%M%S',CAST(date AS string))) AS Date, 15 | SourceCommonName, 16 | ROUND(CAST(SPLIT(V2Tone, ",") [ 17 | OFFSET 18 | (0)] AS FLOAT64),2) AS Sentiment, 19 | (CASE 20 | --WHEN V2Themes LIKE "%IDEOLOGY%" THEN "ideologia" 21 | -- WHEN V2Themes LIKE "%CRISISLEX_CRISISLEXREC%" THEN "use_of_social_media_for_crisis_and_disaster_response" --https://blog.gdeltproject.org/crisislex-taxonomies-now-available-in-gkg/ 22 | -- WHEN V2Themes LIKE "%EPU_POLICY_GOVERNMENT%" THEN "epu_policy_goverment" 23 | --WHEN V2Themes LIKE "%ARMEDCONFLICT%" THEN "conflicto_armado" 24 | --WHEN V2Themes LIKE "%TRANSPARENCY%" THEN "transparencia" 25 | --WHEN V2Themes LIKE "%SCANDAL%" THEN "escandalo" 26 | -- ECONOMICAL 27 | WHEN V2Themes LIKE "%ECON_STOCKMARKET%" THEN "stock_market" 28 | WHEN V2Themes LIKE "%EPU_POLICY%" THEN "incertidumbre_economica" -- https://blog.gdeltproject.org/economic-policy-uncertainty-is-driving-economic-uncertainty-in-the-era-of-trump/ 29 | WHEN V2Themes LIKE "%WB_1235_CENTRAL_BANKS%" THEN "banco_mundial" 30 | WHEN V2Themes LIKE "%UNEMPLOYMENT%" THEN "desempleo" 31 | WHEN V2Themes LIKE "%ECON_INFLATION%" THEN "inflacion_economica" 32 | WHEN v2themes LIKE "%ECON_BANKRUPTCY%" THEN "quiebra_economica" 33 | WHEN V2THEMES LIKE "%WB_1104_MACROECONOMIC_VULNERABILITY_AND_DEBT%" THEN "macroeconomia_deuda_y_vulnerabilidad" 34 | WHEN V2THEMES LIKE "%WB_2745_JOB_QUALITY_AND_LABOR_MARKET_PERFORMANCE%" THEN "job_quality_&_labor_market_performance" 35 | WHEN v2themes LIKE "%POVERTY%" THEN "pobreza" 36 | WHEN v2themes LIKE "%ECON_DEBT" THEN "deuda_publica" 37 | WHEN v2themes LIKE "%WB_471_ECONOMIC_GROWTH%" THEN "crecimiento_economico" 38 | WHEN v2themes LIKE "%WB_318_FINANCIAL_ARCHITECTURE_AND_BANKING%" THEN "finanzas_y_bancos" 39 | WHEN v2themes LIKE "%ECON_OILPRICE%" THEN "precio_petroleo" 40 | WHEN v2themes LIKE "%WB_625_HEALTH_ECONOMICS_AND_FINANCE%" THEN "prosperidad_economica_y_finanzas" 41 | WHEN v2themes LIKE "%ECON_FREETRADE%" THEN "libre_comercio" 42 | -- POLITICAL 43 | WHEN V2Themes LIKE "%POLITICAL_TURMOIL%" THEN "inestabilidad_politica" 44 | WHEN V2Themes LIKE "%REBELLION%" THEN "rebelion" 45 | WHEN V2Themes LIKE "%TRIAL%" THEN "juicio" 46 | WHEN V2Themes LIKE "%TERROR%" THEN "terrorismo" 47 | WHEN V2Themes LIKE "%MILITARY%" THEN "ejercito" 48 | WHEN V2Themes LIKE "%CORRUPTION%" THEN "corrupcion" 49 | WHEN V2Themes LIKE "%PROTEST%" THEN "protestas" 50 | WHEN V2Themes LIKE "%EXTREMISM%" THEN "extremismo" 51 | WHEN V2Themes LIKE "%REFUGEES%" THEN "refugiados" 52 | WHEN V2Themes LIKE "%SURVEILLANCE%" THEN "vigilancia" 53 | WHEN v2themes LIKE "%WB_2024_ANTI_CORRUPTION_AUTHORITIES%" THEN "autoridades_anticorrupcion" 54 | WHEN v2themes LIKE "%EPU_CATS_NATIONAL_SECURITY%" THEN "seguridad_nacional" -- http://data.gdeltproject.org/dailytrendreport/GDELT_Trend_Report-2019-12-22.pdf 55 | WHEN v2themes LIKE "%democracy%" THEN "democracia" 56 | WHEN v2themes LIKE "%WB_2019_ANTI_CORRUPTION_LEGISLATION%" THEN "legislacion_anticorrupcion" 57 | WHEN v2themes LIKE "%WB_739_POLITICAL_VIOLENCE_AND_CIVIL_WAR%" THEN "violencia_politica_y_guerra_civil" 58 | -- SOCIAL 59 | WHEN v2themes LIKE "%WB_2443_RAPE_AND_SEXUAL_VIOLENCE%" THEN "agresion_sexual" 60 | WHEN v2themes LIKE "%IMMIGRATION%" THEN "inmigracion" 61 | WHEN V2Themes LIKE "%SCIENCE%" THEN "ciencia" 62 | WHEN v2themes LIKE "%EDUCATION %" THEN "educacion" 63 | WHEN v2themes LIKE "%ENV_CLIMATECHANGE" THEN "cambio_climatico" 64 | WHEN v2themes LIKE "%ECON_ENTREPRENEURSHIP%" THEN "emprendimiento" 65 | WHEN v2themes LIKE "%WB_2165_HEALTH_EMERGENCIES%" THEN "emergencia_sanitaria" 66 | WHEN v2themes LIKE "%ECON_HOUSING_PRICES%" THEN "precio_vivienda" 67 | WHEN v2themes LIKE "%WB_2167_PANDEMICS" THEN "pandemia" 68 | WHEN v2themes LIKE "%WB_525_RENEWABLE_ENERG%" THEN "energias_renovables" 69 | WHEN v2themes LIKE "%ECON_SUBSIDIES%" THEN "subsidios" 70 | WHEN v2themes LIKE "%DISCRIMINATION_RACE_RACISM%" THEN "racismo" 71 | WHEN v2themes LIKE "%HEALTH_VACCINATION%" THEN "vacunas" 72 | WHEN v2themes LIKE "%MEDIA_CENSORSHIP%" THEN "censura_en_medios" 73 | WHEN V2THEMES LIKE "%TAX_DISEASE%" THEN "enfermedades_muy_infecciosas" -- https://blog.gdeltproject.org/infectious-disease-mapping-and-early-warning-with-gdelt/ 74 | WHEN v2themes LIKE "%TAX_DISEASE_CORONAVIRUS_INFECTIONS%" THEN "numero_de_contagios_covid%" 75 | WHEN v2themes LIKE "%TAX_DISEASE_CORONAVIRUS%" THEN "fallecimiento_por_covid" 76 | END 77 | ) AS news_in_Spain, 78 | v2counts, 79 | v2locations 80 | FROM 81 | `gdelt-bq.gdeltv2.gkg_partitioned` 82 | WHERE 83 | v2counts LIKE '%#SP#%' 84 | AND counts LIKE '%#SP#%' 85 | AND V2Locations LIKE '%#SP#%' 86 | AND DATE(_PARTITIONTIME) >= "2019-01-01" 87 | AND DATE(_PARTITIONTIME) <="2020-06-14" 88 | AND ( SourceCommonName IN ( 89 | SELECT 90 | spanish_newspapers 91 | FROM 92 | `myproject-mydataset.gdelt_info_filtering.spanish_newspapers_SourceCommonName_160620`))) 93 | 94 | WHERE 95 | news_in_Spain IS NOT NULL -------------------------------------------------------------------------------- /bigquery/gdelt_2nd_query_creating_tables.sql: -------------------------------------------------------------------------------- 1 | -- The 1st query request all info and save it to make a concrete graph in Dashboard 2 | -- This 2nd query create tables dividing the Gdelt info, to make other Graphs by theme 3 | 4 | 5 | CREATE OR REPLACE TABLE 6 | `myproject-mydataset.gdelt_info_filtering.dashboard_spanish_news_economical` AS 7 | SELECT 8 | news_in_Spain AS economical, 9 | Date, 10 | Sentiment 11 | FROM 12 | `myproject-mydataset.gdelt_info_filtering.filtered_spanish_news` 13 | WHERE 14 | news_in_Spain = "stock_market" 15 | OR news_in_Spain = "incertidumbre_economica" 16 | OR news_in_Spain = "banco_mundial" 17 | OR news_in_Spain = "desempleo" 18 | OR news_in_Spain ="inflacion_economica" 19 | OR news_in_Spain ="quiebra_economica" 20 | OR news_in_Spain ="macroeconomia_deuda_y_vulnerabilidad" 21 | OR news_in_Spain ="job_quality_&_labor_market_performance" 22 | OR news_in_Spain = "pobreza" 23 | OR news_in_Spain = "deuda_publica" 24 | OR news_in_Spain ="crecimiento_economico" 25 | OR news_in_Spain = "finanzas_y_bancos" 26 | OR news_in_Spain ="prosperidad_economica_y_finanzas" 27 | OR news_in_Spain ="libre_comercio" 28 | OR news_in_Spain ="precio_petroleo" ; 29 | -- ------------------------------------ 30 | CREATE OR REPLACE TABLE 31 | `myproject-mydataset.gdelt_info_filtering.dashboard_spanish_news_political` AS 32 | SELECT 33 | news_in_Spain AS political, 34 | Date, 35 | Sentiment 36 | FROM 37 | `myproject-mydataset.gdelt_info_filtering.filtered_spanish_news` 38 | WHERE 39 | news_in_Spain = "inestabilidad_politica" 40 | OR news_in_Spain = "rebelion" 41 | OR news_in_Spain = "juicio" 42 | OR news_in_Spain = "terrorismo" 43 | OR news_in_Spain = "ejercito" 44 | OR news_in_Spain ="corrupcion" 45 | OR news_in_Spain ="protestas" 46 | OR news_in_Spain ="extremismo" 47 | OR news_in_Spain = "refugiados" 48 | OR news_in_Spain ="vigilancia" 49 | OR news_in_Spain = "autoridades_anticorrupcion" 50 | OR news_in_Spain ="seguridad_nacional" 51 | OR news_in_Spain ="democracia" 52 | OR news_in_Spain ="precio_petroleo" 53 | OR news_in_Spain ="violencia_politica_y_guerra_civil"; 54 | -- ------------------------- 55 | CREATE OR REPLACE TABLE 56 | `myproject-mydataset.gdelt_info_filtering.dashboard_spanish_news_social` AS 57 | SELECT 58 | news_in_Spain AS social, 59 | Date, 60 | Sentiment 61 | FROM 62 | `myproject-mydataset.gdelt_info_filtering.filtered_spanish_news` 63 | WHERE 64 | news_in_Spain = "agresion_sexual" 65 | OR news_in_Spain = "inmigracion" 66 | OR news_in_Spain = "ciencia" 67 | OR news_in_Spain = "educacion" 68 | OR news_in_Spain = "cambio_climatico" 69 | OR news_in_Spain ="emprendimiento" 70 | OR news_in_Spain ="emergencia_sanitaria" 71 | OR news_in_Spain ="precio_vivienda" 72 | OR news_in_Spain = "energias_renovables" 73 | OR news_in_Spain ="pandemia" 74 | OR news_in_Spain = "subsidios" 75 | OR news_in_Spain ="racismo" 76 | OR news_in_Spain ="vacunas" 77 | OR news_in_Spain ="censura_en_medios" 78 | OR news_in_Spain = "enfermedades_muy_infecciosas" -- https://blog.gdeltproject.org/infectious-disease-mapping-and-early-warning-with-gdelt/ 79 | OR news_in_Spain = "numero_de_contagios_covid%" 80 | OR news_in_Spain = "fallecimiento_por_covid"; -------------------------------------------------------------------------------- /bigquery/gdelt_list_themes.sql: -------------------------------------------------------------------------------- 1 | WITH nested AS ( 2 | SELECT SPLIT(RTRIM(Themes,';'),';') themes FROM `gdelt-bq.gdeltv2.gkg_partitioned` WHERE _PARTITIONTIME >= "2019-09-04 00:00:00" AND _PARTITIONTIME < "2019-09-05 00:00:00" and length(Themes) > 1 3 | ) select theme, count(1) cnt from nested, UNNEST(themes) as theme group by theme order by cnt desc -------------------------------------------------------------------------------- /bigquery/spanish_newspapers.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | # this is for creating the dataset of spanish newspaperes. 3 | # https://www.enterat.com/actualidad/periodicos-digitales-espana.php 4 | raw_list=[ 5 | "https://www.abc.es/", 6 | "https://www.elmundo.es/", 7 | "https://elpais.com/", 8 | "https://www.larazon.es/", 9 | "https://www.cuartopoder.es/", 10 | "https://diario16.com/", 11 | "https://www.diariocritico.com/", 12 | "https://www.elconfidencial.com/", 13 | "https://www.elconfidencialdigital.com/", 14 | "https://www.eldiario.es/", 15 | "https://www.elespanol.com/", 16 | "https://www.huffingtonpost.es/", 17 | "https://www.elimparcial.es/", 18 | "https://www.elindependiente.com/", 19 | "https://www.elplural.com/", 20 | "https://www.elsaltodiario.com/", 21 | "https://www.esdiario.com/", 22 | "https://www.estrelladigital.es/", 23 | "https://www.hispanidad.com/", 24 | "https://www.infolibre.es/", 25 | "https://www.gaceta.es/", 26 | "https://www.lainformacion.com/", 27 | "https://www.libertaddigital.com/", 28 | "https://www.mediterraneodigital.com/", 29 | "https://www.moncloa.com/", 30 | "https://www.mundiario.com/", 31 | "https://www.niusdiario.es/", 32 | "https://www.nuevatribuna.es/", 33 | "https://okdiario.com/", 34 | "https://www.periodistadigital.com/", 35 | "https://www.publico.es/", 36 | "https://www.que.es/", 37 | "https://www.republica.com/", 38 | "https://www.vozpopuli.com/", 39 | 40 | #--diarios regionales 41 | 42 | "https://www.diarioinformacion.com/", 43 | "https://www.latribunadealbacete.es/", 44 | "https://www.diariodealmeria.es/", 45 | "https://www.ideal.es/", 46 | "https://www.lavozdealmeria.com/", 47 | "https://www.berria.eus/", 48 | "https://www.noticiasdealava.eus/", 49 | "https://www.elcorreo.com/alava/", 50 | "https://www.naiz.eus/", 51 | "https://www.naiz.eus/", 52 | "https://www.elcomercio.es/", 53 | "https://www.lne.es/", 54 | "https://www.diariodeavila.es/", 55 | "https://www.elperiodicoextremadura.com/", 56 | "https://www.hoy.es/", 57 | "https://www.lacronicabadajoz.com/", 58 | "https://www.ara.cat/", 59 | "https://www.diaridesabadell.com/", 60 | "http://www.diarideterrassa.es/", 61 | "https://www.elperiodico.com/", 62 | "https://www.elperiodico.cat/", 63 | "http://www.elpuntavui.cat/", 64 | "https://www.lavanguardia.com/", 65 | "https://www.regio7.cat/", 66 | "https://www.20minutos.es/", 67 | "https://www.berria.eus/", 68 | "https://www.deia.eus/", 69 | "https://www.elcorreo.com/", 70 | "https://www.naiz.eus", 71 | "http://www.elnervion.com/", 72 | "https://www.naiz.eus/", 73 | "https://www.diariodeburgos.es/", 74 | "https://elcorreodeburgos.elmundo.es/", 75 | "https://www.elperiodicoextremadura.com/", 76 | "https://www.hoy.es/", 77 | "https://www.diariodecadiz.es/", 78 | "https://www.diariodejerez.es/", 79 | "https://www.europasur.es/", 80 | "https://www.lavozdigital.es/", 81 | "https://andaluciainformacion.es/", 82 | "https://www.eldiarioalerta.com/", 83 | "https://www.eldiariomontanes.es/", 84 | "https://www.elperiodicomediterraneo.com/", 85 | "https://elfarodeceuta.es/", 86 | "https://elpueblodeceuta.es/", 87 | "https://www.latribunadeciudadreal.es/", 88 | "https://www.diariocordoba.com/", 89 | "https://www.eldiadecordoba.es/", 90 | "https://www.elcorreogallego.es/", 91 | "https://www.elidealgallego.com/", 92 | "https://www.laopinioncoruna.es/", 93 | "https://www.lavozdegalicia.es/", 94 | "https://www.nosdiario.gal/", 95 | "https://www.latribunadecuenca.es/", 96 | "https://www.berria.eus/", 97 | "https://www.elcorreo.com/gipuzkoa/", 98 | "https://www.diariovasco.com/", 99 | "https://www.naiz.eus/", 100 | "https://www.noticiasdegipuzkoa.eus/", 101 | "https://www.naiz.eus/", 102 | "https://www.ara.cat/", 103 | "https://www.diaridegirona.cat/", 104 | "https://www.elperiodico.com/es/", 105 | "https://www.elperiodico.cat/", 106 | "http://www.elpuntavui.cat/", 107 | "https://www.lavanguardia.com/", 108 | "https://www.regio7.cat/", 109 | "https://www.granadahoy.com/", 110 | "https://www.ideal.es/almeria/", 111 | "http://huelva24.com/", 112 | "https://www.huelvainformacion.es/", 113 | "https://andaluciainformacion.es/", 114 | "https://andaluciainformacion.es/", 115 | "https://www.diariodelaltoaragon.es/", 116 | "https://www.elperiodicodearagon.com/", 117 | "https://www.heraldo.es/", 118 | "https://www.arabalears.cat/", 119 | "https://www.diariodeibiza.es/", 120 | "https://www.diariodemallorca.es/", 121 | "https://www.menorca.info/", 122 | "https://www.diariodeleon.es/", 123 | "https://www.ara.cat/", 124 | "http://www.elpuntavui.cat/", 125 | "https://www.lamanyana.cat/", 126 | "https://www.lavanguardia.com/", 127 | "https://www.segre.com/", 128 | "https://www.elperiodico.com/", 129 | "https://www.lanuevacronica.com/", 130 | "https://www.elperiodico.cat/", 131 | "https://www.diariojaen.es/", 132 | "https://www.ideal.es/", 133 | "https://www.ultimahora.es/", 134 | "https://www.periodicodeibiza.es/", 135 | "https://www.diariodeleon.es/", 136 | "https://www.lanuevacronica.com/", 137 | "https://www.20minutos.es/", 138 | "https://www.elprogreso.es/", 139 | "https://www.lavozdegalicia.es/", 140 | "https://www.diariosur.es/", 141 | "https://www.nosdiario.gal/", 142 | "https://www.arabalears.cat/", 143 | "https://www.diariodeibiza.es/", 144 | "https://www.diariodemallorca.es/", 145 | "https://www.menorca.info/", 146 | "https://www.periodicodeibiza.es/", 147 | "https://www.ultimahora.es/", 148 | "https://andaluciainformacion.es/", 149 | "https://www.diariojaen.es/", 150 | "https://www.diariodeleon.es/", 151 | "https://www.lanuevacronica.com/", 152 | "https://www.elprogreso.es/", 153 | "https://www.lavozdegalicia.es/", 154 | "https://www.nosdiario.gal/", 155 | "https://www.diariosur.es/", 156 | "https://www.laopiniondemalaga.es/", 157 | "https://www.malagahoy.es/", 158 | "https://www.melillahoy.es/", 159 | "https://elfarodemelilla.es/", 160 | "https://www.laopiniondemurcia.es/", 161 | "https://www.laverdad.es/murcia/cartagena/", 162 | "https://www.laverdad.es/", 163 | "https://www.noticiasdenavarra.com/", 164 | "https://www.diariodenavarra.es/", 165 | "https://www.farodevigo.es/", 166 | "https://www.laregion.es/", 167 | "https://www.lavozdegalicia.es/", 168 | "https://www.nosdiario.gal/", 169 | "https://www.elnortedecastilla.es/", 170 | "https://www.diariopalentino.es/", 171 | "https://www.canarias7.es/", 172 | "https://www.laprovincia.es/", 173 | "https://www.atlantico.net/", 174 | "https://www.diariodepontevedra.es/", 175 | "https://www.farodevigo.es/", 176 | "https://www.lavozdegalicia.es/", 177 | "https://www.nosdiario.gal/", 178 | "https://www.larioja.com/", 179 | "https://www.eldia.es/", 180 | "https://diariodeavisos.elespanol.com/", 181 | "https://www.eladelantado.com/", 182 | "https://www.lagacetadesalamanca.es/", 183 | "https://sevilla.abc.es/", 184 | "https://heraldodiariodesoria.elmundo.es/", 185 | "https://www.diariodesevilla.es/", 186 | "https://www.diaridetarragona.com/", 187 | "https://www.ara.cat/", 188 | "http://www.elpuntavui.cat/", 189 | "https://www.lavanguardia.com/", 190 | "https://www.diarimes.com/", 191 | "https://www.elperiodicodearagon.com/", 192 | "https://www.diariodeteruel.es/", 193 | "https://www.heraldo.es/", 194 | "https://www.latribunadetoledo.es/", 195 | "https://www.lasprovincias.es/", 196 | "https://www.levante-emv.com/", 197 | "https://valenciaextra.com/", 198 | "https://diariodevalladolid.elmundo.es/", 199 | "https://www.eldiadevalladolid.com/", 200 | "https://www.laopiniondezamora.es/", 201 | "https://www.elperiodicodearagon.com/", 202 | "https://www.heraldo.es/", 203 | 204 | # just online ones 205 | 206 | "https://alicanteplaza.es/", 207 | "https://www.elperiodic.com/", 208 | "https://www.clm24.es/", 209 | "https://eldiadigital.es/", 210 | "http://www.eldigitalcastillalamancha.es/", 211 | "https://www.elperiodicodeaqui.com/", 212 | "https://www.eldigitaldealbacete.com/", 213 | "https://www.masquealba.com/", 214 | "https://lamarinaplaza.com/", 215 | "https://lacerca.com/", 216 | "https://www.encastillalamancha.es/", 217 | "https://almeria360.com/", 218 | "https://www.gasteizhoy.com/", 219 | "https://nortexpres.com/", 220 | "https://www.lavozdeasturias.es/", 221 | "https://www.noticiascyl.com/", 222 | "https://avilared.com/", 223 | "http://www.regiondigital.com/", 224 | "https://www.cronicabalear.es/", 225 | "https://www.mallorcadiario.com/", 226 | "https://www.menorca.info/", 227 | "https://menorcaaldia.com/", 228 | "https://www.noudiari.es/", 229 | "https://www.capgros.com/", 230 | "https://catalunyadiari.com/", 231 | "https://www.eixdiari.cat/", 232 | "https://www.elcatalan.es/", 233 | "https://elmon.cat/", 234 | "https://www.elnacional.cat/", 235 | "https://www.e-noticies.cat/", 236 | "https://www.isabadell.cat/", 237 | "https://www.larepublica.cat/", 238 | "https://www.metropoliabierta.com/", 239 | "https://www.naciodigital.cat/", 240 | "https://www.racocatala.cat/", 241 | "https://www.vilaweb.cat/", 242 | "https://www.burgosconecta.es/", 243 | "https://www.burgosnoticias.com/", 244 | "https://www.diariodelaribera.net/", 245 | "https://www.noticiascyl.com/", 246 | "https://www.regiondigital.com/", 247 | "https://www.algecirasalminuto.com/", 248 | "https://www.cadizdirecto.com/", 249 | "https://www.lavozdelsur.es/", 250 | "https://rotaaldia.com/", 251 | "https://elmira.es/", 252 | "https://www.eldiariocantabria.es/", 253 | "https://castellonplaza.com/", 254 | "https://www.elperiodic.com/", 255 | "https://www.elperiodicodeaqui.com/", 256 | "https://www.levante-emv.com/castello/", 257 | "https://www.clm24.es/", 258 | "https://eldiadigital.es/", 259 | "http://www.eldigitalcastillalamancha.es/", 260 | "https://www.encastillalamancha.es/", 261 | "https://www.miciudadreal.es/", 262 | "http://manchainformacion.com/", 263 | "https://www.lanzadigital.com/", 264 | "https://lacerca.com/", 265 | "https://cordopolis.es/", 266 | "https://www.lucenahoy.com/", 267 | "https://www.galiciaconfidencial.com/", 268 | "https://www.clm24.es/", 269 | "https://eldiadigital.es/", 270 | "http://www.eldigitalcastillalamancha.es/", 271 | "https://www.encastillalamancha.es/", 272 | "https://lacerca.com/", 273 | "https://www.lasnoticiasdecuenca.es/", 274 | "https://www.vocesdecuenca.com/", 275 | "https://www.donostitik.com/", 276 | "http://www.durangon.com/", 277 | "https://goiena.eus/", 278 | "https://catalunyadiari.com/", 279 | "https://www.elcatalan.es/", 280 | "https://www.elnacional.cat/", 281 | "https://www.e-noticies.cat/", 282 | "https://www.larepublica.cat/", 283 | "https://www.naciodigital.cat/", 284 | "https://www.racocatala.cat/", 285 | "https://www.vilaweb.cat/", 286 | "https://www.granadadigital.es/", 287 | "https://www.clm24.es/", 288 | "https://eldiadigital.es/", 289 | "http://www.eldigitalcastillalamancha.es/", 290 | "https://www.encastillalamancha.es/", 291 | "https://nuevaalcarria.com/", 292 | "https://www.lacronica.net/", 293 | "https://www.aragondigital.es/", 294 | "https://www.diariodevalderrueda.es/", 295 | "https://digitaldeleon.com/", 296 | "https://www.elbierzodigital.com/", 297 | "https://www.ileon.com/", 298 | "https://www.infobierzo.com/", 299 | "https://www.noticiascyl.com/", 300 | "https://www.leonoticias.com/", 301 | "https://catalunyadiari.com/", 302 | "https://www.elcatalan.es/", 303 | "https://www.elnacional.cat/", 304 | "https://www.e-noticies.cat/", 305 | "https://www.larepublica.cat/", 306 | "https://lleidadiari.cat/", 307 | "https://www.vilaweb.cat/", 308 | "https://www.racocatala.cat/", 309 | "https://www.naciodigital.cat/", 310 | "https://www.galiciaconfidencial.com/", 311 | "https://www.alcalahoy.es/", 312 | "https://elcorreodeespana.com/", 313 | "https://www.madridiario.es/", 314 | "https://www.marbella24horas.es/", 315 | "https://cartagenadiario.es/", 316 | "https://navarra.elespanol.com/", 317 | "https://pamplonaactual.com/", 318 | "https://www.galiciaconfidencial.com/", 319 | "https://www.diariodevalderrueda.es/", 320 | "https://www.noticiascyl.com/", 321 | "https://www.eldiario.es/", 322 | "https://www.infonortedigital.com/", 323 | "https://www.lancelotdigital.com/", 324 | "https://www.lavozdelanzarote.com/", 325 | "https://www.galiciaconfidencial.com/", 326 | "https://pontevedraviva.com/", 327 | "https://www.rioja2.com/", 328 | "https://nuevecuatrouno.com/", 329 | "https://www.noticiascyl.com/", 330 | "https://www.salamanca24horas.com/", 331 | "https://salamancartvaldia.es/", 332 | "https://www.noticiascyl.com/", 333 | "https://elcorreoweb.es/", 334 | "https://elmira.es/", 335 | "http://www.sevillaactualidad.com/", 336 | "https://www.desdesoria.es/", 337 | "https://elmirondesoria.es/", 338 | "https://www.noticiascyl.com/", 339 | "https://sorianoticias.com/", 340 | "https://catalunyadiari.com/", 341 | "https://www.eixdiari.cat/", 342 | "https://www.elcatalan.es/", 343 | "https://www.elnacional.cat/", 344 | "https://www.e-noticies.cat/", 345 | "https://www.larepublica.cat/", 346 | "https://www.vilaweb.cat/", 347 | "http://reusdigital.cat/", 348 | "https://www.racocatala.cat/", 349 | "https://www.naciodigital.cat/", 350 | "https://www.aragondigital.es/", 351 | "https://ecodeteruel.tv/", 352 | "https://www.lacomarca.net/", 353 | "https://www.clm24.es/", 354 | "https://eldiadigital.es/", 355 | "http://www.eldigitalcastillalamancha.es/", 356 | "https://www.encastillalamancha.es/", 357 | "https://www.lavozdeltajo.com/", 358 | "https://lacerca.com/", 359 | "https://www.elperiodic.com/", 360 | "https://www.elperiodicodeaqui.com/", 361 | "https://valenciaplaza.com/", 362 | "https://www.tribunavalladolid.com/", 363 | "https://www.noticiascyl.com/", 364 | "https://interbenavente.es/", 365 | "https://www.noticiascyl.com/", 366 | "https://www.zamora24horas.com/", 367 | "https://zamoranews.com/", 368 | "https://www.aragondigital.es/", 369 | "https://cincodias.elpais.com/", 370 | "https://www.eleconomista.es/", 371 | "https://www.expansion.com/", 372 | "https://www.bolsamania.com/", 373 | "https://www.economiadigital.es/", 374 | "https://www.elboletin.com/", 375 | "https://www.finanzas.com/", 376 | "https://intereconomia.com/", 377 | "https://www.elespanol.com/", 378 | "https://www.libremercado.com/", 379 | "https://www.leconomic.cat/" 380 | 381 | ] 382 | 383 | 384 | 385 | spanish_media=set(raw_list) 386 | #print("initial list: ", len(raw_list),"\n", "removing duplicates: ", len(spanish_media)) 387 | 388 | newspapers=pd.DataFrame(spanish_media, columns=["spanish_media"]) 389 | newspapers.to_csv('./output/spanish_media.csv') 390 | -------------------------------------------------------------------------------- /bigquery/tables_python_api.sql: -------------------------------------------------------------------------------- 1 | -- Due to how Google Trends works, I overwrite continuously the Google Trends tables with the newest info 2 | 3 | -- After being ingested the data in a main BigQuery table, it is separated into many tables for the sake of Data Studio good development practices 4 | 5 | CREATE OR REPLACE TABLE 6 | your_table_name.videocall_searches AS 7 | SELECT 8 | keyword AS videocall_searches, 9 | date, 10 | trend_index 11 | FROM 12 | `your created dataset in bigquery` 13 | WHERE 14 | ( keyword = "zoom" 15 | OR keyword = "teams" 16 | OR keyword = "skype" 17 | OR keyword = "hangouts" 18 | OR keyword ="teletrabajo" 19 | OR keyword ="videollamada" 20 | OR keyword ="videoconferencia"); 21 | SELECT 22 | * 23 | FROM 24 | your_table_name.videocall_searches; 25 | CREATE OR REPLACE TABLE 26 | your_table_name.messaging_searches AS 27 | SELECT 28 | keyword AS messaging_searches, 29 | date, 30 | trend_index 31 | FROM 32 | `your created dataset in bigquery` 33 | WHERE 34 | ( keyword = "whatsapp" 35 | OR keyword = "telegram" 36 | OR keyword = "viber" 37 | OR keyword = "tiktok" ); 38 | SELECT 39 | * 40 | FROM 41 | your_table_name.messaging_searches; 42 | CREATE OR REPLACE TABLE 43 | your_table_name.political_searches AS 44 | SELECT 45 | keyword AS political_searches, 46 | date, 47 | trend_index 48 | FROM 49 | `your created dataset in bigquery` 50 | WHERE 51 | ( keyword = "refugiados" 52 | OR keyword = "inmigracion" 53 | OR keyword = "nacionalismo" 54 | OR keyword = "corrupcion" 55 | OR keyword ="juicio" 56 | OR keyword = "guerra comercial") ; 57 | SELECT 58 | * 59 | FROM 60 | your_table_name.political_searches; 61 | CREATE OR REPLACE TABLE 62 | your_table_name.health_searches AS 63 | SELECT 64 | keyword AS health_searches, 65 | date, 66 | trend_index 67 | FROM 68 | `your created dataset in bigquery` 69 | WHERE 70 | ( keyword = "coronavirus" 71 | OR keyword = "pandemia" 72 | OR keyword = "infeccion" 73 | OR keyword = "medico"); 74 | SELECT 75 | * 76 | FROM 77 | your_table_name.health_searches; 78 | CREATE OR REPLACE TABLE 79 | your_table_name.leisure_searches AS 80 | SELECT 81 | keyword AS leisure_searches, 82 | date, 83 | trend_index 84 | FROM 85 | `your created dataset in bigquery` 86 | WHERE 87 | ( keyword = "amazon" 88 | OR keyword = "netflix" 89 | OR keyword = "hbo" 90 | OR keyword = "rakuten" 91 | OR keyword ="steam"); 92 | SELECT 93 | * 94 | FROM 95 | your_table_name.leisure_searches; 96 | CREATE OR REPLACE TABLE 97 | your_table_name.city_searches AS 98 | SELECT 99 | keyword AS city_searches, 100 | date, 101 | trend_index 102 | FROM 103 | `your created dataset in bigquery` 104 | WHERE 105 | ( keyword = "cabify" 106 | OR keyword = "taxi" 107 | OR keyword = "glovo" 108 | OR keyword = "just eat" 109 | OR keyword ="deliveroo" 110 | OR keyword ="uber eats"); 111 | SELECT 112 | * 113 | FROM 114 | your_table_name.city_searches; 115 | CREATE OR REPLACE TABLE 116 | your_table_name.habit_searches AS 117 | SELECT 118 | keyword AS habit_searches, 119 | date, 120 | trend_index 121 | FROM 122 | `your created dataset in bigquery` 123 | WHERE 124 | ( keyword = "comida a domicilio" 125 | OR keyword = "hacer deporte" 126 | OR keyword = "yoga" 127 | OR keyword = "meditacion" 128 | OR keyword ="cursos online"); 129 | SELECT 130 | * 131 | FROM 132 | your_table_name.habit_searches; -------------------------------------------------------------------------------- /cloud_function/main.py: -------------------------------------------------------------------------------- 1 | import os 2 | from os import listdir 3 | from os.path import isfile, join 4 | 5 | def main (data,context): 6 | 7 | processes= ["pytrends_request.py","upload_gcs_real.py","remove_files.py"] 8 | 9 | for p in processes: 10 | print(p) 11 | exec(open(p).read()) 12 | 13 | if __name__ == "__main__": 14 | 15 | main('data','context') 16 | -------------------------------------------------------------------------------- /cloud_function/pytrends_request.py: -------------------------------------------------------------------------------- 1 | # import os 2 | # import gcsfs 3 | import pandas as pd 4 | import random 5 | import pytrends 6 | from pytrends.request import TrendReq 7 | import time as timer 8 | import datetime 9 | from datetime import datetime, date, time, timedelta 10 | 11 | 12 | 13 | 14 | #-------------------------------------- 15 | # date 16 | #yesterday=(datetime.now()-timedelta(days=1)).date() 17 | dates="2019-01-06"+" "+str(datetime.now().date()) 18 | print(dates) 19 | 20 | keywords=[ 21 | "zoom","skype","hangouts", 22 | 23 | "refugiados","inmigracion","nacionalismo","corrupcion","estado de alarma", 24 | "comparecencia","independentismo","crisis política","barometro","crisis economica", 25 | "protesta","manifestacion", 26 | 27 | "bildu","ciudadanos","compromis","erc","mas pais", "pnv","podemos","pp","psoe","vox", 28 | 29 | "teletrabajo","remoto","cursos online","productividad","autonomo", 30 | "negocio online","emprendimiento", "formacion", 31 | 32 | "erte","paro","sepe","desempleo","deshaucio","comedor social", 33 | "banco alimentos", "cruz roja", "caritas", 34 | 35 | "ayuda alquiler","compartir piso","divorcio","embarazo","hipoteca", "idealista", "badi", "piso barato", 36 | 37 | "coronavirus","pandemia","infeccion","medico","residencia ancianos", "desescalada", 38 | 39 | "clases online","examenes","menu escolar","bullying", 40 | 41 | "netflix","hbo","steam","glovo","just eat","deliveroo","uber eats","hacer deporte","en casa", 42 | "yoga","meditacion","videollamada","videoconferencia","tinder","meetic", "disney","amazon", "cabify", 43 | "uber","taxi", "en familia" 44 | 45 | ] 46 | #-------------------------------------- 47 | # the function 48 | #-------------------------------------- 49 | pytrends = TrendReq(hl='ES', tz=0) 50 | future_dataframe={} 51 | c=1 52 | for k in keywords: 53 | 54 | try: 55 | #print("Requesting ",[k]) 56 | pytrends.build_payload([k], timeframe=dates, geo='ES', gprop='') 57 | future_dataframe[c]=pytrends.interest_over_time() 58 | future_dataframe[c].drop(['isPartial'], axis=1,inplace=True) 59 | c+=1 60 | result = pd.concat(future_dataframe, axis=1) 61 | except: 62 | print("***","\n","Error with ",k,"or not enough trending percentaje","\n","***") 63 | 64 | result.columns = result.columns.droplevel(0) 65 | df1=result.unstack(level=-1) 66 | df2=pd.DataFrame(df1) 67 | df2.reset_index(inplace=True) 68 | df2.columns = ["keyword","date","trend_index"] 69 | df2.to_csv("../tmp/data_pytrends.csv") -------------------------------------------------------------------------------- /cloud_function/remove_files.py: -------------------------------------------------------------------------------- 1 | import os 2 | from os import listdir 3 | from os.path import isfile, join 4 | #list of csv 5 | only_csv = [f for f in listdir("../tmp") if isfile(join("../tmp", f))] 6 | 7 | [os.remove("../tmp/{}".format(e)) for e in only_csv] 8 | print(only_csv," removed ../tmp folder limpito") -------------------------------------------------------------------------------- /cloud_function/requirements.txt: -------------------------------------------------------------------------------- 1 | gcsfs==0.6.1 2 | pandas==1.0.3 3 | pytrends==4.7.2 4 | importlib-metadata==1.6.0 5 | 6 | 7 | -------------------------------------------------------------------------------- /cloud_function/upload_gcs_example.py: -------------------------------------------------------------------------------- 1 | import os 2 | import gcsfs 3 | import pandas as pd 4 | 5 | # CREDENTIALS 6 | # - project = Project name in Google Cloud, NOT project ID. 7 | # - fs.open("your bucket/name your future csv.csv") 8 | # - fs.upload("where is temporary located in cloud function", "your bucket/name your future csv.csv") 9 | 10 | # .ENV structure: 11 | # TOKEN_NAME='your credentials name with object access to gcs.json' 12 | # PROJECT_NAME='name of your project, NO name ID' 13 | # PROJECT_PATH='your bucket in gcs/your future.csv' 14 | # PROJECT_TMP='../tmp/name of your requested csv in pytrends_request.py .csv' 15 | 16 | fs = gcsfs.GCSFileSystem(token=os.getenv('TOKEN_NAME') ,project=os.getenv('PROJECT_NAME')) 17 | 18 | # upload the new final companies 19 | with fs.open(os.getenv('PROJECT_PATH')) as mergecomp: 20 | df = pd.read_csv(mergecomp) 21 | fs.upload(os.getenv('PROJECT_TMP'),os.getenv('PROJECT_PATH')) 22 | -------------------------------------------------------------------------------- /pics/automated_dashboard_preview.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/albertovpd/automated_etl_google_cloud-social_dashboard/e3899a2d19b27c1f8af67e1d4c52305a479247ac/pics/automated_dashboard_preview.gif -------------------------------------------------------------------------------- /pics/budget_alert.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/albertovpd/automated_etl_google_cloud-social_dashboard/e3899a2d19b27c1f8af67e1d4c52305a479247ac/pics/budget_alert.png -------------------------------------------------------------------------------- /pics/cloud_function_error.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/albertovpd/automated_etl_google_cloud-social_dashboard/e3899a2d19b27c1f8af67e1d4c52305a479247ac/pics/cloud_function_error.png -------------------------------------------------------------------------------- /pics/gdelt_sketch.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/albertovpd/automated_etl_google_cloud-social_dashboard/e3899a2d19b27c1f8af67e1d4c52305a479247ac/pics/gdelt_sketch.png -------------------------------------------------------------------------------- /pics/interactive_dashboard.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/albertovpd/automated_etl_google_cloud-social_dashboard/e3899a2d19b27c1f8af67e1d4c52305a479247ac/pics/interactive_dashboard.png -------------------------------------------------------------------------------- /pics/list_of_scheduled_queries.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/albertovpd/automated_etl_google_cloud-social_dashboard/e3899a2d19b27c1f8af67e1d4c52305a479247ac/pics/list_of_scheduled_queries.png -------------------------------------------------------------------------------- /pics/monguerteam.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/albertovpd/automated_etl_google_cloud-social_dashboard/e3899a2d19b27c1f8af67e1d4c52305a479247ac/pics/monguerteam.png -------------------------------------------------------------------------------- /pics/mrdogscience.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/albertovpd/automated_etl_google_cloud-social_dashboard/e3899a2d19b27c1f8af67e1d4c52305a479247ac/pics/mrdogscience.jpeg -------------------------------------------------------------------------------- /pics/trends_example.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/albertovpd/automated_etl_google_cloud-social_dashboard/e3899a2d19b27c1f8af67e1d4c52305a479247ac/pics/trends_example.png -------------------------------------------------------------------------------- /pics/twitter_fees.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/albertovpd/automated_etl_google_cloud-social_dashboard/e3899a2d19b27c1f8af67e1d4c52305a479247ac/pics/twitter_fees.png -------------------------------------------------------------------------------- /pics/updating_gdelt_tables.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/albertovpd/automated_etl_google_cloud-social_dashboard/e3899a2d19b27c1f8af67e1d4c52305a479247ac/pics/updating_gdelt_tables.png -------------------------------------------------------------------------------- /twitter_folder/twitter_merging.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import datetime 3 | 4 | # it is really expensive to work with Twitter API for 5 | # monitoring keywords weekly and it is out of project 6 | # (we are simple mortals with limited budget). Nevertheless 7 | # the code of monitoring Twitter with Python will be 8 | # available soon, thanks to Patri 9 | 10 | 11 | # This is just getting the right shape to use it in 12 | # DataStudio. 13 | 14 | twitter_deshaucio=pd.read_csv("./input/defdesa.csv") 15 | deshaucio=twitter_deshaucio[["date","query"]].copy() 16 | 17 | twitter_comedor=pd.read_csv("./input/dfcom.csv") 18 | comedor=twitter_comedor[["date","query"]].copy() 19 | 20 | twitter_desempleo=pd.read_csv("./input/dfdes.csv") 21 | desempleo=twitter_comedor[["date","query"]].copy() 22 | 23 | twitter_erte=pd.read_csv("./input/dferte.csv") 24 | erte=twitter_erte[["date","query"]].copy() 25 | 26 | twitter_sepe=pd.read_csv("./input/dfsepe.csv") 27 | sepe=twitter_sepe[["date","query"]].copy() 28 | 29 | twitter_final 30 | twitter_final.to_csv("./input/twitter_final.csv") --------------------------------------------------------------------------------