├── .gitignore
├── README.md
├── bigquery
    ├── automating_gdelt_query_overwriting_1st_table_weekly.sql
    ├── automating_gdelt_query_updating_dashboard_tables.sql
    ├── gdelt_1st_query_themes_and_sentiment.sql
    ├── gdelt_2nd_query_creating_tables.sql
    ├── gdelt_list_themes.sql
    ├── spanish_newspapers.py
    └── tables_python_api.sql
├── cloud_function
    ├── backup
    │   └── data_pytrends_backup.csv
    ├── main.py
    ├── pytrends_request.py
    ├── remove_files.py
    ├── requirements.txt
    └── upload_gcs_example.py
├── pics
    ├── automated_dashboard_preview.gif
    ├── budget_alert.png
    ├── cloud_function_error.png
    ├── gdelt_sketch.png
    ├── interactive_dashboard.png
    ├── list_of_scheduled_queries.png
    ├── monguerteam.png
    ├── mrdogscience.jpeg
    ├── trends_example.png
    ├── twitter_fees.png
    └── updating_gdelt_tables.png
└── twitter_folder
    └── twitter_merging.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | ipynb_checkpoints/
 2 | twitter_folder/.ipynb_checkpoints/
 3 | *.ipynb 
 4 | .env
 5 | cloud_function_backup 
 6 | tmp
 7 | output/
 8 | input/
 9 | *.json
10 | *.zip
11 | upload_gcs_real.py
12 | spanish_newspapers_SourceCommonName.py
13 | 
14 | 
15 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Socioeconomic Portrait Project.
  2 | 
  3 | ![alt](pics/automated_dashboard_preview.gif "uy mola")
  4 | 
  5 | - Above buttons enable a better visualization: select wanted keywords you want to study and exclude the others:
  6 | ![alt](pics/interactive_dashboard.png " ")
  7 | 
  8 | The final dashboard:
  9 | - https://datastudio.google.com/reporting/755f3183-dd44-4073-804e-9f7d3d993315
 10 | 
 11 | 
 12 | -----------------------------
 13 | 
 14 | Motivation:
 15 | 
 16 | Is there a way of monitoring some aspects of the global crisis in Spain? I believe so, and this is the motivation to develop this **automated** ETL process in Google Cloud involving **Google Trends**, sentiment analysis and influence in news through the **Gdelt Project** and **Twitter**, from raw data acquisition to the final dashboard. Thanks to it, I have been fighting with credentials, permissions, storage locations, processing locations, 3rd party authentications, Cloud Functions, pipelines, trigger schedulers with different time format, Dataprep global updates, etc... And I learned a lot in the way, quaratine fun! :D
 17 | 
 18 | Thanks to Patricia, who worked in very interesting methodologies with Twitter. Here you can find her content => https://medium.com/@patricrp/buscando-contenido-con-la-api-de-twitter-f3c12994a77f
 19 | 
 20 | -------------------------------
 21 | 
 22 | 1. **Introduction**
 23 | 2. **How to use**
 24 | 3. **Results**
 25 | 4. **Analyzing results**
 26 | 5. **Conclusion**
 27 | 6. **Further improvements**
 28 | 7. **Documentation**
 29 | 
 30 | 
 31 | 
 32 | ------------------------------
 33 | 
 34 | # 1. Introduction.
 35 | <details>
 36 |   <summary>Click to expand</summary>
 37 |   
 38 | ----------------- 
 39 | 
 40 | Let's show a social point of view of the pandemic's impact in Spain, through an automated ETL involving the following tools in Google Cloud:
 41 | 
 42 | - Billing: create a maximum budget to avoid unexpected fees (10€,20€,300€... What suits your project).
 43 | - Cloud Function number: Python script sending data to Cloud Storage.
 44 |         - Requesting data from Google Trends API.
 45 | - Pub/Sub topic: Triggers (activates) the Cloud Function.
 46 | - Cloud Scheduler: Schedule a Pub/Sub topic.
 47 | - Cloud Storage: Data Lake. Gets fed periodically with the python script.
 48 | - Transfer: Load periodically from Cloud Storage to BigQuery (done before with Dataprep, but on September 2020 they will change their pricing and it is going to be unaffordable for any personal project).
 49 | - BigQuery: Data Warehouse. 
 50 |         - Creates different tables with the output of dataprep. 
 51 |         - Performs a weekly query modifying tables.
 52 | - Data Studio: Dashboard of results weekly updated, linked to the BigQuery tables.
 53 | 
 54 | The Python ETL pipeline is based on this article:
 55 | 
 56 | - **https://towardsdatascience.com/creation-of-an-etl-in-google-cloud-platform-for-automated-reporting-8a0309ee8a78**
 57 | 
 58 | The Gdelt Queries are based on the shared knowledge with the Data Team of Labelium España:
 59 | 
 60 | - **https://medium.com/@a.vargas.pina/biqquery-and-the-gdelt-project-beyond-dreams-of-marketing-analysts-62e586cc0343**
 61 | 
 62 | 
 63 | 
 64 | ### **TROUBLESHOOTING TIME 1:** ***Golden rules for Cloud Services***:
 65 | <details>
 66 |   <summary>Click to expand</summary>
 67 | ---------------------------------
 68 | 
 69 | - **1. Your wallet is your sacred temple:** If you won't read all Google documentation carefully, at least go to billing and create a budget with alerts for your project, in case you do something wrong, receiving custom alerts before you spend a lot without noticing. Don't end up living under a bridge. 
 70 | 
 71 | - **2. Your sacred temple has no protection by default:** Budget alerts won't cap payment when the limit is reached. They just alert you so you have time to turn everything down in case of panic. It can be configured to cap tho. 
 72 | - **3. Worship Location:** While working on the different stages of your project, pay attention to put all your stuff in the very same region. Beware of headaches if don't. 
 73 | - **4. Don't panic:** That's all.
 74 | 
 75 | </details>
 76 | 
 77 | 
 78 | 
 79 | --------------------
 80 | 
 81 | 
 82 | 
 83 | ### Used tools to analyse the internet in Spain:
 84 | <details>
 85 |   <summary>Click to expand</summary>
 86 | -----------------------------------
 87 | 
 88 | ***Google Trends***:
 89 | 
 90 | - Python. 
 91 | - Pytrends library.
 92 | 
 93 | Google Trends is a tool that analyses the popularity of top search queries in Google Search across various regions and languages. Basically, what people are looking for in Google.
 94 | 
 95 | Google trends searches the maximum on the specified period, makes that maximum the 100% of Trend Index and everything else in the requested dates averaged by that top. If you request information weekly, you will have a point with 100% of Trend Index each week, regardless how popular it is. 
 96 | 
 97 | - If you request a list of elements, all elements will be averaged by the top one.
 98 | 
 99 | - If you request each of your keywords separately, each keyword will be averaged on time by its own top.
100 | 
101 | So basically there are 2 ways of using it: Compare the evolution in time of different keywords, or checking the evolution in time of every keyword, each one separately. We took the 2nd path.
102 | 
103 | - César discovered how tricky is Pytrends. The idea of requesting all at once but separately is his => https://www.linkedin.com/in/cesar-castanon/
104 | 
105 | -----------------------
106 | 
107 | 
108 | ***Twitter***: 
109 | 
110 | Using Tweepy, the Python's library for Twitter requests to generate dataframes for each query.  The built-in function Cursor help to develop a function to create a dataframe and include all tweets from the last 7 days related to a word, recording date, user, id and text for each tweet. The function search includes a complex query, geolocation and language to get only tweets from Spanish people.
111 | 
112 | We used Twitter Standard API, the *freemium* service, so results were limited up to 7 days and a maximum of mixed results from popular and recent tweets.
113 | 
114 | This helped us to know what were people talking about unemployment and its effects during COVID-19.
115 | 
116 | In the twitter folder there is merging-datasets script, you can find the twitter requesting script here, and the process here: 
117 | 
118 | - **https://medium.com/@patricrp/buscando-contenido-con-la-api-de-twitter-f3c12994a77f**
119 | 
120 | 
121 | By the way, work *seriously* with twitter is quite expensive, and we do love programming, but our budget, as our leisure time, is limited.
122 | 
123 | ![alt](pics/twitter_fees.png " ")
124 | 
125 | -------------------------------
126 | 
127 | 
128 | ***Gdelt Project***: 
129 | 
130 | - SQL. 
131 | - Bigquery.
132 | 
133 | The Gdelt Project is a database with all the news of the world, updated every 15 minutes. It also classifies the incoming data, so you can search topics, themes, people, related people to them... It is impressive, and available in almost 70 languages via BigQuery.
134 | 
135 | - https://www.gdeltproject.org/
136 | 
137 | </details>
138 | 
139 | -------------------------------------------
140 | 
141 | We want to monitor what spanish news say about certain topics, so the process is the following:
142 | 
143 | 1. Create a dataset with all the spanish newspapers (paper version and digital ones). Taking a glance to  *bigquery/spanish_newspapers.py* you can figure out the created dataset (>150 different media).
144 | 
145 | 2. Make a bigquery table with that dataset.
146 | 
147 | 3. Select the *themes* you want to monitor in the Gdelt database. In *bigquery/bigquery_gdelt_themes_list.sql* you can find all themes available.
148 | 
149 | 4. Make a query filtering by your goals, contrasting against the created table of spanish media and loading the data into a new table, the *main table*. 
150 | 
151 | 5. Divide the info of the *main table* in different tables, for different graphs in Data Studio (to get a faster visualization in Data Studio)
152 | 
153 | 6. Schedule the overwriting of *main table*, to get periodical updates.
154 | 
155 | 7. Schedule the appending of new info to the dashboarding tables.
156 | 
157 | Updating tables:
158 | ![alt](pics/updating_gdelt_tables.png)
159 | 
160 | List of scheduled queries (Google Trends included)
161 | ![alt](pics/list_of_scheduled_queries.png)
162 | 
163 | 8. Plot it in Data Studio.
164 | 
165 | - Alex taught me how to UNNEST and EXTRACT. Kudos for the sentiment extraction => https://www.linkedin.com/in/alexmasip/
166 | 
167 | - Jesús gave me a hand with some final SQL statements :) => https://www.linkedin.com/in/jralvarez1986/ 
168 | 
169 | 
170 | 
171 | </details>
172 | 
173 | 
174 | ------------------------------
175 | -------------------------------
176 | 
177 | # 2. How to:
178 | 
179 | <details>
180 |   <summary>Click to expand</summary>
181 |  
182 | ------------------------
183 | 
184 | ### Billing.
185 | 
186 | <details>
187 |   <summary>Click to expand</summary>
188 |  
189 | ------------------------
190 | In Cloud environment, select your project, go to billing, and take a glance. Just select a weekly/monthly budget to your project, remember that it is linked to your bank account and a wrong action can cost a lot of money if you don't pay attention.
191 | 
192 | In Notification Channels you can choose a lot of options, whatever suits you.
193 | 
194 | ![alt](pics/budget_alert.png " ")
195 | 
196 | 
197 | </details>
198 | 
199 | --------------------
200 | 
201 | 
202 | ### Create a new project and get your credentials here:
203 | 
204 | <details>
205 |   <summary>Click to expand</summary>
206 |  
207 | ------------------------
208 | 
209 | https://console.cloud.google.com/projectselector2/iam-admin/serviceaccounts?_ga=2.187077160.1256540572.1587286598-806021438.1579453370&pli=1&supportedpurview=project
210 | 
211 | - Service Account: Give it a nice name.
212 | - Role: Storage Object Admin, Cloud Function
213 | 
214 | </details>
215 | 
216 | ----------------
217 | 
218 | ### Create a bucket in Cloud Storage
219 | 
220 | <details>
221 |   <summary>Click to expand</summary>
222 |  
223 | ------------------------
224 | 
225 | https://console.cloud.google.com/storage/create-bucket?
226 |         
227 | - Configure it: Due to our requirements we will use region location (**Europe Multi-region**). Be careful, it can give you a hard headache, mainly if working with BigQuery or data from other regions that are not your selected one. Always locate all buckets where all data sources you are using for the same project. 
228 | 
229 | </details>
230 | 
231 | ----------------
232 | 
233 | ### Python script
234 | 
235 | <details>
236 |   <summary>Click to expand</summary>
237 |  
238 | ------------------------
239 | 
240 | 
241 | *The Python script I'm using is available in the Cloud Function folder*.
242 | 
243 | - Make sure it runs fine in your PC before uploading to Cloud Function. It is a hassle debugging in the Cloud.
244 | 
245 | - If you are following my steps (gcsfs library) and a *NotFounderror* arises while trying to upload to Cloud Storage, try the first time uploading manually the requested csv(s) to Cloud, and after that you'll automate the reading/overwriting of this csv(s)
246 | 
247 | 
248 | Pytrends (keywords without accents or capital letters. Script available in Cloud Function). The chosen keywords for this project are in the ***cloud_function/requesting_from_pytrends.py***  folder.
249 | 
250 | 
251 | - Google Trends: 
252 |         - Has the option of select in which category you want to find your keywords. We have our reasons to not include categories in the code.
253 |         - **Important**: When the requested information is more than a month, it is delivered grouped by week. That means you need to choose carefully when you will start requesting.
254 | 
255 | It is important to clarify the second point: In my script, I was weekly requesting information from 2019-01-01 to the last sunday, and the script was activated every monday. The problem is the following: 2019-01-01 was Tuesday. Due to the fact that my wanted information was delivered grouped by week, it didn't reach every last sunday, it reached until every last tuesday, so I always was missing 5 days of valuable information and at the end of the week it seemed like I had a 2-week delay for some unknown reason. (I realised while having a shower :D)
256 | 
257 | - Moreover, instead of appending new information to our tables each week, we are overwriting them intentionally. This is due to how Google process and delivers information from Trends.
258 | 
259 | If you want another approach, this is an awesome tutorial:
260 | 
261 | - https://searchengineland.com/learn-how-to-chart-and-track-google-trends-in-data-studio-using-python-329119
262 | 
263 | </details>
264 | 
265 | ### Deploy the scripts on Google Cloud Function:
266 | 
267 | <details>
268 |   <summary>Click to expand</summary>
269 |  
270 | ------------------------
271 | 
272 | 1. Create a function here
273 |         
274 | - https://console.cloud.google.com/functions/add?
275 |          
276 | 2. Trigger: Pub/sub
277 | 3. Give a name to your pub/sub topic and select it
278 | 4. Runtime: Python 3.7
279 | 5. Stage bucket: The bucket you created in GCS
280 | 6. Advanced options. Region: Select the same region than the created bucket in GCS (**this is very important**)
281 | 7. Service account: The same which you used to export your json credentials.
282 | 8. Advanced options. Environment variables: In the code, for private stuff like paths or keywords, instead of writing them, do something like <os.getenv("PROJECT_NAME")> (check the code). Then in this section, write PROJECT_NAME and its real value, both without the str symbol (" "). It looks like it does not work with os.dotenv. If you want to see how to use env variables => https://github.com/albertovpd/cloud_function_example_pytrends 
283 | 9. Function to execute: main
284 | 
285 | - You can measure on your pc how much your script lasts and how much memory uses. Then try using that measurements for the configuration (on RAM and timeout). If you get errors (being completely sure your script works fine and attending to the logs), raise them until it works.
286 |  
287 |  Why the pub/sub topic is **main** or questions about the script structure can be found here:
288 | 
289 | - https://github.com/albertovpd/pytrends_cloud_function_example
290 | 
291 | </details>
292 | 
293 | ----------------------
294 | 
295 | 
296 | ### **TROUBLESHOOTING TIME 2: CF problems** 
297 | 
298 | <details>
299 |   <summary>Click to expand</summary>
300 |  
301 | ------------------------
302 | 
303 | - If you can deploy your function perfectly but testing it you get the following error:
304 | 
305 | ![alt](pics/cloud_function_error.png " ")
306 | 
307 | Try debugging the following:
308 | 
309 | 1. When configuring the CF, use a large timeout and memory, in that order. If it works, start reducing that parameters until you use the minimum and still works (if you are going to work with increasing data with time you should foresee that first before assigning the minimal capacity).
310 | 
311 | 2. If nothing seems to work out you should double-check the permissions associated to the used Credentials Accounts.
312 | 
313 | I have been stuck some time with this error. Maybe some information found here can be useful:
314 | 
315 | - https://stackoverflow.com/questions/62064082/unknown-error-has-occurred-in-cloud-functions/62119865#62119865
316 | 
317 | - https://www.reddit.com/r/googlecloud/comments/gs2gd9/an_unknown_error_has_occurred_in_cloud_functions/
318 | 
319 | - https://www.reddit.com/r/cloudfunctions/comments/gu37ye/huge_amout_of_allocated_memory_needed_for/
320 | 
321 | - Another issue that looks like black magic to me at this time: *I am sending from a dataframe a csv to Cloud Storage. If there is no file in Storage with the very same name than the csv you are trying to send, you will get an error. It is like instead of sending, you are overwriting one file (I am writing this warning before checking out what is going on. I do not have all the time I want for my projects, so this is a problem for myself from the future).*
322 | 
323 | </details>
324 | 
325 | -------------------
326 | 
327 | 
328 | ### Scheduling
329 | 
330 | <details>
331 |   <summary>Click to expand</summary>
332 |  
333 | ------------------------
334 | 
335 | ### Go to Cloud Scheduler to schedule your topic.
336 | 
337 | 
338 | - https://console.cloud.google.com/cloudscheduler/appengine
339 | 
340 | 1. Select the same region than all before
341 | 2. Follow instructions and run your job to make sure it works :)
342 | - Be careful copypasting your Pub/Sub. "trigger" and "trigger " are not obviously the same.
343 | 
344 | ### Bigquery 1:
345 | 
346 | - Create a dataset with the same location than the rest of the project.
347 | - Create a table with the elements of the Cloud Storage bucket (this will be updated with DataPrep)
348 | 
349 | </details>
350 | 
351 | ### Transfer:
352 | 
353 | <details>
354 |   <summary>Click to expand</summary>
355 |  
356 | ------------------------
357 | 
358 | - Go to Bigquery => Transfer. 
359 | 
360 | From there, it is really user-friendly to find your data in Cloud Storage, choose a dataset in Bigquery, append/overwrite the table within the dataset (in our case, append), and schedule it.
361 | 
362 | There is also the option of triggering it with Pub/Sub, but it is not necessary for me.
363 | 
364 | A cool option is, once the data is loaded from Cloud Storage to Bigquery, you can configure Transfer to erase the Cloud Storage bucket content (I understand this can save money if you are dealing with heavy volumes of data).
365 | 
366 | In our case:
367 | - MIRROR the bucket info to the BQ table (because, as explained here, Google Trends works weirdly and the receiving info varies regarding the date interval the user is requesting).
368 | - Ignore 0 errors
369 | - "," as delimitation field
370 | - Exclude 1st row (1st row has the column names)
371 | - **Do not erase the file in Storage after sending it to BigQuery.** Your Cloud Function will not work.
372 | 
373 | *If you modify the content of Storage, wait at least 1 hour before using Transfer again, as said in this thread https://stackoverflow.com/questions/61126019/not-able-to-update-big-query-table-with-transfer-from-a-storage-file*
374 | 
375 | </details>
376 | 
377 | ### Dataprep:
378 | 
379 | <details>
380 |   <summary>Click to expand</summary>
381 |  
382 | ------------------------
383 | 
384 | **This is not part of the pipeline anymore. It has been changed by Transfer. Nevertheless I leave here the information, in case someone finds it useful.**
385 | 
386 | Dataprep will load and transfer files from Cloud Storage to BigQuery tables in a scheduled process.
387 | 
388 | - Select your project
389 | 
390 | - **Very important:** Go to <Settings/profile> and <Settings/project/Dataflow Execution> to locate your Dataprep folder in the same region than the rest of your other Google tools. 
391 | 
392 | - Go to Cloud Storage and check the location of the dataprep-stagging bucket. If it is not in the location of your project, create a new bucket. In Dataprep/Preferences/Profile modify location of temp directory, job run directory and upload directory.
393 | 
394 | - Workflow: Manual Job (for testing):
395 | 
396 |         - Select the Cloud Storage bucket (dataset or datasets).
397 |         - Create a recipe.
398 |         - Select an output for the recipe, Select the BigQuery table you will feed (I think it should be done in Bigquery First).
399 |         - Select wether you want to append or modify the existing tables in BigQuery.
400 |         - In the workflow, use the calendar icon to schedule it (beware of time formats between Cloud Function, Dataprep, BigQuery... Use always the same location (if you can)).
401 | 
402 | - When the job works fine, do exactly the same but with a scheduled job.
403 | 
404 | </details>
405 | 
406 | --------------------------------
407 | 
408 | 
409 | ### **TROUBLESHOOTING TIME 3:**
410 | 
411 | <details>
412 |   <summary>Click to expand</summary>
413 |  
414 | ------------------------
415 | 
416 | Follow all mentioned above. Dataprep processes are veeery slow, so you can waste a lot of time waiting for the job to be done and going to the log console.
417 | 
418 | In our case:
419 | 
420 |         - Dataflow Execution Settings: Europe-West1 (Belgium)
421 |         - AutoZone
422 |         - n1 standard
423 |         - Truncate table in Bigquery (we'll overwrite continuously the Google Trends data due to how Google Trends works)
424 | 
425 | </details>
426 | 
427 | -------------------------------------------------
428 | 
429 | ### Bigquery 2:
430 | 
431 | <details>
432 |   <summary>Click to expand</summary>
433 |  
434 | ------------------------
435 | 
436 | 1. Go there.
437 | 2. Create a scheduled query requesting from the tables periodically fed by Transfer, and save them in other tables.
438 | 
439 | The motivation for this is the following: Data Studio is the cheapest when the dashboarded data is allocated in BigQuery tables. So, if you have all the needed info in a BigQuery table, is a good option to create as much tables as graphs you will have in your Data Studio. Don't create tables randomly, just plan beforehand what you want to show in your dashboard, and how do you want to show it. 
440 | 
441 | In ***bigquery/tables_python_api*** is available the sql query of Pytrends, in case you want to take a glance.
442 | 
443 | </details>
444 | 
445 | ### Data Studio:
446 | <details>
447 |   <summary>Click to expand</summary>
448 |  
449 | ------------------------
450 | Find your tables in BigQuery and make some art.
451 | 
452 | **Warning:** I learnt by the hard way not to use Custom Query option in DataStudio, because every time someone opens the dashboard, the query will be performed, and it can cost you kind of A LOT. So always, create your dashboards from BigQuery tables.
453 | 
454 | Hopefully reddit has a great BigQuery community, and it worth taking a glance there. This was the issue if you're curious: https://www.reddit.com/r/bigquery/comments/gdk5mo/some_help_to_understand_how_i_burnt_the_300_and/
455 | 
456 | </details>
457 | 
458 | ------------------------------
459 | 
460 | **- Gdelt Project.**
461 | 
462 | <details>
463 |   <summary>Click to expand</summary>
464 |  
465 | ------------------------
466 | 
467 | - Filter by country with the help of a personal table, with all main national news (not all national webpage as the .es domain).
468 | - Analyse the sentiment of each filtered news.
469 | 
470 | The query is available in this folder: ***bigquery_gdelt_project_sql***
471 | 
472 | </details>
473 | 
474 | </details>
475 | 
476 | 
477 | 
478 | ---------------------------
479 | -------------------------
480 | 
481 | 
482 | # 3. Results
483 | 
484 | Results can be found here:
485 | 
486 | https://datastudio.google.com/s/lE64HKGyR8w
487 | 
488 | -------------------------
489 | ------------------------
490 | 
491 | 
492 | # 4. Analyzing results
493 | <details>
494 |   <summary>Click to expand</summary>
495 |  
496 | ------------------------
497 | 
498 | 
499 | Do you remember when refugees or nationalism were main topic on TV in Spain? The topic will come back for other reasons like an economical crisis across Europe?  How long is going to be the keyword "economical war" the main topic of some politicians?
500 | 
501 | The circumstances we live right now are temporal or it will be part of our future? 
502 | 
503 | People will spend more money on non necessary purchases like ordering food from home?
504 | 
505 | Remote working will last after the crisis?
506 |  
507 | At this date, friday 05/06/2020, it is not easy to give an answer, nevertheless this project could set the basis of how understanding the short term past, the present, and maybe the close future.
508 | 
509 | </details>
510 | 
511 | -----------------------
512 | -----------------------
513 | 
514 | # 5. Conclusion 
515 | 
516 | <details>
517 |   <summary>Click to expand</summary>
518 |  
519 | ------------------------
520 | 
521 | 
522 | - This project has being very useful to reinforce my Google Cloud domain, which is the main goal, and it is accomplished.
523 | 
524 | - The second goal is to get more expertise working with Gdelt, performing more accurate filters. After creating my own tables, testing the Gdelt info against it, extracting, unnesting, requesting by last week regardless the day... I think I learned a lot, and also had fun. I do love Gdelt and starting to really appreciate and understand the power of Bigquery SQL.
525 | 
526 | - It is a drawback the fact that Google Trends delivers information in a *quite weird manner*, it is already processed, and for instance, requesting weekly is pointless for our purposes, you have to request all the period you want to analyse at once. Also doing that implies that the information is delivered by blocks, so if I request information every monday from 1st january 2019, I will get all the info grouped by week, without the last week, making the dashboard look like it is not up to date.
527 | 
528 | - Working with Twitter in a professional manner is very expensive, out-of-fun-budget. You need to request all written twitters in a region/time, and then analyse that info. We did not see a cheap way of requesting just the number of times a concrete word appears, for example. The requesting is also very slow... I don't want to imagine how expensive could be to have a VM instance requesting and processing 24/7.
529 | 
530 | </details>
531 | 
532 | ---------------------
533 | ---------------------
534 | 
535 | 
536 | # 6. Further improvements.
537 | 
538 | <details>
539 |   <summary>Click to expand</summary>
540 |  
541 | ------------------------
542 | 
543 | - Better research on themes to follow in Gdelt.
544 | - Use Python APis to track Stock Markets on time, and add it to the dashboard :)
545 | - I am in the process of automating an ML algorithm to infere how many times the keyword ***"unemployment"*** is searched in Google one week ahead. 
546 | 
547 | 
548 | </details>
549 | 
550 | 
551 | 
552 | ----------------------------
553 | -------------------------------
554 | 
555 | # 7. Documentation:
556 | 
557 | <details>
558 |   <summary>Click to expand</summary>
559 |  
560 | ------------------------
561 | 
562 | - What is Google Trends (no code)   https://www.karinakumykova.com/2019/03/calculate-search-interest-with-pytrends-api-and-python/
563 | 
564 | - List of categories: https://github.com/pat310/google-trends-api/wiki/Google-Trends-Categories
565 | 
566 | - Official documentation: https://pypi.org/project/pytrends/
567 | - Info about how it works: https://www.karinakumykova.com/2019/03/calculate-search-interest-with-pytrends-api-and-python/
568 | - The "real" Google Trends: https://towardsdatascience.com/google-trends-api-for-python-a84bc25db88f
569 | - A great tutorial https://searchengineland.com/learn-how-to-chart-and-track-google-trends-in-data-studio-using-python-329119
570 | 
571 | 
572 | - https://github.com/albertovpd/pytrends_cloud_function_example
573 | 
574 | 
575 | - The Python ETL pipeline is based on this article **https://towardsdatascience.com/creation-of-an-etl-in-google-cloud-platform-for-automated-reporting-8a0309ee8a78**
576 | 
577 | - The Gdelt Queries are based on the shared knowledge with the Data Team of Labelium España: **https://medium.com/@a.vargas.pina/biqquery-and-the-gdelt-project-beyond-dreams-of-marketing-analysts-62e586cc0343**
578 | 
579 | - The final dashboard: - https://datastudio.google.com/reporting/755f3183-dd44-4073-804e-9f7d3d993315
580 | 
581 | - Twitter code and how-to: **https://medium.com/@patricrp/buscando-contenido-con-la-api-de-twitter-f3c12994a77f**
582 | 
583 | - https://github.com/albertovpd/automated_etl_google_cloud-social_dashboard
584 | 
585 | </details>
586 | 
587 | --------------------------------
588 | -------------------------------
589 | 
590 | 
591 | Many thanks to:
592 | 
593 | - **Alex Masip**, Head of Data at Labelium España https://www.linkedin.com/in/alexmasip/
594 | - **César Castañón**: https://www.linkedin.com/in/cesar-castanon/
595 | - **Jesús Rodríguez**:https://www.linkedin.com/in/jralvarez1986/ 
596 | 
597 | 
598 | 
599 | ---------------------
600 | 
601 | ![alt](pics/monguerteam.png " ")
602 | Project by **Patricia Carmona** and **Alberto Vargas**.
603 | 
604 | 
605 | - https://www.linkedin.com/in/carmonapulidopatricia/
606 | 
607 | - https://www.linkedin.com/in/alberto-vargas-pina/
608 | 
609 | 
610 | 
611 | 
612 | 
613 | ----------------
614 | ----------------
615 | 
616 | # Developer notes.
617 | <details>
618 |   <summary>Click to expand</summary>
619 |  
620 | ------------------------
621 | 
622 | ## Python script automation.
623 | 
624 | - Cloud Storage: Bucket in Europe multi-region (dataprep has not Europe-London location).
625 | 
626 | - Cloud Function: Testing it will modify the outcome in Cloud Storage.
627 | 
628 | - Cloud Scheduler for triggering the Cloud Function: 0 3 * * 1 Europe(Germany). It means it will run every monday at 7:00 (GTM2. 0=Sunday => 1=Monday).
629 | 
630 | - Transfer (Load from Storage to BigQuery): Germany. Weekly, on Monday at 26/10/2020, 05:00. 
631 | 
632 | *Be careful, if you modify the content of Cloud Storage it will last at least 1 hour to be appreciated by Transfer, as said in this thread https://stackoverflow.com/questions/61126019/not-able-to-update-big-query-table-with-transfer-from-a-storage-file you'll get an nice log like "no changes have bee appreciated to modify the table. This caused a 2-week delay in Pytrends results.*
633 | 
634 | - BigQuery pytrends: This schedule will run Every Mon at 06:00 Europe/Berlin, starting Mon Jun 08 2020.
635 | 
636 | *Processed in EU.*
637 | 
638 | ## Gdelt automation.
639 | 
640 | - Overwriting the *main table*: This schedule will run Every Sun at 23:59 Europe/Paris, starting Sun Jun 21 2020.
641 | 
642 | - Appending to dashboard tables: This schedule will run Every Mon at 02:00 Europe/Paris, starting Mon Jun 22 2020
643 | 
644 | *Processed in US.*
645 | 
646 | **Very important**: My project is in EU but Gdelt is in US. My Gdelt dataset is also in US => don't forget processing location US
647 | 
648 | 
649 | </details>
650 | 
651 | ------------------------------
652 | ------------------------------


--------------------------------------------------------------------------------
/bigquery/automating_gdelt_query_overwriting_1st_table_weekly.sql:
--------------------------------------------------------------------------------
 1 | -- the first query is to generate the whole content. this content is loaded in different
 2 | -- tables for the sake of speed in Data Studio, so the important tables are this 
 3 | -- dashboard tables.
 4 | 
 5 | -- then, once the first big request is done, the main table will be weekly overwritten,
 6 | -- to get just the new weekly info. This info will be appended to the important tables.
 7 | 
 8 | CREATE OR REPLACE TABLE
 9 |   `myproject-mydataset.gdelt_info_filtering.filtered_spanish_news` AS
10 |   -- list of themes http://data.gdeltproject.org/api/v2/guides/LOOKUP-GKGTHEMES.TXT
11 | SELECT
12 |   -- este select está solo para usar el where al final y eliminar los null que se generan en news_in_Spain
13 |   *
14 | FROM (
15 |   SELECT
16 |     EXTRACT (date
17 |     FROM
18 |       PARSE_TIMESTAMP('%Y%m%d%H%M%S',CAST(date AS string))) AS Date,
19 |     SourceCommonName,
20 |     ROUND(CAST(SPLIT(V2Tone, ",") [
21 |       OFFSET
22 |         (0)] AS FLOAT64),2) AS Sentiment,
23 |     (CASE
24 |       --WHEN V2Themes LIKE "%IDEOLOGY%" THEN "ideologia"
25 |       -- WHEN V2Themes LIKE "%CRISISLEX_CRISISLEXREC%" THEN "use_of_social_media_for_crisis_and_disaster_response" --https://blog.gdeltproject.org/crisislex-taxonomies-now-available-in-gkg/
26 |       -- WHEN V2Themes LIKE "%EPU_POLICY_GOVERNMENT%" THEN "epu_policy_goverment"
27 |       --WHEN V2Themes LIKE "%ARMEDCONFLICT%" THEN "conflicto_armado"
28 |       --WHEN V2Themes LIKE "%TRANSPARENCY%" THEN "transparencia"
29 |       --WHEN V2Themes LIKE "%SCANDAL%" THEN "escandalo"
30 |       -- ECONOMICAL
31 |         WHEN V2Themes LIKE "%ECON_STOCKMARKET%" THEN "stock_market"
32 |         WHEN V2Themes LIKE "%EPU_POLICY%" THEN "incertidumbre_economica" -- https://blog.gdeltproject.org/economic-policy-uncertainty-is-driving-economic-uncertainty-in-the-era-of-trump/
33 |         WHEN V2Themes LIKE "%WB_1235_CENTRAL_BANKS%" THEN "banco_mundial"
34 |         WHEN V2Themes LIKE "%UNEMPLOYMENT%" THEN "desempleo"
35 |         WHEN V2Themes LIKE "%ECON_INFLATION%" THEN "inflacion_economica"
36 |         WHEN v2themes LIKE "%ECON_BANKRUPTCY%" THEN "quiebra_economica"
37 |         WHEN V2THEMES LIKE "%WB_1104_MACROECONOMIC_VULNERABILITY_AND_DEBT%" THEN "macroeconomia_deuda_y_vulnerabilidad"
38 |         WHEN V2THEMES LIKE "%WB_2745_JOB_QUALITY_AND_LABOR_MARKET_PERFORMANCE%" THEN "job_quality_&_labor_market_performance"
39 |         WHEN v2themes LIKE "%POVERTY%" THEN "pobreza"
40 |         WHEN v2themes LIKE "%ECON_DEBT" THEN "deuda_publica"
41 |         WHEN v2themes LIKE "%WB_471_ECONOMIC_GROWTH%" THEN "crecimiento_economico"
42 |         WHEN v2themes LIKE "%WB_318_FINANCIAL_ARCHITECTURE_AND_BANKING%" THEN "finanzas_y_bancos"
43 |         WHEN v2themes LIKE "%ECON_OILPRICE%" THEN "precio_petroleo"
44 |         WHEN v2themes LIKE "%WB_625_HEALTH_ECONOMICS_AND_FINANCE%" THEN "prosperidad_economica_y_finanzas"
45 |         WHEN v2themes LIKE "%ECON_FREETRADE%" THEN "libre_comercio"
46 |       -- POLITICAL
47 |         WHEN V2Themes LIKE "%POLITICAL_TURMOIL%" THEN "inestabilidad_politica"
48 |         WHEN V2Themes LIKE "%REBELLION%" THEN "rebelion"
49 |         WHEN V2Themes LIKE "%TRIAL%" THEN "juicio"
50 |         WHEN V2Themes LIKE "%TERROR%" THEN "terrorismo"
51 |         WHEN V2Themes LIKE "%MILITARY%" THEN "ejercito"
52 |         WHEN V2Themes LIKE "%CORRUPTION%" THEN "corrupcion"
53 |         WHEN V2Themes LIKE "%PROTEST%" THEN "protestas"
54 |         WHEN V2Themes LIKE "%EXTREMISM%" THEN "extremismo"
55 |         WHEN V2Themes LIKE "%REFUGEES%" THEN "refugiados"
56 |         WHEN V2Themes LIKE "%SURVEILLANCE%" THEN "vigilancia"
57 |         WHEN v2themes LIKE "%WB_2024_ANTI_CORRUPTION_AUTHORITIES%" THEN "autoridades_anticorrupcion"
58 |         WHEN v2themes LIKE "%EPU_CATS_NATIONAL_SECURITY%" THEN "seguridad_nacional" -- http://data.gdeltproject.org/dailytrendreport/GDELT_Trend_Report-2019-12-22.pdf
59 |         WHEN v2themes LIKE "%democracy%" THEN "democracia"
60 |         WHEN v2themes LIKE "%WB_2019_ANTI_CORRUPTION_LEGISLATION%" THEN "legislacion_anticorrupcion"
61 |         WHEN v2themes LIKE "%WB_739_POLITICAL_VIOLENCE_AND_CIVIL_WAR%" THEN "violencia_politica_y_guerra_civil"
62 |       -- SOCIAL
63 |         WHEN v2themes LIKE "%WB_2443_RAPE_AND_SEXUAL_VIOLENCE%" THEN "agresion_sexual"
64 |         WHEN v2themes LIKE "%IMMIGRATION%" THEN "inmigracion"
65 |         WHEN V2Themes LIKE "%SCIENCE%" THEN "ciencia"
66 |         WHEN v2themes LIKE "%EDUCATION  %" THEN "educacion"
67 |         WHEN v2themes LIKE "%ENV_CLIMATECHANGE" THEN "cambio_climatico"
68 |         WHEN v2themes LIKE "%ECON_ENTREPRENEURSHIP%" THEN "emprendimiento"
69 |         WHEN v2themes LIKE "%WB_2165_HEALTH_EMERGENCIES%" THEN "emergencia_sanitaria"
70 |         WHEN v2themes LIKE "%ECON_HOUSING_PRICES%" THEN "precio_vivienda"
71 |         WHEN v2themes LIKE "%WB_2167_PANDEMICS" THEN "pandemia"
72 |         WHEN v2themes LIKE "%WB_525_RENEWABLE_ENERG%" THEN "energias_renovables"
73 |         WHEN v2themes LIKE "%ECON_SUBSIDIES%" THEN "subsidios"
74 |         WHEN v2themes LIKE "%DISCRIMINATION_RACE_RACISM%" THEN "racismo"
75 |         WHEN v2themes LIKE "%HEALTH_VACCINATION%" THEN "vacunas"
76 |         WHEN v2themes LIKE "%MEDIA_CENSORSHIP%" THEN "censura_en_medios"
77 |         WHEN V2THEMES LIKE "%TAX_DISEASE%" THEN "enfermedades_muy_infecciosas"  -- https://blog.gdeltproject.org/infectious-disease-mapping-and-early-warning-with-gdelt/
78 |         WHEN v2themes LIKE "%TAX_DISEASE_CORONAVIRUS_INFECTIONS%" THEN "numero_de_contagios_covid%"
79 |         WHEN v2themes LIKE "%TAX_DISEASE_CORONAVIRUS%" THEN "fallecimiento_por_covid"
80 |     END
81 |       ) AS news_in_Spain,
82 |     v2counts,
83 |     v2locations
84 |   FROM
85 |     `gdelt-bq.gdeltv2.gkg_partitioned`
86 |   WHERE
87 |     v2counts LIKE '%#SP#%'
88 |     AND counts LIKE '%#SP#%'
89 |     AND V2Locations LIKE '%#SP#%'
90 |     AND DATE(_PARTITIONTIME) >= DATE_TRUNC(CURRENT_DATE(), WEEK(MONDAY)) -- this is the part to overwriting weekly
91 |     AND ( SourceCommonName IN (
92 |       SELECT
93 |         spanish_newspapers
94 |       FROM
95 |         `myproject-mydataset.gdelt_info_filtering.spanish_newspapers_SourceCommonName_160620`)))
96 | WHERE
97 |   news_in_Spain IS NOT NULL


--------------------------------------------------------------------------------
/bigquery/automating_gdelt_query_updating_dashboard_tables.sql:
--------------------------------------------------------------------------------
 1 | -- go to schedule query
 2 | -- select schedule
 3 | -- Run on mondays
 4 | -- Append to existing related table
 5 | -- The same with the other tables
 6 | 
 7 | SELECT
 8 |   news_in_Spain AS social,
 9 |   Date,
10 |   Sentiment
11 | FROM
12 |   `myproject-mydataset.gdelt_info_filtering.filtered_spanish_news`
13 | WHERE
14 |   news_in_Spain = "agresion_sexual"
15 |   OR news_in_Spain = "inmigracion"
16 |   OR news_in_Spain = "ciencia"
17 |   OR news_in_Spain = "educacion"
18 |   OR news_in_Spain = "cambio_climatico"
19 |   OR news_in_Spain ="emprendimiento"
20 |   OR news_in_Spain ="emergencia_sanitaria"
21 |   OR news_in_Spain ="precio_vivienda"
22 |   OR news_in_Spain = "energias_renovables"
23 |   OR news_in_Spain ="pandemia"
24 |   OR news_in_Spain = "subsidios"
25 |   OR news_in_Spain ="racismo"
26 |   OR news_in_Spain ="vacunas"
27 |   OR news_in_Spain ="censura_en_medios"
28 |   OR news_in_Spain = "enfermedades_muy_infecciosas"  -- https://blog.gdeltproject.org/infectious-disease-mapping-and-early-warning-with-gdelt/
29 |   OR news_in_Spain = "numero_de_contagios_covid%"
30 |   OR news_in_Spain = "fallecimiento_por_covid";


--------------------------------------------------------------------------------
/bigquery/gdelt_1st_query_themes_and_sentiment.sql:
--------------------------------------------------------------------------------
 1 | -- This 1st query request all info and save it to make a concrete graph (percentaje of news according to the selected topics)
 2 | -- The 2nd query create tables dividing the Gdelt info, to make other Graphs by theme
 3 | 
 4 | CREATE OR REPLACE TABLE
 5 |   `myproject-mydataset.gdelt_info_filtering.filtered_spanish_news` AS
 6 | SELECT
 7 |   -- This select * avoid nulls generated by contrasting with News_in_Spain tables
 8 |   *
 9 | FROM (
10 | 
11 |   SELECT
12 |     EXTRACT (date
13 |     FROM
14 |       PARSE_TIMESTAMP('%Y%m%d%H%M%S',CAST(date AS string))) AS Date,
15 |     SourceCommonName,
16 |     ROUND(CAST(SPLIT(V2Tone, ",") [
17 |       OFFSET
18 |         (0)] AS FLOAT64),2) AS Sentiment,
19 |     (CASE
20 |       --WHEN V2Themes LIKE "%IDEOLOGY%" THEN "ideologia"
21 |       -- WHEN V2Themes LIKE "%CRISISLEX_CRISISLEXREC%" THEN "use_of_social_media_for_crisis_and_disaster_response" --https://blog.gdeltproject.org/crisislex-taxonomies-now-available-in-gkg/
22 |       -- WHEN V2Themes LIKE "%EPU_POLICY_GOVERNMENT%" THEN "epu_policy_goverment"
23 |       --WHEN V2Themes LIKE "%ARMEDCONFLICT%" THEN "conflicto_armado"
24 |       --WHEN V2Themes LIKE "%TRANSPARENCY%" THEN "transparencia"
25 |       --WHEN V2Themes LIKE "%SCANDAL%" THEN "escandalo"
26 |       -- ECONOMICAL
27 |         WHEN V2Themes LIKE "%ECON_STOCKMARKET%" THEN "stock_market"
28 |         WHEN V2Themes LIKE "%EPU_POLICY%" THEN "incertidumbre_economica" -- https://blog.gdeltproject.org/economic-policy-uncertainty-is-driving-economic-uncertainty-in-the-era-of-trump/
29 |         WHEN V2Themes LIKE "%WB_1235_CENTRAL_BANKS%" THEN "banco_mundial"
30 |         WHEN V2Themes LIKE "%UNEMPLOYMENT%" THEN "desempleo"
31 |         WHEN V2Themes LIKE "%ECON_INFLATION%" THEN "inflacion_economica"
32 |         WHEN v2themes LIKE "%ECON_BANKRUPTCY%" THEN "quiebra_economica"
33 |         WHEN V2THEMES LIKE "%WB_1104_MACROECONOMIC_VULNERABILITY_AND_DEBT%" THEN "macroeconomia_deuda_y_vulnerabilidad"
34 |         WHEN V2THEMES LIKE "%WB_2745_JOB_QUALITY_AND_LABOR_MARKET_PERFORMANCE%" THEN "job_quality_&_labor_market_performance"
35 |         WHEN v2themes LIKE "%POVERTY%" THEN "pobreza"
36 |         WHEN v2themes LIKE "%ECON_DEBT" THEN "deuda_publica"
37 |         WHEN v2themes LIKE "%WB_471_ECONOMIC_GROWTH%" THEN "crecimiento_economico"
38 |         WHEN v2themes LIKE "%WB_318_FINANCIAL_ARCHITECTURE_AND_BANKING%" THEN "finanzas_y_bancos"
39 |         WHEN v2themes LIKE "%ECON_OILPRICE%" THEN "precio_petroleo"
40 |         WHEN v2themes LIKE "%WB_625_HEALTH_ECONOMICS_AND_FINANCE%" THEN "prosperidad_economica_y_finanzas"
41 |         WHEN v2themes LIKE "%ECON_FREETRADE%" THEN "libre_comercio"
42 |       -- POLITICAL
43 |         WHEN V2Themes LIKE "%POLITICAL_TURMOIL%" THEN "inestabilidad_politica"
44 |         WHEN V2Themes LIKE "%REBELLION%" THEN "rebelion"
45 |         WHEN V2Themes LIKE "%TRIAL%" THEN "juicio"
46 |         WHEN V2Themes LIKE "%TERROR%" THEN "terrorismo"
47 |         WHEN V2Themes LIKE "%MILITARY%" THEN "ejercito"
48 |         WHEN V2Themes LIKE "%CORRUPTION%" THEN "corrupcion"
49 |         WHEN V2Themes LIKE "%PROTEST%" THEN "protestas"
50 |         WHEN V2Themes LIKE "%EXTREMISM%" THEN "extremismo"
51 |         WHEN V2Themes LIKE "%REFUGEES%" THEN "refugiados"
52 |         WHEN V2Themes LIKE "%SURVEILLANCE%" THEN "vigilancia"
53 |         WHEN v2themes LIKE "%WB_2024_ANTI_CORRUPTION_AUTHORITIES%" THEN "autoridades_anticorrupcion"
54 |         WHEN v2themes LIKE "%EPU_CATS_NATIONAL_SECURITY%" THEN "seguridad_nacional" -- http://data.gdeltproject.org/dailytrendreport/GDELT_Trend_Report-2019-12-22.pdf
55 |         WHEN v2themes LIKE "%democracy%" THEN "democracia"
56 |         WHEN v2themes LIKE "%WB_2019_ANTI_CORRUPTION_LEGISLATION%" THEN "legislacion_anticorrupcion"
57 |         WHEN v2themes LIKE "%WB_739_POLITICAL_VIOLENCE_AND_CIVIL_WAR%" THEN "violencia_politica_y_guerra_civil"
58 |       -- SOCIAL
59 |         WHEN v2themes LIKE "%WB_2443_RAPE_AND_SEXUAL_VIOLENCE%" THEN "agresion_sexual"
60 |         WHEN v2themes LIKE "%IMMIGRATION%" THEN "inmigracion"
61 |         WHEN V2Themes LIKE "%SCIENCE%" THEN "ciencia"
62 |         WHEN v2themes LIKE "%EDUCATION  %" THEN "educacion"
63 |         WHEN v2themes LIKE "%ENV_CLIMATECHANGE" THEN "cambio_climatico"
64 |         WHEN v2themes LIKE "%ECON_ENTREPRENEURSHIP%" THEN "emprendimiento"
65 |         WHEN v2themes LIKE "%WB_2165_HEALTH_EMERGENCIES%" THEN "emergencia_sanitaria"
66 |         WHEN v2themes LIKE "%ECON_HOUSING_PRICES%" THEN "precio_vivienda"
67 |         WHEN v2themes LIKE "%WB_2167_PANDEMICS" THEN "pandemia"
68 |         WHEN v2themes LIKE "%WB_525_RENEWABLE_ENERG%" THEN "energias_renovables"
69 |         WHEN v2themes LIKE "%ECON_SUBSIDIES%" THEN "subsidios"
70 |         WHEN v2themes LIKE "%DISCRIMINATION_RACE_RACISM%" THEN "racismo"
71 |         WHEN v2themes LIKE "%HEALTH_VACCINATION%" THEN "vacunas"
72 |         WHEN v2themes LIKE "%MEDIA_CENSORSHIP%" THEN "censura_en_medios"
73 |         WHEN V2THEMES LIKE "%TAX_DISEASE%" THEN "enfermedades_muy_infecciosas"  -- https://blog.gdeltproject.org/infectious-disease-mapping-and-early-warning-with-gdelt/
74 |         WHEN v2themes LIKE "%TAX_DISEASE_CORONAVIRUS_INFECTIONS%" THEN "numero_de_contagios_covid%"
75 |         WHEN v2themes LIKE "%TAX_DISEASE_CORONAVIRUS%" THEN "fallecimiento_por_covid"
76 |     END
77 |       ) AS news_in_Spain,
78 |     v2counts,
79 |     v2locations
80 |   FROM
81 |     `gdelt-bq.gdeltv2.gkg_partitioned`
82 |   WHERE
83 |     v2counts LIKE '%#SP#%'
84 |     AND counts LIKE '%#SP#%'
85 |     AND V2Locations LIKE '%#SP#%'
86 |     AND DATE(_PARTITIONTIME) >= "2019-01-01"
87 |     AND DATE(_PARTITIONTIME) <="2020-06-14"
88 |     AND ( SourceCommonName IN (
89 |       SELECT
90 |         spanish_newspapers
91 |       FROM
92 |         `myproject-mydataset.gdelt_info_filtering.spanish_newspapers_SourceCommonName_160620`)))
93 | 
94 | WHERE
95 |   news_in_Spain IS NOT NULL


--------------------------------------------------------------------------------
/bigquery/gdelt_2nd_query_creating_tables.sql:
--------------------------------------------------------------------------------
 1 | -- The 1st query request all info and save it to make a concrete graph in Dashboard
 2 | -- This 2nd query create tables dividing the Gdelt info, to make other Graphs by theme
 3 | 
 4 | 
 5 | CREATE OR REPLACE TABLE
 6 |   `myproject-mydataset.gdelt_info_filtering.dashboard_spanish_news_economical` AS
 7 | SELECT
 8 |   news_in_Spain AS economical,
 9 |   Date,
10 |   Sentiment
11 | FROM
12 |   `myproject-mydataset.gdelt_info_filtering.filtered_spanish_news`
13 | WHERE
14 |   news_in_Spain = "stock_market"
15 |   OR news_in_Spain = "incertidumbre_economica"
16 |   OR news_in_Spain = "banco_mundial"
17 |   OR news_in_Spain = "desempleo"
18 |   OR news_in_Spain ="inflacion_economica"
19 |   OR news_in_Spain ="quiebra_economica"
20 |   OR news_in_Spain ="macroeconomia_deuda_y_vulnerabilidad"
21 |   OR news_in_Spain ="job_quality_&_labor_market_performance"
22 |   OR news_in_Spain = "pobreza"
23 |   OR news_in_Spain = "deuda_publica"
24 |   OR news_in_Spain ="crecimiento_economico"
25 |   OR news_in_Spain = "finanzas_y_bancos"
26 |   OR news_in_Spain ="prosperidad_economica_y_finanzas"
27 |   OR news_in_Spain ="libre_comercio"
28 |   OR news_in_Spain ="precio_petroleo" ;
29 |   -- ------------------------------------
30 | CREATE OR REPLACE TABLE
31 |   `myproject-mydataset.gdelt_info_filtering.dashboard_spanish_news_political` AS
32 | SELECT
33 |   news_in_Spain AS political,
34 |   Date,
35 |   Sentiment
36 | FROM
37 |   `myproject-mydataset.gdelt_info_filtering.filtered_spanish_news`
38 | WHERE
39 |   news_in_Spain = "inestabilidad_politica"
40 |   OR news_in_Spain = "rebelion"
41 |   OR news_in_Spain = "juicio"
42 |   OR news_in_Spain = "terrorismo"
43 |   OR news_in_Spain = "ejercito"
44 |   OR news_in_Spain ="corrupcion"
45 |   OR news_in_Spain ="protestas"
46 |   OR news_in_Spain ="extremismo"
47 |   OR news_in_Spain = "refugiados"
48 |   OR news_in_Spain ="vigilancia"
49 |   OR news_in_Spain = "autoridades_anticorrupcion"
50 |   OR news_in_Spain ="seguridad_nacional"
51 |   OR news_in_Spain ="democracia"
52 |   OR news_in_Spain ="precio_petroleo"
53 |   OR news_in_Spain ="violencia_politica_y_guerra_civil";
54 |   -- -------------------------
55 | CREATE OR REPLACE TABLE
56 |   `myproject-mydataset.gdelt_info_filtering.dashboard_spanish_news_social` AS
57 | SELECT
58 |   news_in_Spain AS social,
59 |   Date,
60 |   Sentiment
61 | FROM
62 |   `myproject-mydataset.gdelt_info_filtering.filtered_spanish_news`
63 | WHERE
64 |   news_in_Spain = "agresion_sexual"
65 |   OR news_in_Spain = "inmigracion"
66 |   OR news_in_Spain = "ciencia"
67 |   OR news_in_Spain = "educacion"
68 |   OR news_in_Spain = "cambio_climatico"
69 |   OR news_in_Spain ="emprendimiento"
70 |   OR news_in_Spain ="emergencia_sanitaria"
71 |   OR news_in_Spain ="precio_vivienda"
72 |   OR news_in_Spain = "energias_renovables"
73 |   OR news_in_Spain ="pandemia"
74 |   OR news_in_Spain = "subsidios"
75 |   OR news_in_Spain ="racismo"
76 |   OR news_in_Spain ="vacunas"
77 |   OR news_in_Spain ="censura_en_medios"
78 |   OR news_in_Spain = "enfermedades_muy_infecciosas"  -- https://blog.gdeltproject.org/infectious-disease-mapping-and-early-warning-with-gdelt/
79 |   OR news_in_Spain = "numero_de_contagios_covid%"
80 |   OR news_in_Spain = "fallecimiento_por_covid";


--------------------------------------------------------------------------------
/bigquery/gdelt_list_themes.sql:
--------------------------------------------------------------------------------
1 |  WITH nested AS (
2 |                 SELECT SPLIT(RTRIM(Themes,';'),';') themes FROM `gdelt-bq.gdeltv2.gkg_partitioned` WHERE _PARTITIONTIME >= "2019-09-04 00:00:00" AND _PARTITIONTIME < "2019-09-05 00:00:00" and length(Themes) > 1
3 |                 ) select theme, count(1) cnt from nested, UNNEST(themes) as theme group by theme order by cnt desc


--------------------------------------------------------------------------------
/bigquery/spanish_newspapers.py:
--------------------------------------------------------------------------------
  1 | import pandas as pd
  2 | # this is for creating the dataset of spanish newspaperes.
  3 | # https://www.enterat.com/actualidad/periodicos-digitales-espana.php
  4 | raw_list=[
  5 |     "https://www.abc.es/",
  6 |     "https://www.elmundo.es/",
  7 |     "https://elpais.com/",
  8 |     "https://www.larazon.es/",
  9 |     "https://www.cuartopoder.es/",
 10 |     "https://diario16.com/",
 11 |     "https://www.diariocritico.com/",
 12 |     "https://www.elconfidencial.com/",
 13 |     "https://www.elconfidencialdigital.com/",
 14 |     "https://www.eldiario.es/",
 15 |     "https://www.elespanol.com/",
 16 |     "https://www.huffingtonpost.es/",
 17 |     "https://www.elimparcial.es/",
 18 |     "https://www.elindependiente.com/",
 19 |     "https://www.elplural.com/",
 20 |     "https://www.elsaltodiario.com/",
 21 |     "https://www.esdiario.com/",
 22 |     "https://www.estrelladigital.es/",
 23 |     "https://www.hispanidad.com/",
 24 |     "https://www.infolibre.es/",
 25 |     "https://www.gaceta.es/",
 26 |     "https://www.lainformacion.com/",
 27 |     "https://www.libertaddigital.com/",
 28 |     "https://www.mediterraneodigital.com/",
 29 |     "https://www.moncloa.com/",
 30 |     "https://www.mundiario.com/",
 31 |     "https://www.niusdiario.es/",
 32 |     "https://www.nuevatribuna.es/",
 33 |     "https://okdiario.com/",
 34 |     "https://www.periodistadigital.com/",
 35 |     "https://www.publico.es/",
 36 |     "https://www.que.es/",
 37 |     "https://www.republica.com/",
 38 |     "https://www.vozpopuli.com/",
 39 | 
 40 |     #--diarios regionales
 41 | 
 42 |     "https://www.diarioinformacion.com/",
 43 |     "https://www.latribunadealbacete.es/",
 44 |     "https://www.diariodealmeria.es/",
 45 |     "https://www.ideal.es/",
 46 |     "https://www.lavozdealmeria.com/",
 47 |     "https://www.berria.eus/",
 48 |     "https://www.noticiasdealava.eus/",
 49 |     "https://www.elcorreo.com/alava/",
 50 |     "https://www.naiz.eus/",
 51 |     "https://www.naiz.eus/",
 52 |     "https://www.elcomercio.es/",
 53 |     "https://www.lne.es/",
 54 |     "https://www.diariodeavila.es/",
 55 |     "https://www.elperiodicoextremadura.com/",
 56 |     "https://www.hoy.es/",
 57 |     "https://www.lacronicabadajoz.com/",
 58 |     "https://www.ara.cat/",
 59 |     "https://www.diaridesabadell.com/",
 60 |     "http://www.diarideterrassa.es/",
 61 |     "https://www.elperiodico.com/",
 62 |     "https://www.elperiodico.cat/",
 63 |     "http://www.elpuntavui.cat/",
 64 |     "https://www.lavanguardia.com/",
 65 |     "https://www.regio7.cat/",
 66 |     "https://www.20minutos.es/",
 67 |     "https://www.berria.eus/",
 68 |     "https://www.deia.eus/",
 69 |     "https://www.elcorreo.com/",
 70 |     "https://www.naiz.eus",
 71 |     "http://www.elnervion.com/",
 72 |     "https://www.naiz.eus/",
 73 |     "https://www.diariodeburgos.es/",
 74 |     "https://elcorreodeburgos.elmundo.es/",
 75 |     "https://www.elperiodicoextremadura.com/",
 76 |     "https://www.hoy.es/",
 77 |     "https://www.diariodecadiz.es/",
 78 |     "https://www.diariodejerez.es/",
 79 |     "https://www.europasur.es/",
 80 |     "https://www.lavozdigital.es/",
 81 |     "https://andaluciainformacion.es/",
 82 |     "https://www.eldiarioalerta.com/",
 83 |     "https://www.eldiariomontanes.es/",
 84 |     "https://www.elperiodicomediterraneo.com/",
 85 |     "https://elfarodeceuta.es/",
 86 |     "https://elpueblodeceuta.es/",
 87 |     "https://www.latribunadeciudadreal.es/",
 88 |     "https://www.diariocordoba.com/",
 89 |     "https://www.eldiadecordoba.es/",
 90 |     "https://www.elcorreogallego.es/",
 91 |     "https://www.elidealgallego.com/",
 92 |     "https://www.laopinioncoruna.es/",
 93 |     "https://www.lavozdegalicia.es/",
 94 |     "https://www.nosdiario.gal/",
 95 |     "https://www.latribunadecuenca.es/",
 96 |     "https://www.berria.eus/",
 97 |     "https://www.elcorreo.com/gipuzkoa/",
 98 |     "https://www.diariovasco.com/",
 99 |     "https://www.naiz.eus/",
100 |     "https://www.noticiasdegipuzkoa.eus/",
101 |     "https://www.naiz.eus/",
102 |     "https://www.ara.cat/",
103 |     "https://www.diaridegirona.cat/",
104 |     "https://www.elperiodico.com/es/",
105 |     "https://www.elperiodico.cat/",
106 |     "http://www.elpuntavui.cat/",
107 |     "https://www.lavanguardia.com/",
108 |     "https://www.regio7.cat/",
109 |     "https://www.granadahoy.com/",
110 |     "https://www.ideal.es/almeria/",
111 |     "http://huelva24.com/",
112 |     "https://www.huelvainformacion.es/",
113 |     "https://andaluciainformacion.es/",
114 |     "https://andaluciainformacion.es/",
115 |     "https://www.diariodelaltoaragon.es/",
116 |     "https://www.elperiodicodearagon.com/",
117 |     "https://www.heraldo.es/",
118 |     "https://www.arabalears.cat/",
119 |     "https://www.diariodeibiza.es/",
120 |     "https://www.diariodemallorca.es/",
121 |     "https://www.menorca.info/",
122 |     "https://www.diariodeleon.es/",
123 |     "https://www.ara.cat/",
124 |     "http://www.elpuntavui.cat/",
125 |     "https://www.lamanyana.cat/",
126 |     "https://www.lavanguardia.com/",
127 |     "https://www.segre.com/",
128 |     "https://www.elperiodico.com/",
129 |     "https://www.lanuevacronica.com/",
130 |     "https://www.elperiodico.cat/",
131 |     "https://www.diariojaen.es/",
132 |     "https://www.ideal.es/",
133 |     "https://www.ultimahora.es/",
134 |     "https://www.periodicodeibiza.es/",
135 |     "https://www.diariodeleon.es/",
136 |     "https://www.lanuevacronica.com/",
137 |     "https://www.20minutos.es/",
138 |     "https://www.elprogreso.es/",
139 |     "https://www.lavozdegalicia.es/",
140 |     "https://www.diariosur.es/",
141 |     "https://www.nosdiario.gal/",
142 |     "https://www.arabalears.cat/",
143 |     "https://www.diariodeibiza.es/",
144 |     "https://www.diariodemallorca.es/",
145 |     "https://www.menorca.info/",
146 |     "https://www.periodicodeibiza.es/",
147 |     "https://www.ultimahora.es/",
148 |     "https://andaluciainformacion.es/",
149 |     "https://www.diariojaen.es/",
150 |     "https://www.diariodeleon.es/",
151 |     "https://www.lanuevacronica.com/",
152 |     "https://www.elprogreso.es/",
153 |     "https://www.lavozdegalicia.es/",
154 |     "https://www.nosdiario.gal/",
155 |     "https://www.diariosur.es/",
156 |     "https://www.laopiniondemalaga.es/",
157 |     "https://www.malagahoy.es/",
158 |     "https://www.melillahoy.es/",
159 |     "https://elfarodemelilla.es/",
160 |     "https://www.laopiniondemurcia.es/",
161 |     "https://www.laverdad.es/murcia/cartagena/",
162 |     "https://www.laverdad.es/",
163 |     "https://www.noticiasdenavarra.com/",
164 |     "https://www.diariodenavarra.es/",
165 |     "https://www.farodevigo.es/",
166 |     "https://www.laregion.es/",
167 |     "https://www.lavozdegalicia.es/",
168 |     "https://www.nosdiario.gal/",
169 |     "https://www.elnortedecastilla.es/",
170 |     "https://www.diariopalentino.es/",
171 |     "https://www.canarias7.es/",
172 |     "https://www.laprovincia.es/",
173 |     "https://www.atlantico.net/",
174 |     "https://www.diariodepontevedra.es/",
175 |     "https://www.farodevigo.es/",
176 |     "https://www.lavozdegalicia.es/",
177 |     "https://www.nosdiario.gal/",
178 |     "https://www.larioja.com/",
179 |     "https://www.eldia.es/",
180 |     "https://diariodeavisos.elespanol.com/",
181 |     "https://www.eladelantado.com/",
182 |     "https://www.lagacetadesalamanca.es/",
183 |     "https://sevilla.abc.es/",
184 |     "https://heraldodiariodesoria.elmundo.es/",
185 |     "https://www.diariodesevilla.es/",
186 |     "https://www.diaridetarragona.com/",
187 |     "https://www.ara.cat/",
188 |     "http://www.elpuntavui.cat/",
189 |     "https://www.lavanguardia.com/",
190 |     "https://www.diarimes.com/",
191 |     "https://www.elperiodicodearagon.com/",
192 |     "https://www.diariodeteruel.es/",
193 |     "https://www.heraldo.es/",
194 |     "https://www.latribunadetoledo.es/",
195 |     "https://www.lasprovincias.es/",
196 |     "https://www.levante-emv.com/",
197 |     "https://valenciaextra.com/",
198 |     "https://diariodevalladolid.elmundo.es/",
199 |     "https://www.eldiadevalladolid.com/",
200 |     "https://www.laopiniondezamora.es/",
201 |     "https://www.elperiodicodearagon.com/",
202 |     "https://www.heraldo.es/",
203 | 
204 |     # just online ones
205 | 
206 |     "https://alicanteplaza.es/",
207 |     "https://www.elperiodic.com/",
208 |     "https://www.clm24.es/",
209 |     "https://eldiadigital.es/",
210 |     "http://www.eldigitalcastillalamancha.es/",
211 |     "https://www.elperiodicodeaqui.com/",
212 |     "https://www.eldigitaldealbacete.com/",
213 |     "https://www.masquealba.com/",
214 |     "https://lamarinaplaza.com/",
215 |     "https://lacerca.com/",
216 |     "https://www.encastillalamancha.es/",
217 |     "https://almeria360.com/",
218 |     "https://www.gasteizhoy.com/",
219 |     "https://nortexpres.com/",
220 |     "https://www.lavozdeasturias.es/",
221 |     "https://www.noticiascyl.com/",
222 |     "https://avilared.com/",
223 |     "http://www.regiondigital.com/",
224 |     "https://www.cronicabalear.es/",
225 |     "https://www.mallorcadiario.com/",
226 |     "https://www.menorca.info/",
227 |     "https://menorcaaldia.com/",
228 |     "https://www.noudiari.es/",
229 |     "https://www.capgros.com/",
230 |     "https://catalunyadiari.com/",
231 |     "https://www.eixdiari.cat/",
232 |     "https://www.elcatalan.es/",
233 |     "https://elmon.cat/",
234 |     "https://www.elnacional.cat/",
235 |     "https://www.e-noticies.cat/",
236 |     "https://www.isabadell.cat/",
237 |     "https://www.larepublica.cat/",
238 |     "https://www.metropoliabierta.com/",
239 |     "https://www.naciodigital.cat/",
240 |     "https://www.racocatala.cat/",
241 |     "https://www.vilaweb.cat/",
242 |     "https://www.burgosconecta.es/",
243 |     "https://www.burgosnoticias.com/",
244 |     "https://www.diariodelaribera.net/",
245 |     "https://www.noticiascyl.com/",
246 |     "https://www.regiondigital.com/",
247 |     "https://www.algecirasalminuto.com/",
248 |     "https://www.cadizdirecto.com/",
249 |     "https://www.lavozdelsur.es/",
250 |     "https://rotaaldia.com/",
251 |     "https://elmira.es/",
252 |     "https://www.eldiariocantabria.es/",
253 |     "https://castellonplaza.com/",
254 |     "https://www.elperiodic.com/",
255 |     "https://www.elperiodicodeaqui.com/",
256 |     "https://www.levante-emv.com/castello/",
257 |     "https://www.clm24.es/",
258 |     "https://eldiadigital.es/",
259 |     "http://www.eldigitalcastillalamancha.es/",
260 |     "https://www.encastillalamancha.es/",
261 |     "https://www.miciudadreal.es/",
262 |     "http://manchainformacion.com/",
263 |     "https://www.lanzadigital.com/",
264 |     "https://lacerca.com/",
265 |     "https://cordopolis.es/",
266 |     "https://www.lucenahoy.com/",
267 |     "https://www.galiciaconfidencial.com/",
268 |     "https://www.clm24.es/",
269 |     "https://eldiadigital.es/",
270 |     "http://www.eldigitalcastillalamancha.es/",
271 |     "https://www.encastillalamancha.es/",
272 |     "https://lacerca.com/",
273 |     "https://www.lasnoticiasdecuenca.es/",
274 |     "https://www.vocesdecuenca.com/",
275 |     "https://www.donostitik.com/",
276 |     "http://www.durangon.com/",
277 |     "https://goiena.eus/",
278 |     "https://catalunyadiari.com/",
279 |     "https://www.elcatalan.es/",
280 |     "https://www.elnacional.cat/",
281 |     "https://www.e-noticies.cat/",
282 |     "https://www.larepublica.cat/",
283 |     "https://www.naciodigital.cat/",
284 |     "https://www.racocatala.cat/",
285 |     "https://www.vilaweb.cat/",
286 |     "https://www.granadadigital.es/",
287 |     "https://www.clm24.es/",
288 |     "https://eldiadigital.es/",
289 |     "http://www.eldigitalcastillalamancha.es/",
290 |     "https://www.encastillalamancha.es/",
291 |     "https://nuevaalcarria.com/",
292 |     "https://www.lacronica.net/",
293 |     "https://www.aragondigital.es/",
294 |     "https://www.diariodevalderrueda.es/",
295 |     "https://digitaldeleon.com/",
296 |     "https://www.elbierzodigital.com/",
297 |     "https://www.ileon.com/",
298 |     "https://www.infobierzo.com/",
299 |     "https://www.noticiascyl.com/",
300 |     "https://www.leonoticias.com/",
301 |     "https://catalunyadiari.com/",
302 |     "https://www.elcatalan.es/",
303 |     "https://www.elnacional.cat/",
304 |     "https://www.e-noticies.cat/",
305 |     "https://www.larepublica.cat/",
306 |     "https://lleidadiari.cat/",
307 |     "https://www.vilaweb.cat/",
308 |     "https://www.racocatala.cat/",
309 |     "https://www.naciodigital.cat/",
310 |     "https://www.galiciaconfidencial.com/",
311 |     "https://www.alcalahoy.es/",
312 |     "https://elcorreodeespana.com/",
313 |     "https://www.madridiario.es/",
314 |     "https://www.marbella24horas.es/",
315 |     "https://cartagenadiario.es/",
316 |     "https://navarra.elespanol.com/",
317 |     "https://pamplonaactual.com/",
318 |     "https://www.galiciaconfidencial.com/",
319 |     "https://www.diariodevalderrueda.es/",
320 |     "https://www.noticiascyl.com/",
321 |     "https://www.eldiario.es/",
322 |     "https://www.infonortedigital.com/",
323 |     "https://www.lancelotdigital.com/",
324 |     "https://www.lavozdelanzarote.com/",
325 |     "https://www.galiciaconfidencial.com/",
326 |     "https://pontevedraviva.com/",
327 |     "https://www.rioja2.com/",
328 |     "https://nuevecuatrouno.com/",
329 |     "https://www.noticiascyl.com/",
330 |     "https://www.salamanca24horas.com/",
331 |     "https://salamancartvaldia.es/",
332 |     "https://www.noticiascyl.com/",
333 |     "https://elcorreoweb.es/",
334 |     "https://elmira.es/",
335 |     "http://www.sevillaactualidad.com/",
336 |     "https://www.desdesoria.es/",
337 |     "https://elmirondesoria.es/",
338 |     "https://www.noticiascyl.com/",
339 |     "https://sorianoticias.com/",
340 |     "https://catalunyadiari.com/",
341 |     "https://www.eixdiari.cat/",
342 |     "https://www.elcatalan.es/",
343 |     "https://www.elnacional.cat/",
344 |     "https://www.e-noticies.cat/",
345 |     "https://www.larepublica.cat/",
346 |     "https://www.vilaweb.cat/",
347 |     "http://reusdigital.cat/",
348 |     "https://www.racocatala.cat/",
349 |     "https://www.naciodigital.cat/",
350 |     "https://www.aragondigital.es/",
351 |     "https://ecodeteruel.tv/",
352 |     "https://www.lacomarca.net/",
353 |     "https://www.clm24.es/",
354 |     "https://eldiadigital.es/",
355 |     "http://www.eldigitalcastillalamancha.es/",
356 |     "https://www.encastillalamancha.es/",
357 |     "https://www.lavozdeltajo.com/",
358 |     "https://lacerca.com/",
359 |     "https://www.elperiodic.com/",
360 |     "https://www.elperiodicodeaqui.com/",
361 |     "https://valenciaplaza.com/",
362 |     "https://www.tribunavalladolid.com/",
363 |     "https://www.noticiascyl.com/",
364 |     "https://interbenavente.es/",
365 |     "https://www.noticiascyl.com/",
366 |     "https://www.zamora24horas.com/",
367 |     "https://zamoranews.com/",
368 |     "https://www.aragondigital.es/",
369 |     "https://cincodias.elpais.com/",
370 |     "https://www.eleconomista.es/",
371 |     "https://www.expansion.com/",
372 |     "https://www.bolsamania.com/",
373 |     "https://www.economiadigital.es/",
374 |     "https://www.elboletin.com/",
375 |     "https://www.finanzas.com/",
376 |     "https://intereconomia.com/",
377 |     "https://www.elespanol.com/",
378 |     "https://www.libremercado.com/",
379 |     "https://www.leconomic.cat/"
380 | 
381 | ]
382 | 
383 | 
384 | 
385 | spanish_media=set(raw_list)
386 | #print("initial list: ", len(raw_list),"\n",     "removing duplicates: ", len(spanish_media))
387 | 
388 | newspapers=pd.DataFrame(spanish_media, columns=["spanish_media"])
389 | newspapers.to_csv('./output/spanish_media.csv')
390 | 


--------------------------------------------------------------------------------
/bigquery/tables_python_api.sql:
--------------------------------------------------------------------------------
  1 | -- Due to how Google Trends works, I overwrite continuously the Google Trends tables with the newest info
  2 | 
  3 | -- After being ingested the data in a main BigQuery table, it is separated into many tables for the sake of Data Studio good development practices
  4 | 
  5 | CREATE OR REPLACE TABLE
  6 |   your_table_name.videocall_searches AS
  7 | SELECT
  8 |   keyword AS videocall_searches,
  9 |   date,
 10 |   trend_index
 11 | FROM
 12 |   `your created dataset in bigquery`
 13 | WHERE
 14 |   ( keyword = "zoom"
 15 |     OR keyword = "teams"
 16 |     OR keyword = "skype"
 17 |     OR keyword = "hangouts"
 18 |     OR keyword ="teletrabajo"
 19 |     OR keyword ="videollamada"
 20 |     OR keyword ="videoconferencia");
 21 | SELECT
 22 |   *
 23 | FROM
 24 |   your_table_name.videocall_searches;
 25 | CREATE OR REPLACE TABLE
 26 |   your_table_name.messaging_searches AS
 27 | SELECT
 28 |   keyword AS messaging_searches,
 29 |   date,
 30 |   trend_index
 31 | FROM
 32 |   `your created dataset in bigquery`
 33 | WHERE
 34 |   ( keyword = "whatsapp"
 35 |     OR keyword = "telegram"
 36 |     OR keyword = "viber"
 37 |     OR keyword = "tiktok" );
 38 | SELECT
 39 |   *
 40 | FROM
 41 |   your_table_name.messaging_searches;
 42 | CREATE OR REPLACE TABLE
 43 |   your_table_name.political_searches AS
 44 | SELECT
 45 |   keyword AS political_searches,
 46 |   date,
 47 |   trend_index
 48 | FROM
 49 |   `your created dataset in bigquery`
 50 | WHERE
 51 |   ( keyword = "refugiados"
 52 |     OR keyword = "inmigracion"
 53 |     OR keyword = "nacionalismo"
 54 |     OR keyword = "corrupcion"
 55 |     OR keyword ="juicio"
 56 |     OR keyword = "guerra comercial") ;
 57 | SELECT
 58 |   *
 59 | FROM
 60 |   your_table_name.political_searches;
 61 | CREATE OR REPLACE TABLE
 62 |   your_table_name.health_searches AS
 63 | SELECT
 64 |   keyword AS health_searches,
 65 |   date,
 66 |   trend_index
 67 | FROM
 68 |   `your created dataset in bigquery`
 69 | WHERE
 70 |   ( keyword = "coronavirus"
 71 |     OR keyword = "pandemia"
 72 |     OR keyword = "infeccion"
 73 |     OR keyword = "medico");
 74 | SELECT
 75 |   *
 76 | FROM
 77 |   your_table_name.health_searches;
 78 | CREATE OR REPLACE TABLE
 79 |   your_table_name.leisure_searches AS
 80 | SELECT
 81 |   keyword AS leisure_searches,
 82 |   date,
 83 |   trend_index
 84 | FROM
 85 |   `your created dataset in bigquery`
 86 | WHERE
 87 |   ( keyword = "amazon"
 88 |     OR keyword = "netflix"
 89 |     OR keyword = "hbo"
 90 |     OR keyword = "rakuten"
 91 |     OR keyword ="steam");
 92 | SELECT
 93 |   *
 94 | FROM
 95 |   your_table_name.leisure_searches;
 96 | CREATE OR REPLACE TABLE
 97 |   your_table_name.city_searches AS
 98 | SELECT
 99 |   keyword AS city_searches,
100 |   date,
101 |   trend_index
102 | FROM
103 |   `your created dataset in bigquery`
104 | WHERE
105 |   ( keyword = "cabify"
106 |     OR keyword = "taxi"
107 |     OR keyword = "glovo"
108 |     OR keyword = "just eat"
109 |     OR keyword ="deliveroo"
110 |     OR keyword ="uber eats");
111 | SELECT
112 |   *
113 | FROM
114 |   your_table_name.city_searches;
115 | CREATE OR REPLACE TABLE
116 |   your_table_name.habit_searches AS
117 | SELECT
118 |   keyword AS habit_searches,
119 |   date,
120 |   trend_index
121 | FROM
122 |   `your created dataset in bigquery`
123 | WHERE
124 |   ( keyword = "comida a domicilio"
125 |     OR keyword = "hacer deporte"
126 |     OR keyword = "yoga"
127 |     OR keyword = "meditacion"
128 |     OR keyword ="cursos online");
129 | SELECT
130 |   *
131 | FROM
132 |   your_table_name.habit_searches;


--------------------------------------------------------------------------------
/cloud_function/main.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | from os import listdir
 3 | from os.path import isfile, join
 4 | 
 5 | def main (data,context):
 6 |     
 7 |     processes= ["pytrends_request.py","upload_gcs_real.py","remove_files.py"]
 8 | 
 9 |     for p in processes:
10 |         print(p)
11 |         exec(open(p).read())
12 | 
13 | if __name__ == "__main__":
14 |   
15 |     main('data','context')
16 |  


--------------------------------------------------------------------------------
/cloud_function/pytrends_request.py:
--------------------------------------------------------------------------------
 1 | # import os
 2 | # import gcsfs
 3 | import pandas as pd
 4 | import random
 5 | import pytrends
 6 | from pytrends.request import TrendReq
 7 | import time as timer 
 8 | import datetime
 9 | from datetime import datetime, date, time, timedelta
10 | 
11 | 
12 | 
13 | 
14 | #--------------------------------------
15 | # date
16 | #yesterday=(datetime.now()-timedelta(days=1)).date()
17 | dates="2019-01-06"+" "+str(datetime.now().date()) 
18 | print(dates)
19 | 
20 | keywords=[
21 |     "zoom","skype","hangouts",
22 | 
23 |     "refugiados","inmigracion","nacionalismo","corrupcion","estado de alarma",
24 |     "comparecencia","independentismo","crisis política","barometro","crisis economica",
25 |     "protesta","manifestacion",
26 | 
27 |     "bildu","ciudadanos","compromis","erc","mas pais", "pnv","podemos","pp","psoe","vox",
28 | 
29 |     "teletrabajo","remoto","cursos online","productividad","autonomo",
30 |     "negocio online","emprendimiento", "formacion",
31 | 
32 |     "erte","paro","sepe","desempleo","deshaucio","comedor social",
33 |     "banco alimentos", "cruz roja", "caritas",
34 | 
35 |     "ayuda alquiler","compartir piso","divorcio","embarazo","hipoteca", "idealista", "badi", "piso barato", 
36 | 
37 |     "coronavirus","pandemia","infeccion","medico","residencia ancianos", "desescalada",
38 |     
39 |     "clases online","examenes","menu escolar","bullying",
40 | 
41 |     "netflix","hbo","steam","glovo","just eat","deliveroo","uber eats","hacer deporte","en casa",
42 |     "yoga","meditacion","videollamada","videoconferencia","tinder","meetic", "disney","amazon", "cabify",
43 |     "uber","taxi", "en familia"
44 |     
45 | ]
46 | #--------------------------------------
47 | # the function
48 | #--------------------------------------
49 | pytrends = TrendReq(hl='ES', tz=0)
50 | future_dataframe={}
51 | c=1
52 | for k in keywords:   
53 | 
54 |     try:
55 |         #print("Requesting ",[k])
56 |         pytrends.build_payload([k], timeframe=dates, geo='ES', gprop='')
57 |         future_dataframe[c]=pytrends.interest_over_time() 
58 |         future_dataframe[c].drop(['isPartial'], axis=1,inplace=True)
59 |         c+=1
60 |         result = pd.concat(future_dataframe, axis=1)
61 |     except:
62 |         print("***","\n","Error with ",k,"or not enough trending percentaje","\n","***")
63 | 
64 | result.columns = result.columns.droplevel(0)
65 | df1=result.unstack(level=-1)
66 | df2=pd.DataFrame(df1)
67 | df2.reset_index(inplace=True)
68 | df2.columns = ["keyword","date","trend_index"]
69 | df2.to_csv("../tmp/data_pytrends.csv")


--------------------------------------------------------------------------------
/cloud_function/remove_files.py:
--------------------------------------------------------------------------------
1 | import os
2 | from os import listdir
3 | from os.path import isfile, join
4 | #list of csv
5 | only_csv = [f for f in listdir("../tmp") if isfile(join("../tmp", f))]
6 | 
7 | [os.remove("../tmp/{}".format(e)) for e in only_csv]
8 | print(only_csv," removed ../tmp folder limpito") 


--------------------------------------------------------------------------------
/cloud_function/requirements.txt:
--------------------------------------------------------------------------------
1 | gcsfs==0.6.1  
2 | pandas==1.0.3 
3 | pytrends==4.7.2
4 | importlib-metadata==1.6.0
5 | 
6 | 
7 | 


--------------------------------------------------------------------------------
/cloud_function/upload_gcs_example.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import gcsfs
 3 | import pandas as pd
 4 | 
 5 | # CREDENTIALS
 6 | # - project = Project name in Google Cloud, NOT project ID.
 7 | # - fs.open("your bucket/name your future csv.csv")
 8 | # - fs.upload("where is temporary located in cloud function", "your bucket/name your future csv.csv")
 9 | 
10 | # .ENV structure:
11 | # TOKEN_NAME='your credentials name with object access to gcs.json'
12 | # PROJECT_NAME='name of your project, NO name ID'
13 | # PROJECT_PATH='your bucket in gcs/your future.csv'
14 | # PROJECT_TMP='../tmp/name of your requested csv in pytrends_request.py .csv'
15 | 
16 | fs = gcsfs.GCSFileSystem(token=os.getenv('TOKEN_NAME') ,project=os.getenv('PROJECT_NAME'))
17 | 
18 | # upload the new final companies
19 | with fs.open(os.getenv('PROJECT_PATH')) as mergecomp:
20 |     df = pd.read_csv(mergecomp)
21 | fs.upload(os.getenv('PROJECT_TMP'),os.getenv('PROJECT_PATH'))
22 | 


--------------------------------------------------------------------------------
/pics/automated_dashboard_preview.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/albertovpd/automated_etl_google_cloud-social_dashboard/e3899a2d19b27c1f8af67e1d4c52305a479247ac/pics/automated_dashboard_preview.gif


--------------------------------------------------------------------------------
/pics/budget_alert.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/albertovpd/automated_etl_google_cloud-social_dashboard/e3899a2d19b27c1f8af67e1d4c52305a479247ac/pics/budget_alert.png


--------------------------------------------------------------------------------
/pics/cloud_function_error.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/albertovpd/automated_etl_google_cloud-social_dashboard/e3899a2d19b27c1f8af67e1d4c52305a479247ac/pics/cloud_function_error.png


--------------------------------------------------------------------------------
/pics/gdelt_sketch.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/albertovpd/automated_etl_google_cloud-social_dashboard/e3899a2d19b27c1f8af67e1d4c52305a479247ac/pics/gdelt_sketch.png


--------------------------------------------------------------------------------
/pics/interactive_dashboard.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/albertovpd/automated_etl_google_cloud-social_dashboard/e3899a2d19b27c1f8af67e1d4c52305a479247ac/pics/interactive_dashboard.png


--------------------------------------------------------------------------------
/pics/list_of_scheduled_queries.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/albertovpd/automated_etl_google_cloud-social_dashboard/e3899a2d19b27c1f8af67e1d4c52305a479247ac/pics/list_of_scheduled_queries.png


--------------------------------------------------------------------------------
/pics/monguerteam.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/albertovpd/automated_etl_google_cloud-social_dashboard/e3899a2d19b27c1f8af67e1d4c52305a479247ac/pics/monguerteam.png


--------------------------------------------------------------------------------
/pics/mrdogscience.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/albertovpd/automated_etl_google_cloud-social_dashboard/e3899a2d19b27c1f8af67e1d4c52305a479247ac/pics/mrdogscience.jpeg


--------------------------------------------------------------------------------
/pics/trends_example.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/albertovpd/automated_etl_google_cloud-social_dashboard/e3899a2d19b27c1f8af67e1d4c52305a479247ac/pics/trends_example.png


--------------------------------------------------------------------------------
/pics/twitter_fees.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/albertovpd/automated_etl_google_cloud-social_dashboard/e3899a2d19b27c1f8af67e1d4c52305a479247ac/pics/twitter_fees.png


--------------------------------------------------------------------------------
/pics/updating_gdelt_tables.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/albertovpd/automated_etl_google_cloud-social_dashboard/e3899a2d19b27c1f8af67e1d4c52305a479247ac/pics/updating_gdelt_tables.png


--------------------------------------------------------------------------------
/twitter_folder/twitter_merging.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd 
 2 | import datetime
 3 | 
 4 | # it is really expensive to work with Twitter API for
 5 | # monitoring keywords weekly and it is out of project
 6 | # (we are simple mortals with limited budget). Nevertheless
 7 | # the code of monitoring Twitter with Python will be
 8 | # available soon, thanks to Patri
 9 | 
10 | 
11 | # This is just getting the right shape to use it in 
12 | # DataStudio.
13 | 
14 | twitter_deshaucio=pd.read_csv("./input/defdesa.csv")
15 | deshaucio=twitter_deshaucio[["date","query"]].copy()
16 | 
17 | twitter_comedor=pd.read_csv("./input/dfcom.csv")
18 | comedor=twitter_comedor[["date","query"]].copy()
19 | 
20 | twitter_desempleo=pd.read_csv("./input/dfdes.csv")
21 | desempleo=twitter_comedor[["date","query"]].copy()
22 | 
23 | twitter_erte=pd.read_csv("./input/dferte.csv")
24 | erte=twitter_erte[["date","query"]].copy()
25 | 
26 | twitter_sepe=pd.read_csv("./input/dfsepe.csv")
27 | sepe=twitter_sepe[["date","query"]].copy()
28 | 
29 | twitter_final
30 | twitter_final.to_csv("./input/twitter_final.csv")


--------------------------------------------------------------------------------