├── Article 14 One.JPG ├── Article 14 Two.JPG ├── README.md └── Untitled14.ipynb /Article 14 One.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/LSEG-API-Samples/Article.EikonAPI.Python.NewsSentimentAnalysis/48c65e15292c3ff1cf2610a61587565fc985fd3b/Article 14 One.JPG -------------------------------------------------------------------------------- /Article 14 Two.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/LSEG-API-Samples/Article.EikonAPI.Python.NewsSentimentAnalysis/48c65e15292c3ff1cf2610a61587565fc985fd3b/Article 14 Two.JPG -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | # Introduction to News Sentiment Analysis with Eikon Data APIs - a Python example 3 | 4 | This article will demonstrate how we can conduct a simple sentiment analysis of news delivered via our new [Eikon Data APIs](https://developers.thomsonreuters.com/eikon-data-apis). Natural Language Processing (NLP) is a big area of interest for those looking to gain insight and new sources of value from the vast quantities of unstructured data out there. The area is quite complex and there are many resources online that can help you familiarise yourself with this very interesting area. There are also many different packages that can help you as well as many different approaches to this problem. Whilst these are beyond the scope of this article - I will go through a simple implementation which will give you a swift enough introduction and practical codebase for further exploration and learning. 5 | 6 | **Pre-requisites:** 7 | 8 | **Thomson Reuters Eikon** with access to new [Eikon Data APIs](https://developers.thomsonreuters.com/eikon-data-apis) 9 | 10 | **Python 2.x/3.x** 11 | 12 | **Required Python Packages:** eikon, pandas, numpy, beautifulsoup, textblob, datetime 13 | 14 | **Required corpora download:** >>>python -m textblob.download_corpora (this is required by the sentiment engine to generate sentiment) 15 | 16 | ### Introduction 17 | 18 | NLP is a field which enables computers to understand human language (voice or text). This is quite a big area of research and a little enquiry on your part will furnish you with the complexities of this problem set. Here we will be focussing on one application of this called *Sentiment Analysis*. In our case we will be taking news articles(unstructured text) for a particular company, **IBM**, and we will attempt to grade this news to see how postive, negative or neutral it is. We will then try to see if this news has had an impact on the shareprice of **IBM**. 19 | 20 | To do this really well is a non-trivial task, and most universtities and financial companies will have departments and teams looking at this. We ourselves provide machine readable news products with News Analytics (such as sentiment) over our **Elektron** platform in realtime at very low latency - these products are essentially consumed by *algorithmic applications* as opposed to *humans*. 21 | 22 | We will try to do a similar thing as simply as possible to illustrate the key elements - our task is significantly eased by not having to do this in a low latency environment. We will be abstracting most of the complexities to do with the mechanics of actually analysing the text to various packages. You can then easily replace the modules such as the sentiment engine etc to improve your results as your understanding increases. 23 | 24 | So lets get started. First lets load the packages that we will need to use and set our app_id. 25 | 26 | 27 | ```python 28 | import eikon as ek 29 | import pandas as pd 30 | import numpy as np 31 | from bs4 import BeautifulSoup 32 | from textblob import TextBlob 33 | import datetime 34 | from datetime import time 35 | import warnings 36 | warnings.filterwarnings("ignore") 37 | ek.set_app_id('YOUR APP ID HERE') 38 | ``` 39 | 40 | There are two API calls for news: 41 | 42 | **get_news_headlines** : returns a list of news headlines satisfying a query 43 | 44 | **get_news_story** : returns a HTML representation of the full news article 45 | 46 | We will need to use both - thankfully they are really straightforward to use. We will need to use **get_news_headlines** API call to request a list of headlines. The first parameter for this call is a query. You dont really need to know this query language as you can generate it using the **News Monitor App** (type **NEWS** into Eikon search bar) in **Eikon**. 47 | 48 | You can see here I have just typed in 2 search terms, **IBM**, for the company, and, **English**, for the language I am interested in (in our example we will only be able to analyse English language text - though there are corpora, packages, methods you can employ to target other languages - though these are beyond the scope of this article). You can of course use any search terms you wish. 49 | 50 | ![News App 1](Article 14 One.jpg) 51 | 52 | After you have typed in what you want to search for - we can simply click in the search box and this will then generate the query text which we can then copy and paste into the API call below. Its easy for us to change logical operations such as **AND** to **OR**, **NOT** to suit our query. 53 | 54 | ![News App 2](Article 14 Two.png) 55 | 56 | So the line of code below gets us 100 news headlines for **IBM** in english prior to 4th Dec 2017, and stores them in a dataframe, df for us. 57 | 58 | 59 | ```python 60 | df = ek.get_news_headlines('R:IBM.N AND Language:LEN', date_to = "2017-12-04", count=100) 61 | df.head() 62 | ``` 63 | 64 | 65 | 66 | 67 |
68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 |
versionCreatedtextstoryIdsourceCode
2017-12-01 23:11:47.3742017-12-01 23:11:47.374Reuters Insider - FM Final Trade: HAL, TWTR & ...urn:newsml:reuters.com:20171201:nRTV8KBb1N:1NS:CNBC
2017-12-01 19:19:20.2792017-12-01 19:19:20.279IBM ST: the upside prevails as long as 150.2 i...urn:newsml:reuters.com:20171201:nGUR2R6xQ:1NS:GURU
2017-12-01 18:12:41.1432017-12-01 18:12:41.143INTERNATIONAL BUSINESS MACHINES CORP SEC Filin...urn:newsml:reuters.com:20171201:nEOL6JJfRT:1NS:EDG
2017-12-01 18:12:41.0192017-12-01 18:12:41.019INTERNATIONAL BUSINESS MACHINES CORP SEC Filin...urn:newsml:reuters.com:20171201:nEOL3YHcVY:1NS:EDG
2017-12-01 18:06:03.6332017-12-01 18:06:03.633Moody's Affirms Seven Classes of GSMS 2016-GS4urn:newsml:reuters.com:20171201:nMDY7wGNTP:1NS:RTRS
116 |
117 | 118 | 119 | 120 | I will just add 3 new columns which we will need to store some variables in later. 121 | 122 | 123 | ```python 124 | df['Polarity'] = np.nan 125 | df['Subjectivity'] = np.nan 126 | df['Score'] = np.nan 127 | ``` 128 | 129 | So we have our frame with the most recent 100 news headline items. The headline is stored in the **text** column and the storyID which we will now use to pull down the actual articles themselves, is stored in the **storyID** column. 130 | 131 | We will now iterate through the headline dataframe and pull down the news articles using the second of our news API calls, get_news_story. We simply pass the **storyID** to this API call and we are returned a HTML representation of the article - which allows you to render them nicely etc - however for our purposes we want to strip the HTML tags etc out and just be left with the plain text - as we dont want to analyse HTML tags for sentiment. We will do this using the excellent **BeautifulSoup** package. 132 | 133 | Once we have the text of these articles we can pass them to our sentiment engine which will give us a sentiment score for each article. So what is our sentiment engine? We will be using the simple **TextBlob** package to demo a rudimentary process to show you how things work. **TextBlob** is a higher level abstraction package that sits on top of **NLTK** (Natural Language Toolkit) which is a widely used package for this type of task. 134 | 135 | **NLTK** is quite a complex package which gives you a lot of control over the whole analytical process - but the cost of that is complexity and required knowledge of the steps invloved. **TextBlob** shields us from this complexity, but we should at some stage understand what is going on under the hood. Thankfully there is plenty of information to guide us in this. We will be implementing the default **PatternAnalyzer** which is based on the popular **Pattern** library though there is also a **NaiveBayesAnalyzer** which is a **NLTK** classifier based on a movie review corpus. 136 | 137 | All of this can be achieved in just a few lines of code. This is quite a dense codeblock - so I have commented the key steps. 138 | 139 | 140 | ```python 141 | for idx, storyId in enumerate(df['storyId'].values): #for each row in our df dataframe 142 | newsText = ek.get_news_story(storyId) #get the news story 143 | if newsText: 144 | soup = BeautifulSoup(newsText,"lxml") #create a BeautifulSoup object from our HTML news article 145 | sentA = TextBlob(soup.get_text()) #pass the text only article to TextBlob to anaylse 146 | df['Polarity'].iloc[idx] = sentA.sentiment.polarity #write sentiment polarity back to df 147 | df['Subjectivity'].iloc[idx] = sentA.sentiment.subjectivity #write sentiment subjectivity score back to df 148 | if sentA.sentiment.polarity >= 0.05: # attribute bucket to sentiment polartiy 149 | score = 'positive' 150 | elif -.05 < sentA.sentiment.polarity < 0.05: 151 | score = 'neutral' 152 | else: 153 | score = 'negative' 154 | df['Score'].iloc[idx] = score #write score back to df 155 | df.head() 156 | ``` 157 | 158 | 159 | 160 | 161 |
162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 | 208 | 209 | 210 | 211 | 212 | 213 | 214 | 215 | 216 | 217 | 218 | 219 | 220 | 221 | 222 | 223 | 224 | 225 | 226 | 227 |
versionCreatedtextstoryIdsourceCodePolaritySubjectivityScore
2017-12-01 23:11:47.3742017-12-01 23:11:47.374Reuters Insider - FM Final Trade: HAL, TWTR & ...urn:newsml:reuters.com:20171201:nRTV8KBb1N:1NS:CNBC0.0666670.566667positive
2017-12-01 19:19:20.2792017-12-01 19:19:20.279IBM ST: the upside prevails as long as 150.2 i...urn:newsml:reuters.com:20171201:nGUR2R6xQ:1NS:GURU0.0552600.320844positive
2017-12-01 18:12:41.1432017-12-01 18:12:41.143INTERNATIONAL BUSINESS MACHINES CORP SEC Filin...urn:newsml:reuters.com:20171201:nEOL6JJfRT:1NS:EDG0.0000000.000000neutral
2017-12-01 18:12:41.0192017-12-01 18:12:41.019INTERNATIONAL BUSINESS MACHINES CORP SEC Filin...urn:newsml:reuters.com:20171201:nEOL3YHcVY:1NS:EDG0.0000000.000000neutral
2017-12-01 18:06:03.6332017-12-01 18:06:03.633Moody's Affirms Seven Classes of GSMS 2016-GS4urn:newsml:reuters.com:20171201:nMDY7wGNTP:1NS:RTRS0.1750000.325000positive
228 |
229 | 230 | 231 | 232 | Looking at our dataframe we can now see 3 new columns on the right, *Polarity*, *Subjectivity* and *Score*. As we have seen *Polarity* is the actual sentiment polarity returned from **TextBlob** (ranging from -1(negative) to +1(positive), *Subjectivity* is a measure (ranging from 0 to 1) where 0 is very objective and 1 is very subjective, and *Score* is simply a Positive, Negative or Neutral rating based on the strength of the polarities. 233 | 234 | We would now like to see what, if any, impact this news has had on the shareprice of **IBM**. There are many ways of doing this - but to make things simple, I would like to see what the average return is at various points in time **AFTER** the news has broken. I want to check if there are *aggregate differences* in the *average returns* from the Positive, Neutral and Negative buckets we created earlier. 235 | 236 | 237 | ```python 238 | start = df['versionCreated'].min().replace(hour=0,minute=0,second=0,microsecond=0).strftime('%Y/%m/%d') 239 | end = df['versionCreated'].max().replace(hour=0,minute=0,second=0,microsecond=0).strftime('%Y/%m/%d') 240 | Minute = ek.get_timeseries(["IBM.N"], start_date=start, interval="minute") 241 | Minute.tail() 242 | ``` 243 | 244 | 245 | 246 | 247 |
248 | 249 | 250 | 251 | 252 | 253 | 254 | 255 | 256 | 257 | 258 | 259 | 260 | 261 | 262 | 263 | 264 | 265 | 266 | 267 | 268 | 269 | 270 | 271 | 272 | 273 | 274 | 275 | 276 | 277 | 278 | 279 | 280 | 281 | 282 | 283 | 284 | 285 | 286 | 287 | 288 | 289 | 290 | 291 | 292 | 293 | 294 | 295 | 296 | 297 | 298 | 299 | 300 | 301 | 302 | 303 | 304 | 305 | 306 | 307 | 308 | 309 | 310 | 311 | 312 | 313 | 314 | 315 | 316 |
IBM.NHIGHLOWOPENCLOSECOUNTVOLUME
Date
2018-01-05 15:21:00162.32162.18162.22162.3123.03073.0
2018-01-05 15:22:00162.42162.29162.31162.4223.02442.0
2018-01-05 15:23:00162.46162.43162.45162.4611.0960.0
2018-01-05 15:24:00162.46162.40162.46162.405.0505.0
2018-01-05 15:25:00162.39162.31162.36162.3312.01060.0
317 |
318 | 319 | 320 | 321 | We will need to create some new columns for the next part of this analysis. 322 | 323 | 324 | ```python 325 | df['twoM'] = np.nan 326 | df['fiveM'] = np.nan 327 | df['tenM'] = np.nan 328 | df['thirtyM'] = np.nan 329 | df.head(2) 330 | ``` 331 | 332 | 333 | 334 | 335 |
336 | 337 | 338 | 339 | 340 | 341 | 342 | 343 | 344 | 345 | 346 | 347 | 348 | 349 | 350 | 351 | 352 | 353 | 354 | 355 | 356 | 357 | 358 | 359 | 360 | 361 | 362 | 363 | 364 | 365 | 366 | 367 | 368 | 369 | 370 | 371 | 372 | 373 | 374 | 375 | 376 | 377 | 378 | 379 | 380 | 381 | 382 | 383 |
versionCreatedtextstoryIdsourceCodePolaritySubjectivityScoretwoMfiveMtenMthirtyM
2017-12-01 23:11:47.3742017-12-01 23:11:47.374Reuters Insider - FM Final Trade: HAL, TWTR & ...urn:newsml:reuters.com:20171201:nRTV8KBb1N:1NS:CNBC0.0666670.566667positiveNaNNaNNaNNaN
2017-12-01 19:19:20.2792017-12-01 19:19:20.279IBM ST: the upside prevails as long as 150.2 i...urn:newsml:reuters.com:20171201:nGUR2R6xQ:1NS:GURU0.0552600.320844positiveNaNNaNNaNNaN
384 |
385 | 386 | 387 | 388 | OK so I now just need to get the timestamp of each news item, truncate it to minute data (ie remove second and microsecond components) and get the base shareprice of **IBM** at that time, and at several itervals after that time, in our case *t+2 mins,t+5 mins, t+10 mins, t+30 mins*, calculating the % change for each interval. 389 | 390 | An important point to bear in mind here is that news can be generated at anytime - 24 hours a day - outside of normal market hours. So for news generated outside normal market hours for **IBM** in our case, we would have to wait until the next market opening to conduct our calculations. Of course there are a number of issues here concerning our ability to attribute price movement to our news item in isolation (basically we cannot). That said, there might be other ways of doing this - for example looking at **GDRs/ADRs** or surrogates etc - these are beyond the scope of this introductory article. In our example, these news items are simply discarded. 391 | 392 | We will now loop through each news item in the dataframe, calculate (where possible) and store the derived performance numbers in the columns we created earlier: twoM...thirtyM. 393 | 394 | 395 | ```python 396 | for idx, newsDate in enumerate(df['versionCreated'].values): 397 | sTime = df['versionCreated'][idx] 398 | sTime = sTime.replace(second=0,microsecond=0) 399 | try: 400 | t0 = Minute.iloc[Minute.index.get_loc(sTime),2] 401 | df['twoM'][idx] = ((Minute.iloc[Minute.index.get_loc((sTime + datetime.timedelta(minutes=2))),3]/(t0)-1)*100) 402 | df['fiveM'][idx] = ((Minute.iloc[Minute.index.get_loc((sTime + datetime.timedelta(minutes=5))),3]/(t0)-1)*100) 403 | df['tenM'][idx] = ((Minute.iloc[Minute.index.get_loc((sTime + datetime.timedelta(minutes=10))),3]/(t0)-1)*100) 404 | df['thirtyM'][idx] = ((Minute.iloc[Minute.index.get_loc((sTime + datetime.timedelta(minutes=30))),3]/(t0)-1)*100) 405 | except: 406 | pass 407 | df.head() 408 | ``` 409 | 410 | 411 | 412 | 413 |
414 | 415 | 416 | 417 | 418 | 419 | 420 | 421 | 422 | 423 | 424 | 425 | 426 | 427 | 428 | 429 | 430 | 431 | 432 | 433 | 434 | 435 | 436 | 437 | 438 | 439 | 440 | 441 | 442 | 443 | 444 | 445 | 446 | 447 | 448 | 449 | 450 | 451 | 452 | 453 | 454 | 455 | 456 | 457 | 458 | 459 | 460 | 461 | 462 | 463 | 464 | 465 | 466 | 467 | 468 | 469 | 470 | 471 | 472 | 473 | 474 | 475 | 476 | 477 | 478 | 479 | 480 | 481 | 482 | 483 | 484 | 485 | 486 | 487 | 488 | 489 | 490 | 491 | 492 | 493 | 494 | 495 | 496 | 497 | 498 | 499 | 500 | 501 | 502 | 503 |
versionCreatedtextstoryIdsourceCodePolaritySubjectivityScoretwoMfiveMtenMthirtyM
2017-12-01 23:11:47.3742017-12-01 23:11:47.374Reuters Insider - FM Final Trade: HAL, TWTR & ...urn:newsml:reuters.com:20171201:nRTV8KBb1N:1NS:CNBC0.0666670.566667positiveNaNNaNNaNNaN
2017-12-01 19:19:20.2792017-12-01 19:19:20.279IBM ST: the upside prevails as long as 150.2 i...urn:newsml:reuters.com:20171201:nGUR2R6xQ:1NS:GURU0.0552600.320844positive0.0711190.0840500.000000-0.109911
2017-12-01 18:12:41.1432017-12-01 18:12:41.143INTERNATIONAL BUSINESS MACHINES CORP SEC Filin...urn:newsml:reuters.com:20171201:nEOL6JJfRT:1NS:EDG0.0000000.000000neutral0.012944-0.090609-0.0323600.148858
2017-12-01 18:12:41.0192017-12-01 18:12:41.019INTERNATIONAL BUSINESS MACHINES CORP SEC Filin...urn:newsml:reuters.com:20171201:nEOL3YHcVY:1NS:EDG0.0000000.000000neutral0.012944-0.090609-0.0323600.148858
2017-12-01 18:06:03.6332017-12-01 18:06:03.633Moody's Affirms Seven Classes of GSMS 2016-GS4urn:newsml:reuters.com:20171201:nMDY7wGNTP:1NS:RTRS0.1750000.325000positive0.0972380.1555810.0972380.246337
504 |
505 | 506 | 507 | 508 | Fantastic - we have now completed the analytical part of our study. Finally, we just need to aggregate our results by *Score* bucket in order to draw some conclusions. 509 | 510 | 511 | ```python 512 | grouped = df.groupby(['Score']).mean() 513 | grouped 514 | ``` 515 | 516 | 517 | 518 | 519 |
520 | 521 | 522 | 523 | 524 | 525 | 526 | 527 | 528 | 529 | 530 | 531 | 532 | 533 | 534 | 535 | 536 | 537 | 538 | 539 | 540 | 541 | 542 | 543 | 544 | 545 | 546 | 547 | 548 | 549 | 550 | 551 | 552 | 553 | 554 | 555 | 556 | 557 | 558 | 559 | 560 | 561 | 562 | 563 | 564 | 565 | 566 | 567 | 568 | 569 | 570 |
PolaritySubjectivitytwoMfiveMtenMthirtyM
Score
negative-0.1465080.316746NaNNaNNaNNaN
neutral0.0064360.175766-0.004829-0.0095020.0289790.137544
positive0.1292600.4068680.0120890.0127760.0359360.047345
571 |
572 | 573 | 574 | 575 | ### Observations 576 | 577 | From our initial results - it would appear that there might be some small directional differences in returns between the positive and neutral groups over shorter time frames (twoM and fiveM) after news broke. This is a pretty good basis for further investigation. So where could we go from here? 578 | 579 | We have a relatively small *n* here so we might want to increase the size of the study. 580 | 581 | We might also want to try to seperate out more positive or negative news - ie change the threshold of the buckets to try to identify more prominent sentiment articles - maybe that could have more of an impact on performance. 582 | 583 | In terms of capturing news impact - we have thrown a lot of news articles out as they happened outside of market hours - as it is more complex to ascertain impact - we might try to find a way of including some of this in our analysis - I mentioned looking at overseas listings **GDR/ADRs** or surrogates above. Alternatively, we could using **EXACTLY** the same process looking at all news for an index future - say the **S&P500 emini** - as this trades on Globex pretty much round the clock - so we would be throwing out a lot less of the news articles? Great I hear you cry - but would each news article be able to influence a whole index? Are index futures more sensitive to some types of articles than others? Is there a temporal element to this? These are all excellent questions. Or what about cryptocrurrencies? They trade 24/7? and so on. 584 | 585 | We could also investigate what is going on with our sentiment engine. We might be able to generate more meaningful results by tinkering with the underlyng processes and parameters. Using a different, more domain-specific corpora might help us to generate more relevant scores. 586 | 587 | You will see there is plenty of scope to get much more involved here. 588 | 589 | This article was intended as an introduction to this most interesting of areas. I hope to have de-mystified this area for you somewhat and shown how it is possible to get started with this type of complex analysis using only a few lines of code, a simple easy to use yet powerfull API and some really fantastic packages, to generate some meaningful results. 590 | -------------------------------------------------------------------------------- /Untitled14.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Introduction to News Sentiment Analysis with Eikon Data APIs - a Python example\n", 8 | "\n", 9 | "This article will demonstrate how we can conduct a simple sentiment analysis of news delivered via our new [Eikon Data APIs](https://developers.thomsonreuters.com/eikon-data-apis). Natural Language Processing (NLP) is a big area of interest for those looking to gain insight and new sources of value from the vast quantities of unstructured data out there. The area is quite complex and there are many resources online that can help you familiarise yourself with this very interesting area. There are also many different packages that can help you as well as many different approaches to this problem. Whilst these are beyond the scope of this article - I will go through a simple implementation which will give you a swift enough introduction and practical codebase for further exploration and learning.\n", 10 | "\n", 11 | "**Pre-requisites:** \n", 12 | "\n", 13 | "**Thomson Reuters Eikon** with access to new [Eikon Data APIs](https://developers.thomsonreuters.com/eikon-data-apis)\n", 14 | "\n", 15 | "**Python 2.x/3.x**\n", 16 | "\n", 17 | "**Required Python Packages:** eikon, pandas, numpy, beautifulsoup, textblob, datetime \n", 18 | "\n", 19 | "**Required corpora download:** >>>python -m textblob.download_corpora (this is required by the sentiment engine to generate sentiment)\n", 20 | "\n", 21 | "### Introduction\n", 22 | "\n", 23 | "NLP is a field which enables computers to understand human language (voice or text). This is quite a big area of research and a little enquiry on your part will furnish you with the complexities of this problem set. Here we will be focussing on one application of this called *Sentiment Analysis*. In our case we will be taking news articles(unstructured text) for a particular company, **IBM**, and we will attempt to grade this news to see how postive, negative or neutral it is. We will then try to see if this news has had an impact on the shareprice of **IBM**. \n", 24 | "\n", 25 | "To do this really well is a non-trivial task, and most universtities and financial companies will have departments and teams looking at this. We ourselves provide machine readable news products with News Analytics (such as sentiment) over our **Elektron** platform in realtime at very low latency - these products are essentially consumed by *algorithmic applications* as opposed to *humans*. \n", 26 | "\n", 27 | "We will try to do a similar thing as simply as possible to illustrate the key elements - our task is significantly eased by not having to do this in a low latency environment. We will be abstracting most of the complexities to do with the mechanics of actually analysing the text to various packages. You can then easily replace the modules such as the sentiment engine etc to improve your results as your understanding increases. \n", 28 | "\n", 29 | "So lets get started. First lets load the packages that we will need to use and set our app_id. " 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": 1, 35 | "metadata": { 36 | "collapsed": true 37 | }, 38 | "outputs": [], 39 | "source": [ 40 | "import eikon as ek\n", 41 | "import pandas as pd\n", 42 | "import numpy as np\n", 43 | "from bs4 import BeautifulSoup\n", 44 | "from textblob import TextBlob\n", 45 | "import datetime\n", 46 | "from datetime import time\n", 47 | "import warnings\n", 48 | "warnings.filterwarnings(\"ignore\")\n", 49 | "ek.set_app_key('YOUR APP KEY HERE')" 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": {}, 55 | "source": [ 56 | "There are two API calls for news:\n", 57 | "\n", 58 | "**get_news_headlines** : returns a list of news headlines satisfying a query\n", 59 | "\n", 60 | "**get_news_story** : returns a HTML representation of the full news article\n", 61 | "\n", 62 | "We will need to use both - thankfully they are really straightforward to use. We will need to use **get_news_headlines** API call to request a list of headlines. The first parameter for this call is a query. You dont really need to know this query language as you can generate it using the **News Monitor App** (type **NEWS** into Eikon search bar) in **Eikon**. \n", 63 | "\n", 64 | "You can see here I have just typed in 2 search terms, **IBM**, for the company, and, **English**, for the language I am interested in (in our example we will only be able to analyse English language text - though there are corpora, packages, methods you can employ to target other languages - though these are beyond the scope of this article). You can of course use any search terms you wish.\n", 65 | "\n", 66 | "![News App 1](Article 14 One.jpg)\n", 67 | "\n", 68 | "After you have typed in what you want to search for - we can simply click in the search box and this will then generate the query text which we can then copy and paste into the API call below. Its easy for us to change logical operations such as **AND** to **OR**, **NOT** to suit our query. \n", 69 | "\n", 70 | "![News App 2](Article 14 Two.png)\n", 71 | "\n", 72 | "So the line of code below gets us 100 news headlines for **IBM** in english prior to 4th Dec 2017, and stores them in a dataframe, df for us." 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": 2, 78 | "metadata": { 79 | "collapsed": false 80 | }, 81 | "outputs": [ 82 | { 83 | "data": { 84 | "text/html": [ 85 | "
\n", 86 | "\n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | "
versionCreatedtextstoryIdsourceCode
2017-12-01 23:11:47.3742017-12-01 23:11:47.374Reuters Insider - FM Final Trade: HAL, TWTR & ...urn:newsml:reuters.com:20171201:nRTV8KBb1N:1NS:CNBC
2017-12-01 19:19:20.2792017-12-01 19:19:20.279IBM ST: the upside prevails as long as 150.2 i...urn:newsml:reuters.com:20171201:nGUR2R6xQ:1NS:GURU
2017-12-01 18:12:41.1432017-12-01 18:12:41.143INTERNATIONAL BUSINESS MACHINES CORP SEC Filin...urn:newsml:reuters.com:20171201:nEOL6JJfRT:1NS:EDG
2017-12-01 18:12:41.0192017-12-01 18:12:41.019INTERNATIONAL BUSINESS MACHINES CORP SEC Filin...urn:newsml:reuters.com:20171201:nEOL3YHcVY:1NS:EDG
2017-12-01 18:06:03.6332017-12-01 18:06:03.633Moody's Affirms Seven Classes of GSMS 2016-GS4urn:newsml:reuters.com:20171201:nMDY7wGNTP:1NS:RTRS
\n", 134 | "
" 135 | ], 136 | "text/plain": [ 137 | " versionCreated \\\n", 138 | "2017-12-01 23:11:47.374 2017-12-01 23:11:47.374 \n", 139 | "2017-12-01 19:19:20.279 2017-12-01 19:19:20.279 \n", 140 | "2017-12-01 18:12:41.143 2017-12-01 18:12:41.143 \n", 141 | "2017-12-01 18:12:41.019 2017-12-01 18:12:41.019 \n", 142 | "2017-12-01 18:06:03.633 2017-12-01 18:06:03.633 \n", 143 | "\n", 144 | " text \\\n", 145 | "2017-12-01 23:11:47.374 Reuters Insider - FM Final Trade: HAL, TWTR & ... \n", 146 | "2017-12-01 19:19:20.279 IBM ST: the upside prevails as long as 150.2 i... \n", 147 | "2017-12-01 18:12:41.143 INTERNATIONAL BUSINESS MACHINES CORP SEC Filin... \n", 148 | "2017-12-01 18:12:41.019 INTERNATIONAL BUSINESS MACHINES CORP SEC Filin... \n", 149 | "2017-12-01 18:06:03.633 Moody's Affirms Seven Classes of GSMS 2016-GS4 \n", 150 | "\n", 151 | " storyId \\\n", 152 | "2017-12-01 23:11:47.374 urn:newsml:reuters.com:20171201:nRTV8KBb1N:1 \n", 153 | "2017-12-01 19:19:20.279 urn:newsml:reuters.com:20171201:nGUR2R6xQ:1 \n", 154 | "2017-12-01 18:12:41.143 urn:newsml:reuters.com:20171201:nEOL6JJfRT:1 \n", 155 | "2017-12-01 18:12:41.019 urn:newsml:reuters.com:20171201:nEOL3YHcVY:1 \n", 156 | "2017-12-01 18:06:03.633 urn:newsml:reuters.com:20171201:nMDY7wGNTP:1 \n", 157 | "\n", 158 | " sourceCode \n", 159 | "2017-12-01 23:11:47.374 NS:CNBC \n", 160 | "2017-12-01 19:19:20.279 NS:GURU \n", 161 | "2017-12-01 18:12:41.143 NS:EDG \n", 162 | "2017-12-01 18:12:41.019 NS:EDG \n", 163 | "2017-12-01 18:06:03.633 NS:RTRS " 164 | ] 165 | }, 166 | "execution_count": 2, 167 | "metadata": {}, 168 | "output_type": "execute_result" 169 | } 170 | ], 171 | "source": [ 172 | "df = ek.get_news_headlines('R:IBM.N AND Language:LEN', date_to = \"2017-12-04\", count=100)\n", 173 | "df.head()" 174 | ] 175 | }, 176 | { 177 | "cell_type": "markdown", 178 | "metadata": {}, 179 | "source": [ 180 | "I will just add 3 new columns which we will need to store some variables in later." 181 | ] 182 | }, 183 | { 184 | "cell_type": "code", 185 | "execution_count": 3, 186 | "metadata": { 187 | "collapsed": false 188 | }, 189 | "outputs": [], 190 | "source": [ 191 | "df['Polarity'] = np.nan\n", 192 | "df['Subjectivity'] = np.nan\n", 193 | "df['Score'] = np.nan" 194 | ] 195 | }, 196 | { 197 | "cell_type": "markdown", 198 | "metadata": {}, 199 | "source": [ 200 | "So we have our frame with the most recent 100 news headline items. The headline is stored in the **text** column and the storyID which we will now use to pull down the actual articles themselves, is stored in the **storyID** column. \n", 201 | "\n", 202 | "We will now iterate through the headline dataframe and pull down the news articles using the second of our news API calls, get_news_story. We simply pass the **storyID** to this API call and we are returned a HTML representation of the article - which allows you to render them nicely etc - however for our purposes we want to strip the HTML tags etc out and just be left with the plain text - as we dont want to analyse HTML tags for sentiment. We will do this using the excellent **BeautifulSoup** package.\n", 203 | "\n", 204 | "Once we have the text of these articles we can pass them to our sentiment engine which will give us a sentiment score for each article. So what is our sentiment engine? We will be using the simple **TextBlob** package to demo a rudimentary process to show you how things work. **TextBlob** is a higher level abstraction package that sits on top of **NLTK** (Natural Language Toolkit) which is a widely used package for this type of task. \n", 205 | "\n", 206 | "**NLTK** is quite a complex package which gives you a lot of control over the whole analytical process - but the cost of that is complexity and required knowledge of the steps invloved. **TextBlob** shields us from this complexity, but we should at some stage understand what is going on under the hood. Thankfully there is plenty of information to guide us in this. We will be implementing the default **PatternAnalyzer** which is based on the popular **Pattern** library though there is also a **NaiveBayesAnalyzer** which is a **NLTK** classifier based on a movie review corpus. \n", 207 | "\n", 208 | "All of this can be achieved in just a few lines of code. This is quite a dense codeblock - so I have commented the key steps. " 209 | ] 210 | }, 211 | { 212 | "cell_type": "code", 213 | "execution_count": 4, 214 | "metadata": { 215 | "collapsed": false 216 | }, 217 | "outputs": [ 218 | { 219 | "data": { 220 | "text/html": [ 221 | "
\n", 222 | "\n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | "
versionCreatedtextstoryIdsourceCodePolaritySubjectivityScore
2017-12-01 23:11:47.3742017-12-01 23:11:47.374Reuters Insider - FM Final Trade: HAL, TWTR & ...urn:newsml:reuters.com:20171201:nRTV8KBb1N:1NS:CNBC0.0666670.566667positive
2017-12-01 19:19:20.2792017-12-01 19:19:20.279IBM ST: the upside prevails as long as 150.2 i...urn:newsml:reuters.com:20171201:nGUR2R6xQ:1NS:GURU0.0552600.320844positive
2017-12-01 18:12:41.1432017-12-01 18:12:41.143INTERNATIONAL BUSINESS MACHINES CORP SEC Filin...urn:newsml:reuters.com:20171201:nEOL6JJfRT:1NS:EDG0.0000000.000000neutral
2017-12-01 18:12:41.0192017-12-01 18:12:41.019INTERNATIONAL BUSINESS MACHINES CORP SEC Filin...urn:newsml:reuters.com:20171201:nEOL3YHcVY:1NS:EDG0.0000000.000000neutral
2017-12-01 18:06:03.6332017-12-01 18:06:03.633Moody's Affirms Seven Classes of GSMS 2016-GS4urn:newsml:reuters.com:20171201:nMDY7wGNTP:1NS:RTRS0.1750000.325000positive
\n", 288 | "
" 289 | ], 290 | "text/plain": [ 291 | " versionCreated \\\n", 292 | "2017-12-01 23:11:47.374 2017-12-01 23:11:47.374 \n", 293 | "2017-12-01 19:19:20.279 2017-12-01 19:19:20.279 \n", 294 | "2017-12-01 18:12:41.143 2017-12-01 18:12:41.143 \n", 295 | "2017-12-01 18:12:41.019 2017-12-01 18:12:41.019 \n", 296 | "2017-12-01 18:06:03.633 2017-12-01 18:06:03.633 \n", 297 | "\n", 298 | " text \\\n", 299 | "2017-12-01 23:11:47.374 Reuters Insider - FM Final Trade: HAL, TWTR & ... \n", 300 | "2017-12-01 19:19:20.279 IBM ST: the upside prevails as long as 150.2 i... \n", 301 | "2017-12-01 18:12:41.143 INTERNATIONAL BUSINESS MACHINES CORP SEC Filin... \n", 302 | "2017-12-01 18:12:41.019 INTERNATIONAL BUSINESS MACHINES CORP SEC Filin... \n", 303 | "2017-12-01 18:06:03.633 Moody's Affirms Seven Classes of GSMS 2016-GS4 \n", 304 | "\n", 305 | " storyId \\\n", 306 | "2017-12-01 23:11:47.374 urn:newsml:reuters.com:20171201:nRTV8KBb1N:1 \n", 307 | "2017-12-01 19:19:20.279 urn:newsml:reuters.com:20171201:nGUR2R6xQ:1 \n", 308 | "2017-12-01 18:12:41.143 urn:newsml:reuters.com:20171201:nEOL6JJfRT:1 \n", 309 | "2017-12-01 18:12:41.019 urn:newsml:reuters.com:20171201:nEOL3YHcVY:1 \n", 310 | "2017-12-01 18:06:03.633 urn:newsml:reuters.com:20171201:nMDY7wGNTP:1 \n", 311 | "\n", 312 | " sourceCode Polarity Subjectivity Score \n", 313 | "2017-12-01 23:11:47.374 NS:CNBC 0.066667 0.566667 positive \n", 314 | "2017-12-01 19:19:20.279 NS:GURU 0.055260 0.320844 positive \n", 315 | "2017-12-01 18:12:41.143 NS:EDG 0.000000 0.000000 neutral \n", 316 | "2017-12-01 18:12:41.019 NS:EDG 0.000000 0.000000 neutral \n", 317 | "2017-12-01 18:06:03.633 NS:RTRS 0.175000 0.325000 positive " 318 | ] 319 | }, 320 | "execution_count": 4, 321 | "metadata": {}, 322 | "output_type": "execute_result" 323 | } 324 | ], 325 | "source": [ 326 | "for idx, storyId in enumerate(df['storyId'].values): #for each row in our df dataframe\n", 327 | " newsText = ek.get_news_story(storyId) #get the news story\n", 328 | " if newsText:\n", 329 | " soup = BeautifulSoup(newsText,\"lxml\") #create a BeautifulSoup object from our HTML news article\n", 330 | " sentA = TextBlob(soup.get_text()) #pass the text only article to TextBlob to anaylse\n", 331 | " df['Polarity'].iloc[idx] = sentA.sentiment.polarity #write sentiment polarity back to df\n", 332 | " df['Subjectivity'].iloc[idx] = sentA.sentiment.subjectivity #write sentiment subjectivity score back to df\n", 333 | " if sentA.sentiment.polarity >= 0.05: # attribute bucket to sentiment polartiy\n", 334 | " score = 'positive'\n", 335 | " elif -.05 < sentA.sentiment.polarity < 0.05:\n", 336 | " score = 'neutral'\n", 337 | " else:\n", 338 | " score = 'negative'\n", 339 | " df['Score'].iloc[idx] = score #write score back to df\n", 340 | "df.head()" 341 | ] 342 | }, 343 | { 344 | "cell_type": "markdown", 345 | "metadata": {}, 346 | "source": [ 347 | "Looking at our dataframe we can now see 3 new columns on the right, *Polarity*, *Subjectivity* and *Score*. As we have seen *Polarity* is the actual sentiment polarity returned from **TextBlob** (ranging from -1(negative) to +1(positive), *Subjectivity* is a measure (ranging from 0 to 1) where 0 is very objective and 1 is very subjective, and *Score* is simply a Positive, Negative or Neutral rating based on the strength of the polarities. \n", 348 | "\n", 349 | "We would now like to see what, if any, impact this news has had on the shareprice of **IBM**. There are many ways of doing this - but to make things simple, I would like to see what the average return is at various points in time **AFTER** the news has broken. I want to check if there are *aggregate differences* in the *average returns* from the Positive, Neutral and Negative buckets we created earlier." 350 | ] 351 | }, 352 | { 353 | "cell_type": "code", 354 | "execution_count": 5, 355 | "metadata": { 356 | "collapsed": false 357 | }, 358 | "outputs": [ 359 | { 360 | "data": { 361 | "text/html": [ 362 | "
\n", 363 | "\n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | "
IBM.NHIGHLOWOPENCLOSECOUNTVOLUME
Date
2018-01-05 15:21:00162.32162.18162.22162.3123.03073.0
2018-01-05 15:22:00162.42162.29162.31162.4223.02442.0
2018-01-05 15:23:00162.46162.43162.45162.4611.0960.0
2018-01-05 15:24:00162.46162.40162.46162.405.0505.0
2018-01-05 15:25:00162.39162.31162.36162.3312.01060.0
\n", 432 | "
" 433 | ], 434 | "text/plain": [ 435 | "IBM.N HIGH LOW OPEN CLOSE COUNT VOLUME\n", 436 | "Date \n", 437 | "2018-01-05 15:21:00 162.32 162.18 162.22 162.31 23.0 3073.0\n", 438 | "2018-01-05 15:22:00 162.42 162.29 162.31 162.42 23.0 2442.0\n", 439 | "2018-01-05 15:23:00 162.46 162.43 162.45 162.46 11.0 960.0\n", 440 | "2018-01-05 15:24:00 162.46 162.40 162.46 162.40 5.0 505.0\n", 441 | "2018-01-05 15:25:00 162.39 162.31 162.36 162.33 12.0 1060.0" 442 | ] 443 | }, 444 | "execution_count": 5, 445 | "metadata": {}, 446 | "output_type": "execute_result" 447 | } 448 | ], 449 | "source": [ 450 | "start = df['versionCreated'].min().replace(hour=0,minute=0,second=0,microsecond=0).strftime('%Y/%m/%d')\n", 451 | "end = df['versionCreated'].max().replace(hour=0,minute=0,second=0,microsecond=0).strftime('%Y/%m/%d')\n", 452 | "Minute = ek.get_timeseries([\"IBM.N\"], start_date=start, interval=\"minute\")\n", 453 | "Minute.tail()" 454 | ] 455 | }, 456 | { 457 | "cell_type": "markdown", 458 | "metadata": {}, 459 | "source": [ 460 | "We will need to create some new columns for the next part of this analysis." 461 | ] 462 | }, 463 | { 464 | "cell_type": "code", 465 | "execution_count": 6, 466 | "metadata": { 467 | "collapsed": false 468 | }, 469 | "outputs": [ 470 | { 471 | "data": { 472 | "text/html": [ 473 | "
\n", 474 | "\n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | "
versionCreatedtextstoryIdsourceCodePolaritySubjectivityScoretwoMfiveMtenMthirtyM
2017-12-01 23:11:47.3742017-12-01 23:11:47.374Reuters Insider - FM Final Trade: HAL, TWTR & ...urn:newsml:reuters.com:20171201:nRTV8KBb1N:1NS:CNBC0.0666670.566667positiveNaNNaNNaNNaN
2017-12-01 19:19:20.2792017-12-01 19:19:20.279IBM ST: the upside prevails as long as 150.2 i...urn:newsml:reuters.com:20171201:nGUR2R6xQ:1NS:GURU0.0552600.320844positiveNaNNaNNaNNaN
\n", 522 | "
" 523 | ], 524 | "text/plain": [ 525 | " versionCreated \\\n", 526 | "2017-12-01 23:11:47.374 2017-12-01 23:11:47.374 \n", 527 | "2017-12-01 19:19:20.279 2017-12-01 19:19:20.279 \n", 528 | "\n", 529 | " text \\\n", 530 | "2017-12-01 23:11:47.374 Reuters Insider - FM Final Trade: HAL, TWTR & ... \n", 531 | "2017-12-01 19:19:20.279 IBM ST: the upside prevails as long as 150.2 i... \n", 532 | "\n", 533 | " storyId \\\n", 534 | "2017-12-01 23:11:47.374 urn:newsml:reuters.com:20171201:nRTV8KBb1N:1 \n", 535 | "2017-12-01 19:19:20.279 urn:newsml:reuters.com:20171201:nGUR2R6xQ:1 \n", 536 | "\n", 537 | " sourceCode Polarity Subjectivity Score twoM \\\n", 538 | "2017-12-01 23:11:47.374 NS:CNBC 0.066667 0.566667 positive NaN \n", 539 | "2017-12-01 19:19:20.279 NS:GURU 0.055260 0.320844 positive NaN \n", 540 | "\n", 541 | " fiveM tenM thirtyM \n", 542 | "2017-12-01 23:11:47.374 NaN NaN NaN \n", 543 | "2017-12-01 19:19:20.279 NaN NaN NaN " 544 | ] 545 | }, 546 | "execution_count": 6, 547 | "metadata": {}, 548 | "output_type": "execute_result" 549 | } 550 | ], 551 | "source": [ 552 | "df['twoM'] = np.nan\n", 553 | "df['fiveM'] = np.nan\n", 554 | "df['tenM'] = np.nan\n", 555 | "df['thirtyM'] = np.nan\n", 556 | "df.head(2)" 557 | ] 558 | }, 559 | { 560 | "cell_type": "markdown", 561 | "metadata": {}, 562 | "source": [ 563 | "OK so I now just need to get the timestamp of each news item, truncate it to minute data (ie remove second and microsecond components) and get the base shareprice of **IBM** at that time, and at several itervals after that time, in our case *t+2 mins,t+5 mins, t+10 mins, t+30 mins*, calculating the % change for each interval. \n", 564 | "\n", 565 | "An important point to bear in mind here is that news can be generated at anytime - 24 hours a day - outside of normal market hours. So for news generated outside normal market hours for **IBM** in our case, we would have to wait until the next market opening to conduct our calculations. Of course there are a number of issues here concerning our ability to attribute price movement to our news item in isolation (basically we cannot). That said, there might be other ways of doing this - for example looking at **GDRs/ADRs** or surrogates etc - these are beyond the scope of this introductory article. In our example, these news items are simply discarded. \n", 566 | "\n", 567 | "We will now loop through each news item in the dataframe, calculate (where possible) and store the derived performance numbers in the columns we created earlier: twoM...thirtyM." 568 | ] 569 | }, 570 | { 571 | "cell_type": "code", 572 | "execution_count": 7, 573 | "metadata": { 574 | "collapsed": false 575 | }, 576 | "outputs": [ 577 | { 578 | "data": { 579 | "text/html": [ 580 | "
\n", 581 | "\n", 582 | " \n", 583 | " \n", 584 | " \n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | " \n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | " \n", 619 | " \n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | "
versionCreatedtextstoryIdsourceCodePolaritySubjectivityScoretwoMfiveMtenMthirtyM
2017-12-01 23:11:47.3742017-12-01 23:11:47.374Reuters Insider - FM Final Trade: HAL, TWTR & ...urn:newsml:reuters.com:20171201:nRTV8KBb1N:1NS:CNBC0.0666670.566667positiveNaNNaNNaNNaN
2017-12-01 19:19:20.2792017-12-01 19:19:20.279IBM ST: the upside prevails as long as 150.2 i...urn:newsml:reuters.com:20171201:nGUR2R6xQ:1NS:GURU0.0552600.320844positive0.0711190.0840500.000000-0.109911
2017-12-01 18:12:41.1432017-12-01 18:12:41.143INTERNATIONAL BUSINESS MACHINES CORP SEC Filin...urn:newsml:reuters.com:20171201:nEOL6JJfRT:1NS:EDG0.0000000.000000neutral0.012944-0.090609-0.0323600.148858
2017-12-01 18:12:41.0192017-12-01 18:12:41.019INTERNATIONAL BUSINESS MACHINES CORP SEC Filin...urn:newsml:reuters.com:20171201:nEOL3YHcVY:1NS:EDG0.0000000.000000neutral0.012944-0.090609-0.0323600.148858
2017-12-01 18:06:03.6332017-12-01 18:06:03.633Moody's Affirms Seven Classes of GSMS 2016-GS4urn:newsml:reuters.com:20171201:nMDY7wGNTP:1NS:RTRS0.1750000.325000positive0.0972380.1555810.0972380.246337
\n", 671 | "
" 672 | ], 673 | "text/plain": [ 674 | " versionCreated \\\n", 675 | "2017-12-01 23:11:47.374 2017-12-01 23:11:47.374 \n", 676 | "2017-12-01 19:19:20.279 2017-12-01 19:19:20.279 \n", 677 | "2017-12-01 18:12:41.143 2017-12-01 18:12:41.143 \n", 678 | "2017-12-01 18:12:41.019 2017-12-01 18:12:41.019 \n", 679 | "2017-12-01 18:06:03.633 2017-12-01 18:06:03.633 \n", 680 | "\n", 681 | " text \\\n", 682 | "2017-12-01 23:11:47.374 Reuters Insider - FM Final Trade: HAL, TWTR & ... \n", 683 | "2017-12-01 19:19:20.279 IBM ST: the upside prevails as long as 150.2 i... \n", 684 | "2017-12-01 18:12:41.143 INTERNATIONAL BUSINESS MACHINES CORP SEC Filin... \n", 685 | "2017-12-01 18:12:41.019 INTERNATIONAL BUSINESS MACHINES CORP SEC Filin... \n", 686 | "2017-12-01 18:06:03.633 Moody's Affirms Seven Classes of GSMS 2016-GS4 \n", 687 | "\n", 688 | " storyId \\\n", 689 | "2017-12-01 23:11:47.374 urn:newsml:reuters.com:20171201:nRTV8KBb1N:1 \n", 690 | "2017-12-01 19:19:20.279 urn:newsml:reuters.com:20171201:nGUR2R6xQ:1 \n", 691 | "2017-12-01 18:12:41.143 urn:newsml:reuters.com:20171201:nEOL6JJfRT:1 \n", 692 | "2017-12-01 18:12:41.019 urn:newsml:reuters.com:20171201:nEOL3YHcVY:1 \n", 693 | "2017-12-01 18:06:03.633 urn:newsml:reuters.com:20171201:nMDY7wGNTP:1 \n", 694 | "\n", 695 | " sourceCode Polarity Subjectivity Score \\\n", 696 | "2017-12-01 23:11:47.374 NS:CNBC 0.066667 0.566667 positive \n", 697 | "2017-12-01 19:19:20.279 NS:GURU 0.055260 0.320844 positive \n", 698 | "2017-12-01 18:12:41.143 NS:EDG 0.000000 0.000000 neutral \n", 699 | "2017-12-01 18:12:41.019 NS:EDG 0.000000 0.000000 neutral \n", 700 | "2017-12-01 18:06:03.633 NS:RTRS 0.175000 0.325000 positive \n", 701 | "\n", 702 | " twoM fiveM tenM thirtyM \n", 703 | "2017-12-01 23:11:47.374 NaN NaN NaN NaN \n", 704 | "2017-12-01 19:19:20.279 0.071119 0.084050 0.000000 -0.109911 \n", 705 | "2017-12-01 18:12:41.143 0.012944 -0.090609 -0.032360 0.148858 \n", 706 | "2017-12-01 18:12:41.019 0.012944 -0.090609 -0.032360 0.148858 \n", 707 | "2017-12-01 18:06:03.633 0.097238 0.155581 0.097238 0.246337 " 708 | ] 709 | }, 710 | "execution_count": 7, 711 | "metadata": {}, 712 | "output_type": "execute_result" 713 | } 714 | ], 715 | "source": [ 716 | "for idx, newsDate in enumerate(df['versionCreated'].values):\n", 717 | " sTime = df['versionCreated'][idx]\n", 718 | " sTime = sTime.replace(second=0,microsecond=0)\n", 719 | " try:\n", 720 | " t0 = Minute.iloc[Minute.index.get_loc(sTime),2]\n", 721 | " df['twoM'][idx] = ((Minute.iloc[Minute.index.get_loc((sTime + datetime.timedelta(minutes=2))),3]/(t0)-1)*100)\n", 722 | " df['fiveM'][idx] = ((Minute.iloc[Minute.index.get_loc((sTime + datetime.timedelta(minutes=5))),3]/(t0)-1)*100)\n", 723 | " df['tenM'][idx] = ((Minute.iloc[Minute.index.get_loc((sTime + datetime.timedelta(minutes=10))),3]/(t0)-1)*100) \n", 724 | " df['thirtyM'][idx] = ((Minute.iloc[Minute.index.get_loc((sTime + datetime.timedelta(minutes=30))),3]/(t0)-1)*100)\n", 725 | " except:\n", 726 | " pass\n", 727 | "df.head()" 728 | ] 729 | }, 730 | { 731 | "cell_type": "markdown", 732 | "metadata": {}, 733 | "source": [ 734 | "Fantastic - we have now completed the analytical part of our study. Finally, we just need to aggregate our results by *Score* bucket in order to draw some conclusions. " 735 | ] 736 | }, 737 | { 738 | "cell_type": "code", 739 | "execution_count": 8, 740 | "metadata": { 741 | "collapsed": false 742 | }, 743 | "outputs": [ 744 | { 745 | "data": { 746 | "text/html": [ 747 | "
\n", 748 | "\n", 749 | " \n", 750 | " \n", 751 | " \n", 752 | " \n", 753 | " \n", 754 | " \n", 755 | " \n", 756 | " \n", 757 | " \n", 758 | " \n", 759 | " \n", 760 | " \n", 761 | " \n", 762 | " \n", 763 | " \n", 764 | " \n", 765 | " \n", 766 | " \n", 767 | " \n", 768 | " \n", 769 | " \n", 770 | " \n", 771 | " \n", 772 | " \n", 773 | " \n", 774 | " \n", 775 | " \n", 776 | " \n", 777 | " \n", 778 | " \n", 779 | " \n", 780 | " \n", 781 | " \n", 782 | " \n", 783 | " \n", 784 | " \n", 785 | " \n", 786 | " \n", 787 | " \n", 788 | " \n", 789 | " \n", 790 | " \n", 791 | " \n", 792 | " \n", 793 | " \n", 794 | " \n", 795 | " \n", 796 | " \n", 797 | " \n", 798 | "
PolaritySubjectivitytwoMfiveMtenMthirtyM
Score
negative-0.1465080.316746NaNNaNNaNNaN
neutral0.0064360.175766-0.004829-0.0095020.0289790.137544
positive0.1292600.4068680.0120890.0127760.0359360.047345
\n", 799 | "
" 800 | ], 801 | "text/plain": [ 802 | " Polarity Subjectivity twoM fiveM tenM thirtyM\n", 803 | "Score \n", 804 | "negative -0.146508 0.316746 NaN NaN NaN NaN\n", 805 | "neutral 0.006436 0.175766 -0.004829 -0.009502 0.028979 0.137544\n", 806 | "positive 0.129260 0.406868 0.012089 0.012776 0.035936 0.047345" 807 | ] 808 | }, 809 | "execution_count": 8, 810 | "metadata": {}, 811 | "output_type": "execute_result" 812 | } 813 | ], 814 | "source": [ 815 | "grouped = df.groupby(['Score']).mean()\n", 816 | "grouped" 817 | ] 818 | }, 819 | { 820 | "cell_type": "markdown", 821 | "metadata": {}, 822 | "source": [ 823 | "### Observations\n", 824 | "\n", 825 | "From our initial results - it would appear that there might be some small directional differences in returns between the positive and neutral groups over shorter time frames (twoM and fiveM) after news broke. This is a pretty good basis for further investigation. So where could we go from here?\n", 826 | "\n", 827 | "We have a relatively small *n* here so we might want to increase the size of the study. \n", 828 | "\n", 829 | "We might also want to try to seperate out more positive or negative news - ie change the threshold of the buckets to try to identify more prominent sentiment articles - maybe that could have more of an impact on performance. \n", 830 | "\n", 831 | "In terms of capturing news impact - we have thrown a lot of news articles out as they happened outside of market hours - as it is more complex to ascertain impact - we might try to find a way of including some of this in our analysis - I mentioned looking at overseas listings **GDR/ADRs** or surrogates above. Alternatively, we could using **EXACTLY** the same process looking at all news for an index future - say the **S&P500 emini** - as this trades on Globex pretty much round the clock - so we would be throwing out a lot less of the news articles? Great I hear you cry - but would each news article be able to influence a whole index? Are index futures more sensitive to some types of articles than others? Is there a temporal element to this? These are all excellent questions. Or what about cryptocrurrencies? They trade 24/7? and so on.\n", 832 | "\n", 833 | "We could also investigate what is going on with our sentiment engine. We might be able to generate more meaningful results by tinkering with the underlyng processes and parameters. Using a different, more domain-specific corpora might help us to generate more relevant scores. \n", 834 | "\n", 835 | "You will see there is plenty of scope to get much more involved here. \n", 836 | "\n", 837 | "This article was intended as an introduction to this most interesting of areas. I hope to have de-mystified this area for you somewhat and shown how it is possible to get started with this type of complex analysis using only a few lines of code, a simple easy to use yet powerfull API and some really fantastic packages, to generate some meaningful results." 838 | ] 839 | } 840 | ], 841 | "metadata": { 842 | "kernelspec": { 843 | "display_name": "Python 2", 844 | "language": "python", 845 | "name": "python2" 846 | }, 847 | "language_info": { 848 | "codemirror_mode": { 849 | "name": "ipython", 850 | "version": 2 851 | }, 852 | "file_extension": ".py", 853 | "mimetype": "text/x-python", 854 | "name": "python", 855 | "nbconvert_exporter": "python", 856 | "pygments_lexer": "ipython2", 857 | "version": "2.7.11" 858 | } 859 | }, 860 | "nbformat": 4, 861 | "nbformat_minor": 0 862 | } 863 | --------------------------------------------------------------------------------